Cross-modal association analysis and matching model construction of perceptual attributes of multiple colors and combined tones

Wang, Shuang; Liu, Jingyu; Lan, Xuedan; Hu, Qihang; Jiang, Jian; Zhang, Jingjing

doi:10.3389/fpsyg.2022.970219

ORIGINAL RESEARCH article

Front. Psychol., 06 December 2022

Sec. Perception Science

Volume 13 - 2022 | https://doi.org/10.3389/fpsyg.2022.970219

This article is part of the Research TopicAdvances in Color Science: From Color Perception to Color Metrics and its Applications in Illuminated EnvironmentsView all 22 articles

Cross-modal association analysis and matching model construction of perceptual attributes of multiple colors and combined tones

Shuang Wang^1,2,3

Jingyu Liu^1,2,3

Xuedan Lan^1,2,3

Qihang Hu^1,2,3

Jian Jiang⁴

Jingjing Zhang^1,2,3,5*

¹State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing, China
²Key Laboratory of Acoustic Visual Technology and Intelligent Control System, Ministry of Culture and Tourism, Communication University of China, Beijing, China
³Beijing Key Laboratory of Modern Entertainment Technology, Communication University of China, Beijing, China
⁴China Digital Culture Group Co., Ltd, Beijing, China
⁵School of Computer and Cyber Sciences, Communication University of China, Beijing, China

Audio-visual correlation is a common phenomenon in real life. In this article, aiming at analyzing the correlation between multiple colors and combined tones, we comprehensively used experimental methods and technologies such as experimental psychology methods, audio-visual information processing technology, and machine learning algorithms to study the correlation mechanism between the multi-color perceptual attributes and the interval consonance attribute of musical sounds, so as to construct an audio-visual cross-modal matching models. Specifically, in the first, this article constructed the multi-color perceptual attribute dataset through the subjective evaluation experiment, namely “cold/warm,” “soft/hard,” “transparent/turbid,” “far/near,” “weak/strong,” pleasure, arousal, and dominance; and constructed the interval consonance attribute dataset based on calculating the audio objective parameters. Secondly, a subjective evaluation experiment of cross-modal matching was designed and carried out for analyzing the audio-visual correlation, so as to obtain the cross-modal matched and mismatched data between the audio-visual perceptual attributes. On this basis, through visual processing and correlation analysis of the matched and mismatched data, this article proved that there is a certain correlation between multicolor and combined tones from the perspective of perceptual attributes. Finally, this article used linear and non-linear machine learning algorithms to construct audio-visual cross-modal matching models, so as to realize the mutual prediction between the audio-visual perceptual attributes, and the highest prediction accuracy is up to 79.1%. The contributions of our research are: (1) The cross-modal matched and mismatched dataset can provide basic data support for audio-visual cross-modal research; (2) The constructed audio-visual cross-modal matching models can provide a theoretical basis for audio-visual interaction technology; (3) In addition, the research method of audio-visual cross-modal matching proposed in this article can provide new research ideas for related research.

Introduction

Synesthesia (Grossenbacher and Lovelace, 2001; Ward, 2013) refers to the mental activity in which stimulation from one sense induces the sensation of another sense. People’s cognition of the world comes from a variety of sensory information, such as vision, hearing, smell, taste, touch, etc. The integration of multiple sensory channels helps us to have a more comprehensive understanding of things and reduces our dependence on a certain sensory channel (Jia et al., 2018). Audio-visual synesthesia belongs to one of them. For example, when listening to music, different pictures often emerge in our mind along with the music, and color is the most intuitive description of pictures and emotions. And when we appreciate a painting, there seems to be a piece of music in our ears. This cross-modal association is automatic, involuntary, and irrepressible (Li, 2021). At present, a large number of researches have been conducted on the phenomenon, process, and mechanism of the multi-sensory information integration in neurobiology, brain science, psychology, and other related fields, which results in the proposal of a series of theoretical hypotheses and models about synesthesia. The multi-sensory information integration has been scientifically verified in many fields, proving that the phenomenon of synesthesia is universal and stable, and thus can be studied (Rouw and Scholte, 2007; Kim et al., 2013; Arturo and James, 2016).

The connection between music and color studied in this article is an audio-visual synesthesia phenomenon. Music and painting are two art forms that are closely related to our lives, and the relationship between music and color is the closest. At present, there have been many studies on audio-visual synesthesia, which exploring the basic attributes of the single tone (including length of sound, pitch, loudness, etc.) and the basic attributes of the single color (including hue, chroma, lightness, etc.). In the study of the Positron Emission Tomography (PET), Bushara et al. (2001) asked subjects to determine whether pure auditory tones and color rings presented in visual form were presented at the same time, and found that the prefrontal and posterior parietal lobes of the brain are the part of network of multi-sensory brain regions that detects the synchronization of visual and auditory stimuli. Eimer and Schröger (2003) used the functional Magnetic Resonance Imaging (fMRI) to investigate the neurophysiological mechanism of audio-visual cross-modal integration of speech signals, and found that there is a significant integration effect in the posterior part of the left superior temporal sulcus. In addition, the psychological phenomenon of Mcgurk and Macdonald (1976) shows that the process of forming cognition of external information is a process of forming a holistic understanding of things based on different sensory information. The lack or inaccuracy of any sensory information will lead to the brain’s comprehension deviation of external information. Under certain circumstances, the sound obtained simply relying on the ear is different from the sound obtained by combining visual and auditory sensory information. This phenomenon provides support for cross-modal interaction among people’s various senses, and is also a strong evidence of audio-visual synesthesia. Therefore, many studies have shown that the audio-visual cross-modal integration effect is universal and stable for people with different ages, genders, and cultural backgrounds.

In recent years, the research on audio-visual cross-modal has gradually become a hotspot. Zhou (2004) used the method of experimental psychology, took the synesthesia as the breakthrough point, and proved that there is a certain relationship between visual attributes and auditory attributes through qualitative research. Palmer et al. (2013) demonstrated that the cross-modal matches between music and colors are mediated by emotional associations. In the field of information science, Jiang et al. (2019) put forward the method and idea of cross-modal research on audio-visual integration effect, which regarded the human brain as a “black box” and divided the “input” of visual and auditory information into low-level features, middle-level features, and high-level features. The low-level features refer to objective physical features, including visual color, shape, texture, etc., and auditory pitch, loudness, etc. Due to the different data representation of different modalities, the low-level features are quite different, so direct audio-visual association cannot be carried out at this layer. The middle-level features refer to perceptual features, including visual perceptual features (e.g., color’s “warm/cool,” “swell/shrink,” “dynamic/static,” and the harmony of color combination, etc.) and auditory perceptual features (e.g., fullness, roughness, and the interval consonance of musical sounds). Perceptual features refer to physiological reflexes directly generated by physical stimuli, and there are certain similarities among different modalities. High-level features refer to semantic features, including emotion, aesthetic feeling, etc. Semantic features are the results of integrating different-modal information received by human brain, and different-modal information is shared at this layer. On the basis of this, Liu et al. (2021) extracted timbre perceptual features and quantitatively analyzed them with the basic attributes of single color (hue, chroma, and lightness) to construct the timbre-color cross-modal matching model. The research results showed that certain attributes of timbre have a strong correlation with certain attributes of color.

Most of the current studies focus on the cross-modal correlation between single color and single tone. However, in real life, pictures are composed of multiple colors, and music is composed of combined tones, which are more common and applicable. Moreover, studies have shown that there are differences in perceptual attributes between single color and multiple colors, single tone and combined tones (Griscom and Palmer, 2012). Therefore, it is necessary to further study the cross-modal association between multiple colors and combined tones on the basis of studying the correlation between single color and single tone, so as to enrich the theoretical and methodological research of audio-visual synesthesia and reveals the essence of synesthesia. Among them, the classic color matching mode of multiple colors is three-color combination (Kato, 2010). Therefore, this article took three-color combinations as the visual materials and quantified the corresponding perceptual attributes. On the other hand, the basic unit of combined tones is the interval, which refers to the mutual relationship between two tones in pitch (including harmonic interval and melodic interval), and the interval consonance is the basic perceptual attribute reflecting the auditory perception of the interval (Wang and Meng, 2013; Lu and Meng, 2016). Therefore, this article took intervals as the auditory materials and quantified the interval concordance attribute.

To sum up, this article took multiple colors and combined tones as the research object, designed and implemented the audio-visual cross-modal matching subjective evaluation experiment between three-color combinations and intervals, analyzed the audio-visual cross-modal correlation on the basis of quantifying the relevant perceptual attributes of audio-visual materials, and finally constructed the audio-visual cross-modal matching models through the linear and non-linear machine learning algorithms. The section arrangement of this article is shown in Figure 1.

FIGURE 1

Figure 1. The section arrangement of this article.

Perceptual attribute dataset construction

This section introduces the construction of multi-color perceptual attribute dataset and interval consonance attribute dataset, mainly including the construction of audio-visual material sets and the quantification of audio-visual perceptual attributes, which provided data supporting for the research on the association between multi-color perception and interval consonance.

Multi-color perceptual attribute dataset construction

Multi-color material set construction

Fifty three-color combinations selected from the research results of Nippon Color & Design Research Institute (NCD) were used as the multi-color materials. NCD selected 130 representative colors which represented psychological experiences accurately based on the “Hue and Tone System,” which were evaluated by 180 image descriptive words. Then, they combined 130 single color into 50 three-color combinations based on color image and further classified them into a total of 16 categories, including “lovely,” “romantic,” and “refreshing,” as shown in Supplementary Appendix Table A1. On this basis, cluster analysis was adopted to reduce dimensions of 180 image descriptive words, so as to construct a three-dimensional “color image scale,” namely the dimension of “cool/warm,” “soft/hard,” and “transparent/turbid,” and mapped 16 categories into this color image space (Kobayashi, 2006, 2010). These 50 materials and their values of L*, a*, and b* in the CIELAB color space are respectively shown in Supplementary Appendix Table A2.

Subjective evaluation experiment on multi-color perception

The multi-color perceptual attributes refer to the quantitative description of three-color combinations from human’s perception, such as the subjective senses of “cool/warm,” “far/near,” and “pleasure” inspired by color. In this article, the data of multi-color perceptual attributes was obtained through the subjective evaluation experiment on color perception. Then, the experimental data would be applied into the correlation analysis between multi-color perceptual attributes and interval consonance attribute, and the construction of audio-visual cross-modal matching models.

Subjects

A total of 16 subjects, aged 18–22, with the gender ratio close to 1:1, took part in the experiment. None of them majored in visual science or aesthetic design. Before the formal experiment, each subject was given an Ishihara Color Blindness Test (Marey et al., 2015) to ensure his color vision was normal. Then, they would sign the informed-consent form and were compensated for participating.

Perceptual descriptive words selection

A total of 260 perceptual descriptive words were selected from literature, questionnaires, and dictionaries (Gao and Xin, 2006; Wang et al., 2006, 2020), and then they were further screened according to the following principles: (1) remove words which are susceptible to individual preference; (2) combine words with similar meanings; (3) remove words with ambiguous explanations. On this basis, a preliminary experiment (Kobayashi, 2010) was carried out to select perceptual descriptive words based on multi-color materials, and the subjects were asked to select several words which were suitable for describing a certain three-color combination. The selection frequency of each word was recorded to determine the selected one. The selected perceptual descriptive words could not be affected by individual preference and had objectivity and universality. Finally, as shown in Table 1, five pairs of mid-level descriptive words were selected, namely “cool/warm,” “soft/hard,” “transparent/turbid,” “far/near,” and “weak/strong.” In addition, the remaining three high-level descriptive words were from the PAD emotion model with P representing pleasure, A representing arousal, and D representing dominance (Russell and Mehrabian, 1977), so as to acquire the more complete description for human’s multi-color perception.

TABLE 1

Table 1. The selected perceptual descriptive words of multi-color perceptual attributes.

Experimental condition

The experiment was carried out in an underground standard listening room with a reverberation time of 0.3 s. The sound field distribution was uniform, and there was no bad acoustic phenomenon and body noise, so as to avoid the interference of noise on the perceptual evaluation. The laboratory area was 5.37 m × 6 m, and the wall acoustic absorption materials and the main experimental facilities were all gray (Munsell: N4). A 75-inch Sony KD-75X9400D HD display was adopted with a resolution of 4K (3840 2160). According to the Methodology for the Subjective Assessment of the Quality of Television Pictures (ITU-R BT.500-14), the ambient illumination of the display was set to 200 lx. A “slideshow” function in the ACDSee software (official free version; ACD Systems International Inc., Shanghai, China) was used to randomly present the stimuli, which was centered with a gray background (Munsell: N2). The luminance is 596.5 cd/m2, and the luminance uniformity is 0.03 cd/m2 (Input signal: black level; Value: SD; Method: Nine-point Test) and 10.69 cd/m2 (Input signal: white level; Value: SD; Method: Nine-point Test).

Experimental procedure

The experiment was divided into two steps. The first step was to fill in the basic personal information and sign the informed-consent form. The second step was to evaluate the perceptual descriptive words on a five-level scale. In the experiment, the scoring time of each material was uniformly controlled for 30 s, and the interval between two pictures was 5 s. In order to avoid eye fatigue caused by long-term viewing of the screen, the materials were divided into two groups. After 25 materials were evaluated, a 5-min break was taken before the remaining 25 materials were evaluated.

Reliability analysis and multi-color perceptual attributes quantification

Cronbach’s alpha was adopted to evaluate the reliability of the experimental data, which is used to measure the internal consistency of the evaluation results, usually above 0.7 is considered to be reliable.

α = \frac{k}{k - 1} (1 - \frac{\sum_{i = 1}^{k} σ_{i}^{2}}{σ_{x}^{2}}) (1)

where K represents the number of subjects, $σ_{i}^{2}$ repreents the score variance of all the subjects on the ith measurement item, and $σ_{x}^{2}$ represents the total variance of the total scores obtained by all the subjects. All the perceptual descriptive words have very good internal consistency among the 16 subjects, and the Cronbach’s alpha are: 0.94 (“cool/warm”), 0.908 (“soft/hard”), 0.946 (“transparent/turbid”), 0.715 (“far/near”), 0.921 (“weak/strong”), 0.893 (pleasure), 0.917 (arousal), and 0.875 (dominance). It can be seen that the values of the Cronbach’s alpha are all greater than 0.7, and the highest value is up to 0.946, which meets the reliability requirements.

Finally, as shown in Equation (2), an eight-dimensional multi-color perceptual attribute vector was constructed based on the evaluation results.

F = [f {}_{1}, f {}_{2}, \dots, f {}_{8}] (2)

where f_i represents the average value of 5-scale scores of all the subjects on the ith multi-color perceptual descriptive word.

Interval consonance attribute dataset construction

Interval material set construction

The composition of the interval materials is shown in Figure 2, with a total of 52 interval materials, including melodic intervals and harmonic intervals. Among them, a melodic interval refers to a two-tone combination played successively, and a harmonic interval refers to a two-tone combination played synchronously. Each interval category composes of two timbres, namely piano (unsustainable sound) and violin (sustainable sound). A musical instrument includes four phase: Attack phase, Decay phase, Sustain phase, and Release phase, which are called ADSR model for short. “unsustainable sound” means that there is only a simplified ADSR model for musical instruments, and only the attack phase and decay phase, such as plucked instruments (harps), percussion instruments (pianos), and percussion instruments (marimba). “Sustainable sound” refers to the sustain phase of musical instruments in the ADSR stage, such as violin, clarinet and trumpet (Jiang et al., 2020). In general, there are 13 two-tone relationships with different interval consonance indexes, as shown in Table 2. Therefore, there are a total of four categories, namely piano-melodic interval, violin-melodic interval, piano-harmonic interval, and violin-harmonic interval, with 13 intervals respectively.

FIGURE 2

Figure 2. The composition of the interval materials.

TABLE 2

Table 2. The 13 common two-tone relationships (intervals).

The MacBook Pro personal computer and the Sony 8506 monitor earphone were adopted to record interval materials. The steps were as follows: (1) open the digital audio workstation software Logic Pro (X; Apple Inc., Cupertino, CA, USA), and create the new project file; (2) establish two audio tracks of two kinds of instruments, namely piano and violin. Taking the M3 harmonic interval played by piano as an example, import the Kontakt6 sampler into these two audio tracks, and add the piano sound source (Cine Piano, set to the default value) and the violin sound source (Chris Hein Solo Violin, set to Dynamic Expression Long); (3) write the melodic intervals and harmonic intervals with the loudness of mezzo forte (mf) in two audio tracks respectively, and export all the interval materials.

Interval consonance attribute quantification

Based on the research results of the previous literatures (Chen and Lu, 1994; Xue, 2012), according to the harmony principle, a quantitative research method was given for calculating the interval consonance index of the pure-tone and 12-tone equal temperament, and classified the 13 intervals into six categories based on the interval consonance index. The principle of the calculation of interval consonance index is: if the two tones have the same partial, that is, the frequency ratio of the two tones is a simple integer ratio, then the effect of consonance can be produced. The calculation steps are as follows.

Step 1: Calculate the consonance degree of each pure-tone interval firstly. Take the reciprocal of the product of the ordinal numbers of the first consonant partials of the two tones constituting the interval in the respective partial columns as the basis for evaluating the consonance degree, which is called the interval consonance coefficient K:

K = \frac{1}{m n} (3)

where m represents the ordinal number of the first consonant partial in the root-tone partial column and n represents the ordinal number of the first consonant partial in the crown-tone partial column. The partial column of the tones in the pure-tone temperament is shown in Table 3. It can be seen that the tones with different frequency can have the same partial, which is called the consonant partial. Take the interval composed of g and c (P5) as the example. The first consonant partial is g¹, which is the second partial in the g partial column and the third partial in the c partial column. Therefore, the interval consonance coefficient $K = \frac{1}{2 \times 3} = \frac{1}{6}$ .

TABLE 3

Table 3. The partial column of the tones in the pure-tone temperament relative to c (from the first partial to the 16th partial).

Step 2: Since the distribution of K is not uniform, the logarithmic method was adopted to define the interval consonance index of the pure-tone temperament Ip. And the unit is decibel (dB).

I p = 20 l o g_{10} \frac{1000}{m n} dB (4)

Step 3: Since the frequency ratio of each interval of the twelve-tone equal temperament currently in use is an irrational number ${(\sqrt[12]{2})}^{k}$ , where k = 1,2,⋯,12, the calculation method of the interval consonance index of the pure-tone temperament is not suitable for that of the 12-tone temperament. In practice, the similar interval of the pure-tone temperament is commonly adopted to calculate the interval consonance index I of the corresponding twelve-tone equal temperament:

I = I p - △ I (5)

where I represents the interval consonance index of the similar interval of the pure-tone temperament; −△I represents the correction value corresponding to the deviated cent δ of the twelve-tone equal temperament interval from the similar pure-tone temperament interval. −△I is calculated as follows:

l o g η = \frac{δ l o g 2}{1200} (6)

A = \frac{1}{\sqrt{1 + Q^{2} {(η - \frac{1}{η})}^{2}}} (7)

- △ I = 20 l o g_{10} A (8)

where δ represents the cent value of a certain interval deviating from a similar one of the pure-tone temperament; η represents the frequency ratio corresponding to δ; A represents the relative amplitude; Q represents the quality factor of the resonant circuit, which is set to 100. The calculation results are shown in Table 4.

TABLE 4

Table 4. The interval consonance index of each interval of the 12-tone equal temperament (Q = 100).

Audio-visual cross-modal matching subjective evaluation experiment

Firstly, on the basis of research results of Section “Perceptual Attribute Dataset Construction,” this section introduces the design and implementation process of the audio-visual cross-modal matching subjective evaluation experiment between audio-visual perceptual attributes. Then, data preprocessing and reliability analysis were introduced, which provided the input and output data for correlation analysis and model construction.

The experiment was carried out by selecting the corresponding multi-color materials after listening to the interval materials. In the experiment, the independent variable was the playing interval material, and the dependent variable was the corresponding matched or mismatched multi-color material.