Is a High Tone Pointy? Speakers of Different Languages Match Mandarin Chinese Tones to Visual Shapes Differently

Shang, Nan; Styles, Suzy J.

doi:10.3389/fpsyg.2017.02139

ORIGINAL RESEARCH article

Front. Psychol. , 07 December 2017

Sec. Perception Science

Volume 8 - 2017 | https://doi.org/10.3389/fpsyg.2017.02139

Is a High Tone Pointy? Speakers of Different Languages Match Mandarin Chinese Tones to Visual Shapes Differently

$\r\nNan Shang$ Nan Shang

Suzy J. Styles^*

Psychology, School of Social Sciences, Nanyang Technological University, Singapore, Singapore

Studies investigating cross-modal correspondences between auditory pitch and visual shapes have shown children and adults consistently match high pitch to pointy shapes and low pitch to curvy shapes, yet no studies have investigated linguistic-uses of pitch. In the present study, we used a bouba/kiki style task to investigate the sound/shape mappings for Tones of Mandarin Chinese, for three groups of participants with different language backgrounds. We recorded the vowels [i] and [u] articulated in each of the four tones of Mandarin Chinese. In Study 1 a single auditory stimulus was presented with two images (one curvy, one spiky). In Study 2 a single image was presented with two auditory stimuli differing only in tone. Participants were asked to select the best match in an online ‘Quiz.’ Across both studies, we replicated the previously observed ‘u-curvy, i-pointy’ sound/shape cross-modal correspondence in all groups. However, Tones were mapped differently by people with different language backgrounds: speakers of Mandarin Chinese classified as Chinese-dominant systematically matched Tone 1 (high, steady) to the curvy shape and Tone 4 (falling) to the pointy shape, while English speakers with no knowledge of Chinese preferred to match Tone 1 (high, steady) to the pointy shape and Tone 3 (low, dipping) to the curvy shape. These effects were observed most clearly in Study 2 where tone-pairs were contrasted explicitly. These findings are in line with the dominant patterns of linguistic pitch perception for speakers of these languages (pitch-change, and pitch height, respectively). Chinese English balanced bilinguals showed a bivalent pattern, swapping between the Chinese pitch-change pattern and the English pitch-height pattern depending on the task. These findings show for that the supposedly universal pattern of mapping linguistic sounds to shape is modulated by the sensory properties of a speaker’s language system, and that people with high functioning in more than one language can dynamically shift between patterns.

Introduction

For almost 90 years, it has been recognized that people from a variety of backgrounds tend to make the same choices about which nonsense words ‘should’ have which meanings, a trait that seems to be more-or-less universal. For example, English speakers showed high levels of agreement in judging a word form like mal to be a better match for a larger object than mil (Sapir, 1929). Similarly, most people preferred to match curvy line drawings with the nonsense word baluba (maluma in the 1947 version) and angular line drawings with takete (Köhler, 1929/1947/1970). The same sound-shape mapping pattern has also been documented by Ramachandran and Hubbard (2001) who found that the majority of participants (95%∼98%) matched a curvy shape with bouba and an angular shape with kiki—this effect has since become known as the bouba-kiki effect. The bouba-kiki paradigm has been replicated cross-linguistically and cross-culturally, for example, with Swahili-speaking school children living in an isolated peninsula in Africa (Davis, 1961), Czech-speaking adults (Tarte, 1974), Tamil speakers in India (Ramachandran and Hubbard, 2001) and Otjiherero-speaking Himba living in Northern Namibia (Bremner et al., 2013). The bouba-kiki effect has also been found in pre-reading toddlers (Maurer et al., 2006), in pre-vocabulary-spurt 11-month-olds (Imai et al., 2008; Kantartzis et al., 2011; Imai and Kita, 2014), and in pre-linguistic 4-month-olds (Ozturk et al., 2013). These experiments suggest that the effect has its origins prior to the acquisition of language, and is therefore not dependent on language learning.

Some researchers have suggested that these effects are related to the generalized sensory confusion in newborn children, described as a kind of ‘neonatal synaesthesia’ preceding clear sensory differentiation (Maurer, 1993), which may give rise to a kind of ‘weak synesthesia’ in adulthood, from latent sensory ‘cross-wiring’ (Ramachandran and Hubbard, 2001) that remains after developmental changes in connectivity and function (Maurer and Mondloch, 2005). Others view linguistic sound symbolism as an offshoot of generalized cross-modal processing, acquired through experience with the structural regularities of sensory information, as derived from the physical environment (Spence, 2011; Spence and Deroy, 2013). To give an example of environmental regularities, small resonating bodies produce high-pitched sounds, whereas larger bodies are capable of lower-pitched sounds (e.g., trumpet/tuba; mosquito/elephant). These perspectives converge on the idea that the cross-modal perception underlying the bouba/kiki effect is universal (although notable exceptions include Rogers and Ross, 1975 and Styles and Gawne, 2017).

Outside the domain of language, crossmodal congruences between auditory pitch and other sensory modalities have been widely investigated (see Spence and Deroy, 2013 for a review). Early empirical evidence of pitch-related cross-modal correspondences comes from Marks’ laboratory experiments. For example, when given a set of colors varying in lightness and a set of notes varying in pitch, participants consistently paired the higher pitch with the lighter color (Marks, 1978). He also found cross-modal correspondences between pitch and direction as well as pitch and sharpness (Marks, 1974, 1987). In one experiment, Marks (1987) asked participants to match two auditory stimuli (one 220-Hz saw-tooth wave and one 360-Hz saw-tooth wave) to two visual forms (one U-shape and one V-shaped, respectively) and found that the high pitched sound was matched to the angular shape and the low pitched sound was matched to the round shape. O’Boyle and Tarte (1980) also found that when asked to adjust pure tone frequencies to best fit visual stimuli, participants more often assigned lower frequencies to round figures than to angular figures. Auditory pitch is also matched with visual size, for example, Gallace and Spence (2006) asked participants to judge relative visual sizes of objects accompanied by task-irrelevant sounds, and found faster responses to congruent trials (e.g., a big disk with a low pitch sound) than to incongruent ones (e.g., a big disk with a high pitch sound). Pitch-related cross-modal correspondences have also been confirmed in young children (e.g., pitch and size; pitch and brightness in Mondloch and Maurer, 2004) and infants (e.g., pitch and visuospatial height; pitch and visual sharpness in Walker et al., 2010); pitch and size in Fernández-Prieto et al. (2015). Ludwig et al. (2011) even observed the high-high mapping pattern between pitch and luminance in chimpanzees, as has been observed in humans, suggesting such mappings were present in our common ancestors. Taken together, previous studies have shown cross-modal correspondences between pitch and vision, including visual angularity. All of these experiments have been conducted using pitch stimuli with no linguistic content. Linguistic uses of pitch, however, have not been systematically investigated in the cross-modal correspondence literature.

Lexical tones refer to syllable-level variations in the temporal contour of the fundamental frequency (F0, or ‘pitch’) and serve to draw contrasts between word meanings (Gandour, 1978; Jongman et al., 2006; Singh and Fu, 2016). Lexical tones exist in about 70% of the world’s languages (Yip, 2002) and over half the world’s population speak a tone language (Fromkin, 1978). However, cross-modal correspondences for speakers of tone languages have rarely been investigated. Mandarin Chinese is a tone language with four lexical tones. In terms of pitch, Tone 1 is high and steady, Tone 2 mid-rising, Tone 3 falling-rising and Tone 4 high-falling (Chao, 1948), as can be seen in Figure 1. Tone information is vital for meaning discrimination in Mandarin Chinese. For example, when a syllable like ‘yan’ is produced in the four tones, each word has a distinct meaning: www.frontiersin.org (yan1, ‘smoke’); (yan2, ‘salt’); (yan3, ‘eye’); (yan4, ‘colorful’).

FIGURE 1

FIGURE 1. Example pitch contours of the four Mandarin tones, as measured in PRAAT (Boersma, 2001).

Only a few studies have investigated sound symbolism in Mandarin Chinese or with Mandarin-speaking participants. Sapir’s early work on vowel correspondences ([a]-big, [i]-small) included Chinese-speaking participants, who performed the same way as their Western peers (Sapir, 1929). Similarly, Huang et al. (1969) and Chan (1996) have reported the ‘[a]-big, [i]-small’ sound-size mapping pattern in lexical items of Chinese. These findings suggest that Chinese speakers share general sound symbolic mapping patterns, when it comes to sounds that are highly prevalent and contrastive, like the ‘corner vowels’ [i], [a], and [u], which are common to the majority of spoken languages (Styles and Gawne, 2017). A small number of additional studies have suggested that English speakers share sufficient sound symbolic mapping patterns with Chinese speakers that they can guess the correct meanings of Chinese word pairs (e.g., antonym pairs) when words are written phonetically (Brackbill and Little, 1957; Weiss, 1963; Klank et al., 1971) or spoken (Brown et al., 1955; Brown and Nuttall, 1959), at levels better than predicted by chance. However, in a slightly different method, LaPolla (1994) asked native English speakers with no knowledge of Chinese to pick the correct meaning from of a pair of Chinese antonyms, in two versions of the same task – one which included the original words with tones articulated correctly, and a second version where the tones were swapped between antonyms in each pair. LaPolla’s curious finding was that the English speakers performed better when tones in the Chinese antonym pairs were swapped. In another experiment from the same paper (LaPolla, 1994), he found that Mandarin speakers showed chance performance for sound-size mapping when listening to Cantonese minimal pairs differing only in tones. It is important to note here that Mandarin Chinese and Cantonese have radically different tone systems. Both of these findings undermine the supposed universality of sound symbolism when it comes to people who do/don’t speak a tone language, or who speak languages with different tone systems. To date, no satisfactory explanation has been proposed as to how or why these tone-based anomalies exist. Given the fact that so many people in the world speak tone languages and sound symbolism has not been investigated systematically using tone languages, the present study explores cross-modal correspondences between Mandarin tones and visual shapes in two highly systematic investigations of bouba/kiki-type sound-shape matching.

In previous research investigating which sounds match with which shapes, Nielsen and Rendall (2011, 2013) found that sonorant consonants (/m/, /n/, or /l/) and rounded vowels (“oo,” “oh” or “ah”) were matched to curved images while voiceless plosive consonants (/t/, /k/ or /p/) and non-rounded vowels (“ee,” “ay” or “uh”) were matched to jagged images. Similarly, D’Onofrio (2014) found that non-words containing voiced consonants (/b/, /d/ or /g/), labial consonants (/b/ or /p/) and back and/or rounded vowels (/u/ or /a/) were matched with round shapes more than their respective counterparts (/t/, /k/; /i/, /e/). Recently, Fort et al. (2015) also replicated the ‘[o], [u]-round, [i], [e]-spiky’ audiovisual mapping pattern. Taken together, these studies suggest that sonorants, voiced stops and back rounded vowels typically match with curvy shapes, voiceless stops and high-front unrounded vowels typically match with pointy shapes (c.f. Köhler, 1929/1947/1970; Davis, 1961; Ramachandran and Hubbard, 2001; Maurer et al., 2006; Nielsen and Rendall, 2011, 2013; D’Onofrio, 2014; Ozturk et al., 2013; Fort et al., 2015). Hence, all previous studies agree that the high front non-rounded vowel [i] (the ‘ee’ vowel in ‘feet’) and the high back rounded vowel [u] (the ‘oo’ vowel in ‘shoe’) represent a highly salient pointy-curvy contrast as [i] and [u] representing two extremes of vowel space for the majority of documented languages (cf. Styles and Gawne, 2017). Notably, most of the stimuli used in earlier studies differ in multiple phonetic features, making it hard to tease apart the detailed source of the effects. For example, in Köhler’s earliest evidence, maluma differs from takete in vowel roundedness, vowel backness, sonority of consonants, continuity of consonants, voicing of consonants as well as place of articulation of consonants. Most of these features are also contrasted in bouba and kiki (Ramachandran and Hubbard, 2001), as well as the nonsense word-pairs used in Maurer et al. (2006). Since the focus of the current study is the sound-symbolic congruence for tones, we elected to test the smallest acoustic element that can carry a tone – a vowel produced in isolation. This decision allows precise control of the non-tone elements of the speech, and removes possible confounds between vowels and consonants. Furthermore, tones are documented to be more easily identified when presented in isolation (i.e., monosyllables, Broselow et al., 1987). The current study therefore investigates sound symbolism for the single vowels [i] and [u] articulated in the four Mandarin tones.

The two authors had different predictions for what would happen. Given the extensive literature on cross-modal correspondences between pitch and visual shapes (O’Boyle and Tarte, 1980; Marks, 1987; Walker et al., 2010; Parise and Spence, 2012), the second author, a native English speaker who does not speak Chinese, predicted that the high-pitched Tone 1 would be matched with pointy shapes, while the low-pitched Tone 3 would be matched with smooth curvy shapes. By contrast, the first author, based on her experience as a native speaker of Mandarin Chinese, predicted a different pattern: the smooth, steady Tone 1 would be mapped with curvy shapes, while the dynamically changing Tone 4 would be mapped with visually dynamic pointy shapes. Because of our radically different expectations (Tone 1, pointy; Tone 1, curvy), we chose to compare speakers with different experience of the Mandarin tone system. Both authors expected to replicate the well documented [u]-curvy, [i]-pointy vowel-shape pattern (e.g., D’Onofrio, 2014). To date, the present study is the first to investigate lexical tones in systematically controlled sound symbolic selection paradigm (a modified bouba/kiki task).

Study 1: Two Shapes with One Sound

Materials and Methods

Participants were invited to take part in an online quiz using social media. The quiz consisted of eight audiovisual questions, followed by demographic questions. The experimental procedure was approved by the IRB of Nanyang Technological University.

Participants

One hundred and fourteen volunteer participants (64 females), aged 18–57 years, took part in the present study, conducted using Qualtrics online survey software. Participants were over 18 years of age and completed the Information and Consent page online. We designed the study for three groups: a Chinese dominant bilingual group (C); a Chinese-English balanced bilingual group (C/E) and a group of English speakers with no Chinese (E). No fixed limits were set for group size in the online data collection. The study was closed on a predetermined day shortly after 100 participants were recorded, and before analysis was conducted. According to the predetermined grouping criteria, 14 participants did not fall into one of these groups, and were excluded. Therefore, there were 100 valid participants in the present study (C: 45; C/E: 30; E: 25).

Grouping Criteria

A single question asked participants if they are bilinguals of English-and-Chinese. Participants were also asked to rate their proficiency in each of their languages and dialects on a five-point scale where one represents highest competence and five represents lowest, and zero represents that the participant has no knowledge of that language. For further details of the language questions, see the Supplementary Materials for this article. Participants were also asked about where they live now and their residence history in different life stages. Participants were allocated to the C group (Chinese dominant) if they identified as English-and-Chinese bilinguals, their Chinese was self-reported at the highest level, their English was self-reported as lower, and they completed all schooling up to undergraduate level in Mainland China. Participants were allocated to the C/E group (Chinese–English balanced bilinguals) if they identified as English-and-Chinese bilinguals, their Chinese and English proficiency differed by no more than 1 point on the scale, and they completed all schooling up to undergraduate level in Singapore. Participants were allocated into the E group (English speakers with no tone language experience), if they had no knowledge of Chinese and their English was self-reported to be native or near-native level. Group E reported their schooling in a variety of countries (e.g., Singapore, United Kingdom, United States, Australia, Germany, etc.).

Stimuli

The visual stimuli were two ivory three-dimensional novel objects (one curvy; one pointy), photographed against a black background (see Figure 2). The hand-made objects captured salient bouba/kiki differences. The 3D forms were designed to be more visually interesting than more-familiar 2D line drawings, so that participants could maintain visual interest over multiple test trials.

FIGURE 2

FIGURE 2. Visual stimuli: Two hand-made novel three-dimensional objects capturing salient bouba/kiki contrast dimensions. For original photographs, see the Open Science Framework Repository for this Project (https://osf.io/364fm).

The auditory stimuli were the high front non-rounded vowel [i] (the ‘ee’ vowel in ‘feet’) and the high back rounded vowel [u] (the ‘oo’ vowel in ‘shoe’) articulated in the four tones of Mandarin Chinese by a female native speaker. For recording, the auditory stimuli were produced a minimum of three times each as isolated monosyllables, with the syllables arranged in a number of different orders to ensure variation across the recorded set of stimuli. The auditory stimuli were recorded in a sound-proof recording lab using a Shure SN81 microphone and Acoustica recording software, at a 44.1 kHz sampling rate with 16 bit encoding. Audio were edited and trimmed using GoldWave software.

Auditory Stimulus Selection and Validation

To ensure that the audio tokens were sufficiently standard for use in the test, we asked seven bilingual speakers of Mandarin Chinese and English to evaluate the typicality of each recording as an exemplar of the tone produced on the vowel in question. We played each sound, and asked what vowel it was, what tone it was, and asked people to rate the typicality of each sound, on a scale from one to seven. Seven represented the most typical and one represented the least typical. People could also mark zero, if they thought it did not sound like the category at all. All seven raters agreed on the identity of the vowels and the tones.

The highest typicality token of each stimulus was selected for use in the studies reported here. All stimuli were rated as very typical, with median scores of 5 or 6. Each of the eight sound files was 500 ms long. Figure 3 shows the pitch tracks of the eight auditory stimuli as measured in PRAAT (Boersma, 2001), where it is clear that the [i] and [u] track within each tone are more similar to each other than are the pitch tracks between tones. That is to say, each tone was clearly differentiated from the others, and its contour was consistent across the two stimuli. As can be seen in Table 1 and Figure 2, mean pitch differs most between Tone 1 and Tone 3, and pitch variance differs most between Tone 1 and Tone 4.

FIGURE 3

FIGURE 3. Pitch tracks showing pitch (Hz) as it evolves over the 500 ms of the eight auditory stimuli: Vowels [i] (red lines) and [u] (black lines) articulated in each of the lexical tones of Mandarin Chinese (T1, T2, T3, T4). For original audio files, see the Open Science Framework Repository for this Project (https://osf.io/364fm).

TABLE 1

TABLE 1. Pitch measurements for the eight auditory stimuli used in this study.

Experimental Platform

Following a pilot study reported elsewhere, the experiment was presented using Qualtrics online survey software. Participants were instructed to run the experiment using laptops or computers, since the platform did not guarantee stable audio on mobile devices at the time. Participants were also instructed to use headphones and to do the online quiz in a quiet environment.

Procedure

Participants were presented with a single audio file and a pair of pictures, and were asked “Which of these two shapes goes better with this sound.” In the online procedure, after obtaining consent and adjusting audio volume, participants were presented with eight experimental pages followed by the demographic and language background questions. In each experimental page, participants were presented with two 250 px × 250 px pictures side-by-side along with a button which triggered the audio file to play. In each question, participants could listen to the auditory stimulus as many times as they liked, and they were asked to decide which of the two visual stimuli was a better match for the sound. Each of the eight questions was presented on a separate page, and each page was unreturnable. The location of the two pictures (right, left) was randomized, as was the presentation sequence of test questions. The whole procedure took around 7 min. The materials and precise instructions for the task can be found in the Open Science Framework repository for this project¹.

Predictions

According to previous European-language sound-shape matching tasks (e.g., D’Onofrio, 2014), we expected all participants would show an [i]-pointy, [u]-curvy preference, a ‘vowel effect.’ If speakers with different language backgrounds differ in their perceptual mapping preferences (as did the two authors), then different groups would show different mapping patterns. In particular, if crossmodal perception of tones and shapes is guided by pitch change, since Chinese speakers are sensitive to pitch change, we expected that the Chinese speakers would show different responses to Tone 1 and Tone 4, as Tone 1 has the least pitch change while Tone 4 has the most pitch change. Hence, by paying attention to pitch change, Chinese speakers would match steady Tone 1 with the curvy shape and dynamic Tone 4 with the pointy shape. If on the other hand, pitch height is the major driver of this kind of crossmodal correspondence, as English speakers’ tone processing mainly focuses on pitch height (Gandour, 1983), the English speakers may show different responses to Tone 1 (high-pitched) and Tone 3 (low-pitched). Hence, by paying attention to pitch height, English speakers would match high Tone 1 with the pointy shape and low Tone 3 with the curvy shape.

Analytical Approach

To investigate the influence of vowel ([i], [u]), tone (T1, T2, T3, T4), and language group (C, C/E, E) and the interactions among them on shape choice, we ran a fully factorial generalized linear mixed model (GLMM) test, with participant as the only random factor. Since the outcomes are dichotomous (curvy or pointy), we used Binary logistic regression that couples a binomial probability distribution with the logit link function (which is the canonical link function for the binomial distribution) (Heck et al., 2012). Between group effects and interactions were followed up with pairwise group comparisons in GLMM. Tone effects were followed up with Related-Samples McNemar tests to compare pairs of tones. The full statistical reports for each GLMM test can be found in the Supplementary Materials, along with the results of pair-wise comparisons, and Related-Samples McNemar tests, with summary data presented here in the Results section.

Results

Figure 4 shows the percentage of participants who selected the pointy or the curvy shape for each of the eight sounds, with language groups shown separately, following the graphical logic of Fort et al. (2015).

FIGURE 4

FIGURE 4. Shape choice for the three groups for Study 1, where a single vowel was presented with two visual shapes. Choices shown separately for each vowel ([i], [u]) articulated in each of the four Chinese Tones (T1, T2, T3, T4). Groups of participants were C: Chinese-dominant educated in China; C/E: Chinese–English balanced bilinguals educated in Singapore: E: English-speakers with no knowledge of Chinese. Black angled lines indicate significant within-group tone contrasts for tones produced on the same vowel.

According to the GLMM test, the main effect of vowel [F(1,776) = 140.11, p < 0.001] and the main effect of tone [F(3,776) = 4.54, p = 0.004] were significant. The log odds of choosing the curvy shape were higher for the [u] vowel (β = 4.123, p < 0.001) than for the [i] vowel, holding the other effects constant. Overall, all participants, regardless of language background, made significantly more curvy choices for [u] than for [i]. This preference for the ‘[i]-pointy, [u]-curvy’ matching pattern is in line with the existing literature.

In unpacking the tone effect, holding all other effects constant, the log odds of choosing the curvy shape were higher for Tone 1 than Tone 4 (β = 3.178, p = 0.002), Tone 2 than Tone 4 (β = 2.603, p = 0.02) and Tone 3 than Tone 4 (β = 2.234, p = 0.049). Overall, there was a T4 pointy effect with T1 the most curvy and T2 and T3 in between. The interaction between tone and language group did not achieve significance [F(3,776) = 1.867, p = 0.084]. However, since visual inspection of Figure 3 suggests somewhat different response patterns for the [i] and [u] vowels, further analyses were conducted on responses to [i] and [u] stimuli separately.

For the [i] stimuli, there was a significant main effect of tone [F(3,388) = 4.05, p = 0.007] that did not interact with language groups. The log odds of choosing the pointy shape for Tone 4 were higher than for other tones (T1: β = 1.962, p = 0.031; T2: β = 2.912, p = 0.001; T3: β = 2.476, p = 0.007), holding the other effects constant.

For the [u] stimuli, there was a significant main effect of tone [F(3,388) = 4.81, p = 0.003] and a significant interaction between tone and language group [F(6,388) = 2.82, p = 0.011]. The log odds of choosing the curvy shape for Tone 1 were higher in the Chinese-dominant group (β = 2.1, p = 0.006) and the Chinese–English balanced group (β = 2.436, p = 0.008) than in the English group, holding the other effects constant. Overall, the two Chinese groups showed a similar response pattern (i.e., T1 curvy, T4 pointy, T2 and T3 in between), whereas the English group did not (i.e., T1 and T4, in particular T1, was pointier than T2 and T3 for the E group). To unpack this interaction, we ran pair-wise comparisons between language groups.

In the comparison between C group and E group, the difference in tone was also significant [F(3,272) = 3.78, p = 0.011]. Holding the other effects constant, the log odds of choosing the curvy shape for Tone 1 were higher in the Chinese-dominant group (β = 2.04, p = 0.01) than in the English group. The pairwise comparisons demonstrated that the C group made significantly more curvy choices for [u] stimuli articulated in Tone 1 than in Tone 4 (C/E: p = 0.003), with T2 and T3 falling between T1 and T4. The English group showed no significant results for any pairwise comparison between tones for the [u] vowel. However, visual inspection of the graph suggested that English speakers may be treating Tone 1 and 4 differently from Tone 2 and 3. These pairs are of interest as Tone 1 and Tone 4 share a high pitch onset, while Tone 2 and Tone 3 share a low pitch onset, and English-speakers’ perceptual sensitivity for high/low contrasts may be strongest at the onset of Mandarin Chinese tones, where the pitch information is typically loudest. In a follow-up exploratory analysis, the comparison between the pooled T1, T4 and the pooled T2, T3 was significant (p = 0.041, uncorrected). Hence, the English-speaking participants did not show the Chinese-tone mapping pattern, but tended to make somewhat more pointy choices for T1 and T4 than for T2 and T3.

In the comparison between the CE and the E group, the tone effect differed significantly [F(3,212) = 3.01, p = 0.031]. The log odds of choosing the curvy shape for Tone 1 were higher in the Chinese–English balanced group (β = 2.414, p = 0.01) than in the English group, holding the other effects constant. We also conducted pairwise comparisons (Related-Samples McNemar Tests) between the tones for each group. The Related-Samples McNemar tests showed that, similar to the C group, the CE group also showed the significant ‘T1-curvy, T4-pointy’ response pattern (C: p = 0.007), with T2 and T3 between them.

When comparing the two Chinese groups, there was a significant main effect of tone [F(3,292) = 12.28, p < 0.001], but this effect did not differ between the two Chinese groups. The log odds of choosing the curvy shape for Tone 1 were higher (β = 2.432, p < 0.001) than for Tone 4, holding the other effects constant. Hence, both Chinese-speaking groups show the same response pattern where T1 was the curviest, T4 the pointiest, and T2 and T3 between them.

Taking the pairwise comparisons together, it is clear that Chinese speakers in both groups showed a ‘T1-curvy, T4-pointy’ pattern, and both groups differed from English speakers who showed a ‘T1/T4 pointy, T2/T3 curvy’ pattern for [u].

Discussion

Consistent with the results of previous studies (Köhler, 1929/1947/1970; Ramachandran and Hubbard, 2001; D’Onofrio, 2014), the ‘u-curvy, i-pointy’ cross-modal correspondence was replicated in our online 2AFC survey, across participants with different language backgrounds, further supporting the consistency of the cross-modal correspondences for these high-prevalence vowels.

In addition to the vowel effect, Chinese speakers (both C Group and C/E Group) made more curvy choices for [u] stimuli articulated in Tone 1 than in Tone 4, which is consistent with the previous literature on Chinese speakers’ sensitivity to pitch change (Jongman et al., 2006; Singh et al., 2014), and demonstrates for the first time that this perceptual sensitivity is also evident in crossmodal perception using two large samples of homogeneous Chinese speakers (one Chinese dominant with predominately Mandarin Chinese language experience, the other bilingually educated Singaporeans with self-reported balance in their language skills in Chinese and English). This finding also replicated the observations of the pilot study (reported elsewhere), suggesting a robust, replicable effect.

By contrast, English-speaking participants made fewer curvy choices for Tone 1 and Tone 4 compared to for Tone 2 and Tone 3. This pattern was only significant for the [u] vowel, when analyzed in isolation, and the interaction between vowel, tone and group was not significant, meaning that this exploratory finding should be treated with caution until further replicated (see Experiment 2). In unpacking the direction of this trend, it should be noted that Tone 1 and Tone 4 both have high pitch onsets compared to Tone 2 and Tone 3. In other words, instead of pitch change, the English speakers tend to pay more attention to pitch height. This finding is in line with the previous literature on Western high/low pitch perception in lexical tones (Gandour, 1983) and in non-linguistic crossmodal pitch processing (O’Boyle and Tarte, 1980; Marks, 1987; Walker et al., 2010). In Figure 5, we summarize these patterns graphically.

FIGURE 5

FIGURE 5. Schematic diagram illustrating the Tone Contours for the four tones of Mandarin Chinese; The Maximum Contrast Tone-pairs for each group of speakers (as established in the previous literature), and the Language-specific Mapping patterns observed in Study 1, for conceptual replication in Study 2.

In the present study, each stimulus carried vowel identity information (tongue height/backness and lip rounding) as well as tone identity information. As observed here, all groups of participants showed a strong ‘[i]-pointy, [u]-curvy’ matching pattern, with tone modulating responses less strongly. The tone effect was observed for [u] but was not observed for [i]. Since vowel identity is such a strong predictor of shape choice, it may have overshadowed observation of a subtler tone effect. Furthermore, since non-tone language speakers are known for their inability to hold representations of tone information in short term memory, hearing each tone in a separate trial may have ‘washed out’ perceptual effects that may be evident if the tone categories are contrasted more explicitly. For this reason, we developed a second study designed to make the perceptual differences between pitch contours more salient even for non-tone language participants, by presenting two syllables varying only in tone (a tone-minimal pair) along with a single shape.