Iconic Associations Between Vowel Acoustics and Musical Patterns, and the Musical Protolanguage Hypothesis

Fenk-Oczlon, Gertraud

doi:10.3389/fcomm.2022.887739

BRIEF RESEARCH REPORT article

Front. Commun., 05 July 2022

Sec. Psychology of Language

Volume 7 - 2022 | https://doi.org/10.3389/fcomm.2022.887739

This article is part of the Research Topic Relationship of Language and Music, Ten Years After: Neural Organization, Cross-domain Transfer and Evolutionary Origins View all 12 articles

Iconic Associations Between Vowel Acoustics and Musical Patterns, and the Musical Protolanguage Hypothesis

Updated

A correction has been applied to this article in:

Corrigendum: Iconic associations between vowel acoustics and musical patterns, and the musical protolanguage hypothesis
1. Read correction

$\nGertraud Fenk-Oczlon$ Gertraud Fenk-Oczlon^*

University of Klagenfurt, Klagenfurt, Austria

Vowels are the most musical and sonic elements of speech. Previous studies found non-arbitrary associations between vowel intrinsic pitch and musical pitch in senseless syllables. In songs containing strings of senseless syllables, vowels are connected to melodic direction in close correspondence to their intrinsic pitch or the frequency of the second formant F2. This paper shows that also vowel intrinsic duration is related to musical patterns. It is generally assumed that low vowels like [a ɔ o] have a higher intrinsic duration than high vowels like [i y u] and that there is a positive correlation between the first formant F1 and duration. Analyzing 20 traditional Alpine yodels I found that vowels with longer intrinsic duration tend to align with longer notes, whereas vowels with shorter intrinsic duration with shorter notes. This new result might shed some light on size-sound symbolism in general: Since there is a direct match between vowel intrinsic duration and the “size” of musical notes, there is no need to explain the “size” of musical notes via Ohala's “frequency code” hypothesis. Moreover, I will argue that the iconic associations found between vowel acoustics and musical patterns support the idea of a sound-symbolic musical protolanguage. Such a protolanguage may have started with vowel syllables conveying pitch, timbre, as well as emotional, indexical, and sound-symbolic information.

Introduction

Language and music share many commonalties, consistent with a view according to which both have a common evolutionary precursor. The hypothesized common ancestor is often referred to as “musilanguage” (Brown, 2000), “musical protolanguage” (Fitch, 2005), or “prosodic protolanguage” (Fitch, 2006). A growing number of researchers further emphasizes the idea that affective/emotional and iconic vocalizations could have played a significant role in the joint evolution of speech and music (Rousseau, 1781; Darwin, 1871; Fonagy, 1981; Levman, 1992; Scherer, 1995; Thompson et al., 2012; Perlman and Cain, 2014; Brown, 2017; Filippi and Gingras, 2018; Reybrouck and Podlipniak, 2019; Filippi, 2020).

This paper focuses on the role of vowels in the hypothetical construct “musical protolanguage.” I will briefly review some literature that has demonstrated tight relationships between vowels and music, and that has revealed the essential role of vowels in speech intelligibility of sentences, in conveying emotional content and talker discrimination, as well as in size-sound symbolism. I then present new results showing an iconic relationship between vowel duration and musical notes in Alpine yodels. The implications for sound symbolism in general, as well as for the idea of a sound-symbolic musical protolanguage will be discussed.

The most obvious commonality between speech and music is sound, and it is the vowels that are the main carriers of sound and prosodic information in speech and singing (e.g., Fenk-Oczlon and Fenk, 2009b). Vowels are produced without obstructing the airflow from the lungs and are relatively continuous or steady-state sounds exhibiting a greater periodicity than consonants (Cutler and Mehler, 1993). According to Halle et al. (1957, p. 116) vowels can be matched easily in pitch to pure tones, whereas determinations of pitch of consonants “usually refer to the terminal stage of the second formant in the adjacent vowel.” Vowels are distinguished by their timbre, which depends on their harmonics or overtones, whereby the formants F1 and F2 are most relevant for their identification (Peterson and Barney, 1952). The main articulatory parameters responsible for vowel timbre are tongue height, front-to back position of the tongue, and lip rounding. The changes in the vowels' resonances are audible in the case of whispering, when the vocal chords do not vibrate, or when speaking in a creaky voice (Ladefoged, 2001). Indeed, when whispering series of words like heed, hid, head, had, hawed one can hear the descending pitch of F2; and when speaking the series hawed, had, head, hid, heed in a creaky voice, the descending pitch of F1 is audible.

Timbre is clearly the primary parameter that allows for discriminating between different vowels, but vowels differ also in intrinsic pitch, intensity and duration. It is known since Meyer (1896) that, all other things being equal, high vowels such as /i/ have a higher intrinsic fundamental frequency IF0 than low vowels such as /a/. Whalen et al. (1995) could observe this effect in a sample of 31 languages and even in babbling. While the mechanism determining IF0 is still a subject of debate, there seems to be general agreement that vowel pitch depends primarily on the frequency of the second formant F2 (Marks, 1975; Traunmüller, 1986). Concerning vowel intrinsic duration it is generally assumed that low vowels have a higher intrinsic duration than high vowels like [i u y]. and that there is a positive correlation between the first formant F1 and duration, i.e., the lower the vowel, the higher F1, and the higher the intrinsic duration of the vowel (House and Fairbanks, 1953; Peterson and Lehiste, 1960; Lehiste, 1970; Sol and Ohala, 2010; Toivonen et al., 2015). According to House and Fairbanks (1953) intrinsic vowel duration differences show in various types of consonant environments (voiced and voiceless stops and fricatives, nasals); for instance, when pooled across all environments the vowel /i/ has a mean duration of 0.199 s and the vowel /a/ of 0.244 s.

Evidently, vowels show all the core properties of music—timbre, intrinsic pitch, intensity and duration—and they are the most musical components of speech. Recent studies revealed tight relationships between vowels and music. For example, in Fenk-Oczlon (2017) I reported correspondences between the number of vowels and the number of pitches in musical scales across cultures: an upper limit of roughly 12 elements, a lower limit of 2, and a frequency peak at 5 to 7 elements. The match between vowels and musical pitches shows even in specific cultures: e.g., cultures with three vowels tend to have tritonic scales. Concerning relationships between vowel acoustics and musical pitch, Fürniss (1991) reported associations between low vowels and the “low yodel register” and closed vowels and the “high yodel register” in the yodeling of Aka Pygmies; Fenk-Oczlon and Fenk (2009a,b) showed non-arbitrary associations between vowel intrinsic pitch and musical pitch in Alpine yodeling and in Austrian songs containing meaningless syllables. The tight bond between vowels and music is supported by experimental findings demonstrating strong interactions in the processing of vowels and melody, but not between consonants and musical information: “Vowels sing but consonants speak” (Kolinsky et al., 2009, p. 1). Similarly, Lidji et al. (2010) revealed a close processing relationship between vowels and pitch even at a pre-attentive level. Moreover, experiments by Zhang et al. (2017) demonstrated that congenital amusics not only show deficits in the perception of pitch but also in the perception of formant frequency in vowels.

Vowels and their acoustic properties are essential in many further aspects of language and speech, such as in speech intelligibility of sentences, in talker identity discrimination and in conveying emotional state, or in sound symbolism. For example, experimental studies revealed that the intelligibility of sentences was significantly better when hearing vowel-only sentences than when hearing consonant-only sentences (Cole et al., 1996; Kewley-Port et al., 2007). Vowels, unlike consonants, also provide rich indexical information about speaker identity and characteristics such as age, biological sex, origin and emotional state (Owren and Cardillo, 2006). Concerning relationships between vowels and emotional state, Rummer et al. (2014) demonstrated that subjects in a positive mood tend to invent words with /i:/, whereas when in a negative mood they tend to invent more words with /o:/.

As to sound symbolism (the non-arbitrary relation between sound and meaning), vowels are the main drivers in “size-sound symbolism” or “magnitude sound symbolism,” i.e., the association between size (large/small) and sound. In a classic study, Sapir (1929) demonstrated that participants associate meaningless words containing low and back vowels like /a/ (e.g., as in mal) with large concepts and meaningless words containing high and front vowels like /i/ (e.g., as in mil) with small concepts. Numerous experimental studies could replicate Sapir's finding showing the postulated association between vowel quality and size (Bentley and Varon, 1933; Peña et al., 2011; Parise and Spence, 2012; Shinohara and Kawahara, 2016; Knoeferle et al., 2017; Vainio, 2021). Likewise, statistical studies in typologically diverse languages found associations between the high front vowel /i/ and the concept of small (Ultan, 1978; Haynie et al., 2014; Blasi et al., 2016; Johansson et al., 2020). Most recently, Winter and Perlman (2021) demonstrated that—in English—size adjectives clearly feature iconicity, and that the high front vowels /i/ and /I/ are associated with “small,” while the low back vowel /α/ predicts “large.” The only consonant that predicts size symbolism in their English sample was /t/. In general, consonants seem to play a rather marginal role in sound-size associations, whereas their role in sound-shape associations as in the maluma/takete effect (Köhler, 1929) or the bouba–kiki effect (Ramachandran and Hubbard, 2001) is well-attested (but see Cuskley et al., 2017 on possible influences of orthography.)

Further cross-modal correspondences between vowels and other sensory modalities have been demonstrated between “vowels and quickness” (Jespersen, 1933), “vowels and brightness” (Marks, 1975), “vowels and spatial deixis” (Traunmüller, 1986; Johansson and Zlatev, 2013; Rabaglia et al., 2016; Vainio, 2021), “vowels and color” (Moos et al., 2014; Cuskley et al., 2019), or “vowels and taste” (Simner et al., 2010; Patak and Calvert, 2021).

Here I investigate whether there are iconic associations between the acoustic vowel property “intrinsic duration” (see above) and the length of musical notes. More specifically, I hypothesized that in songs containing meaningless syllables, syllables with low vowels like [a ɔ o] should be favored for long notes and syllables with high vowels like [i u y] for short notes.

Materials and Methods

The singing of senseless syllables, where “the pressures of sense are relaxed to those of sound” (Butler 2015, p. 106) provides an ideal material to study relationships between vowels and musical notes. Senseless syllables are used in numerous cultures as complete or partial song texts, for example in Native American songs (Nettl, 1954), in “lilting” or “diddling,” in the singing of Scottish or Irish dance melodies, in children's songs and jazz scat singing, or in yodeling. Here, I chose yodels for testing the hypothesized relationship between vowels and musical notes. The yodeling style, although on the whole not very frequent, can be found around the world (Grauer, 2006), for instance in Paleosiberian cultures, in the tropical forest of Africa (Pygmies), in the Kalahari Desert (Bushmen), and in the Alps (Austria, Switzerland). According to Grauer (2006) yodels are characterized across cultures by a continuous flow of sound, no embellishment, relaxed open voices, non-sense vocables, wide intervals and a polyphonic style. These characteristics also apply to traditional Alpine yodels, which are preferably polyphonic and mostly—but not necessarily—sung with frequent alternation between low and high registers (cf. Wey, 2019); they are yodeled straight without vibrato or portamento and with meaningless syllables. The yodel-syllables are predominately codaless, with rather weak or sonorant consonants in the syllabic onset, such as [jɔ, ha, hɔ, ji, ri, ho, ha]. Vowel-only syllables and codaless syllables with a liquid in the syllabic nucleus like “dl,” occur as well. The transcriptions into musical notation of the previously only orally transmitted Alpine yodels started at the beginning of the 19th century (Wey, 2019). The traditional yodels for the present study are taken from Pommer's (1906) collection of 20 yodels. Most of the yodels of this collection are still yodeled in Austria and are well-known, so that the grapheme—phoneme correspondence of this more than 100 years old transcriptions can be checked. For instance, the grapheme “å” is still used in Bavarian writing to denote an open “o” /ɔ/.

All 20 yodels in the collection were analyzed. I determined all relative note values in the sample: half notes (the longest note values in the sample), quarter notes, eighth notes, sixteenth notes, and thirty-second notes (the shortest notes in the sample). The notes were assigned to the respective syllables containing either high close vowels like [i u y] or low back vowels like [a ɔ o] Furthermore, all dotted notes—the dot increases the duration of the basic note by half of its original value—were identified and matched with the particular syllables.

Results

The total number of notes/syllables in the sample amounts to 1,836. The most frequent note values are eighth notes (n = 845), followed by quarter notes (n = 672), half notes (n = 190), sixteenth notes (n = 95), and thirty-second notes (n = 34); the number of dotted notes amounts to 348. Syllables with high vowels (n = 1,203) are more often used in the yodel sample than syllables with low vowels (n = 633); (X² = 176.961, p < 0.0001).

A detailed analysis: Eighth notes are more often aligned with high vowels (590x) than with low vowels (255x), (X² = 132.811, p < 0.0001). Quarter notes are 405 times aligned with high vowels and 267 times with low vowels (X² = 28.339, p < 0.0001). Sixteenth notes are associated with high vowels 45 times and with low vowels 50 times (X² = 0.263, n.s.). Thirty-second notes are 28 times aligned with high vowels and 6 times with low vowels (X² = 14.235, p < 0.001).

On the contrary half notes, the longest note values in the sample, are more often aligned with low vowels (135x) and less frequently associated with high vowels (55x), (X² = 33.684, p < 0.0001). This also holds for dotted notes which are 265 times associated with low vowels and only 83 times with high vowels (X² = 95.184, p < 0.0001). Figure 1 shows an example.

FIGURE 1

Figure 1. An example of a yodeler from our sample shows that dotted and half notes tend to be linked with syllables containing the vowel å /ɔ/ that has a longer intrinsic duration.

Discussion

Our analysis of 20 Alpine yodels demonstrates that short musical notes such as eighth notes, quarter notes and thirty-second notes tend to align with vowels with smaller intrinsic duration, whereas relative long notes such as half notes or dotted notes are associated with vowels with longer intrinsic duration. These results need to be confirmed in further studies that use an extended sample of songs containing meaningless syllables. It would also be interesting to investigate, whether in an artificial music composition game, people will tend to align vowels with longer intrinsic duration to longer notes.

Vowel Intrinsic Duration and Size-Sound Symbolism

The iconic associations between vowel intrinsic duration and length of musical notes may shed some light on size-sound symbolism in general. Although “duration” of musical notes only metaphorically corresponds to “size” of notes, our data are in line with results by Knoeferle et al. (2017) suggesting F1 and vowel duration are decisive factors in size-sound symbolism; F0 or Ohala (1984, 1994) “frequency code” hypothesis, according to which size-symbolism mirrors the size of the vocalizers producing either lower or higher frequencies, do not seem to play a role in their experiments on visual size judgements. Similarly, Vainio (2021) reports that F0 values did not show to be relevant in his study on magnitude sound symbolism. Since our results demonstrate a direct match between vowel intrinsic duration and the “size” of musical notes, there is no need to explain the “size” of musical notes via Ohala's “frequency code” hypothesis. Therefore, a possible answer to the question What is, for example, so small about mil and large about mal? (Vainio 2021, p. 2) might be: Small about mil, is the small intrinsic duration of the vowel /i/, and large about mal is the large intrinsic duration of the vowel /a/.

Vowels and a Sound-Symbolic Musical Protolanguage

The non-arbitrary associations between vowel intrinsic duration and musical notes are consistent with the results of previous studies (Fenk-Oczlon and Fenk, 2009a,b) reporting non-arbitrary associations between vowel intrinsic pitch and musical pitch in meaningless syllables: In songs containing strings of meaningless syllables, vowels are connected to melodic direction in close correspondence to their intrinsic pitch or the frequency of the second formant F2. The tight relationships between vowel acoustics and musical intervals indicate that in the case of singing senseless syllables, where there is no pressure of text, vowels and melody seem to merge. This might strengthen the idea that both music and speech evolved from a common prosodic precursor.

In Fenk-Oczlon (2017) I speculated that the earliest human vocal communication may have started with vowels or vowel syllables strung together, which were connected by semivowels or glides such as [w], [h], [j] or the glottal stop [ʔ]. The vowel sequences exhibited pitch and timbre modulations which were used to express different social and pragmatic functions, and were probably propositionally meaningless. The main arguments for this speculation were based on findings from language ontogeny, ethnomusicology, and parallels between vowels and musical patterns. In the 2017 paper I did not consider the huge sound symbolic potential of vowels and their disproportionate role in talker identity discrimination, including characteristics such as age, biological sex, origin, or emotional state. Considering all these properties of vowels, it seems plausible that the sequences of vowel syllables were not bare phonology in the sense of Fitch (2010), but instead conveyed sound symbolic information about the environment, about emotional states, or speaker identity. The sequences of vowel syllables probably also contained interjections similar to present-day words such as ah, oh, eh, huh. In this context it is interesting to note that Dingemanse et al. (2013) reported that all variants of the interjection word huh in their cross-linguistic sample consisted either of a vowel-only syllable, a syllable with a glottal stop [ʔ], or a glottal fricative [h] in the onset.

The vowel sequences were likely very polysemous, because of the small number of vowels (present-day languages have on average 5–6 vowels; Maddieson, 2005) which does not allow much variation in a sequence. Only pitch, duration, intonational contour, rhythmic grouping and situational context could help to discriminate the different (sound symbolic) meanings.

Even in present-day languages, vowel-only sentences can be observed. Table 1 gives some examples from Japanese (Tsunoda, 1985), Carinthian (my own native knowledge) and vowel-only expletives from the Mbendjele Pygmies (Lewis, 2009). I am not able to analyze the Japanese examples, but the Carinthian example shows that the word “a”/ a/ is quite polysemous: It can be a question particle, an interjection of astonishment, and also denotes auch “also.” The expletives from the Mbendjele Pygmies nicely demonstrate the potential of vowels to convey emotional content. Furthermore, Lewis (2009) reports that vowel-only sentences can also be observed in very intimate communication situations between two persons of the Mbendjele Pygmies, who “tend to omit consonants, leaving only tone and vowels” (Lewis 2009, p. 241).

TABLE 1

Table 1. Examples of vowel-only sentences and vowel-only expletives in Japanese, Carinthian and in the language of the Mbendjele Pygmies.

One might speculate that the earliest stage of human vocal communication, where mere vowel syllables connected by semivowels were strung together, best represents the hypothesized common prosodic precursor of speech and music. The vowel syllables exhibited all core elements of music, pitch, timbre, duration, and intensity. They conveyed prosodic information such as intonation, rhythm, tempo, but also (semantic) sound-symbolic or onomatopoetic information about the environment, inner mental states or speaker identity. In a later stage, consonants such as obstruents emerged and were combined with vowels into consonant-vowel syllables. This was likely the emergence of articulated speech (Jordania, 2006), and of utterances which could express propositional meaning.

Grauer (2006) speculated that yodeling might be a vestige of the earliest singing style of humanity. The Alpine yodel syllables investigated in this paper may not be too different from the vowel syllables in the hypothesized earliest stage of human vocal communication.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

I thank the reviewers for their insightful comments and helpful suggestions.

References

Bannan, N. (2008). Language out of music: the four dimensions of vocal learning. Aust. J. Anthropol. 19, 272–293. doi: 10.1111/j.1835-9310.2008.tb00354.x