Motor constellation theory: A model of infants’ phonological development

Ekström, Axel G.

doi:10.3389/fpsyg.2022.996894

ORIGINAL RESEARCH article

Front. Psychol., 03 November 2022

Sec. Psychology of Language

Volume 13 - 2022 | https://doi.org/10.3389/fpsyg.2022.996894

Motor constellation theory: A model of infants’ phonological development

Axel G. Ekström^*

Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden

Every normally developing human infant solves the difficult problem of mapping their native-language phonology, but the neural mechanisms underpinning this behavior remain poorly understood. Here, motor constellation theory, an integrative neurophonological model, is presented, with the goal of explicating this issue. It is assumed that infants’ motor-auditory phonological mapping takes place through infants’ orosensory “reaching” for phonological elements observed in the language-specific ambient phonology, via reference to kinesthetic feedback from motor systems (e.g., articulators), and auditory feedback from resulting speech and speech-like sounds. Attempts are regulated by basal ganglion–cerebellar speech neural circuitry, and successful attempts at reproduction are enforced through dopaminergic signaling. Early in life, the pace of anatomical development constrains mapping such that complete language-specific phonological mapping is prohibited by infants’ undeveloped supralaryngeal vocal tract and undescended larynx; constraints gradually dissolve with age, enabling adult phonology. Where appropriate, reference is made to findings from animal and clinical models. Some implications for future modeling and simulation efforts, as well as clinical settings, are also discussed.

Introduction

Human infants are born into complex phonological landscapes, composed of a set of a near-infinite number of possible speech sounds (Maddieson, 1984). At birth, human infants possess a limited vocal repertoire, including crying and moaning (Eibl-Eibesfeldt, 1973; Ackermann and Ziegler, 2010). From such humble beginnings, they display predictable linguistic development across individuals, languages, and cultures, adapting to and acquiring almost flawlessly their native language and phonology (here operationalized as any language-specific set of permissible speech sounds). In under a year, every normally developing infant learns to reliably perceive the sounds of his or her native language (Werker and Tees, 1984; Kuhl et al., 1992; Cheour et al., 1998), and has begun consistently producing language-appropriate syllabic utterances in the form of babble and vocal play (Locke and Pearson, 1992; Guenther, 1994, 1995; Oller, 2000; Jang et al., 2019). The remarkable speed of this development has been the subject of decades of intense research efforts (Oller, 1980, 2000; Jusczyk, 1997; de Boysson-Bardies, 2001). Infant cries, once believed a possible precursor of speech (Lester and Boukydis, in press), are no longer considered as such (Nathani et al., 2006; Oller et al., 2013, 2021). Rather, protophones, infant speech-like utterances including vowel-like sounds and melodic non-cry vocalizations, appearing even before the onset of babble, represent a substantially greater proportion of infant utterances (Stark, 1980; Hsu et al., 2000; Jang et al., 2019; Oller et al., 2021; Wermke et al., 2021) and are considered likely precursors of phonemes proper (Oller, 1980; Koopmans-van Beinum and Stelt, 1986).

At around 6 months of age, infants begin producing canonical babble—repetitions of the same syllable, e.g., /ˈbɑːbɑː/—and around the age of 1 year, begin producing variegated babble—more complex mixed-syllable utterances, e.g., /ˈbɑdə/ (Oller, 2000). Crucially, adequate learning of phonological patterns may facilitate learning of other aspects of language (for a review, see Ruben, 1997). While the vocal milestones reached throughout infanthood have been alternately described by multiple researchers and using varied terminology (reviewed in Vihman, 2013), these general trends and tendencies are not controversial in the literature. Nevertheless, the mechanisms by which infants manage this mapping of language-appropriate sounds to their corresponding points of articulation are poorly understood.

Humans are vocal learners (Janik and Slater, 2000), capable of memorizing and repeating vocally that which has previously been heard. Indeed, human infants exhibit variable generalized imitative behavior with likely bearing on later-in-life speech behavior, including the imitation of facial expressions (Field et al., 1982, 1983), gestures such as tongue protrusion and head movements (Meltzoff and Moore, 1989), as well as goal-directed physical actions (for a review, see Elsner, 2007) and vocalization more broadly (Poulson et al., 1991; Kuhl and Meltzoff, 1996; Kugiumutzakis, 1999; Kokkinaki and Kugiumutzakis, 2000). Neural mechanisms underlying imitation are not yet well understood, but Marshall and Meltzoff (2014) have pointed to mirror neurons—cells triggered upon both the execution of an act, and the observation of the same act (de Di Pellegrino et al., 1992)—as a possible explanation. In terms of behavioral measures, Imafuku and colleagues found that infants’ tendency to vocally imitate vowel sounds was based both on infants’ attention to speakers’ faces, and whether a speaker’s gaze was focused on the infant in return (as opposed to away from the infant; Imafuku et al., 2019).

Human neonates, seemingly based on prosodic and indexical cues, prefer the sound of their mother’s voice, heard in utero, as well as the sounds of their mother’s language (Jusczyk et al., 1993; see overview in Locke and Snow, 2010). Thus, systems of perception undergo a process of adapting to ambient phonological features, beginning even before birth. Phonetically, however, the tuning of systems of speech production to match a native-language phonology represents a monumental task (for a comparative perspective, see Bolhuis, 1991), and the history of the field has seen a range of theories with bearing on the phenomenon, from “innatist” theories assuming a hard-wired cognitive apparatus prepared for learning speech and language (Chomsky, 1986, 2002), to modern input-focused theories, assuming development scaffolding through infants’ interactions with caretakers (Fernald, 1991; Kuhl et al., 1997; Goldstein and Schwade, 2008) or, more generally, acquisition based on learning from the immediate environment (including parental speech; Kuhl, 2000; Perszyk and Waxman, 2019). Supporting evidence is also available from computational modeling and learning approaches (Vallabha et al., 2007).

Despite the range of theories, however, much remains unknown about the mechanisms that underlie infants’ language development. While innatist accounts have been criticized for evolutionary implausibility (Pinker and Bloom, 1990), interactionist theories have found significant support in relevant research (Poulson et al., 1991; Kuhl and Meltzoff, 1996; see review by Chapman, 2000). However, such accounts suffer on theoretical grounds, being heavily based on observation (see Chapman, 2000; Lindblom, 2000). In the words of Chapman (2000, 33), the field has “been productive in identifying developmental patterns and individual differences but slow to develop explanations that are more than a relabeling of the patterns observed.”

Some basic postulates for a theory of phonology as an emergent phenomenon have been presented by Lindblom (2000). Namely, a theory of infants’ phonological learning must—as opposed to “curve-fitting,” the tailoring of explanatory models based solely on observations—be predicated on basic principles of the natural world, while also accommodating empirical findings. The present account accepts this premise, and thus seeks to consider both the deeper biomechanical origins and necessarily pre-verbal development and subsequent employment of in-place motor activity in early speech-like behavior (Lindblom, 2000; MacNeilage and Davis, 2000); that is, principles of learning by which a system of phonology develops from non-systematic exploratory pre-speech; and the neurological changes that accompany these developments. A theory seeking to explicate such a complex and ultimately neuroscientific issue must couch its propositions in a more basic body of literature from the study of learning, phonetics, developmental psychology, and comparative cognition and neuroscience. Providing such a framework is the goal of the present text.

In the following sections, the basics of speech production, and the neural activity to which it corresponds, are reviewed. Drawing on comparative research, including clinical observations and findings from animal models, a theory of phonological development is presented. It is suggested that dopaminergic pathways in the infant brain instantiate learning of tutor (i.e., parent or other ingroup caretaker) phonology, by comparing auditory outputs resulting from a given motor constellation (i.e., simultaneous activation of muscle groups) to target goals, derived from ingroup ambient input. This process is presumed guided via reference to kinesthetic and auditory feedback. Key assumptions are summarized in a theoretical framework, with some tentative implications for modeling approaches and clinical work. Said framework is dubbed the motor constellation theory of infants’ phonological development.

Navigating phonetic output

Speech production and acoustics

Human speech is a behavioral composite of motor activity in the respiratory organs, larynx, and articulatory organs—the tongue, upper and lower lips, upper teeth, alveolar ridge, hard palate, velum, uvula, pharyngeal wall, and glottis—executed in combination (for overviews, see Denes and Pinson, 1963; Ladefoged, 1996; Stevens, 2000). Speech production results from air being expounded from the lungs at variable pressures, causing vibration in the vocal folds of the larynx (except in, e.g., whispering, where vocal folds do not vibrate), and air pressure is forced through structures in the vocal tract imposing narrow constrictions on airflow (Denes and Pinson, 1963). The rate of vocal fold vibration is termed the fundamental frequency (f₀) and corresponds perceptually to pitch height, while the imposition of narrow constrictions results in variations (mainly) in the first and second formants (F₁ and F₂, respectively)—spectral frequency peaks resulting from resonances in the vocal tract—where F₁ is predominantly determined by the height of the tongue body, and inversely related to vowel height, such that lower frequencies correspond to greater vowel heights; and F₂ largely determined by tongue front-to-back position, corresponding to the frontness/backness of a vowel. All spoken languages, thus, share a most basic property, that of being composed of culturally agreed-upon (though largely arbitrary) formalized constellations of motor activity, cognitively imbued with symbolism (i.e., word semantics).

The number of vowels, consonants, and phonemes in a given language is highly variable (Maddieson, 1984), but never exhausts the full potential rendered possible by human systems of speech production. The phonetic structure of vowel systems—that is, the qualities of vowels sustained as part of a language-specific phonology—is contingent on perceptual contrast between vowels (Lindblom and Sundberg, 1969; Liljencrants and Lindblom, 1972). Results of early modeling by Lindblom and Sundberg (1969) investigating the maximum distance between permissible vowels within a random set (while still allowing for intelligibility and sufficient distinctiveness) further point to a role for limitations of perception and memory in the construction and maintenance of language-specific phonologies. Similar principles also govern the structure and development of consonant systems (Lindblom and Maddieson, 1988). It need not be argued that a language–and its associated system of speech sounds–must be simple enough to be perceived and repeated by infants born into the society that speaks it; any language that did not abide by this principle would fail to survive beyond a single generation of speakers. Thus, systems of speech must be flexible enough to allow for the variant qualities, inherent both in the speech signal itself, and in the perceptual systems of listeners. What is built up by the infant in acquiring phonology, then, is a library of systematic knowledge of the relationship between auditory patterns, kinesthetic-orosensory patterns, and (for purposes of modeling) discrete target positions (Fry, 1966; Lindblom and Sundberg, 1969; Boysson-Bardies et al., 1992).

Developmental constraints on infants’ phonological production

Phonological mapping must necessarily be limited by constraints of the developing vocal apparatus (Green and Nip, 2010); for example, the anatomical prerequisites for the production of nasal bilabials such as /m/ or fricative bilabials such as /b/ are largely present at birth, leading to typically observed first words (roughly corresponding to, e.g., /ˈbɑːbɑː/, /ˈmɑːmɑː/; McCarthy, 1946). Meanwhile, fricative alveolars such as /s/ require significant lingual muscle dexterity (not to mention dentition) before its cognitive-orosensory coordinates can be appropriately mapped and accommodated. The same is also true of vowel sounds. For example, utterances such as schwa (in English, an unstressed, or neutral vowel) require comparably little effort or flexibility on behalf of a speaker, compared to, e.g., /i/, which requires significant labial and lingual stretching, as well as the development of necessary anatomical interstructural relationships. In adult humans, roughly half the tongue is positioned in the throat, such that the supralaryngeal airway acquires a roughly right-angle bend at its midpoint. The resulting near 1:1 relationship between horizontal and vertical sections of the supralaryngeal vocal tract (SVT) renders possible the production of quantal vowels /a/, /i/, and /u/ (Stevens, 1972, 1989). However, the same relationship is not found in infants.

Instead, at birth, the tongue is largely contained in the mouth, only descending into the throat with development, reaching completion by roughly 8 years of age (Lieberman, 2012). As the tongue descends, so does the larynx, which is also positioned higher in infants compared with adults (Lieberman et al., 2001; Nishimura, 2018). With SVTs more similar to those of nonhuman primates than of adult humans, human infant SVTs are incapable of producing quantal vowels (Lieberman et al., 1972; Stevens, 1972, 1989; Lieberman, 2012), and their corresponding mapping thus cannot be completed prior to this point of development. That is, the maturing SVT provides increased proprioceptive-auditory affordances (see Gibson, 1979), as exploration of its motor and acoustic-perceptual relationships becomes available. Accordingly, infants’ vowel space (Kent and Murray, 1982), utterance melodic complexity (Wermke et al., 2021), and (in infants acquiring a tonal language) accuracy of tonal suprasegmental features as well as the complexity of individual tones readily acquired (Wong and Strange, 2017)¹ all increase significantly throughout the first year of life with the development of increased lingual and muscle dexterity and flexibility. Such contingence on anatomy places significant constraints on the infants’ initial phonetic development.

Articulation is position control

Even in the most mundane everyday activities such as reaching for an object or placing one foot in front of the other, human actors make use of sophisticated computation when acting upon the world. Neurologically, such instances of fine position control are continually adjusted by cerebellar-motor cortex networks (Drew, 1993; Armstrong and Marple-Horvat, 1996; Drew et al., 2008), via reference to both visual feedback from the immediate environment, and proprioceptive-kinesthetic feedback from relevant muscle groups. Necessary adjustments to fine-motor movements are readily accomplished with little or no premeditation; this phenomenon is termed motor equivalence—the use of variable motor sequences of muscle movements toward achieving some goal. However, the broad domain-general functionality of cerebellar networks for motor control extends beyond reaching, grabbing, and walking. Indeed, there is significant evidence of motor equivalence in speech articulation also. Findings presented by Gay and colleagues on compensation in vowel production in conditions of abnormal jaw openings (Lindblom et al., 1979) and bite blocks (Gay et al., 1981) suggest (1) that articulation is compensatory and (2) that tongue placement is executed appropriately via reference to tactile feedback.

The human tongue possesses four major extrinsic muscles: (1) the genioglossus, which extends, protrudes, and depresses the tongue; (2) the styloglossi, which retract the tongue; (3) the hyoglossus, which depresses and retracts the tongue; and (4) the palatoglossus, which elevates the posterior position of the tongue, and four intrinsic (attaching only to other muscles in the tongue body) paired muscles, the (1) superior longitudinal and (2) inferior longitudinal and (3) transverse and (4) vertical muscles, whose directions of travel are all indicated by their nomenclature. Each muscle or group of muscles is dominant to others in given contract patterns (see Figure 1). Further bridging the gap to motor equivalence in reaching, Moayedi et al. (2021, 3046) have recently suggested that “the organization of [tongue] somatosensory endings is reminiscent of fingertips, suggesting that the hard palate is equipped with a rich repertoire of sensory neurons for pressure sensing and spatial localization of mechanical inputs.” Thus, speech articulation may be defined as the “reaching” in laryngeal–orosensory space for discrete target positions, defined, in turn, as contact patterns.

FIGURE 1

Figure 1. Tongue contact patterns for consonantal sounds. Left to right: alveolar grooved /s/ /z/; alveolar stop /t/ /d/ /n/; velar stop /k/ /g/ /ng/.

However, muscles of the tongue are merely one example of sources of feedback necessary for appropriate articulation. Significant evidence now also points to the role of multimodal feedback in the control of speech articulatory and acoustic parameters, the first and most obvious being auditory feedback.

The role of feedback

Evidence for the necessity of auditory feedback in speech articulation is provided by a range of experiments wherein that feedback is perturbed, and production is adjusted to compensate. Effects of perturbing the auditory feedback channel can be examined by applying real-time frequency modulation of speaker voice (Elman, 1981; Kawahara, 1994). Results of such studies typically observe that subjects shift f₀ in the direction opposite that of the stimuli presented (Burnett et al., 1998; Jones and Munhall, 2005; Larson et al., 2008), but other perturbation experiments have also observed compensatory shifts in F₁ and F₂ (Houde and Jordan, 1998; Purcell and Munhall, 2006; Pile et al., 2007; Katseff et al., 2012). Compensation to perturbation takes place within 150 ms of perturbation onset, and mismatches are coded bilaterally in the superior temporal cortex of the speaker (Tourville et al., 2008). Beyond auditory feedback, the laryngeal mucosa sensing vibrations in the laryngeal cavity (during vocal fold oscillation) also provide important somatosensory feedback. That is, vibrotactile feedback stemming from activity directly in the larynx may also serve as a clue to whether desired vocal production is in fact being executed (see also Shiba et al., 1997; Sapir et al., 2000). As noted by Hammer and Krueger (2014), who tested laryngeal mechanosensory detection thresholds using endoscopy, the sensorium of the larynx itself also appears to modulate afference, attenuating potentially distracting sensory input mid-vocalization.

Indeed, available evidence now suggests that control of articulation is supported by dual feedback channels of auditory and proprioceptive feedback. Work by Schroeder and colleagues examining recordings of macaque monkey (Macaca mulatta and M. fascicularis) auditory association cortices, when subjects were presented with auditory and somatosensory input, suggest a significant temporal overlap between the two, as well as integration at an early stage of auditory cortical processing (Schroeder et al., 2001). Wang and colleagues investigated the simultaneous influence of auditory and vibrotactile feedback disturbances in f₀ control in human subjects, finding stronger compensatory responses in participants in a combined vibrotactile-auditory stimuli condition than for either single modality on its own (Wang et al., 2015a,b; see also Larson et al., 2008).

Such findings are complemented by work by Katseff et al. (2012), who upon finding that subjects compensated more for small feedback shifts than for larger ones, suggested that auditory and somatosensory information was incorporated by a speech motor control system, apparently driven by differential weighting of both modality parameters: Where discrepancies are minor, a premium may be placed on auditory feedback, while for greater discrepancies, somatosensory feedback may outweigh auditory feedback (Katseff et al., 2012). Reflecting the role of both auditory and proprioceptive feedback, feedback parameters are included, as a means of articulatory correction, in speech motor control modeling efforts such as Frank Guenther’s DIVA model (Guenther, 1995; Guenther and Vladusich, 2012). Significantly for the present account, Locke (1993) has also stressed similar roles of feedback for facilitating development of speech capacities in the human child. Indeed, when learning a new motor skill (including the production of any phoneme or set of phonemes), sensory feedback provides crucial referent information; any physical action corresponds to a unique proprioceptive-kinesthetic perceptual experience, which in learning that skill helps facilitate its repetition (e.g., Ullman, 2001).

From perception to production

While intraspecies social vocalization represents an ancient evolutionary heritage (Bass et al., 2008), vocal learning is an ability shared with only a few disparate lineages, including pinnipeds (Schusterman, 2008; Reichmuth and Casey, 2014), bats (Vernes and Wilkinson, 2020), and cetaceans, such as whales (Noad et al., 2000) among mammals; and parrots (Pepperberg, 2010; Bradbury and Balsby, 2016), hummingbirds (Baptista and Schuchmann, 1990), and oscines (hereafter songbirds) among Aves. Among primates, only humans consistently exhibit sophisticated vocal learning (Egnor and Hauser, 2004; but see, e.g., Wich et al., 2009). Of all vocal learning capacities currently known to science, the human ability is rivaled in complexity only by songbirds. Further, outside of humans, songbirds represent by far the most well-studied vocal learning taxonomic group (Konishi, 1964, 1985, 2010; Nottebohm, 1970; Marler and Waser, 1977; Nottebohm et al., 1986; Kroodsma and Konishi, 1991; Bolhuis and Gahr, 2006; Bolhuis et al., 2010; Gale and Perkel, 2010; Bolhuis and Moorman, 2015; Prather et al., 2017).

Though features of songbird vocal anatomy and physiology (Greenwalt, 1968; Suthers, 1997) differ from those of humans (e.g., Ladefoged, 1996) and nonhuman mammals (Negus, 1949; Harrison, 1995)—and though such differences lead to obvious differences in acoustic output—the two systems can be usefully thought of as comparable. Systems of vocalization in both species are a priori free (there should be no objectively more beneficial system of vocalization) and subject to relatively well-defined constraints, including the limitations resulting from the progressive development of the speech apparatus of humans (Lieberman et al., 1972; Green and Nip, 2010; Lieberman, 2012), and song apparatus of songbirds (Greenwalt, 1968; Farries, 2004). There are also remarkable similarities between songbird and human brains, resulting from convergent evolution (Colquitt et al., 2021). Thus, over the course of the development of the field, multiple authors have drawn on the behavioral parallels between birdsong and human speech (Marler, 1970; Doupe and Kuhl, 1999; Goldstein et al., 2003; Kuhl, 2003; Bolhuis et al., 2010; Prather et al., 2017) and such parallels have at times guided the interpretation of experimental work on linguistic development (e.g., Goldstein et al., 2003).

In any species capable of vocal learning, developing individuals must solve a difficult adaptive problem in ontogeny, that is, adapting one’s repertoire of vocal output to ambient sounds as observed in mature conspecifics. In songbird species such as the Zebra finch (Taeniopygia guttata), auditory feedback is necessary for matching explorative vocal output against intended sounds. This was most clearly made evident through the work of Masakazu Konishi in his studies of deafened songbirds, that failed to develop adequate song (Konishi, 1964, 1965b; see also Marler and Waser, 1977; Price, 1979; Brainard and Doupe, 2000). Similarly, deaf-born human infants exhibit impaired development of babbling behavior (Oller and Eilers, 1988) and later in life typically present with underarticulated (e.g., Hudgins and Numbers, 1942) and monotone (e.g., Smith, 1975) speech. Unlike songbirds, suboscines such as chickens (Gallus domesticus) produce species-typical vocalizations, even when deafened (Konishi, 1963a). In the case of species-typical learned vocalization behavior, thus, complex motor learning (underlying vocal learning) is contingent on sensory feedback, which guides the steering toward a target auditory output. Comparative findings in human infants have also been provided by Boysson-Bardies et al. (1992).

In his doctoral work, Konishi (1963b) posited “template theory,” according to which a juvenile songbird will memorize the song of a conspecific tutor individual, using that song as points of reference in future own song development and elaboration. A young bird hears its own song and compares it to that of its sensory template; in the event of a mismatch between the two, the bird continually adjusts its song until it matches the template. Konishi (1963b, 1965a) suggested that, in the process of song learning, a songbird converts an “auditory template,” derived from the song of adult tutor individuals, into a “proprioceptive template,” such that sensory feedback helps guide motor activity toward positional coordinates necessary to produce desired auditory outputs (see also Nottebohm, 1970). Modern research has shown light on some of the neural circuitry that underlies this apparent phenomenon. Namely, in the songbird brain, the caudomedial nidopallium is believed to be the site of auditory tutor song memory storage (Bolhuis and Gahr, 2006; Hahnloser and Kotowicz, 2010; Bolhuis and Moorman, 2015; Yanagihara and Yazaki-Sugiyama, 2016). A basal ganglion dopamine (DA) pathway appears to drive auditory preference and response, forming a neurological basis for song memory (Gale and Perkel, 2010; Barr et al., 2021; Daou and Margoliash, 2021).

For mammals, comparable auditory experience-dependent neuronal plasticity has also been observed in rodents (Sanes and Bao, 2009; de Villers-Sidani and Merzenich, 2011) but direct equivalent evidence for the neurological underpinnings of human infants’ phonological development is, to the knowledge of the author, as of yet not available. However, some evidence exists with apparent bearing on this issue. Crucially, Kuhl and Meltzoff (1996) documented how infants of only a few months of age produced vocalization resembling heard recorded vowels. Echoing template theory of Konishi (1963b), the authors suggested that infants derived perceptual representations of heard vocalizations, which are utilized as targets for subsequent speech production (Kuhl and Meltzoff, 1996). Indeed, research on cultural variations in infant crying and babbling strongly suggest that plasticity begins early in life. Newborns’ crying is influenced by ambient native-language prosodic cues (Mampe et al., 2009), which also influences later-in-life babble (De Boysson-Bardies et al., 1981; de Boysson-Bardies et al., 1984; de Boysson-Bardies et al., 1989; Levitt and Utman, 1992) and rhythmic-prosodic properties such as positionally appropriate syllabic lengthening (Levitt and Wang, 1991). Finally, reflecting the developing SVT, cultural variations in consonantal sounds may appear later in development, compared with vowels—which are comparatively easily produced—and exhibit early cultural influence (Chen and Kent, 2010; Lee et al., 2010; but see de Boysson-Bardies et al., 1989).

Kuhl et al. (2006) have shown that auditory experience drives a progressive process of integration of language-specific phonemes in auditory memory, which may be indicative of analogous neural circuitry to that observed in songbirds and rodents. Following this work, a parallel to birdsong template theory (Konishi, 1963b) has been put forward and elaborated by Kuhl and colleagues (Kuhl, 1992; Kuhl and Meltzoff, 1996; Kuhl et al., 2006; see also Vihman, 2019).² Crucially, recent iterations of Frank Guenther’s DIVA model (Guenther and Vladusich, 2012; Guenther, 2016) present a coherent argument for how such conversion from auditory speech “chunk” component to motor vocal production behavior may take place; that is, two-way prediction of motor and sensory domains facilitates the establishment of a “speech sound map” (Guenther, 2016).

Physiological bases of speech learning

Neural representations

Investigations into somatosensory motor cortex representations of the speech organs and articulators go back to Wilder Penfield’s classic work on the cortical somatotopic mapping of—among others—the tongue, jaw, and lips (Penfield and Boldrey, 1937; Penfield, 1954). More recent work has localized the site of cortical control of the larynx, dubbed the laryngeal motor cortex (Brown et al., 2008, 2021; Simonyan and Horwitz, 2011; Dichter et al., 2018), as well as the site of overlap between larynx and jaw somatotopic representations (Brown et al., 2021; see also MacNeilage, 1998). The organization of the auditory cortical ventral and dorsal pathways of the brain also shows substantial interspecies similarity (Rauschecker and Scott, 2009; Rauschecker, 2012; Hage and Nieder, 2016). Notably, however, complex motor behaviors, including linguistic abilities, are contingent on distributed networks of circuitry, with various localized centers of activity (Mesulam, 1990; Lieberman et al., 1992). Syllabic articulation is thought emergent from constellations of coordinated activity in a constellation of representations of articulatory organs (Browman and Goldstein, 1989; Levelt, 1993; Guenther, 2006; Bouchard et al., 2013). For example, a dorsal pathway in the premotor and temporal cortices supports speech repetition (Friederici and Gierhan, 2013), and the “dual neural network model” posited by Hage and Nieder (2016) assumes that voluntary speech emerges individually via the development of a prefrontal cortical volitional articulatory motor network, that assumes control over a subcortical phylogenetically preserved primary vocal motor network.

While cortical representation of speech production is relatively well researched (Wildgruber et al., 1996; Gracco et al., 2005; Papoutsi et al., 2009), its subcortical underpinnings, now increasingly recognized as crucial to speech behavior, remain relatively poorly understood (Lieberman, 2000, 2012). Patients suffering damage to the basal ganglia (BG; a subcortical structure) often present with classic signs of Broca’s aphasia or Wernicke’s aphasia (i.e., impaired speech production and compression, respectively), even when Broca’s and Wernicke’s areas are left intact by stroke (Stuss et al., 1986; Alexander et al., 1987; overview in Lieberman and McCarthy, 2015). Further, Chrabaszcz et al. (2019) observed significant increases in high-gamma power activity in the subthalamic nucleus (as well as in the sensorimotor cortex) in Parkinsonian patients preparatory to speech production and persisting throughout articulation durations.

Intriguingly, basal ganglion circuitry so implicated also includes the ventromedial prefrontal cortex and Broca’s area—areas classically associated with the regulation of spoken language (Lieberman, 2000). Tellingly, Dronkers et al. (2007) have observed subcortical damage to the BG in Paul Broca’s classic case study, on the patient “Tan,” whose symptoms have traditionally been attributed to damage to Broca’s area (Brodmann areas 44,45; Broca, 1861). Patients presenting with damage to cortical but not subcortical areas may often recover from the injury (Alexander et al., 1987), whereas this is not true of patients presenting with damage to subcortical regions. Finally, various prefrontal cortical areas implicated in speech-centric behavior—including the medial and lateral premotor cortices—project to the BG (Alexander et al., 1987; Cummings, 1993; Guenther, 2006); various prefrontal regions have also been found to be sites of projection from the BG (Middleton and Strick, 2002), further cementing the importance of subcortical circuitry for speech-centric behavior. The related role of the cerebellum in human speech production, meanwhile, appears to be facilitation of temporal organization of speech into smooth rhythmic utterances, as well as prearticulatory organization; this has been outlined by Ackermann (2008).³

The rhythmic motor behavior underlying speech, finally, is supported by central pattern generators, clusters of neurons facilitating predictable rhythmic outputs (Grillner and Wallen, 1985; Grillner et al., 1995), coopted in development for speech from suckling and mastication (Lund and Kolta, 2006; Barlow et al., 2010). From comparative and evolutionary perspectives, activity of basal ganglion motor loop observed in speech activity is believed analogous to similar circuitry underlying song behavior in songbirds (Jarvis, 2004; Ackermann, 2008). Thus, while a traditional neurolinguistics framework may consider Broca’s and Wernicke’s areas as brain regions central to speech, over the last few decades, a new model of speech neurological control has emerged, emphasizing the role of BG in particular (Lieberman, 2000, 2012; Murdoch, 2001, 2009; Wildgruber et al., 2001; Ma and Suga, 2003; Radanovic and Scaff, 2003; Dronkers et al., 2007; Enard, 2011; Reimers-Kipping et al., 2011; Archakov et al., 2020; Chien et al., 2020; an extensive summary of research on the neural control of speech has been presented by Guenther, 2016).

Structure of the basal ganglia and dopaminergic pathways

Neural substrates of motor learning, and the mesencephalic DA system that underlies it, are highly conserved across the animal kingdom (Smeets et al., 2000; Person et al., 2008; Grillner and Robertson, 2016). While differing significantly in terms of anatomical structures⁴ there is widespread continuity in the brains of songbirds and mammals as relating to organization at the level of circuitry (Reiner et al., 2004), including the BG and associated dopaminergic circuitry (Person et al., 2008; Goldberg et al., 2010), allowing for cross-species comparisons (Doupe et al., 2005; Gale and Perkel, 2010; Fee and Goldberg, 2011; Wood, 2021). Grillner and Robertson (2016, 1095) point out that in primates, “the size of the basal ganglia has expanded to a very large structure […] with the striatum being subdivided in several compartments linked to the control of different patterns of behavior.” The authors explain the expansion of the BG as having taken place in parallel with the more general expansion in complexity by the primate behavioral repertoire. In humans, the dorsal striatum can be subdivided into caudate nucleus and putamen, and again into striomes, where spiny striatal projection neurons inhibit DA neuron activity (part of the basal ganglion value-based decision-making circuitry); and matrisomes, participating in movement control (Gerfen, 1992; Stephenson-Jones et al., 2013). The division between striosomes and matrisomes is found in both humans and birds (Holt et al., 1997; Garcia-Calero et al., 2013), again suggesting an ancient evolutionary adaptation, and crucial function of the BG.

The BG is implicated in a range of behaviors, including selection of behavior, motor learning, and control of DA neuron activity and value-based decisions (Wise, 2004). The varied function of DA neurons (reviewed in Alm, 2021; see also Wood, 2021) includes the encoding of subjective goals, the initiation and preparation of movement, and instantiation of memory traces, including motor learning. In the midbrain, two nuclei—the substantia nigra pars compacta and ventral tegmental area (VTA)—are the primary producers of DA. A pathway from the VTA projects DA to the sensorimotor cortex, supplementary motor area, and dorsal premotor cortex—likely crucial for motor learning in the motor cortex (Molina-Luna et al., 2009). The primary nucleus of dopaminergic input to the BG is the striatum (Tepper et al., 2007), which also receives input from the cerebral cortex and projects to frontal lobe and brain stem nuclei (Coddington and Dudman, 2019; Klaus et al., 2019). Striatal DA release has been observed in both implicit and explicit motor performance and memory (Badgaiyan et al., 2008). Such DA neuron control is phasic, with increased activity in the presence of rewards (and decreased activity when an expected reward fails to be delivered; Howe et al., 2013), or when initiating locomotor activity (Jin and Costa, 2015). Brainstem-mediated plasticity also appears to be subject to cultural influence, with native speakers of Mandarin—a tonal language—exhibiting greater frequency-following ensemble responses to pitch contours of lexical tones, compared with native English speakers (Krishnan et al., 2005; see also Wong et al., 2009).

Fee and Goldberg (2011) proposed a common reinforcement learning mechanism underlying motor sequence learning in mammals and song learning in songbirds, based on a reward prediction biasing procedure, encompassing a BG-thalamocortical loop. Related BG circuits also contribute to the generation of variability in vocal exploration, necessary for normal mapping of song (Leblois et al., 2010). In juvenile songbirds, lesions to deep cerebellar nuclei impede song learning, with more substantial lesions resulting in greater worsening of tutor imitation (Pidoux et al., 2018). Crucially, increased DA neuron activity also facilitates long-term potentiation, the increase in synaptic strength following recent activity, including in the cerebral cortex, and including motor movement (Bailey et al., 2000; Malenka and Bear, 2004; Wise, 2004; Hosp and Luft, 2013). In addition, recent work in neurogenetics indicates that DA-genotypic individual differences are determinant of linguistic development (“the dopamine hypothesis”; Wong et al., 2012). Namely, earlier-in-life bilingual proficiency is modulated by subcortical dopamine (while later-in-life proficiency is modulated by cortical dopamine; Vaughn et al., 2016; Vaughn and Hernandez, 2018). Overall, then, basal ganglion involvement in speech, and the observed role of DA in the innervation of speech-relevant neural architectures further suggests that DA may also help guide the acquisition of speech (see also Alm, 2021).

Finally, recent work by Archakov et al. (2020) provides an important evolutionary complement. In their study, macaque monkeys were trained to produce sound sequences via physical manipulation of a specially designed “monkey piano.” In subsequent fMRI scans, the author observed cortical motor area activation when hearing learned melodies; simultaneous activity was also observed in the putamen of the BG (see Rauschecker, 2012, 2018). Genetics analyses of the “humanized” Forkhead Box B2 also indicate substantive involvement of the gene in the development of BG-cortical networks involved in speech (as well as language more broadly; Enard, 2011; Reimers-Kipping et al., 2011), suggesting that mutations on the gene unique to the Homo genus, contributed for the evolution of speech in ancestral hominids, as well as its proper development in modern humans (Nudel and Newbury, 2013).

Speech and dopamine: Some clinical observations

The role of DA in speech has typically been studied in clinical contexts; namely, speech pathologies and deficits exhibit comorbidity with conditions characterized by dopaminergic dysregulation. Evidence to this effect is available from both animal models—where DA-depleted laboratory rats (Rattus norvegicus domestica) present with decreased call bandwidth, and maximum frequency and intensity (Ciucci et al., 2009)—and clinical research on humans, typically patients diagnosed with Parkinson’s disease (PD) or stuttering. PD is characterized by gradual brain cell death and low or falling levels of DA. Accordingly, most PD patients present with some speech pathology, most commonly hypophonic and/or monotonous speech, resulting in an articulatory undershoot (see, e.g., Ho et al., 1998). In marked contrast, stuttering—the involuntary repetition of words or segments of words—may sometimes be driven by elevated DA activity (the “dopamine hypothesis of stuttering”; Wu et al., 1997; Maguire et al., 2012; but see Alm, 2004, 2021 for nuanced accounts). The depletion of DA, characteristic of PD, degrades the local operations of the BG (Jellinger, 1990), and speech motor control is subsequently degraded also (Lieberman et al., 1992). For example, in a relevant case study, Pickett et al. (1998) observed degraded articulatory gesture sequencing in a Parkinsonian patient.

Finally, bearing on medical conditions such as PD that typically involve pathological speech, the cognitive mapping of speech-centric motor constellations remains intact; but a speaker’s ability to navigate them is disordered due to dopaminergic dysregulation, the underlying circuitry of which would otherwise maintain its reach-and-grasp-like function. Thus, while much remains unknown concerning its role in governing speech abilities, current research does indicate a role for DA in the maintenance of speech capacities across the lifespan. Less yet is known about the role of DA in phonological production learning. Nevertheless, evidence from comparative animal studies and results from simulation now suggest that dopaminergic circuitry plays a critical role in the ontogenetic development of speech motor behaviors (Gale et al., 2008; Chen and Goldberg, 2020; Kearney, 2020).

From motor chunks to speech constellations

Neurologically, motor learning is facilitated by activity in the BG, parsing successful from unsuccessful motor behavior through comparisons with desired outcomes (Graybiel, 2005); and the cerebellum, continually adjusting fine-motor behavior (Paulin, 1993; Doya, 2000). Neurotransmission of DA significantly affects the encoding and strength of encoding of memory traces (Williams and Goldman-Rakic, 1995; Wise, 2004). In the broader context of motor learning, DA is known to contribute toward a range of behaviors. DA is crucial for enforcing associations between stimulus and subsequent rewards (Wise, 2004), and reward prediction error are, accordingly, believed to be coordinated by the BG (Wickens et al., 2003; Schultz, 2013; Gadagkar et al., 2016). Molina-Luna et al. (2009) found that lesioning dopaminergic inputs to the motor cortex in rats impaired learning of motor skills, but not execution of previously learned motor skills. Further, Gardner et al. (2018) have argued that DA be conceptualized as signaling error in both sensory and reward prediction.

Complex motor learning, underlying vocal learning, is contingent on sensory feedback (Schultz, 2007, 2013). Thus, in phonological mapping, the BG, through being part of the neural dopaminergic circuitry, likely provides the necessary emphasis for mapping speech sounds, once achieved, to its corresponding place in orosensory space, facilitating repetition across continuous interaction (Gale et al., 2008; Hoffmann et al., 2016). Simonyan et al. (2012) have previously suggested that the laryngeal motor cortex may be modulated by DA via its being part of the vocal BG circuitry. Neurologically, internally guided vocal explorative behavior and imitation are likely indeed enabled by common VTA-BG circuitry (Hisey et al., 2018) and guided via cortical-basal ganglion circuitry (Warren et al., 2011; Ali et al., 2013).

Work by Hoffmann et al. (2016) on vocal learning in Bengalese finches have demonstrated how dopaminergic inputs to the BG, such that lesions on Area X result in deficits in subjects’ vocal learning when auditory stimuli were accompanied by white noise. For explorative vocalization behavior, aspects of production corresponding to measurable acoustic outcomes (e.g., pitch, amplitude) may be controlled by separate neuronal ensembles (Sober et al., 2008). Based on their observations, Hoffmann et al. (2016) argued that vocal plasticity is selectively reinforced via dopaminergic inputs to the BG (Hoffmann et al., 2016, p. 2176), mirroring an equivalent process in perception learning (Gale and Perkel, 2010). Similarly, in humans, imitation is also presumed to guide children’s acquisition of speech (Messum, 2008). Production itself is likely regulated via inputs from the cerebellum (Ackermann, 2008), as indicated by work on the song production pathways of zebra finches by Pidoux et al. (2018).

The cerebral DA network thus appears to provide a mechanism for the automatization of motor movement sequence “chunks”—that is, sequences composed from otherwise isolated movements—to be coordinated and executed in tandem, or in sequence (Marsden and Obeso, 1994; Alm, 2021). Basal ganglion–cerebellar dopaminergic circuitry thus provides the necessary emphasis for mapping a song component or fragment, once achieved, to its corresponding motor activity constellation in syringeal–orosensory space, enabling replicated matching over repeated vocalizations across time (see Gale et al., 2008).⁵ Thus, it is here supposed that generalized mechanisms have evolved convergently for the mapping of constellations of motor activity in domains of mouth and larynx (in mammals) or syrinx (in songbirds), to the bounded auditory outputs to which their innervation corresponds.

Motor constellation theory

The purpose of the present text was to indicate the biological underpinnings of infants’ phonological mapping. To this goal, the motor constellation theory of phonological development (MC) was presented. The theory posits that human infants are born with the instinct to explore orosensory space through tactile sensory motor behavioral and auditory feedback. Babbling is the result of successful such exploration, giving rise to emergent pseudo-segmental phonetic properties. Continuous perceptual-motor mapping facilitates the acquisition of language-specific phonemic repertoires, and gives rise to phonemes proper, defined as discrete target positions in cognitive–orosensory space. Babble is thus gradually replaced by elective values in sound space, selected via interaction with ingroup members, enforced and reinforced via cerebellar–basal ganglion circuitry for dopaminergic signaling, which instantiates encoding of combinations of motor sensory and auditory perceptual features, and providing the necessary mechanism by which speech sounds are mapped onto corresponding laryngeal–orosensory motor activity constellations. Once achieved, any reinforced combinatory pattern becomes more easily repeatable through continuous reinstatement (see Figure 2). Continuous and ritualized reuse of a given constellation of motor coordinates leads to the formation and memorization of phonetic concepts in memory; motor constellations thus become the roadmaps by which a phonetic concept is explored, learned, mapped, and maintained across time in the individual speaker.

FIGURE 2

Figure 2. Motor constellation theory: A sketch of the proposed model.

Some considerations for modeling

The dopaminergic innervation of speech behavior thus proposed, we next seek to model—and ultimately to simulate—phonological production development. Vocal learning is (at least in part) intrinsically motivated, as is evident from both anthropological evidence that infants learn to speak normally even in cultures where they are rarely if ever addressed directly (Ochs and Schieffelin, 2009); observations of songbirds’ song learning (Marler, 1970); and simulation and modeling approaches (e.g., Chen and Goldberg, 2020). In his work on birdsong, Marler (1970, 670) speculated that “the process of vocal imitation may prove to be essentially self-reinforcing in the cases both of juvenile birds and infant humans and thus basically be independent of reward by the parent.”

Researchers investigating song learning have also previously hypothesized the importance of motor exploration. It was first noted by Metfessel (1935) that domestic canaries (Serinus canaria domestica) learn to sing through a process of improvisation, and that this process still occurs even in the absence of external referent sources. Later work showed how the same species can also learn by imitation (Poulsen, 1959; Marler and Waser, 1977; see also Nottebohm et al., 1986). Even in adulthood (some) songbirds are capable of adaptive fundamental frequency shift in vocalization, shifting the fundamental frequency of some targeted portion of a song to avoid disruption, consistent with some degree of flexibility across the lifespan (Tumer and Brainard, 2007). While DA has traditionally been studied in the context of reinforcement learning—trial-and-error based environmental sampling with the goal of attaining maximum value (see Wood, 2021), complex motor behaviors such as song— and therefore, possibly also speech—likely involve the utilization of multiple simultaneous learning strategies and mechanisms (Guenther, 2016; Krakauer et al., 2019; Wood, 2021).

Human infants’ imitative vocalizations are seemingly guided by memorized phonological patterns (Fry, 1966; Kuhl and Meltzoff, 1996), and phonological production learning likely represents such a case of simultaneous model-based and model-free reinforcement learning, where prior motor-sound equivalence experience helps guide increasingly sophisticated attempts at phonological matching of own-speech output, with that observed prior; that is, learning by reference sensory-prediction error. Constellations thus enforced become more easily reachable across future interactions via Hebbian learning, the strengthening of synaptic connection via repeated signaling activity (Hebb, 1949; Marsden and Obeso, 1994; Gale et al., 2008; Hoffmann et al., 2016; Wood, 2021). Indeed, even in adults, greater white matter content predicts faster phonetic learning (Golestani et al., 2002). Because of concerns both ethical and methodological, however, the hypothesis here presented is not available to direct investigation. Modern neuroscientific tools are not yet sophisticated enough to track dopaminergic flow non-invasively, a problem multiplied when subjects are non-verbal and unable to consent to experiment procedures.

Implications discussed, do however, open up new avenues for computational and simulation modeling (Lindblom, 2000; Guenther and Vladusich, 2012). In particular, one promising novel avenue for future modeling work is that of actor-critic methods, where an actor is synonymous with policy—the appropriate action given a certain state—and critic corresponds to a value function—the estimated return from committing to a policy (see Konda and Tsitsiklis, 2003). Chen and Goldberg (2020) have recently presented an actor-critic reinforcement model of song learning in songbirds. The authors suggest that both note correctness and quality, unexpectedly achieved in improvised vocalization, trigger DA neuron activation. Additionally, Kearney (2020) has also presented results of actor-critic simulations of song learning, showing that (1) disruption of midbrain DA circuit input (“actors”) at the moment of auditory feedback, impairs learning, as does and (2) disruption of downstream premotor region activity at early preparatory stages of vocalization (see also Gale et al., 2008; Gale and Perkel, 2010). To the knowledge of the author, no actor-critic model yet presented has attempted to simulate infants’ phonological development. Nevertheless, these promising early results merit further exploration, and application to vocal learning in human infants also.

Some considerations for clinical practice

Motor constellation also has important implications for understanding early-in-life speech pathologies, such as stuttering. DA functioning is indeed highly implicated in stuttering behavior (Wu et al., 1997; Alm, 2004; Maguire et al., 2012). While the exact nature of the relationship is not certain, results of various interventions have pointed to lessened stuttering following treatment with DA agonists (e.g., Levodopa; Anderson et al., 1999) and worsened stuttering following treatment with DA antagonists, often interpreted as evidence that an excess DA drives stuttering (e.g., Rosenberger et al., 1976; for an overview, see Maguire et al., 2020; but see also Alm, 2021). The relationship is further complicated by a variety of individual variables. For example, genotypical makeup likely plays a determinant role in the development of the condition, as is evident from twin studies (Yairi and Ambrose, 2013) and genetics research (Montag et al., 2012). However, while children identified as carrying genotypic traits associated with greater levels of DA exhibit higher levels of linguistic proficiencies (Wong et al., 2012; Vaughn and Hernandez, 2018), it is as yet not known whether children exhibiting stuttering (or other speech disorders) can be similarly characterized (though results of twin studies point to this being so). Future work should aim to address this issue.

Finally, Ashby and colleagues (Ashby et al., 2010; Hélie et al., 2015) have proposed that BG serve to ritualize motor sequences, such that once learned they can be executed without direct BG involvement (BG may still be central to execution during early developmental periods; the “Ashby model”). That is, the role of DA in speech mapping and maintenance is likely inconsistent, changing significantly across the lifespan, with DA release in the BG affecting vigor (but not motor sequence initiation) later in life. Stuttering disfluencies also vary significantly with situational variables, with more demanding speech situations causing greater stuttering (Craig, 1990; Perkins et al., 1991; Alm, 2014), again suggesting an effect of higher cognition. As a framework of phonological development, MC is consistent with these views. Assuming DA-innervated reuse of motor constellations in early life, childhood stuttering may result from dysregulated DA innervation of ritualized constellations.

Concluding comments

Motor constellation sidesteps common theoretical misgivings in the construction of theories of language acquisition postulated post hoc based on observed data (Chapman, 2000; Lindblom, 2000). It presents researchers with an account of phonological development that (1) assimilates observations of human early speech acquisition and (2) is rooted in principles of the natural sciences and neuroscience underlying motor learning, and (3) affords integration with phonetic, neuropsychological, and evolutionary sciences. Finally, while empirical testing in human infants—due to technological limitations of contemporary brain imaging techniques, as well as ethical considerations—may not be feasible, MC affords both computational modeling and simulation approaches, and has additional implications for clinical work. It is the hope of the author that the present text helps guide such efforts in the future.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Acknowledgments

The author gratefully acknowledges Björn Lindblom (Stockholm University) and Per Alm (Uppsala University) for comments on an earlier version of the manuscript. This work is dedicated to the memory of Professor Philip Lieberman (1934–2022).

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1. ^Note that as tonal elements are delineated by changes in f₀, the trajectory of tone acquisition outlined by Wong and Strange (2017) involves laryngeal, as opposed to supralaryngeal development.

2. ^In this context it is worth noting that the degree to which the organization of the songbird brain parallels that of humans (and other mammals) is subject to extensive, as of yet unsettled debate (Reiner et al., 2004; Petkov and Jarvis, 2012; Olkowicz et al., 2016; Prather et al., 2017).

3. ^For language learning (as well as phonological learning), "Procedural/Declarative" model of Ullman (2001) similarly argues for a role of BG in ordering mental grammar.

4. ^Aves lack the mammalian prefrontal cortex, but seemingly possess a functionally comparable structure in the nidopallium caudolaterale (see Güntürkün, 2005).

5. ^It is not here suggested, then, that songbirds’ mapping of song fragments is in any way equivalent to human language grammar (though such arguments have been made elsewhere; e.g., Abe and Watanabe, 2011).

References

Abe, K., and Watanabe, D. (2011). Songbirds possess the spontaneous ability to discriminate syntactic rules. Nat. Neurosci. 14, 1067–1074. doi: 10.1038/nn.2869

PubMed Abstract | CrossRef Full Text | Google Scholar

Ackermann, H. (2008). Cerebellar contributions to speech production and speech perception: psycholinguistic and neurobiological perspectives. Trends Neurosci. 31, 265–272. doi: 10.1016/j.tins.2008.02.011

PubMed Abstract | CrossRef Full Text | Google Scholar

Ackermann, H., and Ziegler, W. (2010). Brain mechanisms underlying speech motor control. Handbook Phonet. Sci. 2, 202–250. doi: 10.1002/9781444317251.ch6