Multimodal cues in L2 lexical tone acquisition: current research and future directions

Farran, Bashar M.; Morett, Laura M.

doi:10.3389/feduc.2024.1410795

MINI REVIEW article

Front. Educ., 24 July 2024

Sec. Language, Culture and Diversity

Volume 9 - 2024 | https://doi.org/10.3389/feduc.2024.1410795

This article is part of the Research TopicTonal Language Processing and Acquisition in Native and Non-native SpeakersView all 6 articles

Multimodal cues in L2 lexical tone acquisition: current research and future directions

Bashar M. Farran^*

Laura M. Morett

Department of Speech, Language and Hearing Sciences, University of Missouri, Columbia, MO, United States

This review discusses the effectiveness of visual and haptic cues for second language (L2) lexical tone acquisition, with a special focus on observation and production of hand gestures. It explains how these cues can facilitate initial acquisition of L2 lexical tones via multimodal depictions of pitch. In doing so, it provides recommendations for incorporation of multimodal cues into L2 lexical tone pedagogy.

1 Introduction

Imagine a language where the meaning of a word hinges on its pitch. This is the reality in tonal languages, where pitches, not just phonemes, determine word meaning. Most world languages, including Mandarin Chinese, Vietnamese, Thai, Yorùbá and various African languages, are tonal (Maddieson, 2013). While mastery of tonal first languages (L1s) comes naturally, second language (L2) learning of tonal languages entails a unique challenge, particularly for learners whose first language is atonal (Wang et al., 2006, 2020).

L2 acquisition of lexical tones encompasses both perception and production. Although perception often precedes production in L2 lexical tone acquisition (Wang et al., 1999), the relationship between them is not always straightforward, and improvements in perception do not necessarily entail improvements in production, and vice versa (Leather, 2011). L2 lexical tone acquisition involves perception of not only auditory cues, but also visual and haptic cues such as hand gestures (Gullberg, 2006). The importance of these multimodal cues in facilitating L2 lexical tone perception and production has increasingly gained recognition (McCafferty, 2004; Hostetter, 2011; Lewis and Kirkhart, 2022; Zhang et al., 2023). Multisensory learning, which integrates multiple sensory modalities, is more effective than unisensory approaches due to optimization of the brain for multisensory environments, suggesting that L2 lexical tone pedagogy could be enhanced by incorporating such approaches (Shams and Seitz, 2008). Macedonia and Kepler (2013) argue that use of pedagogical approaches informed by neuroscience findings into L2 instruction can significantly enhance learning via a three-pronged approach: (1) utilizing multisensory experiences for vocabulary acquisition, (2) incorporating imitation exercises to leverage mirror neurons for pronunciation training, and (3) tailoring instruction to brain development stages for optimal grammar and pronunciation outcomes. Moreover, multisensory cues enhance learning outcomes by supporting content comprehension (Dick et al., 2009). Understanding how nonverbal cues enhance auditory representations can shed light on how multimodal approaches can be leveraged to facilitate acquisition of an unfamiliar tonal L2 (Yip, 2002; Liu et al., 2022).

2 Auditory training methods

Cognitively, tonal languages require awareness of pitch, which permits discrimination, identification, and manipulation of lexical tones. In the intricate acoustic signal of speech, multiple cues such as formant frequencies, amplitude, and temporal information coexist with pitch contours. Thus, tonal language comprehension entails selective attention to pitch cues in conjunction with suppression of other acoustic information (Huang and Johnson, 2011). This selective attention to pitch cues is shaped by experience with lexical tone. Moreover, pitch perception in tonal languages goes beyond recognizing static pitch levels as it entails tracking rapid pitch movements and complex tonal contours over time (Gandour, 1983; Xie and Myers, 2015). Thus, processing of pitch within the speech stream is critical to L2 lexical tone acquisition (Jasmin et al., 2020).

Neurologically, the ability to selectively focus on pitch involves specialized mechanisms shaped by tonal language experience (Gandour et al., 2003; Xu et al., 2006). Lexical tone processing involves both subcortical and cortical structures (Gandour and Krishnan, 2016). Initially, L2 lexical tone processing is predominantly handled by the right hemisphere or bilaterally, but with increased exposure, it becomes more left lateralized and akin to L1 processing (Gandour et al., 2004; Wang et al., 2004; Gandour, 2006; Xi et al., 2010; Kaan et al., 2013).

Considering the cognitive and neurological complexities of lexical tone processing, auditory methods have been developed to facilitate L2 lexical tone learning. These methods include discrimination training, categorization training, and auditory corrective feedback.

Discrimination training involves exposure to contrasting pairs of tones and subsequent testing via determination of whether trained tones are the same or different. For example, má and mà could be presented consecutively in training, and discrimination between the rising and falling tones could then be tested by determining whether chó and chò are perceived as the same or different. Discrimination tasks are perceptual, involving the discernment of differences in pitch contours and other acoustic cues. Discrimination training leads to significant improvements in perception of differences between lexical tones (Wang et al., 1999; Wayland and Guion, 2004; Hao, 2012).

Categorization training involves exposure to labeled tones and subsequent testing via labeling of unlabeled tones. For example, the tones in má and mà could be labeled as rising and falling in training, and categorization could then be tested by labeling má as rising and mà as falling. Thus, identification tasks draw on memory as well as perception because they require mapping acoustic features of lexical tones onto their representations. Categorization training improves L2 lexical tone identification, particularly in the early stages of acquisition, but may not be sufficient for accurate production (Leather, 1990; Wang et al., 2003; Duanmu, 2007; Ladefoged and Johnson, 2015).

The distinction between discrimination and categorization is significant because discrimination can precede categorization in L2 lexical tone acquisition. However, discrimination and categorization are related; thus, they can support one another. Understanding the relationship between discrimination and categorization is essential for designing effective language learning materials, speech recognition systems, and other natural language processing applications for tonal languages.

Discrimination and categorization training based on a small set of stimuli in experimental tasks may not fully capture the natural variations of lexical tones in everyday speech. This limitation helped lead to the emergence of High Variability Perception Training (HVPT) in lexical tone learning tasks. This training entails exposure to lexical tones within varying linguistic contexts or produced by multiple speakers in the interest of more closely approximating the natural variability encountered in real-life tonal language processing (Lively et al., 1994; Pisoni and Lively, 1995). HVPT improves both perception and production of L2 lexical tones as it enhances generalization across different contexts and speakers (Guion et al., 2000; Wang et al., 2003). This approach emphasizes the importance of exposure to diverse linguistic input to achieve more robust language learning outcomes.

Auditory corrective feedback may consist of recasts, in which the correct tone is heard in response to incorrect tone production; contrastive feedback, which highlights the difference between attempted and correct pronunciation; and explicit feedback, which provides verbal explanations of errors and correction techniques (Lee and Lyster, 2016; Saito, 2021). The effectiveness of auditory corrective feedback relies upon perception as well as memory because differences between incorrect and correct tones must be perceived and remembered to produce them correctly. Auditory corrective feedback improves L2 lexical tone production accuracy by highlighting errors and modeling correct pronunciation (Bryfonski and Ma, 2020).

While auditory methods have been a mainstay in L2 lexical tone acquisition, they have limitations stemming from challenges inherent in relying solely on auditory input and feedback. Furthermore, L1 background and the L2 tone system may limit the effectiveness of auditory methods.

3 Visual cues

Visual cues can be powerful tools for enhancing L2 lexical tone acquisition. One approach utilizes static visual depictions of lexical tone pitch contours (Figure 1). These depictions, which may consist of lines, graphs, or color-coded charts, visually represent fundamental frequency (F0) variations characterizing tones (Godfroid et al., 2017). Such visual depictions facilitate understanding of lexical tone contours (Zhou and Olson, 2023), as evidenced by enhanced perception of lexical tones cross-linguistically (Burnham et al., 2022). Moreover, visual depictions of pitch contours improve categorization of L2 lexical tones compared to auditory input (Chun et al., 2012).

Figure 1

Figure 1. Images of pitch contours of Mandarin lexical tones.

Building upon the benefits of visual depictions of pitch contours, another approach leverages pitch gestures to enhance L2 lexical tone learning. Also known as tone gestures or tone-bearing gestures, pitch gestures are hand or body movements that visually convey pitch patterns of words or syllables via fundamental frequency (Morett and Chang, 2015; Figure 2). Pitch gestures spontaneously occur in conjunction with tonal languages (Krahmer and Swerts, 2007) and are often produced with the hands or head but may also include eyebrow movements or body posture changes corresponding with tones (Antoniou and Chin, 2018; Lacombe et al., 2022).

Figure 2

Figure 2. Pitch gestures for Mandarin lexical tones.

Observing pitch gestures enhances perception and production of L2 lexical tones. Observing eye movements, head movements, and hand gestures conveying pitch contours enhances understanding and pronunciation of L2 Mandarin tones (Chen and Massaro, 2008). Additionally, observing pitch gestures positively impacts discrimination between L2 Mandarin words differing in lexical tone (Morett and Chang, 2015; Morett, 2023).

Visual cues such as observed pitch gestures provide tangible depictions of lexical tones that strengthen mental representations of them via encoding and retrieval and enhance their perception and memory. In addition, visual cues offer additional support when auditory processing is impaired or exposure to tonal languages is limited.

While observing pitch gestures supports L2 lexical tone perception and production, relying solely on visual input may entail limitations. Visual depictions alone may not fully capture the richness and complexity of tonal variation, leading to incomplete or oversimplified learning outcomes. Additionally, visual depictions of lexical tones may encourage dependence on visual cues, neglecting development of auditory perception skills necessary for real-world communication. For example, use only of visual input for L2 Mandarin tone learning results in lower perception accuracy compared to use of both visual and auditory input (Jiang, 2017). Therefore, integrating visual cues with input from audition and other modalities may yield superior learning outcomes.

Theories providing explanations for the effects of visual cues on L2 lexical tone acquisition include dual coding theory and multimedia learning theory. Dual coding theory posits that information can be processed via both auditory (verbal) and visual (non-verbal) channels (Paivio, 1991, 2014a), each of which has strengths and weaknesses. Visual cues excel at conveying spatial information and relationships, while verbal cues are better suited for conveying linear sequences and abstract concepts. When visual and verbal cues occur together, tones can be processed via both the auditory and visual channels simultaneously. The resulting multimodal representations enhance encoding, storage, and retrieval of L2 lexical tones, improving their acquisition (Paivio, 2014b).

Multimedia learning theory emphasizes the importance of using multiple modes of representation to facilitate learning. This theory emphasizes combining different modalities (e.g., auditory, visual) to optimize learning outcomes and improve comprehension and retention of material (Mayer, 2005, 2009; Gullberg, 2022). It posits that learning is an active process that entails building connections between information presented in different modalities. Like dual coding theory, multimedia learning theory maintains that presenting corresponding verbal and visual information simultaneously can enhance learning. This process leads to deeper understanding, improved retention, and enhanced knowledge transfer and real-world application (Mayer and Moreno, 1998; Mayer, 2005, 2014). For L2 lexical tone acquisition, multimodal methods that combine auditory verbal input with visual representations of pitch contours are consistent with multimedia learning theory.

4 Haptic cues

Haptic approaches to L2 lexical tone learning involve the use of bodily movements to facilitate and reinforce production and perception of lexical tones. Haptic approaches posit that physical interaction with lexical tone can enhance its cognitive processing and memory retention. Examples of haptic approaches may include hand movements conveying tonal contours or tactile feedback corresponding to pitch changes. One promising haptic approach is gesture production, which entails enactment of specific hand or arm movements to convey lexical tones. This approach capitalizes on the close connection between speech production and bodily movements, as well as the benefit of haptic cues for language learning.

Pitch gesture production improves discrimination and production of L2 lexical tone (Hannah et al., 2017). More specifically, producing pitch gestures, rather than merely observing them, leads to better learning outcomes (Baills et al., 2019). Producing hand gestures in conjunction with lexical tone not only enhances production of lexical tone but also improves discernment of subtle tonal differences (Zheng et al., 2018; Li et al., 2020; Yu et al., 2024). This suggests that producing hand movements results in deeper understanding of tonal contrasts, enhancing L2 tone acquisition. From a neurological perspective, speech perception and production involve distributed neural networks that encompass not only auditory and motor cortices but also somatosensory and premotor areas (Guenther and Vladusich, 2012). This overlap suggests that haptic cues may recruit additional neural resources, resulting in enriched representations of lexical tones.

Despite their potential benefits, haptic approaches to L2 lexical tone acquisition may entail challenges. Firstly, the design and implementation of activities involving haptic cues requires careful consideration. Appropriate gestures or movements must be selected and consistently mapped to lexical tones, ensuring that associations are intuitive and easy to remember. Secondly, explicit instruction and feedback may be necessary to ensure that lexical tones are conveyed accurately via haptic cues. Thirdly, cultural and contextual factors may influence the acceptability and effectiveness of learning approaches involving haptic cues.

Multimodal methods incorporating haptic cues align with the principles of embodied cognition, providing evidence that cognitive processes are grounded in sensorimotor experiences and interactions with the physical world (Lakoff and Johnson, 2017; Shapiro, 2019). Embodied cognition proposes that recruitment of multiple sensory modalities facilitates acquisition and representation of abstract concepts by activating relevant physical experiences via mental simulation. Mental simulation leads to a stronger connection between acoustic features of tone and embodied experience, fostering more accurate production and perception.

5 Integrated multimodal cues

Research has increasingly explored integration of multimodal cues in the auditory, visual, and haptic modalities to enhance perception and production of L2 lexical tone. This approach focuses on the synergistic effects of engaging multiple sensory channels via complementary sources of information and its reinforcement of the mapping between lexical tones and their depictions. Integration of multiple modalities engages a broad range of cognitive and sensory processes, resulting in effective learning. This enhances attention, memory, and engagement with content, leading to improved acquisition and retention of L2 lexical tone. Thus, integration of visual and haptic cues should enrich representations of lexical tone, enhancing categorization and differentiation of lexical tones. Visual and haptic cues should be consistent with the vertical conceptual metaphor of pitch, which posits that high pitch is associated with upward positions and motion and that low pitch is associated with downward positions and motion. Visual–auditory mappings aligned with this metaphor result in accurate and robust representations of L2 lexical tones (Morett et al., 2022).

Multimodal approaches may help overcome the challenges associated with learning L2 Mandarin tones (Pelzl et al., 2022). Moreover, methods integrating visual and haptic cues are more effective than unimodal methods, highlighting the benefits of multimodality in facilitating L2 lexical tone acquisition (Godfroid et al., 2017). However, the effectiveness of multimodality may depend on several factors, such as the specific combination of modalities employed, the design and implementation of instructional materials, and prior tonal language experience. Although the factors discussed here provide explanations for the effectiveness of multimodal approaches, further research is needed to fully understand the underlying mechanisms and to optimize the design and implementation of multimodal instructional approaches to L2 lexical tone acquisition.

6 Discussion

Moving forward, insights from this review can inform development of strategies to enhance L2 tone acquisition. One strategy is to incorporate multimodal cues into existing curricula, leveraging techniques such as pitch gesture observation, pitch gesture production, and images of pitch contours to enhance L2 lexical tone acquisition. However, it is essential to critically evaluate existing instructional methods to determine their efficacy for both teachers and learners. To ensure maximum effectiveness, activities should convey lexical tone intuitively via the vertical conceptual metaphor of pitch.

Although existing research provides insight into how multimodal learning benefits L2 lexical tone acquisition, several topics warrant further investigation. Future research should determine the optimal combination of cues in different modalities by comparing their impacts on L2 lexical tone learning, as assessed via multiple measures. Additionally, research on the cognitive and neural correlates of lexical tone learning is needed to better understand the mechanisms enabling enrichment of representations via multimodal input. Furthermore, development and evaluation of technology-based tools presents opportunities to leverage digital technologies to enhance L2 tone instruction via multimodal learning. Addressing these research gaps will advance the understanding of multimodal learning and its implications for L2 lexical tone acquisition, informing development of practices that facilitate L2 lexical tone learning.

In summary, research illuminating the impact of multimodal cues on L2 lexical tone acquisition presents compelling evidence supporting their efficacy, particularly with respect to observation and production of hand gestures. Incorporating visual and haptic cues from gestures alongside auditory cues provides an enriched learning experience, enhancing perception and production of L2 lexical tone. The research reviewed here underscores the benefits of multimodal approaches, highlighting how visual depictions such as observed pitch gestures and haptic approaches such as gesture production can complement auditory input, resulting in enriched mental representations of L2 lexical tones. Taken together, this work demonstrates that multimodality enriches mental representations of L2 lexical tone, leading to improved learning outcomes.

Author contributions

BF: Writing – original draft. LM: Writing – review & editing.

Funding

The authors declare that financial support was received for the research, authorship, and/or publication of this article. LM was funded by US National Science Foundation CAREER award #2140073.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Antoniou, M., and Chin, J. L. L. (2018). What can lexical tone training studies in adults tell us about tone processing in children? Front. Psychol. 9:1. doi: 10.3389/fpsyg.2018.00001

PubMed Abstract | Crossref Full Text | Google Scholar

Baills, F., Suárez-González, N., González-Fuente, S., and Prieto, P. (2019). Observing and producing pitch gestures facilitates the learning of Mandarin Chinese tones and words. Stud. Second. Lang. Acquis. 41, 33–58. doi: 10.1017/S0272263118000074

Multimodal cues in L2 lexical tone acquisition: current research and future directions

1 Introduction

2 Auditory training methods

3 Visual cues

4 Haptic cues

5 Integrated multimodal cues

6 Discussion

Author contributions

Funding

Conflict of interest

Publisher’s note

References

94% of researchers rate our articles as excellent or good