Cross-Situational Word Learning in Two Foreign Languages: Effects of Native Language and Perceptual Difficulty

Tuninetti, Alba; Mulak, Karen E.; Escudero, Paola

doi:10.3389/fcomm.2020.602471

ORIGINAL RESEARCH article

Front. Commun. , 18 December 2020

Sec. Psychology of Language

Volume 5 - 2020 | https://doi.org/10.3389/fcomm.2020.602471

Cross-Situational Word Learning in Two Foreign Languages: Effects of Native Language and Perceptual Difficulty

$\nAlba Tuninetti,, &#x;$ Alba Tuninetti^1,2,3^*^†

Karen E. Mulak^1,2,4^†

Paola Escudero^1,2

¹The MARCS Institute for Brain, Behavior, and Development, Western Sydney University, Penrith, NSW, Australia
²Australian Research Council Centre of Excellence for the Dynamics of Language, Canberra, ACT, Australia
³Department of Psychology, Bilkent University, Ankara, Turkey
⁴Department of Hearing and Speech Sciences, University of Maryland, College Park, MD, United States

Cross-situational word learning (CSWL) paradigms have gained traction in recent years as a way to examine word learning in ambiguous scenarios in infancy, childhood, and adulthood. However, no study thus far has examined how CSWL paradigms may provide viable learning pathways for second language (L2) word learning. Here, we used a CSWL paradigm to examine how native Australian English (AusE) speakers learned novel Dutch (Experiment 1) and Brazilian Portuguese (Experiment 2) word-object pairings. During each learning phase trial, two words and objects were presented without indication as to which auditory word belonged to which visual referent. The two auditory words formed a non-minimal or vowel minimal pair. Minimal pairs were classified as “perceptually easy” or “perceptually difficult” based on the acoustic-phonetic relationship between AusE and each L2. At test, participants again saw two visual referents but heard one auditory label and were asked to select the corresponding referent. We predicted that accuracy would be highest for non-minimal pair trials (in which the auditory words associated with the target and distractor object formed a non-minimal pair), followed by perceptually easy minimal pairs, with lowest accuracy for perceptually difficult minimal pair trials. Our results support these hypotheses: While accuracy was above chance for all pair types, in both experiments accuracy was highest for non-minimal pair trials, followed by perceptually easy and then perceptually difficult minimal pair trials. These results are the first to demonstrate the effectiveness of CSWL in adult L2 word learning. Furthermore, the difference between perceptually easy and perceptually difficult minimal pairs in both language groups suggests that the acoustic-phonetic relationship between the L1-L2 is an important factor in novel L2 word learning in ambiguous learning scenarios. We discuss the implications of our findings for L2 acquisition, cross-situational learning and encoding of phonetic detail in a foreign language.

Introduction

For adults, learning a second language (L2) can be difficult and time-consuming. Compared to young children, adults need more time and exposure to achieve native-like proficiency in an L2 (e.g., Johnson and Newport, 1989; deKeyser, 2000), and despite these efforts, are rarely rated as sounding like native speakers of the language (see Piske et al., 2001). While adulthood appears to confer some advantage in learning aspects of a language, for example, syntax and morphology (Krashen et al., 1979; Hartshorne et al., 2018), compared to younger L2 learners, later learners have greater trouble with language skills in the L2 such as pronunciation and production (Seliger et al., 1975; Oyama, 1976; Tahta et al., 1981; Piske et al., 2001), grammar learning (Johnson and Newport, 1989; deKeyser, 2000), and lexical access (Jared and Kroll, 2001; Kroll and Sunderman, 2003).

One contributing factor to this difficulty is the influence of the L2 learner's first (or native) language (L1). L2 word learning is affected by the relation between the phonetics and phonology of the learner's L1 and the L2. Models of L2 speech perception focus on how individual vowels and consonants are perceived based on these relations, with the expectation that difficulty or ease in discriminating a phonetic contrast in the L2 extends to respective difficulty or ease in discriminating minimal word pairs that differ by that contrast (e.g., Perceptual Assimilation Model-L2 [PAM-L2]: Best and Tyler, 2007; Second Language Linguistic Perception [L2LP]: Escudero, 2005, 2009; Speech Learning Model [SLM]: Flege, 1995). For example, words such as rock and lock in English are difficult for native Japanese speakers to learn, even with specific training (e.g., McCandliss et al., 2002). This is because L1 Japanese speakers perceive the initial sounds in English rock and lock, [.ɪ] and [l], as instances of a single Japanese phoneme, /r/ (e.g., Aoyama et al., 2004), making it difficult to perceive rock and lock as two separate words. Similarly, L1 Spanish speakers learning novel Dutch minimal pair words that differed in a single Dutch vowel (e.g., [piχ] and [pɪχ]) in an explicit word learning task, in which each word is explicitly paired with its corresponding referent, showed poorer learning of minimal pairs when the Dutch vowel contrast differentiating the word pair ([ɪ] and [i]) was predicted to be perceived as a single Spanish vowel (/i/; Escudero et al., 2013, 2014).

But while such dissimilarities between the L1 and L2 sound inventories impede L2 word learning, these obstacles are not present when L1 and L2 sound contrasts align. The same L1 Spanish participants mentioned above showed stronger learning of Dutch minimal pair contrasts when an analogous contrast existed in their native Spanish (e.g., Dutch [i] and [y] in [piχ] and [pyχ] were predicted to be perceived as Spanish /i/ and /u/, respectively). This was true both for L2 learners who were naïve to Dutch, as well as those who had been learning and using the language in an immersive environment, suggesting this perceptual influence of the L1 on L2 perception and explicit word learning is relatively stable and not readily altered by L2 experience or proficiency (Escudero et al., 2013, 2014; see also Antoniou et al., 2015).

The value of this research in growing our understanding of the factors involved in L2 word learning is clear, specifically regarding our understanding of how the perceptual biases shaped by the L1 that the L2 learner brings with them affects this process. Nevertheless, these findings are limited in scope due to exclusive use of explicit word learning paradigms. In an explicit auditory word learning paradigm, participants undergo a training phase in which each novel word is explicitly, unambiguously paired with its corresponding referent. Typically, in each trial participants are shown a picture of a novel object on a screen in tandem with the auditory label for the object. This is followed with a test phase in which participants typically hear an auditory word and are asked to select the corresponding referent from a set of two or more visual objects (Smith, 2000; Escudero et al., 2013, 2014). This type of learning may more closely mimic classroom learning, in which teaching of L2 words generally occurs explicitly, such that a person is presented with the unambiguous one-to-one association between an unfamiliar L2 word and an existing concept (e.g., Spada, 1997). Such learning can, for instance, take the form of an activity in which students are shown pictures of concepts and their associated L2 word and are asked to repeat words out loud, with explicit instructions and corrective feedback (see Spada, 1997).

While undoubtedly effective in increasing L2 vocabulary (e.g., Ellis, 2015), explicit teaching methods do not reflect all the ways a language can be learned both in the classroom and in more naturalistic and immersive environments. The process of associating new words with objects can happen when the connection between the two is not explicitly taught, and no explicit feedback is provided (see e.g., Kriengwatana et al., 2016, for evidence that providing explicit and corrective feedback in non-native learning environments enhances performance compared to no feedback). Instead, the referent belonging to an auditory word is derived across multiple exposures to the word, narrowed down from an infinite set of possible referents. Determining these novel word-object pairings is supported through statistical tracking of word-referent co-occurrences over time, forming associations between words and referents that co-occur with the greatest probability (e.g., Yu and Smith, 2007) as well as top-down, hypothesis-checking techniques whereby the learner tests a possible word-object association by seeing whether the word and object co-occur in subsequent exposures (e.g., Trueswell et al., 2013; Berens et al., 2018). Indeed, this type of learning, termed cross-situational word learning (CSWL), is likely a primary way in which we learn words in our native language (Yu and Smith, 2007) and subsequent languages in an immersive environment.

In the lab, a CSWL paradigm comprises a learning phase followed by a test phase. In the learning phase, participants are not informed that this is a word learning task, instructed instead to simply view and attend to the stimuli presented to them. Participants see multiple novel images (candidate referents) on a screen, while the spoken label for each object (or in some cases, only one label) is presented in random order so that there is no indication of which spoken label refers to which novel image, resulting in referential ambiguity. In this way, while it is not possible to derive word-object associations in a single trial, participants can draw inferences about the relation between pseudowords and candidate referents across multiple exposures by tracking pseudoword-object co-occurrences across trials. These associations are then tested in a forced-choice test phase, in which participants hear one word and are asked to select its referent from more than one presented on the screen. No feedback is given at any point throughout the learning or testing phases.

Research using this paradigm supports CSWL as a real-world word-learning strategy (e.g., Yu and Smith, 2007; Fitneva and Christiansen, 2011; Vlach and Sandhofer, 2014; Escudero et al., 2016). In their seminal study, Yu and Smith (2007) showed that university students could associate novel English pseudowords with novel objects with differing degrees of within-trial ambiguity, as defined by the number of pseudowords and novel objects presented in a single trial. Participants saw two to four pictures and heard two to four pseudowords in each trial, with no indication of picture-word mappings. Therefore, participants' degree of certainty about which pseudoword corresponded with which object varied, based on the number of pseudowords and objects presented within a trial (i.e., less ambiguity with two pseudowords and two objects compared with four pseudowords and four objects). At test, they selected the correct referent significantly above chance in all conditions, demonstrating that adults can use cross-situational learning to derive the correct word-object associations for novel words produced in their native language. Research since then has continued to show support for CSWL as a real-world mechanism, demonstrating that adults can retain these novel word-object pairings for at least a week (Vlach and Sandhofer, 2014), and can encode novel words learned via CSWL in fine phonological detail (Escudero et al., 2016; Mulak et al., 2019). Australian English (AusE) speakers learned and subsequently identified pseudoword-object pairings in a non-minimal pair (e.g., bon-deet), consonant minimal pair (e.g., bon-ton), and vowel minimal pair (e.g., deet-dit) context above chance in all conditions, though accuracy was lower in the vowel minimal pair context, suggesting weaker encoding of vowels compared to consonants (Escudero et al., 2016).

While the research on CSWL in adults supports CSWL as a viable word learning mechanism in the L1, adults' ability to learn words in an L2 via CSWL has not been investigated. This is because novel words across experiments to date have conformed to the phonology and phonotactic rules of the learner's L1. Of course, there is no reason to believe that adults cannot use CSWL to learn L2 words at all. While not previously tested in adults, children have been found to learn L2 words in this way. In a direct comparison of L2 CSWL and a more explicit paradigm, 8-year-old Mandarin-speaking students who were studying English were exposed to four real English words that were unknown to the students (clamp, wedge, snood, and dart) that were paired with novel images (i.e., not with the actual referent in English). Half of the participants were exposed to word-object pairings in a CSWL paradigm, in which a target word was presented with another target word in each trial. The other participants were taught words in a more unambiguous, mutual exclusivity paradigm, in which the novel image paired with the auditory word in a trial was presented alongside an image for which children knew the corresponding English word. In this way, the auditory label could be inferred as belonging to the novel referent by a process of elimination, since participants already know the label associated with the alternate referent. In both conditions, children learned all four words. While immediate testing revealed a disadvantage for words learned via CSWL, there were no differences across conditions when retention was examined 15 min after the task had ended (Hu, 2017).

Similarly, Junttila and Ylinen (2020) demonstrated that 5- to 8-year-old Finnish children could learn real English word-object pairs in a CSWL paradigm in which they were presented with two spoken words and two images in each trial. The authors found no evidence that CSWL differed in effectiveness compared to an intentional, explicit learning paradigm in which children were asked to memorize the word object pairs (with some children also being asked to produce the words) or an incidental learning paradigm in which children were not asked to memorize the words and were asked to produce the Finnish word for each visual referent, such that any learning of the English labels would have occurred incidentally.

While these studies demonstrate that children can learn L2 words via CSWL, the L2 words used by Hu (2017) and Junttila and Ylinen (2020) were all very phonologically distinct. As discussed above, certain sounds in an L2 can be particularly difficult for a learner to discriminate based on the relation between the sounds in the L2 and the listener's native L1 sound categories. Because CSWL involves tracking the co-occurrences between auditory words and candidate referents, if listeners are unable to reliably distinguish between certain minimal pair words between the L1 and the L2, that could greatly impact the efficacy of CSWL in the L2. A similar situation arises when learners are tasked with learning two words for one referent, as is common in the bilingual CSWL literature. Indeed, in a CSWL task in which participants were taught two auditory labels for each visual referent, participants with experience with more than one language (i.e., they knew English and had knowledge of at least one other language) were better at learning both labels compared to monolingual English participants, but only when the two auditory labels were very phonologically distinct (disyllabic words ending in a vowel vs. monosyllabic words ending in /k/; Benitez et al., 2016). These results highlight the possibility that if monolinguals are unable to distinguish between the auditory minimal pairs, their ability to track the word-object pairs across learning trials may be disrupted and may lead to competition between certain word pairs depending on their phonetic similarity.

To investigate whether adults can learn L2 words via CSWL and in particular whether perceptual difficulties in the L2 obstruct CSWL in the L2, the current study compared L2 learners' ability to learn phonologically distinct L2 pseudowords and vowel minimal pairs in which the vowel contrast is predicted to be perceptually easy or difficult for the listener to discriminate based on the phonological relation between the L1 and to-be-learned language. Specifically, we tested monolingual AusE speakers' ability to learn referents associated with novel words produced by native speakers of Dutch (Experiment 1) or Brazilian Portuguese (Experiment 2), which conformed to Dutch or Brazilian Portuguese phonology and phonotactics. Comparing learning in two languages allows us to test the idea that the acoustic-phonetic relationship between the L1 and L2 is a deciding factor in how well learners acquire new words.

Our predictions regarding the effects of the L1-L2 acoustic-phonetic relationship were based off models of L2 speech perception, such as the L2LP model (e.g., van Leussen and Escudero, 2015) and the PAM-L2 (Best and Tyler, 2007), which examine how this early or initial perception of L2 sounds can help or hinder discrimination based on acoustic (L2LP) or articulatory (PAM-L2) characteristics. Specifically, when two sounds in the L2 map onto one category in the L1, this can make discrimination of the contrast more difficult, since both L2 sounds are perceived as belonging to a single L1 sound (e.g., Spanish speakers confusing English [i] in bean and [ɪ] bin because both map onto the sole Spanish /i/ category). This is known as a “new scenario” in the L2LP model. Another difficult scenario is a “subset scenario,” in which one non-native vowel can be categorized into two or more native categories. For example, in a categorization study, native AusE listeners categorized the non-native Dutch vowel /ʏ/ across three different native AusE vowels fairly equally: /ε/ - 19%, /Ʊ/ - 19%, /ʉ/ - 14% (Alispahic et al., 2017). In that same study, Dutch /ɪ/ and /i/ were mapped most frequently to AusE /ɪ/ (40 and 48%, respectively), in an example of a “new scenario.” These two scenarios highlight the difficulties in vowel perception and categorization across different languages, such that depending on the acoustic relationship between the L1 and L2, certain vowels may be perceived as belonging to one or more native categories, leading to difficulty in learning to discriminate them in lexical contexts.

Following the L2LP model, we used acoustic measurements of the Dutch, Brazilian Portuguese, and AusE vowels to classify the target Dutch and Brazilian Portuguese minimal pairs as perceptually easy or difficult for native AusE listeners to discriminate. We focused on vowel minimal pairs over consonant minimal pairs because previous work has demonstrated that while native AusE listeners learn both consonant and vowel minimal pairs via CSWL in their own language, performance is weakest for native vowel minimal pairs (Escudero et al., 2016; Mulak et al., 2019). Thus, if L2 perceptual difficulties do affect CSWL in the L2, the ability to learn L2 vowel minimal pairs may be particularly affected.

We predicted that overall, native AusE participants would be able to learn L2 pseudowords in both Dutch and Brazilian Portuguese, given evidence that adults can learn non-minimal and minimal pair L1 pseudowords via CSWL (Escudero et al., 2016) and evidence that children can learn phonologically distinct L2 words in a version of CSWL (Hu, 2017; Junttila and Ylinen, 2020). Given that when tested in the L1 adults show better learning of non-minimal than vowel minimal pairs (Escudero et al., 2016), we predicted performance would be best for non-minimal pairs compared to perceptually easy minimal pairs. We further predicted that accuracy would be worst for perceptually difficult minimal pairs compared to perceptually easy pairs, presumably because the difficulty in discriminating between words would disrupt cross-situational tracking of word-object co-occurrences.

Methods

Experiment 1: Dutch

Participants

Participants were 20 undergraduate students at Western Sydney University (15 females, 5 males, mean age = 22.87 years, SD = 3.72 years). All students were monolingual AusE speakers as revealed by a language background questionnaire administered at the beginning of the session (i.e., participants self-reported only using English in their daily lives, including schooling and work). Participants received course credit for their participation.

Stimuli

The 12 Dutch auditory words were the same as those used in Escudero et al. (2013). They were produced by a native female Dutch speaker and were recorded at the University of Amsterdam in a soundproof booth. All words adhered to Dutch phonology and phonotactics. Half of the words were in a /p-vowel-χ/ (/pVχ/) context with the Dutch vowels /ɪ, i, ɑ, a, y, ʏ/. Table 1 reports the first (F1) and second (F2) vowel formants, which approximately correspond to the position of the tongue during vowel production with regards to height and backness, respectively. Overlap between these values can help predict whether a contrast will be perceptually easy or difficult for a non-native listener to perceive. These measurements were originally reported by Elvin et al. (2016) for AusE (Western Sydney area) and Adank et al. (2004) for Dutch.

TABLE 1

Table 1. Formant values (Hz) of vowels across languages used in present study.

Pairings of these six /pVχ/ words comprised the minimal pair (MP) set. The remaining six words were disyllabic words adapted from Shatzman and McQueen (2006). These words contained different consonants and vowels from the MP set, arranged in variable, phonologically distinct contexts (/.beːptuː/, /.foːmpəl/, /.jɔmtoː/, /.kεstə/, /.surkεt/, /.tœykfɔm/). These formed non-minimal pairs (non-MPs) when paired with one another or with a /pVχ/ word. The 12 Dutch pseudowords were randomly paired with 12 black-and-white line drawings of nonsense objects from Shatzman and McQueen (2006).

As mentioned in the introduction, categorization of vowel contrasts as easy or difficult was based on patterns of expected acoustic-phonetic mapping of L2 vowels to the L1 phonological space, following the L2LP model (see e.g., van Leussen and Escudero, 2015; Alispahic et al., 2017). We used the categorization results from Alispahic et al. (2017) to predict how our participants would categorize the Dutch vowels tested here, as this study also tested native AusE speakers from the Western Sydney area. As can be seen in Table 2, difficult minimal pairs were those that contained vowel contrasts that could be classified as belonging to the same L1 vowel category or that could be categorized across more than one L1 category, whereas easy minimal pairs contained vowel contrasts that were expected to be categorized clearly to two separate L1 vowel categories.

TABLE 2

Table 2. Easy and difficult vowel mappings of Dutch vowels to AusE vowels.

Procedure

Participants completed a CSWL task which consisted of a learning phase and a testing phase. They were seated in front of a 17-inch laptop computer and were told they would see images and hear words. Participants were not told that the words were associated with the images or that they would later be tested on their association between the pictures and words.

Learning Phase

The learning phase consisted of 72 trials (12 words presented 6 times each) in which they were presented auditorily with two pseudowords and visually with two black-and-white non-sense line drawings on the screen on the left and right sides (from Shatzman and McQueen, 2006). All pseudowords were presented with every other pseudoword at least once and trials were counterbalanced such that each novel line drawing was presented an equal number of times on the left or right and each pseudoword was presented an equal number of times as the first or second word in each trial. Trials had a 500 ms delay between picture onset and word onset and a 500 ms inter-stimulus interval.

Testing Phase

The testing phase consisted of 264 trials (12 words presented 22 times each as target, twice with every other word). This created 28 trials that were considered difficult MPs, 32 trials that were considered easy MPs, and the remaining 204 trials were non-minimal pairs. While pairing each word with every other word resulted in a greater number of non-minimal pair trials compared to minimal pair trials, this ensured minimal pairs did not stand out as a primary focus of our investigation (see also Escudero et al., 2014), and is also reflective of naturalistic language exposure in which minimal pairs appear at a much lower frequency.

Participants heard one pseudoword and were presented with two line drawings on the screen and were asked to indicate which line drawing corresponded to the word they heard. They pressed a key on the keyboard to make their response. While it is possible that participants could learn or strengthen pseudoword-object pairings during the training phase, importantly, no feedback was given on test trials. Thus, the test trials were effectively additional cross-situational trials in which one auditory word co-occurred ambiguously with two candidate referents, and therefore still reflect CSWL. See Figure 1 for a representation of the learning and testing phase for Dutch stimuli.

FIGURE 1

Figure 1. Representation of Dutch learning phase and testing phase presented to participants. Arrows indicate linear order through trials; participants were first exposed to the learning phase and then the testing phase.