Multi-talker background and semantic priming effect

Dekerle, Marie; Boulenger, Véronique; Hoen, Michel; Meunier, Fanny

doi:10.3389/fnhum.2014.00878

ORIGINAL RESEARCH article

Front. Hum. Neurosci. , 31 October 2014

Sec. Cognitive Neuroscience

Volume 8 - 2014 | https://doi.org/10.3389/fnhum.2014.00878

This article is part of the Research Topic The Cognitive and Neural Organisation of Speech Processing View all 15 articles

Multi-talker background and semantic priming effect

$\r\nMarie Dekerle,*$ Marie Dekerle^1,2^*

Véronique Boulenger^2,3

Michel Hoen^2,4

Fanny Meunier^1,2

¹Laboratoire sur le Langage, le Cerveau et la Cognition, Centre National de la Recherche Scientifique, UMR 5304, Lyon, France
²University of Lyon, Lyon, France
³Laboratoire Dynamique Du Langage, Centre National de la Recherche Scientifique, UMR5596, Lyon, France
⁴Centre National de la Recherche Scientifique, UMR5292, Institut National de la Santé et de la Recherche Médicale, U1028, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France

The reported studies have aimed to investigate whether informational masking in a multi-talker background relies on semantic interference between the background and target using an adapted semantic priming paradigm. In 3 experiments, participants were required to perform a lexical decision task on a target item embedded in backgrounds composed of 1–4 voices. These voices were Semantically Consistent (SC) voices (i.e., pronouncing words sharing semantic features with the target) or Semantically Inconsistent (SI) voices (i.e., pronouncing words semantically unrelated to each other and to the target). In the first experiment, backgrounds consisted of 1 or 2 SC voices. One and 2 SI voices were added in Experiments 2 and 3, respectively. The results showed a semantic priming effect only in the conditions where the number of SC voices was greater than the number of SI voices, suggesting that semantic priming depended on prime intelligibility and strategic processes. However, even if backgrounds were composed of 3 or 4 voices, reducing intelligibility, participants were able to recognize words from these backgrounds, although no semantic priming effect on the targets was observed. Overall this finding suggests that informational masking can occur at a semantic level if intelligibility is sufficient. Based on the Effortfulness Hypothesis, we also suggest that when there is an increased difficulty in extracting target signals (caused by a relatively high number of voices in the background), more cognitive resources were allocated to formal processes (i.e., acoustic and phonological), leading to a decrease in available resources for deeper semantic processing of background words, therefore preventing semantic priming from occurring.

Introduction

In daily life, speech is rarely perceived in silence, but with interference from wind, music or other people's conversation. Although often used to study psychoacoustic topics (Brungart, 2001; Brungart et al., 2001; McDermott, 2009), speech-in-noise and cocktail party situations (i.e., speech-in-speech, Cherry, 1953) also appear to be interesting paradigms to tackle linguistic processes and competition occurring between backgrounds and targets (Hoen et al., 2007; Boulenger et al., 2010). Our study aimed to investigate the extent to which multi-talker background is processed semantically when listening to speech-in-speech and therefore how the cocktail party situation can be used to study the automaticity of word semantic activation.

The cocktail party situation is described as involving two types of masking effects: energetic and informational masking (Brungart, 2001). Energetic masking relies on the spectro-temporal features of sounds and results from different sounds stimulating the same part of the cochlea at the same time so that one of them cannot be heard (i.e., as two signals increasingly share spectro-temporal characteristics, energetic masking becomes more efficient). In multi-talker background situations, the magnitude of energetic masking is proportional to the number of voices that comprise the background (Simpson and Cooke, 2005). Informational masking, however, usually refers to masking effects that cannot be attributed to energetic masking. Specifically, it is related to the overlap of information carried by the different signals at a higher level (e.g., lexical level and working memory; see Durlach, 2006; Cooke et al., 2008; Mattys et al., 2009; Mattys and Wiget, 2011). Whereas background noise mainly elicits energetic masking, a speech background produces both energetic and informational masking (Brungart et al., 2006). Despite the masking, it is still possible to detect and recognize a word or linguistic token embedded in a babble. Of course, as more voices are present in the babble, participants become less accurate (Freyman et al., 2004). However, it is interesting to note that Simpson and Cooke (2005) showed, using a—6 dB SNR, that intelligibility decreases as a monotonic function of the number of speakers in babbles of up to 8 voices. Specifically, participants' accuracy to detect the target token decreases as the number of voices increases up to 8 voices. Further increasing the number of voices does not lead to a decrease in accuracy. These results suggest that if energetic masking is too high, informational masking decreases with the diminution of the available linguistic cues. For example, with more than 8 talkers, phonetic cues are not or less available and therefore cannot be attributed incorrectly to the target.

The first aim of this paper is to test whether semantic features are involved in informational masking. It has been established that informational masking is not monolithic and occurs at many linguistic levels. Indeed, a multi-talker background will create less interference on a target word, if it is pronounced in a different language (Van Engen and Bradlow, 2007; Brouwer et al., 2012) and different languages will not have the same masking power (Gautreau et al., 2013). By manipulating the number of talkers in the background, Boulenger et al. (2010) revealed lexical competitions between a 2-talker background and target speech using a lexical decision task. Increasing the number of voices in the background, however, led to the disappearance of lexical interference because of increased energetic masking (i.e., words from the background became less intelligible and therefore competed less with target processing). However, using the same paradigm but with an intelligibility task, Hoen et al. (2007) showed that lexical processing of a background could be performed with up to 4 concurrent voices; beyond that threshold, masking was too high and seemed to prevent linguistic processes. Although it has been shown that phonological and lexical information contribute to the informational masking effect, our experiments tested the role of semantic information.

Processing of the background's semantic content has already been highlighted with 2 talkers, pronouncing either semantically correct sentences (i.e., “rice is often served in round bowls”) or incorrect sentences (i.e., “the great car met the milk,” Brouwer et al., 2012). Semantic incoherence in the background impacts the recognition of the target sentence. This result suggests that the background signal with 2 talkers is processed semantically. Our experiments aimed to identify how many talkers are allowed in this semantic processing using words and how semantic information from the backgrounds interferes with the identification of target words.

The ability to semantically process auditory words presented outside of the attentional focus is traditionally studied using dichotic listening. This paradigm allows to study pure informational masking as no energetic masking occurs in dichotic listening. However, discrepant results have been reported (Cherry, 1953; Lewis, 1970; Eich, 1984; Wood et al., 1997; Dupoux et al., 2003). In 1984, Eich showed a semantic effect on the recognition of words presented in the unattended channel. However, this effect resulted, at least partially, from an attentional shift toward the to-be-ignored channel (as suggested by Wood et al., 1997). As the speaker rate was very slow in Eich's experiment (85 words per minute), it allowed participants to listen to the supposedly unattended channel without disturbing the primary task (in the case of Eich's study, a shadowing task). In replicating Eich's study using the same speech rate, Wood et al. (1997) observed the same semantic effect; however, it disappeared if the speaker's rate was increased to 170 words per minute, corresponding to a more ecologically valid rate. The authors concluded that as this faster speech rate demanded more cognitive resources, participants could no longer shift attention to the unattended channel while performing the primary task, suggesting that at least in dichotic listening, informational masking does not involve semantic information. The issue raised by this paradigm is that the spatial separation of auditory signals creates a masking release compared to a binaural condition and therefore facilitates stream segregation that could prevent competition between the to-be-ignored and target speech (Drullman and Bronkhorst, 2000; Hawley et al., 2004).

Concerning semantic activation and according to traditional theoretical models, semantic memory is organized into networks. The recognition of one word leads to its activation in semantic memory, and this activation is supposed to spread automatically to other related concepts (Collins and Quillian, 1969; Collins and Loftus, 1975). This supposition is derived from semantic priming paradigms, shown in auditory and visual modalities, in which the presentation of prime word leads to faster recognition of a semantically related target word (Meyer and Schvaneveldt, 1971; Donnenwerth-Nolan et al., 1981; Radeau, 1983; Schacter and Church, 1992; Radeau et al., 1998; Spruyt et al., 2012). For example, the presentation of the prime “nurse” before the target “doctor” facilitates the recognition of the target word “doctor” compared to a condition in which the prime is unrelated to the target (Meyer and Schvaneveldt, 1971). Adapting this paradigm to the cocktail party situation will allow us to investigate if the semantic content of the background is processed and interferes with the target word, despite decreased intelligibility. Some background words will therefore act as primes.

In the current study, we used the rationale of a priming paradigm by manipulating the association between words pronounced in the background and target words. Additionally, we varied the amount of masking to evaluate how it modulates semantic priming effects. Participants were required to perform a lexical decision task on a target item (i.e., decide whether the target item is a word or a pseudo-word) embedded in backgrounds composed of 1 to 4 voices depending on the experiment. These voices could pronounce words that were semantically related to each other and that were related or unrelated to the target. They acted as primes and were called Semantically Consistent (SC) voices. Additional voices pronounced words that were always unrelated to each other and unrelated to the target, acting as maskers. They were called Semantically Inconsistent (SI) voices.

Across experiments, we manipulated the ratio between SC and SI voices. The aim was to test the preservation of the semantic processing of SC voices despite increased masking (i.e., more SI voices). In Experiment 1, backgrounds were composed of 1 or 2 SC voices. In Experiments 2 and 3, respectively, 1 and 2 SI voices were added to each background to increase masking and therefore, decrease the intelligibility of the SC voices. Consequently, in Experiment 2, backgrounds in one condition consisted of 1 SC voice and 1 SI voice and in a second condition of 2 SC voices and 1 SI voice. In Experiment 3 they comprised 1 SC voice and 2 SI voices in one condition and 2 SC voices and 2 SI voices in the other condition.

Overall increasing the number of voices allowed us to examine if and how semantic priming can be impacted by the increase in the number of talkers in the background. Additionally, the variation in the number of SC voices compared to the number of SI voices allowed us to study the effect of prime saliency on semantic processing and therefore its participation in informational masking. Indeed, across experiments, backgrounds can consist of the same number of voices whereas the number of SC voices compared to the number of SI voices could differ (e.g., 3 voices in the background: either 2SC/1SI in Experiment 2 or 1SC/2SI in Experiment 3).

If semantic processing can occur automatically, semantic priming should be observed at least as long as background words are intelligible and should not be disturbed by increased masking and decreased prime saliency. Indeed, automaticity is defined as a strategy free processing that occurs without using the resources of a limited capacity central processor (Neely, 1977). Therefore, if semantic processing is strategy free, it should occur even if participants are not aware that a given word is presented to them (as is done in visual modality in classical masked priming paradigms, see Forster and Davis, 1984).

Experiment 1

The aim of this experiment was to first establish set up and test our paradigm and experimental materials. Backgrounds were composed of 1 or 2 SC voices that pronounced words sharing semantic features with each other. In the related condition, target words belonged to the same semantic field as the prime, but they did not in the unrelated condition. We therefore expected to observe a semantic priming effect: participants should more quickly and accurately identify target words in the related compared to the unrelated condition. The second aim of this first experiment was to test if the presence of 2 voices in the background would affect participants' performance as suggested by the psychoacoustic literature (Brungart, 2001; Brungart et al., 2001). We therefore hypothesized that target words would be answered to more slowly and less accurately in the 2SC condition compared to the 1SC condition. Finally, we examined whether the semantic priming effect was modulated by increased energetic and informational masking caused by the augmentation of the number of voices in the background.

Method

Participants

Twenty-seven participants (20 females) volunteered for this experiment. All were right-handed, French native speakers and reported no known hearing or language disorder. Subjects' ages ranged from 18 to 25 years old. All participants gave written informed consent and were not aware of the experiment's purpose. They were compensated for their participation. The protocol that was used in this experiment was approved by the local ethics committee (CPP Sud-Est IV, Lyon; ID RCB: 2008-A00708-47).

Stimuli

Forty-eight disyllabic target words (M_{lexical frequency} = 21.94 per million, SD = 18.75 according to the French database Lexique 3, New et al., 2001) were selected, and each word belonged to a specific semantic field (e.g., CAROTTE “carrot”; MÉTRO “subway”). Each target word was matched to 10 words belonging to the same semantic field (e.g., CAROTTE “carrot” was associated with légume, chou, céleri, salade, tomate “vegetable, cabbage, celery, lettuce, tomato”). As participants had to perform a lexical decision task, 48 pseudo-words respecting French phonotactic rules were created (e.g., PLARO, HUMEL). Ten words sharing semantic features with each other were arbitrarily associated with each pseudo-word target, resulting in a total of 96 groups of 10 words (See Supplementary Material) (M_{lexical frequency} = 21.86, SD = 18.20). As each background comprised 1 or 2 SC voices (related or not to the target), each group was divided into two subgroups of 5 words one of the subgroups was spoken by a first speaker (S1), and the other by a second speaker (S2).

Target words were presented with a semantically related (related condition) or semantically unrelated background (unrelated condition). In the unrelated condition, SC voices pronounced words that were semantically related to each other but not to the target (see Figure 1). Backgrounds comprised 1 SC voice (1SC condition) or 2 SC voices (2SC condition). The 48 target words were divided into 4 groups of 12 words, the mean frequency did not differ significantly between the groups (F < 1), nor did the number of phonemes [M = 6.97, SD = 5.65; F_{(3, 44)} = 1.1, n.s.] and phonological neighbors [M = 4.75, SD = 0.81; F_{(3, 44)} = 2.2, n.s.]. Each group of twelve target words was assigned to a condition (1SC related, 1SC unrelated, 2SC related, 2SC unrelated) depending on the experimental list. The same was true for pseudo-words. Four experimental lists of 96 stimuli (i.e., 48 target words and 48 target pseudo-words) were created so that each target word was presented in each condition, but only once in a list (each participant was presented with one list only).

FIGURE 1

Figure 1. (A) Example of two backgrounds in the 1SC condition of Experiment 1, presented with a semantically related target word (left; related condition) or not (right; unrelated condition). S1, speaker 1, LDT, Lexical Decision Task. (B) Example of a background in the 2SC condition of Experiment 1, presented with a semantically related target word (left; related condition) or not (right; unrelated condition). S2, speaker 2.

Targets and SC voices were recorded by 3 different French native female speakers (age: 21–22) in a sound-proof room (22 kHz, mono, 16 bits). Auditory sequences of 5 words from Speakers 1 and 2 (S1 and S2) were segmented into 3 s periods. The periods were then normalized at an intensity of 60 dB-A and mixed together to create backgrounds. All audio files were synchronized at the beginning, so all voices started to speak at the same time. However, as all voices pronounced words of different lengths, they soon became desynchronized, and there was always one speaker talking in the background. Targets recorded by Target Speaker (TS; also normalized at an intensity of 60 dB-A) were inserted 2 s after the start of the backgrounds (so that each participant always had the same exposure to the background before the target speech was presented), with a 0 dB SNR (Signal/Noise Ratio; see Figure 1). Because the backgrounds, which comprised 1 or 2 voices, generated different amounts of energy, the intensity of all stimuli was varied over a ±3 dB range in 1 dB steps to prevent participants from predicting condition depending on individual stimuli intensity.

Procedure

Participants sat in front of a computer screen and heard the stimuli binaurally through headphones at a comfortable level (mean level 65 dB-A, ranging from 62 dB-A to 68 dB-A, normalized using an artificial ear). A fixation cross was presented on the screen at the beginning of each trial and remained on the screen during stimulus presentation. Participants were asked to listen to the stimuli to decide as quickly and accurately as possible whether the target was a word or a pseudo-word, by pressing one of two pre-specified keys. After a response was given, a string of hash marks indicated that the trial was over; participants could then press a key to start the next trial. Half of the participants gave the response to “word” with their left hand and to “pseudo-word” with their right hand. As all participants were right-handed, they might answer faster with their right hand than their left hand. To avoid this confounding effect, the other half were given the opposite instruction. A training session composed of twelve trials (different from the experimental stimuli) preceded the test session so that participants could acclimate to the stimuli and the task.

Results

Two Two-Way repeated measures analyses of variance (ANOVAs) by participants (F₁) and by items (F₂) were conducted, with Response Times (RTs, in ms) and Error Rates (ERs) for target word identification as dependent variables. We included Number of Voices in the background (1 Voice, 1SC or 2 Voices, 2SC) and Semantic Link between prime and target (related or unrelated) as within-subjects factors. Three participants were excluded from analyses because of very high ERs (more than 40%). Four target words error rates greater than 50% were also excluded from Item analyses (POIGNET, RIDEAU, RATON, and RACINE “wrist, curtain, baby rat, root”). Trials with RTs below or above 2.5 standard deviations from the individual means (4.5%) and trials in which participants made mistakes (19.5%) were not included in RTs analysis. Means and Standard Deviations (SDs) of RTs and ERs are summarized in Table 1.

TABLE 1

Table 1. Means and Standard Deviations (SDs) of Response Times (RTs) and Error Rates (ERs) depending on the number of voices in the background and the semantic link between prime and target in Experiment 1.

The ANOVA by participants first revealed a significant main effect of the Number of Voices: participants were faster [F_{1(1, 23)} = 4.25, p = 0.05] and more accurate [F_{1(1, 23)} = 9.12, p < 0.01] to identify targets in the 1SC condition (M_RT = 1008 ms, SD = 157; M_ER = 11.7%, SD = 10.9) than in the 2SC condition (M_RT = 1042 ms, SD = 161; M_ER = 20.1%, SD = 16.5). The Item analysis, however, did not highlight an effect of the Number of Voices on RT [F_{2(1, 43)} = 1.9, p = 0.1], although target words were better categorized as words in the 1SC condition [F_{2(1, 43)} = 13.43, p < 0.001].

The main effect of Semantic Link also appeared to be significant on RTs [F_{1(1, 23)} = 14.24, p < 0.001; F_{2(1, 43)} = 4.63, p < 0.05], participants responded faster if targets shared semantic features with the prime (M = 997 ms, SD = 169) than if they did not (M = 1053 ms, SD = 145); this resulted in a 56 ms priming effect. This effect was also significant for ERs in the participant analysis [F_{1(1, 23)} = 3.93, p = 0.05], and there was only a trend in the item analysis [F_{2(1, 43)} = 3.07, p < 0.10]. Participants tended to be more accurate in the related condition (M = 15.4%, SD = 15.4) than in the unrelated condition (M = 18.55%, SD = 13.2). There was no significant interaction between the two factors for RTs (F₁ < 1; F₂ < 1) and ERs [F_{1(1, 23)} = 2.34, n.s.; F_{2(1, 43)} = 1.97, n.s.], suggesting that the semantic priming effect was not modulated by the Number of Voices (one or two) in the background.

Discussion

These results first highlight that participants were slowed by the increase in the number of voices in the background. This effect is certainly attributable to enhanced target masking in the two-voice condition (Brungart, 2001; Brungart et al., 2001). Interestingly, participants' performance was improved by the semantic relationship between the prime and target, and this effect was independent of the number of voices, suggesting that the increase in masking from one to two background voices, was not sufficient to prevent semantic processing. However, in this first experiment, prime was salient in both conditions (1SC voice and no SI voice or 2SC voices and no SI voice). To test whether participants could still take advantage of the semantic relationship between target and prime if the intelligibility of the SC voices was further decreased, we conducted a second experiment in which a SI voice was added to each background.

Experiment 2

This second experiment aimed to investigate whether the semantic priming effect would resist increased masking. An SI voice was therefore added to each background. This voice pronounced words sharing no semantic features with each other or with the target word, whatever the condition. The purpose was to use the same material and procedure as in Experiment 1 with the addition of mask on the SC voices. In Experiment 2, backgrounds were composed of two voices (1 SC voice + 1 SI voice) or 3 voices (2 SC voices + 1 SI voice). A deleterious effect of the number of voices on participants' performance was predicted, and we expected that the presence of SI voice would not affect the semantic priming effect if this latter effect is automatic.

Another change was made in Experiment 2 regarding target items. In Experiment 1, target items were pronounced by a female speaker, and were consequently, difficult to detect among the other female speakers (S1 and S2). These difficulties might partly explain the low accuracy and long response times to target words inserted in babbles that were only composed of one or two voices. To avoid flux segregation difficulties (Festen and Plomp, 1990; Brungart et al., 2001), target items were therefore pronounced by a male speaker (Target Speaker 2; TS2) in the two following experiments.