Do Musicians and Non-musicians Differ in Speech-on-Speech Processing?

Kaplan, Elif Canseza; Wagner, Anita E.; Toffanin, Paolo; Başkent, Deniz

doi:10.3389/fpsyg.2021.623787

ORIGINAL RESEARCH article

Front. Psychol., 18 February 2021

Sec. Auditory Cognitive Neuroscience

Volume 12 - 2021 | https://doi.org/10.3389/fpsyg.2021.623787

Do Musicians and Non-musicians Differ in Speech-on-Speech Processing?

Elif Canseza Kaplan^1,2^*

Anita E. Wagner¹

Paolo Toffanin¹

Deniz Başkent^1,2

¹Department of Otorhinolaryngology/Head and Neck Surgery, University Medical Center Groningen, University of Groningen, Groningen, Netherlands
²Research School of Behavioral and Cognitive Neurosciences, Graduate School of Medical Sciences, University of Groningen, Groningen, Netherlands

Earlier studies have shown that musically trained individuals may have a benefit in adverse listening situations when compared to non-musicians, especially in speech-on-speech perception. However, the literature provides mostly conflicting results. In the current study, by employing different measures of spoken language processing, we aimed to test whether we could capture potential differences between musicians and non-musicians in speech-on-speech processing. We used an offline measure of speech perception (sentence recall task), which reveals a post-task response, and online measures of real time spoken language processing: gaze-tracking and pupillometry. We used stimuli of comparable complexity across both paradigms and tested the same groups of participants. In the sentence recall task, musicians recalled more words correctly than non-musicians. In the eye-tracking experiment, both groups showed reduced fixations to the target and competitor words’ images as the level of speech maskers increased. The time course of gaze fixations to the competitor did not differ between groups in the speech-in-quiet condition, while the time course dynamics did differ between groups as the two-talker masker was added to the target signal. As the level of two-talker masker increased, musicians showed reduced lexical competition as indicated by the gaze fixations to the competitor. The pupil dilation data showed differences mainly in one target-to-masker ratio. This does not allow to draw conclusions regarding potential differences in the use of cognitive resources between groups. Overall, the eye-tracking measure enabled us to observe that musicians may be using a different strategy than non-musicians to attain spoken word recognition as the noise level increased. However, further investigation with more fine-grained alignment between the processes captured by online and offline measures is necessary to establish whether musicians differ due to better cognitive control or sound processing.

Introduction

Musical training may grant normal-hearing listeners an advantage in auditory tasks, not only for tasks related to music, but also in encoding and processing speech in adverse listening conditions (e.g., Kraus and Chandrasekaran, 2010), such as in the presence of competing talkers (e.g., Başkent and Gaudrain, 2016). However, various studies that investigated whether long-term musical training improves speech-in-noise processing yielded conflicting results (for a review, see Coffey et al., 2017). Inconsistency in findings is fueled not only by the challenges to reproducibility across paradigms and labs but also by different levels of complexity in signal-noise properties (Swaminathan et al., 2015), as well as considerable individual variability that is inherent to speech processing in adverse listening conditions (e.g., Peelle, 2018). In the current study, we used three measures, one that gives a post-task measure (offline), and two online measures that capture real-time processing of speech perception (gaze-fixations) and cognitive resources employed (pupillometry). We used stimuli of comparable complexity across different measures of spoken language processing within the same group of participants. With this approach, we aimed to investigate whether we can find differences in speech-on-speech processing between musicians and non-musicians consistent within groups across all measures.

Music and language both engage complex cognitive processes that result from the interaction of attention, memory, motor, and auditory systems. Musical training requires extensive use of domain-specific, as well as domain-general auditory processes (Asaridou and McQueen, 2013), and has been investigated as a potential learning process that increases cross-domain plasticity (Pantev and Herholz, 2011). According to Patel (2014), musical training may improve cross-domain plasticity and strengthen the shared neural resources required for processing both music and speech, and this may enhance auditory attention and auditory working memory (Besson et al., 2011). For normal hearing listeners with typical language development, speech processing without background noise is considered an automatic process (Shiffrin and Schneider, 1977) that does not pose extra demands on cognitive resources (Marslen-Wilson and Tyler, 1981). Nonetheless, in adverse listening conditions, speech processing may become demanding (Mattys et al., 2012; McQueen and Huettig, 2012; Heald and Nusbaum, 2014). Accordingly, musical training has been suggested as a means to enrich speech-in-noise perception, through the recruitment of these strengthened cross-domain resources and mechanisms when processing becomes challenging (Strait and Kraus, 2011).

In the past decades, there has been an increase in the number of studies that investigate speech-on-speech processing in musicians and non-musicians (Parbery-Clark et al., 2009; Ruggles et al., 2014; Boebinger et al., 2015; Swaminathan et al., 2015; Başkent and Gaudrain, 2016; Clayton et al., 2016; Madsen et al., 2017, 2019; Yates et al., 2019; Bidelman and Yoo, 2020). However, this line of research has provided ambiguous results. Parbery-Clark et al. (2009) showed that musicians performed better than non-musicians in the often used audiological measures of Hearing in Noise Test (HINT; Nilsson et al., 1994), which measures sentence recognition in speech-shaped noise at a variety of adaptive signal-to-noise ratios, and QuickSIN (Killion et al., 2004), which measures sentence recall in four-talker babble presented at fixed target-to-masker ratios (TMRs). On the contrary, Ruggles et al. (2014) failed to replicate their results with a larger sample size. Particularly for speech-on-speech perception, Başkent and Gaudrain (2016) showed in a sentence recall task that musicians correctly recall more words than non-musicians in a single-talker masker. Similarly, Deroche et al. (2017) observed musicians to have better speech reception thresholds in a two-talker masker, but not in non-speech maskers. On the other hand, Madsen et al. (2017, 2019) did not find any group differences in a variety of speech-in-noise and speech-on-speech conditions with a larger sample size.

The nature of the demands imposed by both the task and the stimuli may play a role in the variability across different results reported in the literature. Sentence recall without any background noise involves encoding and retention of auditory information. Without a secondary task, the recall performance is automatic and effortless, with no involvement of central executive functions (Baddeley et al., 2009). In the presence of a secondary task, attention plays a role in retention and retrieval of information (Treisman, 1964). Inhibiting the interference from the speech from competing talkers or processing the acoustic information in the target signal may serve as such a secondary task, and thus speech-on-speech requires additional central executive involvement. The interference from the masker also depends on the type of the background masker itself, i.e., lexical content of the speech masker (Helfer and Jesse, 2015); amount of linguistic interference from the speech masker (Calandruccio et al., 2010); number of talkers in speech masker, such as 1, 2, and 4 vs. 16 talkers (Rosen et al., 2013), and this may lead to variations in the chunking strategies to inhibit the background talkers (Miller, 1947; Bronkhorst, 2015; Calandruccio et al., 2017). Thus, regardless of the musical experience, the type of target and masker properties used across different studies and the nature of the task demands (i.e., recall vs. recognition) might play a role in the “different findings” in the literature.

In contrast to most studies that employed offline behavioral measures, studies that sought musician/non-musician differences using online measures have more consistently found a difference between groups. Offline measures, such as accuracy of responses, are obtained after the task and do not capture individual’s real time processing of the stimuli. Online measures provide real-time information while the spoken language processing happens (Godfroid, 2019). Most online measures that have been employed in testing differences between musicians and non-musicians in speech-in-noise related tasks used neuroimaging methods, such as EEG (Parbery-Clark et al., 2009; Meha-Bettison et al., 2018) or MEG (Puschmann et al., 2018). These methods require precise control over stimuli, such as phonemes embedded within broadband noise (Parbery-Clark et al., 2012; Du and Zatorre, 2017) or within multi-talker babble noise (Strait and Kraus, 2011), and hence measure lower level sound encoding. In addition, many of these studies used different methods (e.g., ABR, EEG, and MEG) and even different exact dependent variables within a single method (e.g., latencies vs. phase-locking vs. peak magnitude in ABR).

Eye-tracking is another online method that captures the real-time, automatic, and anticipatory information processing (Allopenna et al., 1998). Cooper (1974) has shown that when listeners are presented simultaneously with spoken language and a screen that depicts objects that are mentioned in the utterance, listeners perform a visual search on the screen and fixate their gaze upon the objects mentioned in the utterance. Eye movements and gaze fixations thus reveal the incremental processing of spoken language as the speech signal unfolds over time. In addition to capturing gaze fixations, it is also possible to record pupil responses with an eye-tracker. Pupil dilation is taken to reflect changes in the engagement of cognitive resources next to the quick ocular reflexes to changes in luminance (Beatty, 1982). Changes in pupil dilation have been used as a measure of attention and effort (Kahneman, 1973) and have been applied also to record mental effort in language processing (Kuchinsky et al., 2013; Schmidtke, 2014; Wagner et al., 2016; Nagels et al., 2020). Whereas pupil responses due to changes in luminance take about 150–400 ms (Bergamin et al., 2003), pupil responses that relate to cognitive processing are slower and can take about 1 s (Hoeks and Levelt, 1993; Wierda et al., 2012). An increase in pupil dilation is often considered to reflect increased cognitive effort and increased allocation of attentional resources.

In the current study, we used two online measures: (1) gaze-tracking, which provides insight into spoken word recognition in real time and (2) pupillometry, which provides insight into the employment of cognitive resources in spoken word recognition in the presence of speech maskers. We also implemented a sentence recall task that gives an offline measure, utilizing similar sets of stimuli in both online and offline measures, and the same groups of non-musician and musician participants. The purpose was to test whether the results from the online and offline measures would all reveal processing differences between groups.

The first experiment is a sentence recall task, in which participants listen to, recall, and repeat target Dutch sentences presented with two-talker Dutch sentence maskers in different TMRs. This offline task provides an estimate of intelligibility by measuring the percentage of correctly recalled words, similar to the study reported by Başkent and Gaudrain (2016). These authors showed that musicians overall had a larger number of correctly identified words than non-musicians when the target sentence was embedded in a single-talker masker. We aimed to further test whether the difference these authors observed would be present also when using a similar task with slightly different sets of stimuli and more effective masking conditions, with parameters adjusted to not reach ceiling performance across different TMRs. According to Rosen et al. (2013) masking effects differ when the number of background talkers changes from 1 to 2 or 4, and Calandruccio et al. (2017) showed that a two-talker masker was the most effective masker. The similarity between the masker and target also plays a role in how strong the masking effect can be. The more dissimilar the two streams are in terms of the target and masker speakers (Brungart et al., 2001), target and masker speakers’ voices (Darwin et al., 2003), the language of the target and the masker (Lecumberri and Cooke, 2006), and the semantic content of the target and the masker (Calandruccio et al., 2010), the easier it becomes to understand the target speech stream. Thus, in the current study, based on literature and confirmed by an initial pilot study for sufficient masking effects, we have decided to use two-talker maskers of same sex talkers as background noise, and a talker of same sex for the target speaker. We hypothesized that if musical training benefits speech-on-speech perception, as some of the literature has suggested, musicians would recall more words correctly when compared to non-musicians in the two-talker masked sentence recall task.

The second experiment is an eye-tracking experiment that employed visual-world paradigm (VWP; Eberhard et al., 1995; Tanenhaus et al., 1995; Allopenna et al., 1998; Salverda and Tanenhaus, 2017), where we measured participants’ gaze-fixations and pupil dilation (Wagner et al., 2016; Nagels et al., 2020). In the VWP, while listening to target sentences embedded in two-talker masker sentences, participants visually search for and choose the image of a target word uttered by the target speaker. Spoken word recognition involves ambiguity resolution among lexically related items (see also a TRACE model: McClelland and Elman, 1986; Salverda et al., 2003). As listeners hear the acoustic speech cues coming from the target speaker, they form and continually fine-tune hypotheses regarding the target word. The displayed images include the target word and a phonological competitor that shares an onset-overlapping segment with the target word (Figure 1) and two unrelated distractors. The linking hypothesis is that the shifts in visual attention among the objects displayed on the screen are a consequence of what is heard in the utterance and can capture real-time spoken language processing (Cooper, 1974; Allopenna et al., 1998; Salverda and Tanenhaus, 2017). Hence, the time course of gaze fixations to the images of the target and competitor words can capture the time course of the continual integration of acoustic information, while the signal is mapped to meaning. Additionally, changes in pupil dilation reflect how the cognitive resources allocated for spoken language processing are affected by the presence of the two-talker masker.

FIGURE 1

Figure 1. Example of the screen presented during the experiment, where ham (upper left) is the competitor word and hamster (lower right) is the target word. The illustrations were made by Jop Luberti for the purpose of this study. This image is published under the CC BY NC 4.0 license.

In the second experiment, we aimed to answer three main questions: (1) to what extent would speech maskers affect processes underlying spoken word recognition, such as lexical decision making, as captured by the time course of gaze fixations, (2) how would the effect of masking on spoken word recognition differ between musicians and non-musicians, and (3) do musicians and non-musicians allocate their cognitive resources differently when processing speech in two-talker masker vs. speech-in-quiet, as captured by the changes in pupil dilation. For the first question, we hypothesized that there would be an effect of masking on the time course of gaze fixations, in line with previous literature that utilized a similar paradigm to test the effect of signal degradation on lexical decision making (McQueen and Huettig, 2012; Wagner et al., 2016). For the second question, we assumed that musicians might be better able to focus their attention on the target or suppress the background noise. If that is the case, we hypothesized that the differences observed in the time course of gaze fixations to the competitor to differ less between speech-in-quiet and speech in two talker masker conditions for musicians than for non-musicians. For the third question, the effect of masking on the employment of cognitive resources that can be captured by pupil dilation, we expected to find differences between musicians and non-musicians in the two-talker masker condition but not in speech-in-quiet, as this would be the control condition, where the spoken language processing is assumed to occur automatically and in a similar manner in both groups.

Experiment 1: Sentence Recall Experiment

In this experiment, participants repeated a grammatically correct and meaningful Dutch target sentence embedded within two-talker maskers that also consisted of meaningful Dutch sentences. Different utterances of the same target and masker speakers were utilized in both this experiment and the eye-tracking experiment in order for the stimuli to be consistent between the online and the offline measures. Participants completed the eye-tracking experiment (Experiment 2) first to become familiar with the voice of the target speaker.

Method

Participants

Sixteen musicians (10 females) and 17 non-musicians (11 females) from Groningen, the Netherlands, participated in the study and met the inclusion criteria. All participants were native Dutch speakers that were raised monolingually (i.e., did not have a second language spoken by their caregivers at home). The musicians were selected based on the following criteria (Parbery-Clark et al., 2009; Fuller et al., 2014; Başkent and Gaudrain, 2016): having started musical training at/before the age of seven, having at least 10 years of musical training, and actively practicing music for at least 3 years prior to the study (the years of musical training do not refer to active years of engagement, but rather indicate the amount of formal training). Accordingly, the non-musician criteria were the following: not meeting all of the musician criteria, and additionally, not having more than 3 years of musical training (see Table 1 below for musical background information of participants).

TABLE 1

Table 1. Musical training background (in years).

To ensure that participants had normal hearing, we assessed audiometric thresholds to make sure the hearing levels were <20 dB HL for pure tone thresholds measured at audiometric frequencies from 250 to 4,000 Hz for both ears. We used modified criteria that differed than the standard clinical audiometric measurements that include 8,000 Hz, since some musicians had unilaterally higher thresholds than 20 dB HL due to playing an instrument close to the ear (i.e., violin). All participants reported having normal or corrected-to-normal vision, i.e., using contact lenses or glasses, and having no language disorders. The study was approved by the Medical Ethics Committee of the University Medical Center Groningen. All participants were given information about the study prior to participation, they provided written consent, and after that, their hearing was screened and music and language background inclusion criteria were tested before experiments started. At the end of the study, they were given financial compensation for participation according to department guidelines.

Apparatus

Participants were seated at a 50 cm distance in front of a 17-inch computer screen in a sound attenuated booth. Both experiments were set up and ran in MATLAB (The MathWorks), using the PsychToolBox (Kleiner et al., 2007). The auditory stimuli were presented through an AudioFire4 sound card (Echo Digital Audio Corporation) and played on a Tannoy Precision 8D speaker (Tannoy Ltd) located behind the computer screen, in front of which the participants were seated. For the sentence recall task, participants followed instructions through a computer screen and used a keyboard to proceed within the experiment. Their verbal responses were recorded through another laptop using Audacity (version 2.1.3.0).

Materials and Design

Twenty-eight semantically neutral Dutch target sentences were embedded in two-talker maskers. We used the filler items recorded for Wagner et al. (2016) as the target sentences. Target sentences were uttered by a female Dutch speaker without any regional accent (f0 mean = 173.91 Hz, sd = 44.03 Hz). Each target sentence contained seven to nine words.

The masker sentence set consisted of meaningful Dutch sentences from the corpus of Versfeld et al. (2000), uttered by a different female speaker than the target speaker (f0 mean: 160.01 Hz, sd: 54.83 Hz). This female speaker’s utterances were used to generate two-talker maskers. All target and masker sentences were root-mean-square normalized in intensity.

The target sentences were embedded within the two-talker maskers, where the target sentence onset was 500 ms after the masker onset and the target offset was 500 ms before the masker offset, similar to Başkent and Gaudrain (2016). The 200 ms at the beginning and at the end of the maskers were ramped up and down, respectively, to avoid audible clicks. If the duration of a single masker sentence was not sufficient to cover the duration of target duration plus 1,000 ms, another sentence was added to the masker sequence to add up to the required total duration.

Four lists of sentences were generated corresponding to the four levels of TMRs. The TMR levels (−3, −5, −7, and −9 dB) were chosen based on Calandruccio et al. (2017) and our own pilot study. Each TMR condition contained seven sentences, with either 54 or 55 words in total. The lists were generated so that each sentence was presented in all TMR conditions across participants. The two-talker masker was fixed at 75 dB SPL presentation level, while the single-talker target’s presentation level was adjusted depending on the TMR condition.

Procedure

The experiment started with a practice phase, where the participants completed four trials corresponding to the four TMR levels included in the experimental phase. The participants completed the eye-tracking experiment first, to become familiar with the voice of the target and masker speakers before moving on to the sentence recall task. To help the participant with identifying which speech stream from the target masker combination was the target speech, it was explained that the target voice was the same as the female speaker from the eye-tracking experiment and that the target speaker began to speak later than the masker speaker. During both practice and experimental phases, the participants were instructed to verbally repeat the utterance of the target speaker immediately at the end of each trial. The participants’ verbal responses were recorded to be coded later for the correctly recalled words. The experimental phase contained 28 sentences in total, all presented in presence of two-talker masker (seven sentences per list × four TMR levels). The TMR conditions were presented in a random order across participants to prevent a potential effect of the order of presentation.

Data Analysis

Two independent Dutch-native speaker student assistants assessed the correctness of the recalled words from the recordings of the participant responses. The student assistants were blind to the hypotheses and rationale of the experiment but not to the group belonging of the participants since the group assignment made part of how participants were coded. Every word within the sentence was used to calculate the correct scores. Morphological changes (i.e., in tense and plural marker) were taken into account when giving a correct score. The response was considered correct if minor mistakes were made, such as using unstressed use of a pronoun (zij – ze [she]), different forms of modals (can-could) or diminutive forms of nouns (addition of -je). Percentage of correctly recalled words per each trial per participant was calculated by dividing the correctly recalled words by the total amount of words contained in the sentence.

R (R Core Team, 2013) and lme4 (Bates et al., 2015) were used to perform a generalized linear mixed effects analysis. The optimal model was determined in an iterative backward fitting with model comparison of χ² test and evaluation of Akaike’s information criterion (Akaike, 1974; Baayen, 2008). The most complex model including all fixed effects with interactions and maximal random effects structure (Barr et al., 2013) was constructed. Then, the interaction term is removed to check for the effect of the interaction. If the interaction was not significant, the main effect of the fixed effects in the model was evaluated by removing each fixed effect from the full model. We followed Barr et al. (2013) in simplifying the random effects structure until the model converged. The best model was determined by model comparison and evaluated by likelihood ratio test through ANOVA Chi-Square tests.

Results

Figure 2 illustrates the percentage of correctly recalled words averaged across participants and across different TMR conditions. The percentage of correctly recalled words was calculated separately for each participant, target sentence, and TMR condition, and was used as the dependent measure in the generalized linear mixed effect model. TMR Conditions (−3, −5, −7, and −9 dB) and Group (musician vs. non-musician) were entered as fixed effects. The final model that converged resulted in having a model with two random intercepts: per subject and per sentence. Step-wise model comparison revealed that the interaction term between TMR Conditions and Group did not improve the model significantly [χ²(3) = 2.58, p = 0.46, AIC difference = 3]. The main effects of both TMR Conditions [χ²(3) = 181, p < 0.000, AIC difference = −175] and Group were significant [χ²(1) = 4.3, p = 0.038, AIC difference = −2]. Table 2 shows the converted predicted probabilities of the fixed effect model estimates using plogis function in R (R Core Team, 2013; also see Supplementary Materials for the full model summary). Table 3 shows the 95% confidence intervals which were determined via bootstrap resampling based on 1,000 simulations. The bootstrapped confidence intervals do not cross zero indicating that TMR Condition and Group are significant predictors in our model. Overall, the two groups’ performances differed with musicians recalling more words correctly across TMR conditions. Also, as the TMR value became lower, both groups’ recall performance became worse. The lack of interaction between group and TMR condition was not significant, thus not supporting the claim that musicians’ performance would improve more than non-musicians’ performance as the task became more difficult.

FIGURE 2

Figure 2. Boxplots represent the percentage scores of correctly recalled words for musicians (dark red) and non-musicians (light pink) across Target-to-Masker Ratios (from −9 dB most left to −3 dB most right). The horizontal lines in the boxes represent the median value and the dots represent data points for individual participants.

TABLE 2

Table 2. The percentage of correct recall as predicted by the final model across target-to-masker-ratio (TMR) conditions for both groups.

TABLE 3

Table 3. Bootstrapped estimates for the 95% confidence intervals, based on resampling of the sentence recall model.

Experiment 2: Eye-Tracking Experiment

In the eye-tracking experiment (online measure), implementing the visual world paradigm (Cooper, 1974; Salverda and Tanenhaus, 2017), participants identified a target word, uttered in a sentence by the target speaker, among four images displayed on the screen (Figure 1). This paradigm enables to measure the time course of lexical decision making by capturing the gaze fixations to the target and competitor images. The target speaker’s utterance was either presented without background talkers (speech-in-quiet) or was embedded within two-talker masker. Gaze fixations enabled us to capture how the process of lexical decision making would be affected by two-talker maskers. In addition, we also recorded participants’ pupil dilation, which is taken to reflect how the cognitive effort and changes in allocation of cognitive resources imposed by the two-talker maskers would differ between groups.