Acoustics of Emotional Prosody Produced by Prelingually Deaf Children With Cochlear Implants

Chatterjee, Monita; Kulkarni, Aditya M.; Siddiqui, Rizwan M.; Christensen, Julie A.; Hozan, Mohsen; Sis, Jenni L.; Damm, Sara A.

doi:10.3389/fpsyg.2019.02190

ORIGINAL RESEARCH article

Front. Psychol., 30 September 2019

Sec. Auditory Cognitive Neuroscience

Volume 10 - 2019 | https://doi.org/10.3389/fpsyg.2019.02190

This article is part of the Research TopicChildren Listen: Psychological and Linguistic Aspects of Listening Difficulties During DevelopmentView all 25 articles

Acoustics of Emotional Prosody Produced by Prelingually Deaf Children With Cochlear Implants

Monita Chatterjee^*

Aditya M. Kulkarni

Rizwan M. Siddiqui

Julie A. Christensen^†

Mohsen Hozan^†

Jenni L. Sis^†

Sara A. Damm^†

Auditory Prostheses and Perception Laboratory, Center for Hearing Research, Boys Town National Research Hospital, Omaha, NE, United States

Purpose: Cochlear implants (CIs) provide reasonable levels of speech recognition quietly, but voice pitch perception is severely impaired in CI users. The central question addressed here relates to how access to acoustic input pre-implantation influences vocal emotion production by individuals with CIs. The objective of this study was to compare acoustic characteristics of vocal emotions produced by prelingually deaf school-aged children with cochlear implants (CCIs) who were implanted at the age of 2 and had no usable hearing before implantation with those produced by children with normal hearing (CNH), adults with normal hearing (ANH), and postlingually deaf adults with cochlear implants (ACI) who developed with good access to acoustic information prior to losing their hearing and receiving a CI.

Method: A set of 20 sentences without lexically based emotional information was recorded by 13 CCI, 9 CNH, 9 ANH, and 10 ACI, each with a happy emotion and a sad emotion, without training or guidance. The sentences were analyzed for primary acoustic characteristics of the productions.

Results: Significant effects of Emotion were observed in all acoustic features analyzed (mean voice pitch, standard deviation of voice pitch, intensity, duration, and spectral centroid). ACI and ANH did not differ in any of the analyses. Of the four groups, CCI produced the smallest acoustic contrasts between the emotions in voice pitch and emotions in its standard deviation. Effects of developmental age (highly correlated with the duration of device experience) and age at implantation (moderately correlated with duration of device experience) were observed, and interactions with the children’s sex were also observed.

Conclusion: Although prelingually deaf CCI and postlingually deaf ACI are listening to similar degraded speech and show similar deficits in vocal emotion perception, these groups are distinct in their productions of contrastive vocal emotions. The results underscore the importance of access to acoustic hearing in early childhood for the production of speech prosody and also suggest the need for a greater role of speech therapy in this area.

Introduction

Emotional communication is a key element of social development, social cognition, and emotional well-being. Studies have shown that in children and adults with cochlear implants (CIs), performance in vocal emotion recognition tasks predicts their self-perceived quality of life, but their general speech recognition does not (Schorr et al., 2009; Luo et al., 2018). This indicates that speech emotion communication is a critical area of deficit in CIs that needs to be addressed. Acoustic cues signaling vocal emotions in speech include voice pitch, timbre, intensity, and speaking rate (e.g., Banse and Scherer, 1996). Among these, voice pitch is a dominant cue. CIs do not represent voice pitch to the listener with adequate fidelity, but other cues to vocal emotions, such as intensity and duration cues, are retained in the electric input. These deficits in vocal pitch perception have been implicated in CI users’ poorer performance in pitch-dominant areas of speech perception such as prosody or lexical tones (Peng et al., 2004, 2008, 2017; Green et al., 2005; Chatterjee and Peng, 2008; See et al., 2013; Deroche et al., 2016; Jiam et al., 2017). The importance of voice pitch for spoken emotions is thought to account for the deficits observed in cochlear implant users’ ability to identify emotional prosody (Luo et al., 2007; Hopyan-Misakyan et al., 2009; Chatterjee et al., 2015; Paquette et al., 2018). The perceptual deficit observed in CI users in emotion identification suggests that on their own, these secondary cues are not sufficient to provide normal levels of accuracy in vocal emotion identification. Similar deficits have been observed in normally hearing listeners attending to CI-simulated speech (Luo et al., 2007; Chatterjee et al., 2015; Gilbers et al., 2015; Tinnemore et al., 2018).

Prelingually deaf children who received a CI (CCI) within the sensitive period (e.g., by 2 years of age) and are developing oral communication skills through the prosthesis provide a unique opportunity to investigate the impact of the perceptual deficits associated with electric hearing on the development of emotional prosody. This population also provides an important contrast to postlingually deaf adult CI users (ACI) who learned to hear and speak with good hearing in childhood before losing their hearing as teenagers or adults, in many cases in middle age or later years. ACI generally retain excellent speech production skills, despite listening through the distorted input of the CI. In a previous study comparing ACI and CCI in their vocal emotion perception, Chatterjee et al. (2015) noted that they were similar in both the mean and the range of performance. Notably, the stimuli used by Chatterjee et al. (2015) were highly recognizable by normally hearing listeners as they were produced in a child-directed manner, with exaggerated prosody. While few studies have reported deficits in prelingually deaf pediatric CI users’ productions of vocal emotions (Nakata et al., 2012; Van De Velde et al., 2019), they have focused on younger children (<10 years of age) and used perceptual ratings of the productions as the outcome measure. Little is known about the factors predicting the acoustic features of these productions as children develop into teenagers, and no studies have reported on a comparison between pre and postlingually deaf CI users. Here, we present acoustic analyses of emotional prosody [a set of 20 emotion-neutral sentences (i.e., without lexically based emotional information) read with “happy” and “sad” emotional prosody] produced by prelingually deaf school-aged children and postlingually deaf adults with CIs, alongside productions by typically developing normally hearing children and young normally hearing adults. We selected happy and sad emotions because these are well-contrasted acoustically (happy is spoken with a higher mean pitch, more fluctuating pitch, higher intensity, and faster than sad). These two emotions are also uncontroversial and relatively easy for school-aged children as young as 6 years old to know and be able to produce without an exemplar. Previous studies have used different methodologies, e.g., Nakata et al. (2012) asked children to imitate the vocal productions of an exemplar, while Van De Velde et al. (2019) asked children to produce a word depicted in a picture with an emotion simultaneously depicted in a picture. Imitative production provides information about vocal capabilities but not about how the participants would normally produce emotions. Van De Velde et al.’s (2019) method avoided imitation but may have imposed additional task complexity in the requirement to generate the word associated with the picture and the emotion associated with the picture, combine them conceptually, and produce the word with the correct emotion. In our task, we avoided imitation and kept the cognitive load to a minimum by asking children to read the list of sentences in a happy way and in a sad way. There was still the remaining task burden of having to combine the emotion with the sentence before producing it, but the participants did not have to generate the words themselves or figure out the emotion required for the production.

Among acoustic cues, we focused on mean voice pitch, variance of voice pitch, mean intensity, mean spectral centroid, and mean duration of each utterance. These cues were found to be important acoustic features of vocal emotions in previous studies (Banse and Scherer, 1996; Scherer, 2003). These cues have also been found to be useful in artificial manipulations of speech designed to represent different human emotions (e.g., Přibilová and Přibil, 2009). Based on pitch and spectral degradations in CIs, we expected the CI users (particularly CCI) to show deficits in the pitch and spectral centroid domains of their productions. We expected to observe smaller acoustic contrasts between “happy” and “sad” emotions in the productions of the CCI than in those by children with normal hearing (CNH) and adults with normal hearing (ANH), but we were interested in the specific acoustic cues that might show such reduced contrasts. We expected CNH and ANH to produce the emotions similarly. A key question of interest was how CCI and ACI would compare in their productions. Specifically, we asked if CCI and/or ACI would emphasize intensity or duration differences between the emotions to compensate for any deficits in the pitch domain. Previous studies have shown that adult and child CI users can trade primary acoustic cues for secondary cues such as duration and intensity in speech recognition, intonation recognition, and lexical tone recognition tasks (Peng et al., 2009, 2017; Winn et al., 2012). Luo et al. (2007) showed that removing intensity cues from the stimuli resulted in much poorer emotion recognition scores in their adult CI listeners, indicating that intensity cues are emphasized in vocal emotion recognition by postlingually deaf CI users. The extent to which this would influence their vocal emotion productions is not known, nor is it known whether prelingually deaf CCI would emphasize intensity cues in their productions. Among the CCI, we asked if earlier age at implantation or longer duration of experience with the device would change the acoustic characteristics of their productions. These questions center around the role of neuroplasticity within the more sensitive, early years of brain development and during the developmental period of auditory and language systems, which extends into the teenage years.

Materials and Methods

Participants

Participants were comprised of four groups of talkers. All talkers provided informed consent to be recorded, and procedures were approved by Boys Town National Research Hospital’s IRB protocol #11-24-XP. The four groups of talkers are as described below. Detailed information about the CI users who participated is shown in Table 1. The information in Table 1 was derived from a questionnaire filled out by participants or (in the case of child participants) by their parents/guardians. Written informed assent was obtained from all child participants, together with written informed parental consent to participate; written informed consent was obtained from all adult participants. Participants were compensated for travel time and for their listening time. In addition, children were offered a toy or a book of their choice after they completed their sessions.

TABLE 1

Table 1. Information about CI participants.

Children With Normal Hearing

Nine children with normal hearing participated. Their ages ranged between 6 and 18 years [mean age 12.5 years, standard deviation (SD) 4.4 years]. Five of the children were females, and four were males. All had normal hearing based on audiometric screening at criterion level of 20 dB HL or better between 250 and 8,000 Hz.

Children With Cochlear Implants

Thirteen children with cochlear implants participated. Their ages ranged between 7 and 18 years (mean age 12.93 years, SD 4.27 years). Four of the children were males, and nine were females. All of the CCI were prelingually deaf, implanted at the age of 2, and none had any usable hearing at birth. Their mean age at implantation was 1.36 years (SD 0.35 years), and their mean duration of device use was 11.57 years (SD 4.04 years).

Adults With Normal Hearing

Nine adults with normal hearing participated. Their ages ranged between 21 and 45 years. Six of the ANH were females; three were males. As with the CNH, normal hearing was confirmed based on audiometric screening at criterion level of 20 dB HL or better between 250 and 8,000 Hz.

Adults With Cochlear Implants

Ten postlingually deaf adults with cochlear implants participated. Their ages ranged between 27 and 75 years. Six of the ACI were females; four were males.

Procedure

The list of materials used for this study is comprised of 20 simple sentences that had no overt semantic cues about emotion. These sentences are provided in Table 2 (identical to Table 2 in Damm et al., 2019, JSLHR). The sentences were simple enough that the youngest participants (as young as 6 years of age) could read them aloud easily. The protocol for the recordings was as follows: the participant was invited to sit in a soundproof booth at a distance of 12 inches from a recording microphone (AKG C 2000 B) and asked to read the 20 sentences in sequence, first in a happy way (three times) and then in a sad way (three times). They were provided with some initial practice runs and recordings that were initiated when they felt ready. No targeted training or feedback was provided; all feedbacks were encouraging and laudatory in nature. The signal from the microphone was routed through an external A/D converter (Edirol UA-25X) using Adobe Audition v. 3.0 or v. 6.0. Recordings were made at a sampling rate of 44,100 Hz and with 16-bit resolution. The recordings were high-pass filtered using a 75-Hz cut-off frequency. Of the three sets of recordings in each emotion provided by individual talkers, the second set was typically used for acoustic analyses. For instance in which the second recording of a particular sentence was noisy and included non-speech sounds (such as coughing or throat-clearing), the best sample of the other two recordings was selected. An order effect may be present in the data, as happy emotions were recorded prior to sad. The recordings took very little time overall, so it is unlikely that fatigue played a role. Based on experience, we noted that it was easier for the participants (particularly, the younger children) to begin the session with the happy productions and to continue recording in a particular emotion, rather than to switch from happy to sad during the recordings. Any order effect in the data would be expected to be present for all participants. The CI users who were bilaterally implanted were recorded with their earlier-implanted devices activated only.

TABLE 2

Table 2. List of sentences.

Acoustic Analyses

Acoustic analyses were performed on the recordings using the Praat software package (Boersma, 2001; Boersma and Weenink, 2019). For the 40 recordings (20 sentences, 2 emotions each) provided by each participant, a Praat script was run to compute the mean pitch (F0, Hz), the F0 variation (standard deviation of F0), the mean intensity (dB), and the duration (sec) of each utterance. The default autocorrelation method in the Praat software program was used to estimate F0. Primary challenges in such analyses are encountered by researchers attempting to determine the onset and offset of the utterances in a consistent way and in setting parameters for pitch estimation appropriately for each utterance. The onset and offset times of each waveform were estimated using similar criteria by at least two of the co-authors so as to obtain consistent measures of duration. The pitch settings were established using the following steps: for each talker and emotion, a set of 4–5 recordings (from the total of 20) was pseudo randomly selected, and the pitch range, silence threshold, voicing threshold, octave cost, octave-jump cost, and voiced/unvoiced cost were set to appropriate levels, ensuring that the pitch contour was properly represented (e.g., avoiding octave jumps, discontinuities in the estimated pitch, or silences in regions of voiced speech). This was done more than once to ensure that the settings were indeed appropriate. Next, an automated Praat script was run on all the 20 recordings for that talker and emotion. The output was then analyzed for consistency (e.g., mean F0 values were compared across the recordings, and the ratio of maximum to minimum F0 for individual recordings was investigated). If these values appeared suspect for any of the recordings (e.g., if the ratio of maximum to minimum F0 values exceeded a value of 3.0 or if the estimated values were obviously different from other recordings by the same talker in the same emotion), they were individually checked again, modifications were made as needed to the settings, and the values were manually computed in Praat for those individual recordings. Two of the authors (RS and MC) were always involved in the final analyses. Some of the analyses of productions by the children had been previously conducted (by authors MC and JS) using a similar but not identical approach. Care was taken to compare these older analyses with the newer ones. When correlations between the two sets of data fell below 0.85, the analyses were again checked to ensure accuracy and modified again as needed.

Spectral centroid analyses were conducted in R using the seewave package (Sueur et al., 2008; Sueur, 2018). A window was first applied to discard the first 10% and last 10% of each waveform, with a bandpass filter with cut-off frequencies at [50, 4,000] Hz to narrow the range of the calculated centroid to speech content. Next, using the meanspec() function, the short-term Fourier transform (STDFT) of 50-ms long successive time segments (Hann-windowed with 50% overlap) of the waveform was computed and averaged across all segments to obtain the mean spectrum. Finally, using the specprop() function, the spectral centroid of each waveform was computed for its mean spectrum, based on the formula $C = \sum_{i = 1}^{N} f_{i} \times a_{i}$ , where N is the number of frequency bins (STDFT columns), f_i is the center frequency of the ith bin, and a_i is the relative amplitude. Both frequency and amplitude are linearly scaled in the centroid calculations.

Statistical Analyses

Statistical analyses and graphical renderings were conducted using R v. 3.3.2 (R Core Team, 2016). Plots were created using the ggplot2 package within R (Wickham, 2016). Linear mixed effects models were constructed using the package lme4 (Bates et al., 2015). A hierarchical approach was used to determine the best-fitting model, and the function anova() was used in the car package in R to compare models (Fox and Weisberg, 2011). Model residuals were visually inspected (using plots and histograms of residuals) to ensure normality. The lmerTest package (Kuznetsova et al., 2017) was used to obtain estimated model results and t-statistic-based significance levels for each parameter of interest. The optimx package (Nash and Varadhan, 2011) was used to promote model convergence in one instance.

Results

Group Differences

Figure 1 shows boxplots of the acoustic characteristics of happy and sad emotions produced by each of the four groups of participants. From top to bottom, the different rows show the mean F0, F0 variation [standard deviation of F0 (F0 s.d.)], mean intensity, duration, and spectral centroid of the sentences produced with the two emotions (red and blue). The boxplots in the left-hand panels show the distribution of values computed for each sentence (abscissa) across the participants. The boxplots in the right panel of each row show the mean values computed across the sentences recorded in each emotion.

FIGURE 1

Figure 1. Group differences in acoustic features of emotional productions. (Top to bottom – left panels) These figures show boxplots of mean F0 (Hz), F0 s.d. (Hz), Intensity (dB), Duration (s), and Spectral Centroid (Hz) values estimated for each sentence (abscissa) recorded by the participants in each emotion (happy: red; sad: blue). Data from the four groups of participants are represented in the four panels (left to write: ACI, ANH, CCI, and CNH). (Top to bottom – right panels) These figures show boxplots of the mean values of these acoustic features computed across the 20 sentences recorded in each emotion by individual participants. The abscissa shows the four groups (ACI, ANH, CCI, and CNH). Happy and sad emotions are again shown in red and blue colors.

LME analyses were conducted on these data to investigate effects of Group, Sentence, and Emotion and their interactions. In all cases, the LME model was constructed including Group, Sentence, and Emotion as fixed effects, subject-based random intercepts, and random slopes for the effect of sentence. The dependent variable in each case was the particular acoustic measure under consideration (mean F0, F0 s.d., Intensity, Duration, or Spectral Centroid). The effect of Sentence was included as a fixed effect because systematic differences for individual sentences were expected, based on differences between them in their phonetic and linguistic characteristics.

Mean F0

Results showed a significant interaction between Group and Emotion [β = −17.656 (SE = 3.879), t(1599.10) = −4.552, p < 0.0001] and a significant main effect of Emotion – i.e., higher mean F0 for happy than for sad productions [β = −45.014 (SE = 6.907), t(1599.1) = −6.517, p < 0.0001]. No other effects and no other interactions were observed. To follow-up on the interaction, we investigated the effect of Group for the happy and the sad productions separately. LME analyses on the happy productions with fixed effect of Group, by-subject random intercepts, and by-subject random slopes for the effect of individual sentences showed no effect of Group. A similar analysis on the sad productions did show a significant effect of Group [β = −25.96 (SE = 8.76), t(41) = −2.96, p = 0.005], explaining the interaction between Group and Emotion. A pairwise t test (Bonferroni correction) to investigate the effect of Group in the sad productions showed no significant differences between the ANH and ACIs’ mean F0 values (p = 0.32), but all other comparisons showed significant differences (p < 0.001 in all cases). Of note, the CCIs’ sad productions had the highest mean F0 of the four groups.

F0 Variation

The mean F0 and F0 s.d. values were significantly correlated in all groups. A linear multiple regression analysis confirmed that F0 s.d. was significantly predicted by mean F0 and also showed that there was an interaction with Group (i.e., different correlation coefficients for the different groups). Individual linear regression analyses within the four groups confirmed this observation: estimated coefficients for the ANH, ACI, and CCI groups were 0.266 (SE 0.01), 0.263 (SE 0.012), and 0.259 (SE 0.011), respectively, whereas the coefficient for the CNH group was only 0.162 (SE 0.009).

The LME analysis showed significant effects of Group [β = 9.307 (SE = 2.661), t(49.4) = 3.498, p = 0.001], as well as a significant interaction between Group and Emotion [β = −10.44 (SE = 1.569), t(1599) = −6.651, p < 0.0001]. No other effects or interactions were observed. Follow-up analyses showed that the effect of Group was significant for the happy emotion [β = 8.759 (SE = 2.774), p = 0.003], but no significant effect of Group was observed for the sad emotion. Post hoc pairwise t tests (Bonferroni corrections applied) comparing the F0 s.d. values obtained by the different groups for the happy emotion productions showed significant differences between the CCI group and ACI, ANH, and CNH groups (p < 0.0001 in all cases), but no significant differences between the ACI, ANH, and CNH groups. Thus, the CCI group’s productions for happy were more monotonous (smaller F0 s.d.) than all other groups.

Mean Intensity

Results showed a significant interaction between Group and Emotion [β = −0.757 (SE = 0.276), t(1558) = −2.738, p = 0.00625], a main effect of Emotion [β = −6.041 (SE = 0.492), t(1558) = −12.27, p < 0.0001], and a main effect of Sentence [β = −0.0773 (SE = 0.031), t(133.9) = 2.526, p = 0.0127].

The interaction between Group and Emotion was not clearly supported by follow-up analyses. When the data were separated out into happy and sad emotions, separate LME analyses with Group as a fixed effect, random subject-based intercepts, and random subject-based slopes for the effect of Sentence showed no significant effects of Group for either emotion. However, the estimated effect for Group [β = −1.1224 (SE = 0.686), t(41) = −1.635, p = 0.11] was larger for sad productions than for happy productions [β = −0.3044 (SE = 0.538), t(41) = 0.566, p = 0.574). This is likely explained by the somewhat lower intensity levels observed in CNH relative to other groups.

Duration

Results showed significant effects of Emotion [β = 0.2233 (SE = 0.0336), t(1599) = 6.8, p < 0.0001] and Sentence [β = 0.01244 (SE = 0.00194), t(1443) = 6.4, p < 0.0001] but no effects of Group and no two-way or three-way interactions.

Spectral Centroid

Results showed a significant effect of Emotion [β = −272.591 (SE = 30.093), t(1599) = −9.058, p < 0.0001] but no effect of Group or Sentence and no interactions.

Acoustic Contrasts Between Happy and Sad Productions

The acoustic contrast between happy and sad productions was specifically investigated for each acoustic cue. For the mean F0, the contrast was defined as the ratio between the mean F0s for happy and sad productions. For the F0 s.d., the contrast was defined as the ratio between the standard deviations of F0 for happy and sad productions. For Intensity, the contrast was defined as the difference in dBs between mean intensities of happy and sad productions. For Duration, the contrast was defined as the ratio between the durations of happy and sad productions. For Spectral Centroid, the contrast was defined as the ratio between the spectral centroids of happy and sad productions. Ratios between the values for happy and sad productions were chosen over other measures (e.g., simple difference) for consistency with findings in the literature on auditory perception, which indicates that perceptual sensitivity to differences between sounds in specific acoustic dimensions are well modeled by a system that encodes the sensory input using a power law and/or logarithmic representation. LME analyses were conducted with Group and Sentence as fixed effects and by-subject random intercepts and Sentence as by-subject random slopes.

Mean F0 Contrasts

Results of the LME analysis showed a significant effect of Group [β = 0.115 (SE = 0.05), t(39.00) = 2.237, p = 0.031]. A pairwise t test with Bonferroni corrections showed significant differences between all Groups (p < 0.0001 in all cases). Figure 2 (upper) shows boxplots of the mean F0 contrast for the four groups and for each of the 20 sentences. The CCI group (blue) shows the smallest contrast of all four groups.

FIGURE 2

Figure 2. Boxplots of acoustic contrasts between happy and sad emotions for mean F0 (upper) and F0 s.d. (lower) for each sentence (abscissa) and for the four groups (see legend).

F0 Standard Deviation Contrasts

Results of the LME analysis showed a significant effect of Group [β = 0.35 (SE = 0.160), t(39.00) = 2.19, p = 0.0345]. A pairwise t test with Bonferroni correction showed significant differences between all groups (p < 0.0001 in all cases). Figure 2 (lower) shows boxplots of the F0 s.d. contrast for the four groups and for each of the 20 sentences. The CCI group shows the smallest contrast of all four groups.

Intensity Contrast

The results of the LME analysis showed no significant effects of Sentence or Group and no interactions.

Duration Contrast

Consistent with previous analyses, results of the LME analysis showed no effects of Group or Sentence and no interactions.

Spectral Centroid Contrast

Consistent with previous analyses, the LME analysis showed no effects of Group or Sentence and no interactions.

Analyses of Results Obtained in Child Participants With Normal Hearing and Cochlear Implants

Initial analyses indicated different patterns for CNH and CCI and for female versus male children. Results obtained in NH and CI child participants were therefore analyzed separately for effects of Age and Sex on mean F0, F0 variation, Intensity, Duration, and Spectral Centroid. The data are plotted in Figure 3, which shows each acoustic cue as a function of Age, separated out by Sex and Group.

FIGURE 3

Figure 3. (A–E) Values of acoustic features (A: mean F0; B: F0 s.d.; C: Intensity; D: Duration; E: Spectral Centroid) of the happy (red) and sad (blue) emotions recorded by CNH and CCI, plotted against their age (abscissa). For each acoustic feature, left- and right-hand panels show results in CCI and CNH, respectively, and upper and lower plots show results in female and male participants, respectively. The differently shaped symbols and lines in each color represent individual sentences recorded in each emotion.

Acoustic Analyses of Productions by Children With Normal Hearing