Hot Speech and Exploding Bombs: Autonomic Arousal During Emotion Classification of Prosodic Utterances and Affective Sounds

Jürgens, Rebecca; Fischer, Julia; Schacht, Annekathrin

doi:10.3389/fpsyg.2018.00228

ORIGINAL RESEARCH article

Front. Psychol. , 28 February 2018

Sec. Emotion Science

Volume 9 - 2018 | https://doi.org/10.3389/fpsyg.2018.00228

Hot Speech and Exploding Bombs: Autonomic Arousal During Emotion Classification of Prosodic Utterances and Affective Sounds

$\r\nRebecca Jürgens,*$ Rebecca Jürgens^1,2*

Julia Fischer^1,3

Annekathrin Schacht^2,3

¹Cognitive Ethology Laboratory, German Primate Center, Göttingen, Germany
²Department of Affective Neuroscience and Psychophysiology, Institute of Psychology, University of Göttingen, Göttingen, Germany
³Leibniz ScienceCampus Primate Cognition, Göttingen, Germany

Emotional expressions provide strong signals in social interactions and can function as emotion inducers in a perceiver. Although speech provides one of the most important channels for human communication, its physiological correlates, such as activations of the autonomous nervous system (ANS) while listening to spoken utterances, have received far less attention than in other domains of emotion processing. Our study aimed at filling this gap by investigating autonomic activation in response to spoken utterances that were embedded into larger semantic contexts. Emotional salience was manipulated by providing information on alleged speaker similarity. We compared these autonomic responses to activations triggered by affective sounds, such as exploding bombs, and applause. These sounds had been rated and validated as being either positive, negative, or neutral. As physiological markers of ANS activity, we recorded skin conductance responses (SCRs) and changes of pupil size while participants classified both prosodic and sound stimuli according to their hedonic valence. As expected, affective sounds elicited increased arousal in the receiver, as reflected in increased SCR and pupil size. In contrast, SCRs to angry and joyful prosodic expressions did not differ from responses to neutral ones. Pupil size, however, was modulated by affective prosodic utterances, with increased dilations for angry and joyful compared to neutral prosody, although the similarity manipulation had no effect. These results indicate that cues provided by emotional prosody in spoken semantically neutral utterances might be too subtle to trigger SCR, although variation in pupil size indicated the salience of stimulus variation. Our findings further demonstrate a functional dissociation between pupil dilation and skin conductance that presumably origins from their differential innervation.

Introduction

Emotional expressions conveyed by face, voice and in body gestures are strong social signals and might serve as emotion-elicitors in a spectator or listener. Situations that are of relevance for someone’s wellbeing or future prospects, such as meeting an aggressor on the street, possess an emotional meaning that has the power to trigger emotions in the beholder. Bodily reactions, one of the key components of emotion (Moors et al., 2013), are regulated by the autonomous nervous system (ANS), and include changes in the cardiovascular system, in respiration and perspiration (Kreibig, 2010). While autonomic responses to affective pictures and sounds have been reliably demonstrated (e.g., Bradley et al., 2001a), only little is known about ANS responses to emotional expressions, in particular with regard to spoken language. Emotional expressions in the voice, however, are of special relevance considering that speech might be the most important communication channel in humans. Our study therefore had two main aims; first, we investigated autonomic activation in response to spoken utterances of neutral semantic content but varying in their emotional prosody, and second, we compared these responses to those triggered by another auditory domain, namely affective sounds.

There are various physiological indicators reflecting autonomic responses during emotion processing. Skin conductance responses (SCRs) are one of the most frequently used peripheral physiological markers; presumably because they are exclusively activated by the sympathetic nervous system and because they are robust against voluntary modulations. Thus, they can be assumed to provide an excellent measure for the elicitation of emotional arousal (Dawson et al., 2007). Another promising indicator of even unconscious and subtle changes of emotional arousal are changes of the pupil size during stimulus processing (Laeng et al., 2012). The size of the pupil diameter is controlled by two muscles, innervated by both sympathetic and parasympathetic branches of the ANS that receive input from parts of the central nervous system involved in cognitive and affective processing (e.g., Hoeks and Ellenbroel, 1993). A vast body of research has suggested that pupillary responses serve as a potent measure for top-down and bottom-up attention (e.g., Laeng et al., 2012; Riese et al., 2014), both with regard to emotional and motivational processing (e.g., Bayer et al., 2010, 2017a,b; Bradley et al., 2008; Partala and Surakka, 2003; Võ et al., 2008) and cognitive load (e.g., Stanners et al., 1979; Verney et al., 2001; Nuthmann and Van der Meer, 2005; Van der Meer et al., 2010). An increased attention or mental effort is accompanied by enlarged pupil dilations: the more attention, the larger the pupil size. During emotion perception and emotion recognition, pupil dilation can be influenced by both emotion-based and cognitive factors. The simultaneous consideration of SCRs and changes of the pupil size might therefore help to separate the emotion-related from the cognitive sub-processes during processing of emotional information.

Affective pictures or sounds, mainly representing violence and erotica, have been shown to robustly increase SCRs and pupil dilations of the perceiver (Partala and Surakka, 2003; Bradley et al., 2008; Lithari et al., 2010). While the processing of emotional expressions has been shown to evoke emotion-related pupil size changes (see Kuchinke et al., 2011 for prosodic stimuli, Laeng et al., 2013 for faces), evidence for increased SCRs to emotional expressions is, however, less clear (Alpers et al., 2011; Aue et al., 2011; Wangelin et al., 2012). Alpers et al. (2011) and Wangelin et al. (2012) directly compared SCRs to emotional faces and affective scenes. Both studies found increased SCRs to arousing scenes compared to neutral ones, but not in response to facial expressions of emotion. In contrast, Merckelbach et al. (1989) reported stronger SCRs to angry compared to happy faces, while Dimberg (1982) did not find any differences between the two conditions. SCRs to emotional prosody have been even less investigated: Aue et al. (2011) studied the influence of attention and laterality during processing of angry prosody. Compared to neutrally spoken non-sense words, the angry speech tokens caused higher SCRs. In line with this finding, Ramachandra et al. (2009) demonstrated that nasals pronounced in an angry or fearful tone of voice elicit larger SCRs in the listener than neutrally pronounced ones, but their stimulus set only consisted of an extremely limited number of stimuli. A direct comparison between ANS responses to prosodic utterances vs. affective sounds, both conveying emotional stimuli of the same modality, has not been conducted so far.

The inconsistencies in the studies mentioned above might be explained by the absence of contexts, in which the stimuli were presented to the participants. Experimental setups conducted with entirely context-free presentation of emotional expressions, which are unfamiliar and also unimportant for the participants may simply reduce the overall social relevance of these stimuli and therefore fail to trigger robust emotion-related bodily reactions. In a recent study, Bayer et al. (2017b) demonstrated the importance of context. The authors observed increased pupil dilations to sentence-embedded, written words of emotion content in semantic contexts of high individual relevance. Similarly, perceived similarity to a person in distress increases emotional arousal in a bystander (Cwir et al., 2011). In general, sharing attitudes, interests, and personal characteristics with another person have been shown to immediately create a social link to that person (Vandenbergh, 1972; Miller et al., 1998; Jones et al., 2004; Walton et al., 2012). We therefore intended to vary the relevance of speech stimuli by embedding them into context and manipulating the idiosyncratic similarity between the fictitious speakers and participants.

The first aim of the present study was to test whether spoken utterances of varying emotional prosody trigger arousal-related autonomic responses, measured by pupil dilation and skin conductance in an explicit emotion categorization task. We increased the social relevance of our speech samples by providing context information with manipulated personal similarity in terms of biographical data between the participant and a fictitious speaker. Second, we examined participants’ physiological responses to affective sounds in comparison to the prosodic utterance. These affective sounds were for instance exploding bombs, or applause. Based on previous findings on emotional stimuli in the visual modality, we predicted stronger arousal-related effects for the affective sounds than for prosodic stimuli. Finally, we implemented a speeded reaction time task on the prosodic and sound stimuli in order to disentangle the cognitive and emotion-based modulations of the two physiological markers, by examining the cognitive difficulties during explicit recognition of the prosodic utterances and affective sounds.

Materials and Methods

Ethics Statement

The present study was approved by the local ethics committee of the Institute of Psychology at the Georg-August-Universität Göttingen. All participants were fully informed about the procedure and gave written informed consent prior to the experiment.

Participants

Twenty-eight female German native speakers, ranging in age between 18 and 29 years (M = 22.8), participated in the main study. The majority of participants (23 out of 28) were undergraduates at the University of Göttingen, three just finished their studies and two worked in a non-academic profession. Due to technical problems during recordings, two participants had to be excluded from analyses of pupil data. We restricted the sample to female participants in order to avoid sex-related variability in emotion reactivity (Bradley et al., 2001b; Kret and De Gelder, 2012).

Stimuli

Spoken Utterances With Emotional Prosody

The emotional voice samples were selected from the Berlin Database of Emotional Speech (EmoDB, Burkhardt et al., 2005). The database consists of 500 acted emotional speech tokens of 10 different sentences. These sentences were of neutral meaning, such as “The cloth is lying on the fridge” [German original “Der Lappen liegt auf dem Eisschrank”], or “Tonight I could tell him” [“Heute abend könnte ich es ihm sagen”]. From this database we selected 30 angry, 30 joyful, and 30 neutral utterances, spoken by five female actors. Each speaker provided 18 stimuli to the final set (6 per emotion category). The stimuli had a mean duration of 2.48 ± 0.71 s (anger = 2.61 ± 0.7, joy = 2.51 ± 0.71, and neutral = 2.32 ± 0.71), with no differences between the emotion categories (Kruskal–Wallis chi-squared = 2.893, df = 2, p = 0.24). Information about the recognition of indented emotion and perceived naturalness were provided by Burkhardt et al. (2005). We only chose stimuli that were recognized well above chance and perceived as convincing and natural (Burkhardt et al., 2005). Recognition rates did not differ between emotion categories (see Table 1 for descriptive statistics, Kruskal–Wallis chi-squared = 5.0771, df = 2, p = 0.079). Anger stimuli were, however, perceived as more convincing than joyful stimuli (Kruskal–Wallis chi-squared = 11.1963, df = 2, p = 0.004; post hoc test with Bonferroni adjustment for anger – joy p = 0.003). During the experiment, we presented prosodic stimuli preceded by short context sentences that were presented in written form on the computer screen. With this manipulation we aimed at providing context information in order to increase the plausibility of the speech tokens. These context sentences were semantically related to the prosodic target sentence and neutral in their wording, such as “She points into the kitchen and says” [German original: “Sie deutet in die Küche und sagt”] followed by the speech token “The cloth is laying on the fridge” [“Der Lappen liegt auf dem Eisschrank”] or “She looks at her watch and says” [German original “Sie blickt auf die Uhr und sagt”] followed by the speech “It will happen in 7 h” [“In sieben Stunden wird es soweit sein”].

TABLE 1

TABLE 1. Descriptive statistic of the stimulus material.

Affective Sounds

Forty-five affective sounds (15 arousing positive, 15 arousing negative, 15 neutral¹) were selected from the IADS database (International Affective Digital Sounds, Bradley and Lang, 1999). All of them had a duration of 6 s. Erotica were not used in our study as they have been shown to be processed differently compared to other positive arousing stimuli (Partala and Surakka, 2003; van Lankveld and Smulders, 2008). The selected positive and negative stimuli did not differ in terms of arousal (see Table 1 for descriptive statistics; t(27) = -0.743, p = 0.463) and were significantly more arousing than the neutral stimuli (t(25) = 12.84, p < 0.001). In terms of emotional valence, positive and negative stimuli differed both from each other (t(24) = 21.08, p < 0.001) and from the neutral condition (positive–neutral t(19) = 11.99, p < 0.001, negative–neutral t(25) = 15,15, p < 0.001), according to the ratings provided in the IADS database. Positive and negative sounds were controlled for their absolute valence value from the neutral condition (t(24) = 0.159, p = 0.875). Note that this stimulus selection was based on ratings by female participants’ ratings only, provided by Bradley and Lang (2007).

As the emotional sounds were rather diverse in their content, we controlled for differences in specific acoustic parameters that might trigger startle reactions or aversion and thus influence the physiological indicators used in the present study in an unintended way. These parameters included intensity, intensity onset (comprising only the first 200 ms), intensity variability (intensity standard deviation), noisiness, harmonic-to-noise ratio (HNR) and energy distribution (frequency at which 50% of energy distribution in the spectrum was reached). Intensity parameters were calculated using Praat (Boersma and Weenink, 2009), while noisiness, energy distribution and HNR were obtained by using LMA (Lautmusteranalyse developed by K. Hammerschmidt – Schrader and Hammerschmidt, 1997; Hammerschmidt and Jürgens, 2007; Fischer et al., 2013). We calculated linear mixed models in R to compare these parameters across the three emotion categories (see Table 2). We conducted post hoc analysis even when the general analysis was only significant at trend level. We found differences at trend level for intensity and intensity variability, and significant effects for energy distribution across the emotion categories. Differences were marginal and unsystematically spread across the categories, meaning that no emotion category accumulates all aversion related characteristics (see Table 2). Differences rather depict the normal variation when looking at complex sounds. The probability that acoustic structure confounds the physiological measure is thus low.

TABLE 2

TABLE 2. Acoustic parameter values for the affective sounds grouped by valence.

Similarity Manipulation

On the basis of participants’ demographic data – such as first name, date and place of birth, field of study, place of domicile, living situation and hobbies – obtained prior the main experiment, we constructed personal profiles of the fictive speakers. They either resembled or differed from the participant’s profile. Similarity was created by using the same gender, first name (or similar equivalents, e.g., Anna and Anne), same or similar dates and places of birth, same or similar study program, and same hobbies. Dissimilar characters were characterized by not being a student, being around 10 years older, not sharing the birth month and date, living in a different federal state of Germany, having a dissimilar first name, and being interested in different hobbies. Manipulations for every participant were done using the same scheme. The manipulation resulted in four personal profiles of (fictive) speaker characters that resembled the respective participant in her data, and four profiles that differed from the participant’s profile. To detract participants from the study aim, we included trait memory tasks between acquiring the biographical information and the main experiment. Additionally, we instructed the participants to carefully read every profile that was presented during the experiment, as they later should respond to questions regarding bibliographic information.

Procedure

First, participants filled out questionnaires regarding their demography and their handedness (Oldfield, 1971). After completing the questionnaires, participants were asked to wash their hands and to remove eye make-up. Participants were then seated in a chin rest 72 cm in front of a computer screen. Peripheral physiological measures were recorded from their non-dominant hand, while their dominant hand was free to use a button box for responding. Stimuli were presented via headphones (Sennheiser, HD 449) at a volume of around 55 db. During and shortly after auditory presentation, participants were instructed to fixate a green circle displayed at the center of a screen in order to prevent excessive eye movements. The circle spanned a visual angle of 2.4° × 2.7° and was displayed on an equiluminant gray background. Additionally, they were asked not to move and to avoid blinks during the presentation of target sentences.

The experiment consisted of two parts. Figure 1 gives an overview about the procedure of the stimulus presentation. Within the first part, prosodic stimuli were presented. Stimuli were presented twice (once in the similar/once in the dissimilar condition), resulting in a total number of 180 stimuli. The stimulus set was divided into 20 blocks of 9 stimuli (three stimuli per emotion category that is anger, neutral, joy). All stimuli within one block were spoken by the same speaker and were presented in random order within a given block. Prior to every prosodic stimulus, a context sentence was presented for 3 s. The personal profile, which manipulated the similarity, was shown prior to each block for 6 s. Every second block was followed by a break. Rating was done 6 s after stimulus onset. Participants had to indicate the valence of each stimulus (positive, negative, and neutral) by pressing one of three buttons. In order to avoid early moving and thus assuring reliable SCR measures, the rating options appeared not until 6 s after stimulus onset and valence-button assignment changed randomly for every trial. Participants were instructed to carefully read the personal profiles and to feel into the speaker and the situation, respectively. This part lasted for about 40 min. At the end of this part, participants answered seven questions regarding bibliographic information of the fictitious speakers.

FIGURE 1

FIGURE 1. Overview of stimulus presentation procedure. (A) One of the 20 presentation blocks created for the prosodic stimuli. All nine stimuli of one block were spoken by the same speaker, and included in randomized order three neutral, three anger and three joy sentences. (B) Stimulus presentation of sounds.

After a short break, the second part started, in which the 45 emotional sounds were presented. Every trial started with a fixation cross in the middle of the screen for 1 s. The sound was then replayed for 6 s each, while a circle was displayed on screen. When the sound finished, response labels (positive, negative, and neutral) were aligned in a horizontal row on the screen below the circle. The spatial arrangement of the response options was randomly changed for every trial; thus, button order was not predictable. The 45 emotional sounds were presented twice in two independent cycles, each time in randomized order. In analogy to the prosodic part, participants were instructed to listen carefully and to indicate the valence they intuitively associate most with the sounds without elaborative analysis of the sound’s specific meaning. Short breaks were included after every 15th trial. This part of the experiment lasted for about 20 min. The experiment took approximately 60 min in total.

Psychophysiological Data Recording, Pre-processing, and Analysis

Pupil Diameter

Pupil diameter was recorded from the dominant eye using the EyeLink 1000 (SR Research Ltd.), at a sampling rate of 250 Hz. The head position was stabilized via a chin and forehead rest that was secured on the table. Prior to the experiment, the eyetracker was calibrated with a 5-point calibration, ensuring correct tracking of the participant’s pupil. Offline, blinks and artifacts were corrected using spline interpolation. Data was then segmented around stimulus onset (time window: -1000 ms to 7000 ms) and referred to a baseline 500 ms prior stimulus onset. Data were analyzed in consecutive time segments of 1 s duration each. We started the analysis 500 ms after stimulus onset, to allow a short orientation phase, and ended 5500 ms afterward.

Skin Conductance

Skin conductance was recorded at a sampling rate of 128 Hz using ActivView and the BioSemi AD-Box Two (BioSemi B.V.). The two Ag/AgCl electrodes were filled with skin conductance electrode paste (TD-246 MedCaT supplies) and were placed on the palm of the non-dominant hand approximately 2 cm apart, while two additional electrodes on the back of the hand served as reference. Offline, data was analyzed using the matlab based software LedaLab V3.4.5 (Benedek and Kaernbach, 2010a). Data was down-sampled to 16 Hz and analyzed via Continuous Decomposition Analysis (Benedek and Kaernbach, 2010a). Skin conductance (SC) is a slow reacting measure based on the alterations of electrical properties of skin after sweat secretion. SC has long recovery times leading to overlapping peaks in the SC signal when SCR are elicited in quick succession. Conducting standard peak amplitude measures is thus problematic, as peaks are difficult to differentiate and subsequent peaks are often underestimated. Benedek and Kaernbach (2010a) developed a method that separates the underlying driver information, reflecting the sudomotor nerve activity (and thus the actual sympathetic activity) from the curve of physical response behavior (sweat secretion causing slow changes in skin conductivity) via standard deconvolution. Additionally, tonic and phasic SC components are separated, to allow a focus on the phasic, event-related activity only. The phasic driver subtracted by the tonic driver is characterized by a baseline of zero. Event-related activation was exported for a response window of 1–6 s after stimulus onset, taking into account the slow signal (Benedek and Kaernbach, 2010b). Only activation stronger than 0.01 μS was regarded as an event-related response (Bach et al., 2009; Benedek and Kaernbach, 2010a). We used averaged phasic driver within the respective time window as measure for SCR. The inter stimulus interval was 2 s for sounds, as rating normally takes around 1 s; 7 s for prosodic stimuli (cf. Recio et al., 2009).

Reaction Time Task

A subset of participants (20 out of 28, aged 21–30 years, M = 24.45) was selected to participate in an additional reaction time task in order to collect behavioral speed and confidence measures of emotion recognition to additionally estimate for potential cognitive difficulties in recognizing the emotional content of stimuli. These measures could not be obtained during the main experiment due to the physiological recordings that were accessed from the non-dominant hand and due to the pupillary recordings that forbid blinks during the critical time window. This part of the study was conducted with a delay of 6 months after the main experiment to ensure that participants did not remember their previous classifications of the stimulus materials. Participants sat in front of a computer screen, and listened to the acoustic stimuli via headphones. They were first confronted with the emotional sounds (first part) in a randomized order and were instructed to stop the stimulus directly as fast as they had recognized the emotion within a critical time window of 6 s. The time window was in accordance to the one in main experiment and corresponded to the durations of sounds. After participants pushed a button, reflecting the time needed for successful emotion recognition, they had to indicate which emotion they perceived (positive, negative, and neutral) and how confident they were in their recognition (likert-scale 1–10), both by paper-pencil. The next trial started after a button press. In the second part, they listened to the prosodic stimuli that had to be classified as expressing joy, anger, or neutral, respectively, within the same procedure as in the first part. The critical time window was again 6 s after stimulus onset.

Statistical Analysis

Statistical analyses were done in R (R Developmental Core Team, 2012). The similarity manipulation was included into the statistics to account for potential effects of this manipulation. Additionally, this could be seen as a manipulation check. To test the effects of emotion category and similarity on recognition accuracy we built a generalized linear mixed model with binomial error structure (GLMM, lmer function, R package lme4, Bates et al., 2011). Effects on SCRs and pupil size were analyzed using linear mixed models (LMM, lmer function). Models included emotion category, similarity, and the interaction between these two as fixed factors and participant-ID as random factor, to control for individual differences. All models were compared to the respective null model including the random effects only by likelihood ratio tests (function anova). Additionally, we tested the interaction between emotion category and similarity by comparing the full model including the interaction with the reduced model excluding the interaction. We used the model without interaction when appropriate. Models for the emotional sounds included only emotion category as fixed factor and participant-ID as random effect. The models were compared to the respective null models by likelihood ratio tests. Normal distribution and homogeneity of variance for all models were tested by inspecting Quartile–Quartile-Plots (QQ-plots) and residual plots. SCR data deviated from normal distribution and were log transformed. Pairwise post hoc tests were conducted using the glht function of the multcomp package (Hothorn et al., 2008) with Bonferroni correction.

In the reaction time task, we did not compare prosody and sounds statistically, knowing about the differences in stimulus length, quantity of stimuli and, regarding the broader perspective, the stimulus structure overall (Bayer and Schacht, 2014). Reaction time data was not normally distributed and was thus log transformed prior to the analysis. Recognition accuracy and reaction time data were only calculated for those stimuli that were responded to within the time window of 6 s, whereas certainty ratings were analyzed for all stimuli in order to not overestimate the ratings. We tested the effect of emotion category on recognition accuracy (using GLMM), reaction time (using a LMM), and certainty ratings (using a cumulative link mixed model for ordinal data, package ordinal, Christensen, 2012) for both prosodic stimuli and emotional sounds. The models include emotion category as fixed factor and participant-ID as random effect. The models were compared to the respective null models by likelihood ratio tests. Pairwise post hoc tests were conducted using the glht function with Bonferroni correction for recognition accuracy and reaction time. As cumulative link models cannot be used in the glht post hoc tests, we used the single comparisons of the model summary, and conducted the Bonferroni correction separately.

In addition to analyzing the emotion recognition rates in the main experiment and the reaction time task, we also calculated the unbiased hit rates (Hu scores, Wagner, 1993). Recognition rates mirror the listener’s behavior in the actual task, but might be affected by the participant’s bias to preferentially choose one response category. Unbiased hit rates account for the ability of a listener to distinguish the categories by correcting for a potential bias (Wagner, 1993; cf. Rigoulot et al., 2013; Jürgens et al., 2015). We descriptively report the Hu scores in order to provide a complete description of the recognition data, but focused the further analyses on recognition rates only.

Results