Continuous, Lateralized Auditory Stimulation Biases Visual Spatial Processing

Pomper, Ulrich; Schmid, Rebecca; Ansorge, Ulrich

doi:10.3389/fpsyg.2020.01183

ORIGINAL RESEARCH article

Front. Psychol. , 12 June 2020

Sec. Perception Science

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.01183

Continuous, Lateralized Auditory Stimulation Biases Visual Spatial Processing

$\r\nUlrich Pomper*$ Ulrich Pomper^1*

Rebecca Schmid¹

Ulrich Ansorge^1,2

¹Department of Cognition, Emotion, and Methods in Psychology, Faculty of Psychology, University of Vienna, Vienna, Austria
²Cognitive Science Hub, University of Vienna, Vienna, Austria

Sounds in our environment can easily capture human visual attention. Previous studies have investigated the impact of spatially localized, brief sounds on concurrent visuospatial attention. However, little is known on how the presence of a continuous, lateralized auditory stimulus (e.g., a person talking next to you while driving a car) impacts visual spatial attention (e.g., detection of critical events in traffic). In two experiments, we investigated whether a continuous auditory stream presented from one side biases visual spatial attention toward that side. Participants had to either passively or actively listen to sounds of various semantic complexities (tone pips, spoken digits, and a spoken story) while performing a visual target discrimination task. During both passive and active listening, we observed faster response times to visual targets presented spatially close to the relevant auditory stream. Additionally, we found that higher levels of semantic complexity of the presented sounds led to reduced visual discrimination sensitivity, but only during active listening to the sounds. We provide important novel results by showing that the presence of a continuous, ongoing auditory stimulus can impact visual processing, even when the sounds are not endogenously attended to. Together, our findings demonstrate the implications of ongoing sounds on visual processing in everyday scenarios such as moving about in traffic.

Introduction

In natural environments, critical visual events regularly occur independent of or at different spatial locations than ongoing auditory stimuli. In traffic, for example, we have to monitor our visual environment simultaneously to hearing noise from nearby construction works, music being played on a car radio, or listening to a person next to us (cf. Cohen and Graham, 2003; Levy et al., 2006). Furthermore, as in these examples, sounds often are not transient events but continuous, ongoing streams. Thus far, little is known on whether such continuous auditory stimuli can attract human visual spatial attention. Moreover, it is unknown whether sounds need to be endogenously attended to, or whether the mere presence of a sound source might be sufficient for eliciting such a cross-modal spatial bias.

A multitude of previous studies have investigated the phenomenon of cross-modal spatial attention, in which orienting attention to a location in one modality (most commonly vision or audition) leads to a simultaneous shift of spatial attention in another modality (most commonly audition or vision, respectively; Spence and Driver, 1997; Driver and Spence, 1998; Eimer and Schröger, 1998; McDonald et al., 2000; Spence and Read, 2003). To a large degree, these studies have employed exogenous or bottom-up-driven shifts of attention. That is, spatial attention in one modality is automatically but only transiently shifted across modalities by a brief, salient cue from another modality. For instance, McDonald et al. (2000) had participants discriminate visual targets presented randomly in either the left or right periphery. Between 100 and 300 ms prior to the visual targets, a salient, spatially non-predictive auditory cue was briefly presented via a speaker from either the left or right side. The authors found that visual targets presented at the same spatial location as the preceding auditory cues were discriminated faster and more accurately than those presented on the opposite side, suggesting an automatic cross-modal shift of attention. Several other studies have demonstrated similar effects, both regarding behavioral reductions in response times (RTs) (McDonald and Ward, 2000; Kean and Crawford, 2008; Lee and Spence, 2015) and increases in performance accuracy (Dufour, 1999), as well as modulation of event-related potentials measured via electroencephalography (EEG; cf. Eimer and Schröger, 1998; Teder-Sälejärvi et al., 1999; Zhang et al., 2011).

In addition to such exogenous, bottom-up-driven effects, previous studies have also demonstrated instances of endogenous, top-down-driven cross-modal shifts of attention (Spence and Driver, 1996; Eimer and Driver, 2000; Green and McDonald, 2006). For example, Spence and Driver (1996) found links between endogenous auditory and visuospatial attention in an experiment, in which a central arrow cue indicated the likely location of a target stimulus in one modality. Occasional unexpected targets in the other modality were discriminated faster when appearing on the cued side rather than on the uncued side. This suggests that when participants deliberately direct their spatial attention to a location in one sensory modality, attention in a second modality shifts toward the same location.

However, although attention in the above example is not bottom-up driven, it still is shifted based on a discrete, brief onset of an external stimulus shortly before each visual target. This is clearly different from the everyday example described in the beginning, encompassing an ongoing auditory stimulus. A few studies (Santangelo et al., 2011, 2007) have demonstrated that endogenously attending to a visual or auditory rapid serial presentation (RSP) task can reduce or even eliminate exogenous spatial attention effects in the same or different sensory modality. In the experiment conducted by Santangelo et al. (2007), participants were presented with a peripheral spatial cueing task in either the auditory or visual modality. This task was presented either alone or simultaneously together with a centrally presented visual or auditory RSP task. The authors observed classical spatial cueing effects when the cueing task was presented in isolation, but no cueing effects when participants had to additionally perform the central RSP task. Interestingly, this was the case for both unimodal conditions (both cueing and RSP stimuli in the same modality) and cross-modal conditions (auditory cueing and visual RSP task, and vice versa). This experiment demonstrates that endogenously directing attention in one modality to a certain task and location affects exogenous attention in a different task at a different location. However, it still remains unclear whether this is due to a cross-modal shift in spatial attention or due to the endogenous attention task using up most of the limited attentional resources (cf. Kahneman, 1973; Lavie, 2005; Levy et al., 2006). Using an experimental design closer to everyday life, Driver and Spence (1994) demonstrated a cross-modal attentional bias from vision to audition. Here, participants were instructed to shadow one of two streams of speech, presented from equidistant locations in their right and left hemifields. Simultaneously, they had to monitor a quickly changing stream of unrelated visual stimuli for a target presented either from the same side or from the opposite side of the auditory target stream. Participants performed significantly worse in the speech-shadowing task when visual and auditory sources were presented from a different spatial location rather than the same spatial location, suggesting a cross-modal spatial bias from vision to audition. Interestingly, neither this (Driver and Spence, 1994) nor a later follow-up study investigating speech shadowing during a simulated driving task (Spence and Read, 2003) found evidence for cross-modal spatial bias in the opposite direction, from the auditory to visual domain. However, a further finding by Driver and Spence (1994) was that the auditory shadowing task was only affected when participants endogenously attended to the concurrent visual stream, but not when they merely viewed it passively.

Given evidence from dual-task studies, it is likely that not only active vs passive listening but also the level of semantic complexity, task difficulty, or load associated with the auditory task (Pomplun et al., 2001; Alais and Burr, 2004; Iordanescu et al., 2008, 2010; Mastroberardino et al., 2015) has an impact on the amount of cross-modal attentional bias. For example, Iordanescu et al. (2008) observed faster RTs in a visual search task when a sound associated with the target object was played during the search, even though it contained no spatial information (see also Van der Burg et al., 2008).

Taken together, previous work has shown that brief, salient auditory cues can attract exogenous visual attention. Endogenous cross-modal shifts of spatial attention using ongoing stimuli have so far only been demonstrated from the visual to auditory domain. In this case, only endogenously attending not passive viewing the visual stream has led to a spread of attention across modalities. To investigate the possibility of cross-modal shifts of attention from the auditory to visual domain, we presently asked the following three main questions: (1) To what extent can continuous, lateralized auditory stimuli bias visual spatial attention? (2) Is the mere presence of auditory stimuli sufficient to bias visual spatial attention, or do auditory stimuli need to be actively attended to in order to do so? (3) Does the semantic complexity of auditory stimuli impact their bias on visual attention?

In two experiments, participants discriminated visual targets in their left and right hemifields. At the same time, they had to either passively or actively listen to continuous lateralized auditory stimuli of varying semantic complexity. We measured the degree of attention directed to the visual targets through average correct RTs and accuracy of target discrimination at the same position as the auditory input (congruent condition) vs at a different position than the auditory input (incongruent condition). We show that even the mere presence of continuous auditory stimulation biases visual spatial processing and that, during active listening, overall visual task performance is dependent on the semantic complexity of the sounds.

Experiment 1

Methods

Participants

Twenty healthy university students participated in the experiment, in exchange for either course credits or monetary compensation. Our sample size was based on previous reports of cross-modal biases of attention, which commonly incorporated between 15 and 24 participants (e.g., Spence and Driver, 1996; Dufour, 1999; McDonald et al., 2000). Three datasets were excluded, as participants did not perform above chance level in the visual task. The remaining 17 participants (12 female; M_age = 25.6 years, range = 19 to 34) had normal or corrected-to-normal vision and were naive to the purpose of the experiment. All gave written informed consent, and the study was conducted in accordance with the standards of the Declaration of Helsinki. We further followed the Austrian Universities Act, 2002 (UG2002, Article 30 §1), which states that only medical universities or studies conducting applied medical research are required to obtain an additional approval by an ethics committee. Therefore, no additional ethical approval was required for our study.

Apparatus

Stimuli were presented on a 19-in. CRT monitor with a resolution of 1,024 by 768 pixels and a refresh rate of 85 Hz. Auditory stimuli were presented via two loudspeakers (Logitech Z150), placed directly left and right next to the monitor at the height of the visual targets and fixation cross. The distance between the centers of the two speaker membranes was 29.78° visual angle. Stimuli were controlled via an external USB sound card (Behringer U-Control UCA222). Sound levels were individually adjusted for each participant prior to the experiment, to be at a comfortable listening level [60–70 dB of sound pressure level (SPL); e.g., Andreou et al., 2015; Auksztulewicz et al., 2017; Barascud et al., 2016; Sohoglu and Chait, 2016]. Participants sat inside a dimly lit room 64 cm away from the screen, with their heads supported by a chin and forehead rest. The experiment was controlled by MATLAB (2014b v. 8.4.0, The MathWorks, Natick, MA, United States) using the Psychophysics Toolbox (Brainard, 1997) with the Eyelink extension (Cornelissen et al., 2002) on a PC running Windows 7.

Stimuli

Visual stimuli were presented against a black background (luminance: 8.2 cd/m²). Visual targets consisted of white triangles (17.5 cd/m²) presented at an eccentricity of 6.1° either left or right of a central fixation cross (0.4 × 0.4°). The triangles had an initial width of 0.7° and height of 0.4°. To titrate the average performance accuracy to around 75%, size and brightness of the triangles were increased or decreased by a factor of 0.05 after every fourth trial.

In the low-semantic complexity (LSC) condition, auditory stimuli consisted of a unilaterally presented stream of tone pips at a rate of 5 Hz. The individual tone pips had varying pitches within a frequency range of 310–1,000 Hz. The initial tone pip always had a pitch of 440 Hz, and the pitch for subsequent tone pips was randomly increased or decreased by 5%. In one third of the blocks, auditory stimuli were presented from the left speaker and in one third of blocks from the right speaker. As a baseline condition, no auditory stimulation was presented in the remaining third of the blocks. In the high-semantic complexity (HSC) condition, auditory stimuli consisted of a short story from Greek mythology (in German: Inachos und Eris; Köhlmeier, 2011). The story was taken from a publicly available online source. Again, the auditory stimuli were presented from the left and right speakers in one third of the blocks each. As a further baseline condition, auditory stimulation was provided bilaterally in the remaining third of the blocks.

Task and Procedures

The task was to discriminate the orientation of the target triangles (up or down) presented in either the left or right hemifield of the screen (Figure 1). A trial consisted of the presentation of a triangle for 250 ms, presented between 2 and 4 s after trial onset. For a triangle pointing upward, participants had to press Key eight with the right index finger on the number pad of the keyboard as fast as possible. Conversely, if the triangle was pointing downward, participants were instructed to press Key two. If no response was given within 1 s after onset of the visual target, the trial was counted as a Miss. During the trials, the index finger remained on the Key five to ensure that the distance to both response keys was equal.

FIGURE 1

Figure 1. Experimental setup. (A) Experiment 1 [passive listening, for both low-semantic complexity (LSC) and high-semantic complexity (HSC) conditions]. Participants’ task was to discriminate the orientation of triangles (up or down) presented in either the left or right hemifield of the screen. On two thirds of trials, task-irrelevant auditory stimuli were presented from either a left or right speaker throughout each trial. In the LSC condition, the auditory stimuli consisted of a regular stream of tone pips with varying pitches. In the HSC condition, the auditory stimulus consisted of a spoken story. (B) Experiment 2 [active listening, for LSC, medium-semantic complexity (MSC), and HSC conditions]. The visual stimuli and task were identical to those of Experiment 1. Additionally, participants had to perform a parallel task in the auditory modality, which differed between the LSC, MSC, and HSC conditions (see section “Methods” for details).

In the LSC condition, participants completed six blocks with 42 trials each. Two thirds of trials contained unilateral auditory stimulation consisting of a continuous stream of tone pips, presented from the beginning of each trial. Blocks were alternating between left, right, and no auditory stimulations. In the HSC condition, participants completed six blocks of 34 trials each. Blocks were alternating between left, right, and bilateral auditory stimulations. Here, auditory stimulation consisted of an ongoing narrated story. Thus, in blocks with unilateral auditory stimulation, visual targets could be either spatially congruent (presented on the same side) or spatially incongruent (presented on the opposite side) with the auditory input. Importantly, the onset of visual targets was completely independent of (i.e., not time-locked to) the events in the auditory modality. In both conditions, participants were instructed to fixate the central cross throughout, focus on the visual task, and disregard the auditory input. Sound intensity was individually adjusted to each participant. Prior to the start of both complexity conditions, participants performed a short practice run to familiarize themselves with the task.

Eye Tracking

To ensure correct fixation throughout the experiment, gaze position at screen center was continuously ensured. Data were recorded monocularly using an EyeLink 1000 Desktop Mount (SR Research, Mississauga, Ontario, Canada) video-based eye tracker sampling at 1,000 Hz. Prior to the beginning of both conditions, the signal was calibrated on participants’ right eye using a nine-point calibration and validation sequence. Additionally, a recalibration of the eye tracker was performed in case participants left their position in the chinrest during breaks. An analysis and discussion of fixation behavior can be found in the Supplementary Information Figure S1.

Analysis of Behavioral Data

Prior to the statistical analysis, outlier trials with RTs deviating more than two SDs from the mean were excluded per participant and condition. Mean sensitivity (d′) and RT measures (for correct trials only) were computed separately for each condition. For the calculation of d′, we utilized the fact that our visual task required a discrimination between triangles pointing upward vs downward. We computed d′ separately for targets pointing upward and targets pointing downward and then averaged across them, according to the following equation:

d^{'} = \frac{z (p H i t_{u p}) - z (p F A_{u p}) + z (p H i t_{d o w n}) - z (p F A_{d o w n})}{2}

Here, Z indicates a z-transform standardization. Trials in which participants gave an “upward” response by pressing Key eight were evaluated as hits, in case of triangles pointing upward (pHit_up) and as false alarms, in case of triangles pointing downward (pFA_up). Conversely, trials in which participants gave a “downward” response by pressing Key two were evaluated as hits, in case of triangles pointing downward (pHit_down) and as false alarms, in case of triangles pointing upward (pFA_down). For statistical analysis, sensitivity and RT data were each subjected to a repeated-measures analysis of variance (ANOVA), including the within-subjects variables Congruency (congruent vs incongruent) and Semantic Complexity (low vs high). To investigate potential differences between passive unilateral auditory stimulation (left or right) and no or bilateral auditory stimulation, we performed two additional repeated-measures ANOVAs. For the LSC only, we compared congruent vs incongruent vs no-sound trials. For the HSC only, we compared congruent vs incongruent vs bilateral sound trials.

Results

Figure 2 illustrates the behavioral results. For both correct RTs and sensitivity, we performed a two-way ANOVA, using the variables Congruency (congruent vs incongruent) and Semantic Complexity (low vs high). For RTs, in line with our hypotheses, we found a significant main effect of Congruency, F(1,16) = 5.99, p = 0.026, η_p² = 0.27, owing to faster RTs in congruent compared with incongruent trials. Neither Semantic Complexity nor the interaction term reached significance (both ps > 0.740). Further, we observed no differences in sensitivity between conditions (all ps > 0.120).

FIGURE 2

Figure 2. Behavioral results from Experiments 1 and 2, for the low-semantic complexity (Tone pips), medium-semantic complexity (Digits, only in Experiment 2), and high-semantic complexity (Story) conditions separately for congruent (blue) and incongruent (red) trials. (A) Visual response times (RTs). (B) Visual task sensitivity. Overall, congruency between visual and auditory spatial attention produced faster response times to visual targets, during both passive and active spatial listening conditions. Semantic complexity of the auditory stimuli modulated visual discrimination sensitivity, with higher semantic complexity leading to reduced sensitivity. Error bars indicate standard error of the mean.

We performed an additional ANOVA comparing the congruent, incongruent, and no-sound trials in the LSC condition (note that the no-sound condition was not included in the first ANOVA, as it was not present in the HSC condition). Here, we observed significant differences in RTs, F(1,15) = 5.0, p = 0.022, η_p² = 0.40, but not in sensitivity (p > 0.75). Follow-up t-tests (not corrected for multiple comparisons) yielded significantly faster RTs in the congruent compared with the no-sound condition, t(16) = −3.06, p = 0.007, d = 0.24, and in the incongruent compared with no-sound condition, t(16) = −2.51, p = 0.023, d = 0.13, but not in the congruent compared with incongruent condition (p = 0.115). Further, when comparing the congruent, incongruent, and bilateral-sound trials within the HSC condition (note that the bilateral-sound condition was not included in the above ANOVA, as it was not present in the LSC condition), we again observed significant differences in RTs, F(1, 15) = 4.43, p = 0.019, η_p² = 0.19, but not in sensitivity (p > 0.260). Follow-up t-tests (not corrected for multiple comparisons) yielded significantly faster RTs in the congruent compared with incongruent condition, t(16) = −2.67, p = 0.017, d = 0.13, but not in the congruent compared with bilateral-sound condition, or the incongruent compared with bilateral-sound condition (both p > 0.240).

Discussion of Experiment 1

As hypothesized, we observed overall faster RTs to visual targets in congruent compared with incongruent trials. This is in line with previous studies showing both exogenous and endogenous cross-modal attentional biases from the auditory to the visual domain RTs (McDonald and Ward, 2000; Kean and Crawford, 2008; Lee and Spence, 2015). Importantly, however, participants in our experiment were instructed to ignore the auditory inputs and solely focus on the visual task. Thus, we demonstrate that the mere presence of an ongoing, task-irrelevant sound can affect concurrent visual spatial processing.

Further, we found neither a main effect nor an interaction including the variable Semantic Complexity. Presumably, when auditory input is not attended to, the nature or content of this auditory input is of little relevance for the overall visual task performance or the amount of cross-modal attentional spread. Importantly, however, whether or not an auditory stimulus was presented did affect task performance: Our participants were overall faster in trials featuring auditory stimulation (both congruent and incongruent) compared with the no-sound baseline trials (only present in the LSC condition). This suggests that ongoing sounds might increase the level of vigilance or overall, non-spatial attention, at least when listened to passively. Previously, such an effect has been shown for brief, task irrelevant sounds, which can transiently facilitate visual target detection (Kusnir et al., 2011). At the same time, further increasing the auditory input from one unilateral to two bilateral streams (only present in the HSC condition) did not result in any further significant reductions in RTs. We, thus, assume that the effect on vigilance due to the input of an additional sensory modality was already close to ceiling.

Overall, in Experiment 1, we demonstrate that the mere presence of a localized sound can spatially selectively bias performance in a visual attention task. In Experiment 2, we slightly modified our experimental paradigm to investigate how actively attending to a continuous lateralized auditory input would affect visual spatial attention. To induce the orienting of endogenous selective attention to one spatial location, we now presented two simultaneous continuous auditory streams, one from the left loudspeaker and one from the right loudspeaker. Here, participants had to selectively shift their auditory attention to one prespecified auditory stream while ignoring the other. Further, to investigate the potential impact of task difficulty more thoroughly, we added a third auditory stimulus condition of medium semantic complexity (MSC). Finally, owing to time limitations, we did not include a unimodal visual condition as in the LSC condition of Experiment 1.