Visual Presentation Effects on Identification of Multiple Environmental Sounds

Masakura, Yuko; Ichikawa, Makoto; Shimono, Koichi; Nakatsuka, Reio

doi:10.3389/fnint.2016.00011

ORIGINAL RESEARCH article

Front. Integr. Neurosci., 04 March 2016

Volume 10 - 2016 | https://doi.org/10.3389/fnint.2016.00011

This article is part of the Research TopicIntegration and segregation in sensorimotor processingView all 13 articles

Visual Presentation Effects on Identification of Multiple Environmental Sounds

Yuko Masakura^1,2*

Makoto Ichikawa^2,3

Koichi Shimono⁴

Reio Nakatsuka²

¹Faculty of Media Theories and Production, Aichi Shukutoku University, Aichi, Japan
²Faculty of Engineering, Yamaguchi University, Yamaguchi, Japan
³Department of Psychology, Chiba University, Chiba, Japan
⁴Faculty of Marine Technology, Tokyo University of Marine Science and Technology, Tokyo, Japan

This study examined how the contents and timing of a visual stimulus affect the identification of mixed sounds recorded in a daily life environment. For experiments, we presented four environment sounds as auditory stimuli for 5 s along with a picture or a written word as a visual stimulus that might or might not denote the source of one of the four sounds. Three conditions of temporal relations between the visual stimuli and sounds were used. The visual stimulus was presented either: (a) for 5 s simultaneously with the sound; (b) for 5 s, 1 s before the sound (SOA between the audio and visual stimuli was 6 s); or (c) for 33 ms, 1 s before the sound (SOA was 1033 ms). Participants reported all identifiable sounds for those audio–visual stimuli. To characterize the effects of visual stimuli on sound identification, the following were used: the identification rates of sounds for which the visual stimulus denoted its sound source, the rates of other sounds for which the visual stimulus did not denote the sound source, and the frequency of false hearing of a sound that was not presented for each sound set. Results of the four experiments demonstrated that a picture or a written word promoted identification of the sound when it was related to the sound, particularly when the visual stimulus was presented for 5 s simultaneously with the sounds. However, a visual stimulus preceding the sounds had a benefit only for the picture, not for the written word. Furthermore, presentation with a picture denoting a sound simultaneously with the sound reduced the frequency of false hearing. These results suggest three ways that presenting a visual stimulus affects identification of the auditory stimulus. First, activation of the visual representation extracted directly from the picture promotes identification of the denoted sound and suppresses the processing of sounds for which the visual stimulus did not denote the sound source. Second, effects based on processing of the conceptual information promote identification of the denoted sound and suppress the processing of sounds for which the visual stimulus did not denote the sound source. Third, processing of the concurrent visual representation suppresses false hearing.

Introduction

Many studies have been conducted to ascertain how auditory and visual modalities mutually interact. Many interactions have been examined only under conditions in which a participant is exposed to a single visual stimulus and a single auditory stimulus. In daily life, however, we are enveloped in and bombarded by multiple auditory and visual stimuli. Consequently, it is important to elucidate how the two modalities interact under the multiple-stimuli condition to reveal the processes used in daily life. This study specifically examines the interaction under the multiple-stimulus condition and particularly investigates how visual processing facilitates or interferes with auditory processing when the visual stimulus is relevant or irrelevant to the auditory stimuli.

Numerous previous reports have described that auditory processing interacts with visual processing in several ways. For instance, some studies have demonstrated that vision can dominate audition in determining perception of spatial aspects related to a stimulus. For example, the perceived location of a sound source tends to be shifted to the location of a visual stimulus (Stratton, 1897a,b; Young, 1928; Ewert, 1930; Willey et al., 1937; Thomas, 1941). This effect is known as visual capture (Jackson, 1953; Hay et al., 1965) or the ventriloquism effect (Howard and Templeton, 1966; Jack and Thurlow, 1973; Bertelson and Radeau, 1981). Different studies have demonstrated that audition can modify vision through duration perception (Walker and Scott, 1981), frequency perception (Welch et al., 1986; Shams et al., 2000, 2002; Wada et al., 2003; McCormick and Mammasian, 2008), and apparent motion (Kamitani and Shimojo, 2001; Wada et al., 2003; Ichikawa and Masakura, 2006). These results of studies suggest that the dominant modality in interaction between auditory and visual processing depends upon whether a participant judges the spatial or temporal aspect of the stimulus (Shimojo and Shams, 2001). Recent Bayesian models (e.g., Battaglia et al., 2003; Ernst, 2006) show that dominance of a modality in audio–visual interaction can be expected to result from experience related to the reliability of each modality in our daily life.

Previous studies examined audio–visual interaction using congruent and incongruent visual and auditory stimuli. For instance, studies of visual search have found that presenting characteristic sounds might enhance visual searching of corresponding objects, although presenting the name of the object as written word has no effect (Iordanescu et al., 2008). Studies of object recognition have found that presenting a semantically congruent sound might improve sound source identification (Chen and Spence, 2010). Presenting picture and sound elements that are mutually congruent, might hasten the recognition of objects that are denoted by the picture and sound, but presenting an incongruent picture and sound has no interference effect (Molholm et al., 2004). Studies of speech perception have revealed that congruent visual information might enhance target voice detection in a noisy environment (Sumby and Pollack, 1954; Campbell and Dodd, 1980; MacLeod and Summerfield, 1990; Thompson, 1995). A motion picture of a face speaking incongruent vowels is expected to modify pronunciation perception (McGurk and MacDonald, 1976; Sekiyama and Tohkura, 1993). In addition, a motion picture of a speaking face is expected to facilitate listening to the target sounds in terms of grouping of congruent motion pictures and sounds (Driver, 1996).

Interactions between vision and auditory processing can also be found in sound identification. Crossmodal priming studies, for example, have revealed that prior presentation of a picture hastened and improved identification of a sound when a picture was relevant to a sound, compared to when it was not (e.g., Greene et al., 2001; Noppeney et al., 2008; Schneider et al., 2008; Ozcan and van Egmond, 2009). A similar priming effect was found in auditory word recognition when a spoken word was presented after presentation of a written word corresponding to the spoken word (e.g., Holcomb and Anderson, 1993; Noppeney et al., 2008). These results of studies suggest that visual prime, irrespective of a picture or a written word, facilitates identification of the following relevant sound. However, from those earlier studies, it remains unclear how visual processing affects identification of the auditory stimulus when multiple auditory stimuli, relevant or irrelevant, are presented simultaneously as they are in daily life.

To elucidate how visual processing affects the identification of auditory stimuli, this study examined the effect of a visual stimulus on auditory processing with visual stimuli of different types and with various temporal relations between the visual and auditory stimuli in four experiments. To elucidate the effects of information obtained from the visual system on the identification of sounds, in the four experiments, we presented a picture or a written word which might or might not denote the source of one of four sounds recorded in a daily life environment. Then we examined how the visual representation of an object affects the auditory stimulus identification. Furthermore, we investigated the mechanism by which the time course of the visual–audio stimuli affects the visual–audio interaction. Manipulating the temporal relation between the audio and visual stimuli is expected to yield important information about what information processing is involved in crossmodal processing. For instance, we might infer that crossmodal effects, which are found only if the audio stimulus is presented concurrently with the visual stimulus, would be based on real-time perceptual processing. Furthermore, we might infer that the crossmodal effects which are found with a visual stimulus preceding an auditory stimulus, would be based upon the cognitive processing of the visual representation formed in a past observation of the visual stimulus. In addition, crossmodal effects, which are found with quite a short-term presentation of visual stimulus, would be based on the processing of the visual information obtained within a short period, such as low spatial frequency components (Schyns and Oliva, 1994). In fact, previous studies of the effects of presenting audio stimulus on the identification of visual stimulus (Chen and Spence, 2010, 2011) revealed that presenting a semantically congruent audio stimulus might facilitate the identification of the visual stimulus with various stimulus onset asynchronies (SOAs) between the audio and visual stimuli (less than 500 ms). This result indicates that facilitation of visual stimulus identification depends on real-time perceptual processing. However, few studies have examined the degree to which the temporal relation between the audio and visual stimuli affects the visual stimulus’ influence upon the audio stimulus. To elucidate the time course of visual–audio interaction, we manipulated the inter-stimulus interval between visual stimulus (a picture) and auditory stimuli in Experiments 1–3. In Experiment 1, the picture was presented simultaneously with the auditory stimulus. In Experiment 2, the picture was presented before the auditory stimulus. In Experiment 3, the picture was presented for an extremely short period before the auditory stimulus. In Experiment 4, we presented a written word as a visual stimulus, instead of a picture, under the same inter-stimulus conditions used in Experiments 1 and 2. The participant was instructed to identify all sounds involved in the auditory stimuli. We used the correct identification rate of the sounds that were presented, and the frequency of hearing the sound that was not presented (frequency of “false hearing”) as indexes of the effect of the visual stimulus. Comparison of the effects of a picture and written word on the identification of the auditory stimulus might reveal how either the visual representation or conceptual information derived from the visual stimulus affects the identification of the auditory stimulus. We also compare the effects of a temporal relation between the visual and auditory stimuli and discuss how information derived from a visual system is processed to identify the sounds in hearing multi-auditory stimuli.

Experiment 1

In the first experiment, we presented a single visual stimulus and multiple auditory stimuli simultaneously to assess the effects of a visual stimulus on the identification of auditory stimuli. As auditory stimuli, we used natural sounds that are audible in our daily life and which are readily identified when presented separately. As the visual stimulus, we used a picture (static photograph), which might or might not correspond to what one of the auditory stimuli represented: the picture was relevant or irrelevant to the auditory stimulus. The durations of the visual and auditory stimuli were sufficiently long for participants to judge what were depicted and represented. Referring to the results of previous studies (Holcomb and Anderson, 1993; Greene et al., 2001; Noppeney et al., 2008; Schneider et al., 2008; Ozcan and van Egmond, 2009), we expected that presenting visual information for the auditory stimulus facilitates identification of the auditory stimulus.

Methods

Participants

The experiment included 20 university students (4 females, 16 males) with ages of 20–25 years. All had normal hearing and normal or corrected-to-normal visual acuity. All were naïve to the purpose of this study. The experiments in this study were approved by the local ethical committee of the department of perceptual sciences and design engineering in Yamaguchi University.

Apparatus and Stimuli

A 21-inch display (21C-S11; Mitsubishi Electric Corp.), controlled using a personal computer (Macintosh G3; Apple Computer Inc.), and two loudspeakers (MU-S7; Sony Corp.) were used to present the visual stimulus and the auditory stimuli. The display was placed on a table at the height of the participant’s eye level and in the frontoparallel plane arranged at a viewing distance of 180 cm. The horizontal center of the display was placed at the intersection between the midsagittal plane and the frontoparallel plane. The right and left speakers on the floor were placed, respectively, 107.5 cm right and left of the intersection. The visual stimulus presented in the display was 10.2 by 13.5 degree. The sound pressure for the auditory stimuli was 35 dB (in L_Aeq).

Stimulus sets of two types were used: experimental and additional sets. For the experimental stimulus sets, we used 13 sounds as auditory stimuli, which were recorded in a daily life environment, and 16 pictures of objects or creatures as visual stimuli, which were also taken in daily life (Table 1). We created nine sound combinations, each of which included four sounds selected from the 13 sounds. The number of the combined sounds in a sound combination was chosen based on results from a preliminary experiment in which we examined the difficulty of identifying multiple sounds. The relevant visual stimulus means that the picture denotes a sound source that 8 of 10 participants reported in the preliminary experiment in which we presented one sound and asked participants to report its sound source. In contrast to the relevant visual stimulus, the irrelevant visual stimulus means that the picture denotes no sound source. In Experiment 1, the sound sets were presented with a visual stimulus relevant to one of the four sounds or irrelevant to all of them, or they were presented without the visual stimulus.

TABLE 1

Table 1. Auditory and visual stimulus conditions.

We prepared 12 additional stimulus sets that presented one, two, three, or four sounds that had been selected from five newly recorded sounds. We were concerned that participants would adhere to an answer to give the same numbers of sounds for all stimuli if the stimuli always presented the same number of sounds. To avoid such participant adherence, we changed the number of the sounds presented in the additional stimuli. The responses for the additional stimulus sets were not recorded. Therefore they were not analyzed in the following sections of this manuscript.

Procedures

The experimental stimulus sets have three conditions in which four sounds were presented simultaneously as auditory stimuli with or without a visual stimulus. In those stimuli, one sound was selected randomly as a target in each sound set and was designated as the target sound. Participants did not know which sound was the target in each sound set. In the relevant visual stimulus condition, a picture that denoted the target sound source was presented with the sounds. In the irrelevant visual stimulus condition, a picture that denoted no sound source of the four sounds was presented with the sounds. In the no visual stimulus condition, a blank screen was presented with the sounds. Table 1 presents a list of the target sounds and pictures in each of the relevant and irrelevant visual stimulus conditions.

Each participant conducted 39 trials in which nine experimental sets for each of the three stimulus conditions and 12 additional stimulus sets were presented in random order. In each trial, we presented the visual (or no-visual) and auditory stimuli for 5.0 s simultaneously. After the stimulus presentation, the participants listed sound sources of all the sounds that they identified in each sound combination on a paper. No time restriction or number restriction of answers was used for the participants to describe the sounds. They took at most 30 s in each trial.

Results and Discussion

To analyze the reported sounds, first, three naïve participants, who were newly recruited, judged whether or not the reported sounds corresponded to the presented sounds in each trial. Second, we classified the judgments into three categories: identification of a target sound, identification of non-target sound, and false hearing. False hearing is used to describe a report of a sound that was not presented in the sound combinations. Third, we calculated the rate of the identification of target sounds and that of identification of non-target sounds for all judgments (i.e., all reported sounds), and counted the frequency of false hearing for each sound set. No number restriction of answers was used for the participants to describe the sounds: the number decrement (or increment) of false hearing does not mean the number increment (or decrement) of correct hearing.

Figures 1A,B respectively present the average rates of the identification of target sounds and of non-target sounds based on data from 20 participants for each visual stimulus condition. We conducted one-way repeated measures analyses of variance (ANOVA) with the visual stimulus condition, separately for the identification rate of target sound and that of non-target sound. The main effect of the visual stimulus condition was found to be significant [F_(2,38) = 14.60, p < 0.001, partial η² = 0.43] for the target sound. Tukey’s post hoc HSD tests showed that the identification rate in the relevant visual stimulus condition was significantly higher than those in the other two visual stimulus conditions (p < 0.01). We also found a significant main effect [F_(2,38) = 10.99, p < 0.001, partial η² = 0.37] for the non-target sound. Tukey’s HSD tests showed that the identification rate in the relevant visual stimulus condition was significantly lower than those in the other two visual stimulus conditions (p < 0.05). These results suggest that the visual image of the sound source promotes the identification of that sound, and that the visual image suppresses the identification of sounds for which the sound source was not presented. Details of the bases of these promotive and suppressive processes will be discussed in section “General Discussion”.

FIGURE 1

Figure 1. Experiment 1 results. Mean and SEM. (A) Target sound identification rate, (B) Non-target sound identification rate, and (C) False hearing frequency. The target sound identification rate and that of non-target sounds to all judgments (i.e., all reported sounds) were calculated for each visual stimulus condition. The false hearing frequency was counted for each sound. *p < 0.05, **p < 0.01.

Figure 1C presents the mean false hearing frequency for each of the visual stimulus conditions. A one-way repeated measures ANOVA revealed a significant main effect of the visual stimulus condition for the frequency of false hearing [F_(2,38) = 3.42, p < 0.05, partial η² = 0.15]. Tukey’s HSD tests showed that the frequency in the no visual stimulus condition was significantly higher than those in the other two visual stimulus conditions (p < 0.05). This result indicates that presentation of the irrelevant visual stimulus did not increase the false hearing, and that presenting a visual image of any object, irrespective of its relevance to the sounds, suppresses false hearing.

In Experiment 1, we found the facilitative effect of visual presentation on sounds identification, as we expected from the results of previous studies (Holcomb and Anderson, 1993; Greene et al., 2001; Noppeney et al., 2008; Schneider et al., 2008; Ozcan and van Egmond, 2009), and also found other effects. The results of Experiment 1 revealed that presentation of the visual image simultaneously with auditory stimulus has three effects on identification of the presented sounds: promotion of identification for the target sound, suppression of identification for the non-target sound, and false hearing. Two possible explanations exist for these effects. One is that these effects are based on perceptual processing in terms of the visual image presented simultaneously with sounds. Another is that these effects are based on cognitive processing in which visual representation of the past visual image affects the identification of the present sounds. In this case, the basis of the effects of visual image on identification of sounds is the priming effect of the visual representation, which is obtained shortly after the beginning of the visual stimulus presentation. This issue is examined further in Experiment 2.

Experiment 2

Experiment 1 revealed the effects of presenting a visual image on identification of sounds presented simultaneously with the visual image. In the second experiment, we examine how cognitive processing, not the real-time perceptual processing, is involved in the effects of presenting visual image on identification of sounds. We presented the auditory stimuli after the presentation of the visual stimulus with ISI which was much longer than storage duration of visual iconic memory (Sperling, 1960).