1
Center for Cognitive Neuroscience, Duke University, Durham, NC, USA
2
Department of Neurobiology, Duke University Medical Center, Durham, NC, USA
3
Department of Psychology and Neuroscience, Duke University, Durham, NC, USA
4
Department of Psychiatry, Duke University Medical Center, Durham, NC, USA
Human perception of faces is widely believed to rely on automatic processing by a domain-specific, modular component of the visual system. Scalp-recorded event-related potential (ERP) recordings indicate that faces receive special stimulus processing at around 170 ms poststimulus onset, in that faces evoke an enhanced occipital negative wave, known as the N170, relative to the activity elicited by other visual objects. As predicted by modular accounts of face processing, this early face-specific N170 enhancement has been reported to be largely immune to the influence of endogenous processes such as task strategy or attention. However, most studies examining the influence of attention on face processing have focused on non-spatial attention, such as object-based attention, which tend to have longer-latency effects. In contrast, numerous studies have demonstrated that visual spatial attention can modulate the processing of visual stimuli as early as 80 ms poststimulus – substantially earlier than the N170. These temporal characteristics raise the question of whether this initial face-specific processing is immune to the influence of spatial attention. This question was addressed in a dual-visual-stream ERP study in which the influence of spatial attention on the face-specific N170 could be directly examined. As expected, early visual sensory responses to all stimuli presented in an attended location were larger than responses evoked by those same stimuli when presented in an unattended location. More importantly, a significant face-specific N170 effect was elicited by faces that appeared in an attended location, but not in an unattended one. In summary, early face-specific processing is not automatic, but rather, like other objects, strongly depends on endogenous factors such as the allocation of spatial attention. Moreover, these findings underscore the extensive influence that top-down attention exercises over the processing of visual stimuli, including those of high natural salience.
Faces are of undeniable ecological significance and commonly evoke behavioral and physiological responses that differ from responses to many other kinds of stimuli. Neuropsychological and behavioral evidence of the special character of face processing, available for many years, includes the existence of patients with selective deficits in face perception (Bodamer, 1947
), the ontogenetically early appearance of preference for face-like stimuli (Goren et al., 1976
) and the finding that image inversion impairs perception of faces more than perception of other objects (Yin, 1969
).
Correspondingly, numerous studies with both human and non-human primates have reported that visually-evoked neurophysiological responses to faces often differ from responses evoked by other kinds of objects. Over thirty years ago, neurophysiological studies with macaque monkeys revealed that a population of neurons in the inferotemporal cortex respond preferentially to images of faces (Gross et al., 1972
). Subsequently, functional imaging studies have identified certain regions in the human brain, including the fusiform gyrus and the superior temporal sulcus, that respond more strongly to faces than other objects (Clark et al., 1996
; Kanwisher et al., 1997
; McCarthy et al., 1997
; Puce et al., 1995
; Sergent and Signoret, 1992
). Distinctive patterns of neural activity associated with face processing have also been observed in human electrophysiological recordings (Allison et al., 1999
; Bentin et al., 1996
; George et al., 1996
; Lu et al., 1991
; Sams et al., 1997
). In particular, relative to other visual objects, faces elicit an enhanced negative-polarity event-related potential (ERP) component over lateral occipital scalp peaking about 170 ms after stimulus presentation (Bentin et al., 1996
). Furthermore, in agreement with functional imaging studies, ERP source analysis has indicated that the activity underlying the face-specificity of the N170 responses likely arises from a combination of activity in the fusiform gyrus and the superior temporal sulcus (Itier and Taylor, 2004
).
Based on the unique behavioral and physiological responses elicited by faces, many investigators have concluded that the processing of faces is qualitatively distinct from the processing given to other types of objects. According to this view, faces are processed by an anatomically well-localized modular system that is highly specialized for analyzing images of faces (Farah et al., 1998
; Tovee, 1998
). Proponents of the modular view have argued that certain regions of the brain, most especially the fusiform gyrus (Allison et al., 1999
; Kanwisher et al., 1997
; McCarthy et al., 1997
), are highly specialized for face processing. In this context, it is interesting to note that a number of studies have reported that the processing of faces appeared to be relatively impervious to the influence of the allocation of attention (e.g., Cauquil et al., 2000
; Lueschow et al., 2004
). If true, the processing of faces would be a striking exception to the robust attention-dependence broadly exhibited throughout the visual system and would strongly underscore the unique character of face processing.
The idea that face processing is immune to top-down influence, however, does not accord with our increasing appreciation of the prevalence of attentional modulation in the visual system. If face stimuli obligatorily receive special processing, the early neurophysiological responses selectively elicited by images of faces, such as the N170 ERP effect, should be about the same whether the faces are attended or not. It is now clear, however, that even the very early stages of cortical visual processing can be influenced by attention (e.g., Heinze et al., 1994
; Mangun, 1995
; Moran and Desimone, 1985
; Motter, 1993
; Poghosyan et al., 2005
; Posner and Gilbert, 1999
; Smith et al., 2006
; Woldorff et al., 1997
). Thus, the early analysis of complex objects like faces seems likely to be sensitive to the allocation of attention as well. Therefore, in order to determine whether the early cortical discrimination of faces depends on the allocation of attention, we have compared the neurophysiological responses of human observers to images of faces and other objects when they were presented in an attended location with the responses evoked by those same images when they were presented in an unattended location.
Participants
Nineteen right-handed adults (seven females and twelve males) ranging in age from 18 to 37 years (mean 22.4 years) participated in the study. Four subjects were excluded due to poor performance and/or excessive physiological artifacts such as eye blinks, eye movements, or muscular activity. The study protocol was approved by the Duke University Health System Institutional Review Board, and written informed consent was obtained from all participants.
Stimuli and Paradigm
Subjects were seated comfortably in an electrically shielded, sound-attenuated, dimly illuminated chamber facing a computer monitor. Stimulus presentation was controlled by a personal computer running the “Presentation” software package (Neurobehavioral Systems, Inc., Albany, CA, USA). All stimuli were displayed on a 15" CRT screen refreshed at 60 Hz.
Throughout each recording block, subjects were required to fixate on a small cross in the center of the screen while streams of visual stimuli were presented above and below the fixation cross (Figure 1
). The upper stream consisted of a rapid serial visual presentation (RSVP) of alphanumeric characters, approximately two degrees square, centered about two degrees above the fixation cross. The characters in this alphanumeric stimulus stream, were replaced every 150 ms. Simultaneously, a series of images of faces (obtained from the Psychological Image Collection at Stirling; http://pics.psych.stir.ac.uk) and houses, each approximately five degrees wide and six degrees high, were presented in randomized order at a point centered nine degrees below the fixation point. These images were presented for 100 ms each and the intervals between image onsets were randomly varied between 600 and 900 ms (in increments of the frame rate).
Figure 1. Stimuli. (A) Spatial layout of stimuli. Throughout each block, subjects were presented with two streams of stimuli. Above fixation, an RSVP stream of alphanumeric characters was presented. Below fixation, a series of face and house images (of equal probability) were presented. (B) Prior to each recording block, subjects were instructed to attend to one or the other of the two locations and detect occasional target stimuli in the stream at that location. In the attend-RSVP condition, participants attended to the stream of alphanumeric characters to detect an occasional digit (target) amongst mostly letters (non-targets, or “standards”). In the attend-images condition, participants attended to the stream of face and house images, most of which were in focus (standards), to detect the occasional occurrence of a blurred image (targets). Note that in this condition all the stimuli in the lower image stream (i.e., faces and houses) were attended, but the image content itself (i.e., whether it was a face versus a house) was completely orthogonal to the task of detecting blurred images of either type.
Each subject performed 16 blocks of trials with each block lasting approximately 2.5 minutes. Prior to the start of each block, subjects were instructed to attend either the upper stream (the alphanumeric character stimuli) or lower stream (the face and building images) and to indicate the appearance of an occasional target in the designated stream by pressing a button on a key pad. When attending the alphanumeric stream, subjects attempted to detect the appearance of infrequently presented numerals (approximately 2% of the characters in the stream) amongst mostly uppercase alphabetical characters. Targets in the face and house stream were blurred images of these objects and comprised 20% of the number of images presented. In order to avoid biasing the attention of the subjects toward one type of image in the lower stream, the blurry target images were equally likely to be either a face or a house. Thus, whether the images contained a face or a house was irrelevant to the performance of the task. The order of experimental conditions (i.e., attend to alphanumeric stream or attend to face/house image stream) was randomized across blocks.
Electrophysiological Recording and Analysis
The EEG (electroencephalogram) was recorded from 64 electrodes in a customized elastic cap (Electro-Cap International, Inc.) and referenced to the right mastoid during recording. Electrode impedances were maintained at less than 2 kΩ for the mastoids and the ground electrode, less than 10 kΩ for the vertical and horizontal eye electrodes, and less than 5 kΩ for the remaining electrodes. The 64 channels of EEG/EOG were continuously recorded with a band pass filter of 0.01–100 Hz and a gain of 1000 (SynAmps, Neuroscan Inc.). The raw signal was continuously digitized with a sampling rate of 500 Hz.
Eye blinks and eye movements were monitored by horizontal and vertical electrooculogram (EOG) electrodes for later rejection of trials with such artifacts. Vertical eye movements and eye blinks were detected by two electrodes placed below the orbital ridge of each eye, each referenced to the electrodes above the eye. Horizontal eye movements were monitored by two electrodes located at the outer canthi of the eyes. During recording, subjects were also monitored using a closed circuit video monitoring system to detect gross eye and/or head movements. Subjects who displayed an excessive degree of eye movement or blinking were excluded from further participation in the study, and any data collected from such subjects was discarded. Artifact rejection was performed off-line by discarding trials in which the EEG/EOG were contaminated by eye movements, eye blinks, excessive muscle activity, drifts or amplifier blocking. ERP averages to the various trial types were extracted by time-locked averaging from 500 ms before to 1000 ms after stimulus presentation and then digitally low-pass filtered with a nine-point moving average (which heavily filters out activity at and above 56 Hz at our 500-Hz digitization rate) and re-referenced to the algebraic average of the two mastoid electrodes. The analyses focused on the ERPs on the non-target trials, for both the RSVP and image streams, thereby avoiding the presence of any significant target-detection or motor-related activity in the ERPs.
To evaluate the effect of attention on the steady-state modulation induced by the central letter stream, average EEG traces were computed for each subject on the channels of interest and the envelope of the SSVEP signal was extracted by complex demodulation (Draganova and Popivanov, 1999
; Makeig et al., 1996
; Muller et al., 1998
). More specifically, the averaged EEG epoch was multiplied by a complex sinusoid at the frequency of the RSVP stream (6.67 Hz), and the resultant waveform was then low-pass filtered with a zero phase-shift filter and a cutoff of 2 Hz. For each condition, mean amplitude was subsequently calculated from the complex demodulated waveforms between 0 and 500 ms after stimulus onset. The difference of the oscillation amplitude between the attended condition and the unattended condition was tested using within-subject repeated-measure analyses of variance (ANOVAs). To evaluate the significance of the effect of attention on the P1 component elicited by the images, the mean amplitudes of the image ERP waves between 60 and 140 ms were measured for each subject and condition, and ANOVAs were performed on these amplitude measures.
Image-type difference waves were calculated for each attention condition by subtracting the average ERP evoked by the houses from the average ERP evoked by the faces. The ERPs and ERP difference waves for the individual subjects were grand averaged across subjects. Repeated-measures ANOVAs were performed on mean amplitudes of the ERP waveforms and difference waves in specific latency windows across subjects, relative to a 200 ms prestimulus baseline. In particular, activity at several occipital sites in a window around 160–170 ms (the hallmark N170 component) was analyzed for significant differences as a function of the factors of Attention (attended vs. unattended) and Object Type (face vs. house).
Subjects performed both the RSVP digit-detection task and the blurry-image detection task well (detecting an average of 93.5 ± 8.2% of the targets in the image stream and 85.7 ± 11.6% of the targets in the RSVP stream) and showed similar reaction times on both tasks (an average of 465.3 ± 40.7 ms for targets in the image stream and 477.3 ± 40.8 ms for targets in the RSVP stream).
The presentation of the constant-rate RSVP stream of characters above fixation induced a steady state oscillation in the EEG traces over bilateral occipital scalp (Figure 2
A, blue trace). As expected (Muller and Hillyard, 2000
), when attention was directed toward this stream, the amplitude of this oscillation was much larger (F(1,14) = 25.3, p < 0.0005; Figure 2
A, red trace), reflecting the enhanced sensory processing of stimuli in an attended region of space. The images of the faces and houses in the other stream evoked the occipital P1 component 100 ms poststimulus that is characteristic of ERPs to visual stimuli (Figure 2
B, blue trace). Also as expected (reviewed in Mangun, 1995
), when subjects were attending to these images, the amplitude of the P1 to all the stimuli in the stream was greatly magnified (F(1,14) = 7.11, p < 0.02; Figure 2
B, red trace), demonstrating the strong influence of spatial attention on processing in the early visual sensory pathways.
Figure 2. Effect of Spatial Attention on Visually Evoked ERP Responses. Grand average waveforms (n =15) over occipital (visual) cortex demonstrating that the processing of stimuli at the attended location was enhanced. (A) Stimuli in the letter/digit stream, which were presented at a regular rate (6.67 Hz), produced a steady-state oscillation in the EEG trace. The amplitude of this oscillation was strongly enhanced when the letter stream was attended. (B) ERPs to non-target stimuli in the face/house image stream. When attention was directed to this stream, all the images in the stream evoked larger sensory responses, including a strongly enhanced sensory P1 component at 100 ms poststimulus.
More importantly, spatial attention had a profound influence on the face-specific processing reflected in the difference between the ERP responses to faces and the ERP responses to houses in the N170 latency range (135–185 ms; Figure 3
). When subjects were attending to the image stream, faces evoked a substantially larger negative wave over lateral occipital cortex in the N170 latency range than houses (Figure 3
A). This early face-specific activity was spatially focal, relatively right lateralized, and peaked at approximately 160 ms (Figure 3
B), highly consistent with previously reported characteristics of the N170 component (e.g., Bentin et al., 1996
). In contrast, when subjects were attending away from the image stream (i.e., attending to the RSVP stream), the difference between the N170-latency activity evoked by faces and houses was essentially eliminated (Figure 3
C–D).
Figure 3. Effect of spatial attention on early face processing. Spatial attention enhances the processing of faces. (A) Grand average waveforms (n = 15) evoked by the non-target images during blocks in which the location of the image stream was attended. (B) Distribution of the difference potential (face ERP minus house ERP) from 135 to 185 ms poststimulus during blocks in which the image stream was attended. (C) Grand average waveforms evoked by the face and house images when they were ignored (i.e., during blocks in which attention was directed to the location of the RSVP letter stream). (D) Corresponding distribution of the face-minus-house difference potential from 135 to 185 ms poststimulus during blocks in which these images were unattended.
The observed influence of attention on this early face-selective processing was reflected statistically in several ways. First, there was a statistically significant interaction between Attention and Object type (F(1,14) = 4.92, p < 0.05), revealed by a two-way repeated-measure analysis of variance (ANOVA) of the mean amplitude of the activity in the latency window around the N170. In addition, specific comparisons within the two attention conditions confirmed that the face-house difference in the N170 latency range in the attended condition was highly significant (F(1,14) = 8.82, p = 0.01), whereas there was no significant difference in the unattended condition (F(1,14) = 1.37, p = 0.26).
Our analyses focused on the ERP responses derived relative to the standard averaged-mastoid reference. Because of the lateral occipital focus of the N170 activity, however, this reference may have been somewhat less sensitive to the N170 effects than might be optimal. Accordingly, to ensure that our choice of reference did not bias or otherwise limit our results, we also derived ERP averages for all subjects and conditions with respect both to a fully averaged reference (i.e., referenced to the average of all the channels) and to a frontal reference (forehead sites). Although the N170 effect in the attended channel appeared to be slightly larger with this derivation, the analyses of these data were completely consistent with those using a mastoid-reference data – namely, a robust face-specific N170 effect when the images were attended, no such significant effect when attention was directed toward the letters, and a significant two-way interaction between Attention and Object Type.
In the present study, we investigated the influence of spatial attention on the face-specific N170 effect, believed to reflect the earliest stage at which face processing clearly and consistently diverges from the processing of other types of objects. To examine such face-specific processing, we compared the ERPs elicited by faces with the ERPs evoked by other objects (i.e., houses), under different spatial attention conditions. In agreement with previous reports, when subjects attended to the images, we found that the occipital N170-latency negative-wave response to faces was much larger than the response to houses. However, when attention was focused on a demanding task in another location, there was no significant difference between the ERPs to faces and houses in the N170 latency range. Thus, in contrast to various prior reports, our results indicate that face-specific processing is not automatic but requires the allocation of spatial attention.
Prior reports in the ERP literature have generally shown, in contrast to the results reported here, little or no effect of attention on the N170 elicited by faces (Carmel and Bentin, 2002
; Cauquil et al., 2000
; Holmes et al., 2003
). MEG studies have yielded similar findings (Downing et al., 2001
; Furey et al., 2006
) These findings had been interpreted as indicating that face processing is relatively immune to the effects of endogenous processes such as the allocation of attention, thereby reinforcing modular accounts of face processing. On the other hand, hemodynamically based neuroimaging studies have suggested that face-specific processing is modulated by attention. For example, several fMRI studies found that the activity evoked in the fusiform gyrus was enhanced when subjects selectively attended to the faces when watching a display containing images of faces and houses (Wojciulik et al., 1998
) or watching a display containing superimposed transparent faces and houses (Carmel and Bentin, 2002
; Cauquil et al., 2000
; Holmes et al., 2003
; O’Craven et al., 1999
). This evidence therefore argues against the fully automatic nature of face-specific processing. The discrepancy between findings from electrophysiological studies using ERP and MEG and neuroimaging studies using fMRI may have arisen from the differing temporal resolution of these methods. Hemodynamically based studies cannot resolve the time course of such attentional modulation and thus leave open the possibility that the influence of attentional allocation or task is limited to the later stages of face processing while the early processing of faces is fully automatic.
Nevertheless, there exists a discrepancy between our findings and those of most previous ERP studies, which may result from differences in the types of attentional manipulation employed. Prior studies (Carmel and Bentin, 2002
; Cauquil et al., 2000
; Downing et al., 2001
; Furey et al., 2006
; Lueschow et al., 2004
) mainly focused on manipulating object-based attention rather than spatial attention. In other words, the stimuli used in those studies were all presented in attended locations, while the task relevance of faces was manipulated. Thus, the potentially highly robust influence of spatial attention on early face processing was not examined. Given the findings of numerous ERP reports that early sensory processing components, including the P1 at 100 ms, are strongly modulated by spatial attention, it seems quite reasonable that the N170 component, with its later onset, would also be affected by spatial attention. Therefore, the null effects of attentional modulation on N170 or M170 in previous studies mainly demonstrate that “object-based” attention has relatively little influence on early-latency face-specific processing.
A couple of prior ERP studies have more specifically investigated the effect of spatial attention on face processing. In a recent ERP study focusing on the effects of attention on emotional face expression, a small enhancement of the N170 component was observed when subjects attended to a pair of face images in a display containing other objects relative to attending to a pair of house images in that display (Holmes et al., 2003
). While this finding suggests that attending for faces can induce some modulation of the early responses evoked by images of faces, it does not address the question of whether the specific processing that faces receive requires attention. More specifically, the design of that experiment did not allow the assessment of face-selective activity (e.g., an N170 effect) in an unattended location, nor the ability to compare it to face-selective activity evoked by attended images. Whether face processing is largely automatic, therefore, was not resolved in that study.
A more recent study (Jacques and Rossion, 2007
) showed larger effects of spatial attention on the amplitude of a negative component peaking at around 170 ms poststimulus; however, the ability to attribute the effect in this study to an attentional modulation of face-specific processing was rather limited. More specifically, these authors manipulated the difficulty of a centrally presented discrimination task and showed that a negative wave in the N170 latency evoked by peripherally presented faces was strongly reduced when the central task was very demanding. However, it is well-known that spatial attention can enhance not just the P1 wave at 100 ms poststimulus, but also the occipital N1 component at 180 ms; this enhancement occurs for all visual stimuli (including faces). Accordingly, in order to isolate the influence of attention on “face-specific” processing, it is necessary to first extract face-specific processing by comparing ERP responses to faces with ERP responses to non-face objects. Face-specific activity evoked by stimuli presented in an attended location can then be directly compared to the face-specific activity evoked by the same stimuli when they are presented in a spatially unattended location. We have performed this analysis in the present study, using a paradigm in which spatial attention was manipulated while object-based attention was controlled (i.e., the content of the images – whether they contained a face or a house – was irrelevant to the performance of the blurry-image detection task). Furthermore, by extracting the face-specific component in terms of the differential processing effects between faces and houses, we were able to demonstrate a clear and robust modulation of this early face-specific activity due to spatial attention.
We note further that the essential elimination of a significant face-specific N170 effect in the unattended channel in the present study to be under circumstances where attention was strongly directed toward a very demanding task in the attended channel (the RSVP task). It may be that lower-load conditions in the task-relevant channel would allow additional processing in an unattended channel (Lavie, 2006
), such that significant levels of early face-house discrimination activity (such as that reflected in the N170) could be elicited. Future studies will be important for delineating the relationship between the degree of attentional load and the ability of the brain to rapidly discriminate faces from other visual objects.
In our study, the effects of spatial attention on the basic early sensory processing of the stimuli was clear, as evidenced by the strong enhancement of the P1 response at 100 ms for attended images (as well as by the large attentional enhancement of the steady-state responses to the letter stream). The P1 effect was followed in the image ERPs by a large attentional modulation of the face-specific N170 effect. This pattern fits the hypothesis that the attentional modulation on the early visual sensory responses ramifies forward to substantially gate the differential processing of faces shortly later in visual cortical processing. Such a result also argues against there being much input to the N170 face-specific brain activity from any highly automatic, alternate pathway specific for face processing information (e.g., Morris et al., 2001
; Ohman, 2002
) that circumvents the feedforward early sensory cortical pathways in extrastriate visual cortex and thereby in turn circumvents the pervasive influence that spatial attention that has been shown to exercise on these pathways (e.g., Heinze et al., 1994
; Mangun, 1995
; Moran and Desimone, 1985
; Motter, 1993
; Poghosyan et al., 2005
; Posner and Gilbert, 1999
; Smith et al., 2006
; Woldorff et al., 1997
).
In summary, our results clearly rule out any account of early face discrimination mechanisms that stipulate independence from the allocation of spatial attention. When faces appeared in an unattended spatial location, even the initial face-specific processing indexed by the differential N170 response was essentially absent. The early processing which differentiates faces from non-face objects thus strongly depends on endogenous factors such as the distribution of spatial attention. These results further suggest that the processing of faces may in fact be more similar to that applied to other highly significant stimuli than the current prevailing view indicates. Moreover, these findings underscore the extensive reach of visual attention in influencing the sensory processing of all stimuli in our environment, including at early stages of that processing.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We thank Tineke Grent-’t-Jong for her helpful technical assistance and input. Supported by NIH grants (2P01-NS41328-Proj.1, R01-MH60415, and R01-NS51048) to M.G.W.