The Other-Race-Effect on Audiovisual Speech Integration in Infants: A NIRS Study

Ujiie, Yuta; Kanazawa, So; Yamaguchi, Masami K.

doi:10.3389/fpsyg.2020.00971

ORIGINAL RESEARCH article

Front. Psychol., 15 May 2020

Sec. Cognition

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.00971

The Other-Race-Effect on Audiovisual Speech Integration in Infants: A NIRS Study

Yuta Ujiie^1,2,3*

So Kanazawa⁴

Masami K. Yamaguchi⁵

¹Graduate School of Psychology, Chukyo University, Aichi, Japan
²Research and Development Initiative, Chuo University, Tokyo, Japan
³Japan Society for the Promotion of Science, Tokyo, Japan
⁴Department of Psychology, Japan Women’s University, Kawasaki, Japan
⁵Department of Psychology, Chuo University, Tokyo, Japan

Previous studies have revealed perceptual narrowing for the own-race-face in face discrimination, but this phenomenon is poorly understood in face and voice integration. We focused on infants’ brain responses to the McGurk effect to examine whether the other-race effect occurs in the activation patterns. In Experiment 1, we conducted fNIRS measurements to find the presence of a mapping of the McGurk effect in Japanese 8- to 9-month-old infants and to examine the difference between the activation patterns in response to own-race-face and other-race-face stimuli. We used two race-face conditions, own-race-face (East Asian) and other-race-face (Caucasian), each of which contained audiovisual-matched and McGurk-type stimuli. While the infants (N = 34) were observing each speech stimulus for each race, we measured cerebral hemoglobin concentrations in bilateral temporal brain regions. The results showed that in the own-race-face condition, audiovisual-matched stimuli induced the activation of the left temporal region, and the McGurk stimuli induced the activation of the bilateral temporal regions. No significant activations were found in the other-race-face condition. These results mean that the McGurk effect occurred only in the own-race-face condition. In Experiment 2, we used a familiarization/novelty preference procedure to confirm that the infants (N = 28) could perceive the McGurk effect in the own-race-face condition but not that of the other-race-face. The behavioral data supported the results of the fNIRS data, implying the presence of narrowing for the own-race face in the McGurk effect. These results suggest that narrowing of the McGurk effect may be involved in the development of relatively high-order processing, such as face-to-face communication with people surrounding the infant. We discuss the hypothesis that perceptual narrowing is a modality-general, pan-sensory process.

Introduction

Humans’ perceptual systems develop to adapt to the surrounding environments. It has been found that during the development of infants, exposure to specific faces and languages influences their sensitivity to face or speech, which is called as the perceptual narrowing (Werker and Tees, 1983, 2002; Pascalis et al., 2002; Lewkowicz and Ghazanfar, 2006). For example, it has been shown that 6-month-old infants can discriminate individual human and monkey faces, but older infants aged 9 months can only discriminate individual human faces (Pascalis et al., 2002). Even within human faces, perceptual narrowing occurs such that 3-month-old infants can recognize both own- and other-race faces, but the ability to recognize other-race faces is diminished in infants older than 6 months, which is called as the other-race effect (Kelly et al., 2007, 2009). In speech perception, it also has been shown that English-learning infants aged 6–8 months can discriminate phonetic contrasts in their native language (English) as well as a non-native language (Hindi), but infants aged older than 10 months are not able to discriminate non-native phonetic contrasts that do not exist in their native language (Werker and Tees, 1983, 2002). Furthermore, a couple of studies reported the presence of narrowing in the perception of musical rhythms (Hannon and Trehub, 2005a, b). These studies demonstrated that 12-month-old infants show an adult-like, culture-specific response pattern to musical rhythms (Hannon and Trehub, 2005b) in contrast to the culture-general response that is evident at 6 months of age (Hannon and Trehub, 2005b). Infants’ perceptual sensitivity to faces, spoken languages, and even musical rhythms is broader in the early months of development and narrows gradually by the end of the first year.

The timing of emerging narrowing is shared in face perception and speech perception, although the interaction of speed of perceptual narrowing in both domains remains discussed. Recent studies have investigated the correlations between perceptual narrowing in the face and speech domains (e.g., Krasotkina et al., 2018; Xiao et al., 2018). These studies have suggested that the speed of the developmental trajectories of perceptual narrowing in the speech domain is not necessarily correlated with that in the face domain within infants older than 8 months. Whether the narrowing process is driven by modality-general mechanisms (e.g., Pascalis et al., 2002) or by modality-particular mechanisms (e.g., Krasotkina et al., 2018) remains unclear.

Some studies have suggested that experiences play roles in the development of multisensory perception, especially in audiovisual speech perception (Lewkowicz and Ghazanfar, 2006; Pascalis et al., 2014). This implies that perceptual narrowing is a modality-general, pan-sensory process. That is, the basic and broadly tuned abilities of audiovisual speech perception are present in the early months and are gradually tuned to match the environment around infants during the first year of life (Lewkowicz and Ghazanfar, 2009; Pascalis et al., 2014). Indeed, along with increased exposure to native languages, an infant’s ability for audiovisual speech matching (Kuhl and Meltzoff, 1982; Patterson and Werker, 1999) develops to work limited to specific phenomes that are present in the native language, by 11 month-olds (Pons et al., 2009). In addition to language experience, Lewkowicz and Ghazanfar (2006) and Lewkowicz et al. (2008) demonstrated one aspect of the role of visual experience by measuring infants’ sensitivity to audiovisual associations for rhesus monkey vocalizations. They presented 4- to 10-month-old infants with two side-by-side rhesus monkey faces producing a coo call and grunt call in the presence of one of the corresponding auditory calls. In their results, 4-to-6-month-old infants preferred the face corresponding with the auditory call, but 8- and 10-month-old infants were not able to do that (Lewkowicz and Ghazanfar, 2006).

However, no prior study demonstrated the evidence for the role of experience with own-race-faces in the development of audiovisual speech perception. Here, we tested this issue in the context of McGurk effect (McGurk and MacDonald, 1976). The McGurk effect is a well-known illusion that demonstrates the influence of visual speech on voice perception (McGurk and MacDonald, 1976). An example of this illusion is when a movie of a mouth articulating the phoneme/ka/ is dubbed with a voice uttering a different phoneme, /pa/, observers tend to perceive an intermediate phoneme (/ta/). The McGurk effect is widely used as an index of the robustness of the influence of visual speech in adults (Sekiyama and Tohkura, 1991; Ujiie et al., 2018a) and children (Massaro et al., 1986; Sekiyama and Burnham, 2008). The McGurk effect has been observed from the preverbal stage of infant development (Rosenblum et al., 1997; Desjardins and Werker, 2004). By 4 months of age, infants can discriminate auditory syllables (e.g., Eimas et al., 1971; Jusczyk et al., 1978) and match an auditory voice with facial speech (e.g., Kuhl and Meltzoff, 1982, 1984; Patterson and Werker, 1999, 2002). At around 5 months of age, infants can integrate a voice with an incongruent facial speech and perceive the McGurk effect, regardless of syllable combination (Rosenblum et al., 1997; Desjardins and Werker, 2004). Rosenblum et al. (1997) habituated 5-month-old infants with the speech of auditory/va/ with visual /va/ and presented them with two test stimuli; auditory /ba/ with visual /va/, which causes the McGurk effect (/va/), and auditory /da/ with visual /va/, which is perceived as /da/. The results revealed that the infants showed dishabituation to the stimulus of auditory /da/ with visual /va/. Thus, the infants could integrate auditory /ba/ with visual /va/ and perceive the McGurk effect (/va/) like adults. Desjardins and Werker (2004) demonstrated the McGurk effect in infancy by using the stimulus of auditory /bi/ with visual /vi/, which causes the McGurk effect (/vi/) in adults.

This study shed light on the different brain responses to own-race and other-race faces in the McGurk effect. Previous studies have reported the different brain responses of face processing between own-race and other-race conditions (e.g., Balas et al., 2011; Timeo et al., 2019) and those of speech processing between native and non-native speech (e.g., Kuhl et al., 2014). However, those of the McGurk effect have not yet been reported. The neural basis of the McGurk effect has been investigated from infants (Kushnerenko et al., 2008) to adults (e.g., Beauchamp et al., 2010; Nath and Beauchamp, 2012). Several functional magnetic resonance imaging (fMRI) studies showed that the left superior temporal sulcus (STS), an area critical for the integration of auditory and visual speech information (Calvert et al., 2000), is responsible for the occurrence of the McGurk effect as well as the processing of audiovisual congruent syllables (in children, Nath et al., 2011; in adults, Nath and Beauchamp, 2012). In infants, Kushnerenko et al. (2008) found the neural basis of the McGurk effect by using event-related brain potentials (ERPs). Their results showed that the ERP responses to the McGurk-type stimulus (audio /ba/ with visual /ga/) were similar to that to the audiovisual-matched stimulus (audio /ba/ with visual /ba/) rather than to that of the audiovisual-mismatched stimulus (audio /ga/ with visual /ba/).

In this study, we used a functional brain activity imaging technique, functional near-infrared spectroscopy (fNIRS) to measure infant brain activities. This technique is reliable and valid for measuring brain activity in infants and is also easier to conduct in infants than fMRI. Previous studies from our research group have revealed that increased hemodynamic responses of temporal regions in infants’ brain in reaction to processing faces (Otsuka et al., 2007; Kobayashi et al., 2018), color (Yang et al., 2016), and audiovisual matching of material information (Ujiie et al., 2018b). Especially, it has been shown that the cerebral hemoglobin concentrations in bilateral temporal brain regions includes brain activities in the STS area (e.g., Otsuka et al., 2007; Ujiie et al., 2018b). Based on these studies, we considered that fNIRS is informative for investigating the question of how experiences with faces of different races affect infants’ development of audiovisual speech integration.

In summary, the present study focused on infants’ brain responses to the McGurk effect to examine whether the other-race effect occurs in the activation patterns. In Experiment 1, we conducted fNIRS measurements to find the presence of a mapping of the McGurk effect in Japanese 8- to 9-month-old infants and to examine the difference between the activation patterns of own-race-face and other-race-face stimuli. We hypothesized that the left temporal region would selectively activate in response to the McGurk speech of the own-race face and audiovisual-matched speech but not to those of the other-race face. To support the fNIRS data, we confirmed whether the infants could perceive the McGurk effect only in the own-race face and not in the other-race face by using a familiarization/novelty preference procedure (Experiment 2).

Experiment 1

Methods

Participants

All infants were full term at birth (37+ weeks) and were healthy at the time of the experiments. The participants were 34 healthy Japanese infants (17 infants for the own-race-face condition, and 17 infants for the other-race-face condition) aged 8–9 months old (25 girls and 9 boys; mean age = 246.5 days, range = 226–283 days), all of who grew up in Japan. An additional 12 infants were excluded because of an insufficient number of successful trials (fewer than three trials for each condition) due to fussiness motion artifacts. Ethical approval for this study was obtained from the local Ethical Committee. Written informed consent was obtained from the parents of the participants.

Stimuli

We assigned 17 infants to the own-race condition, and 17 infants to other-race condition. Then, we conducted measurements of brain activity in the infants using the ETG-4000 system (Hitachi Medical Systems, Tokyo, Japan), the reliability, benefit, and variability of which were validated in our previous studies (e.g., Yang et al., 2016; Ujiie et al., 2018b).

For own-race-face and other-face-race conditions, we used audiovisual speech stimuli that were created from recordings of two women’s utterances for three syllables (/pa/, /ta/, and /ka/). In order to reduce the possible difference in accents between English and Japanese speakers, we used infant-directed speech (IDS), which has been shown to be relatively similar, regardless of the language (e.g., Piazza et al., 2017). The speakers were two women, a Japanese East-Asian (22 years old) and an English Caucasian (23 years old), both of whom are monolingual speakers. The visual stimuli (800 pixels × 450 pixels) were recordings of the speakers’ faces, made using a digital video camera (GZ-EX370; JVC Kenwood, Yokohama, Japan). The voices (digitized at 48 kHz with a 16-bit quantization resolution) were recorded using a dynamic microphone (MD42; Sennheiser, Wedemark, Germany). The visual and auditory stimuli were combined to create two matched and two McGurk stimuli using Adobe Premiere Pro CS6 (Adobe Systems, San Jose, CA, United States). For the McGurk stimuli, we combined /pa/ voice with the facial movement for /ka/, by adjusting the onset of voice (/pa/) based on the onset of the original utterance (/ka/). The congruency of the stimuli was based on the speech sound. The McGurk stimuli consisted of a voiced /pa/ with an incongruent articulation /ka/. Pink noise was added to the voices (the signal-to-noise ratio was 0 dB) to induce perception of the McGurk effect (e.g., Sekiyama and Tohkura, 1991; Ujiie et al., 2015). Finally, we created matched stimuli (auditory /pa/ and visual /pa/) and McGurk stimuli (auditory /pa/ and visual /ka/) for two speakers of different races (East Asian and Caucasian).

Apparatus

A 21-inch color cathode ray tube display with a resolution of 1,024 pixels × 768 pixels was used to present the visual stimuli. The display was placed in front of the infant at a distance of 40 cm. A pinhole camera was set below the display to monitor the infant’s looking behavior. The audio stimuli were presented at a sound pressure level of approximately 60 dB through two loudspeakers placed on the left and right sides of the display.

The Hitachi ETG-4000 system (Hitachi Medical, Japan) was used to record the hemodynamic response simultaneously from 24 channels, with 12 channels for each right and left temporal area. The instrument generated two different wavelengths (695 and 830 nm) and measured the time course of changes in oxy-Hb, deoxy-Hb, and total-Hb with a 0.1-s time resolution. We used a pair of probes, each containing nine optical fibers (3 × 3 arrays) with five light emitters and four detectors. The optical fibers of each probe were kept in place with a soft silicon holder, and the inter-fiber distance was set at 2 cm. According to the International 10–20 EEG system, the center of each probe was placed at the T3 and T4 position for the measurement of the bilateral temporal regions (Figure 1). After positioning the probes, the experimenter checked whether the signals of the channels were appropriate to measure the hemodynamic responses via the ETG-4000 system, which automatically detects whether or not the probes were contacting the infant’s scalp correctly. The channels were rejected from the analysis if adequate contact between the fibers and scalp could not be achieved because of interference from hair.

FIGURE 1

Figure 1. Location of the measurement channels in the current study.

Procedure

Each infant was seated on her (or his) parent’s or an experimenter’s lap. The viewing distance was approximately 40 cm. The sequence of the stimulus presentation consisted of a baseline trial and two test trials (Figure 2). One test trial consisted of three presentations of match stimuli, and the other included three presentations of McGurk stimuli. The duration of the test trial was 9.6 s. Each test trial was presented alternately between the baseline trials. During the baseline trial, dynamic random dot patterns (800 pixels × 450 pixels) with an auditory white noise were displayed simultaneously once every 3.2 s. The baseline trial was controlled by the experimenter, and its duration was at least 9.6 s. The presentation order of the two test trials was randomly counterbalanced across infants. Each test trial was shown to the infants for a maximum of eight times.

FIGURE 2

Figure 2. An example of the order of stimulus presentation.

The infants looked at the stimuli passively while their brain activity was recorded. They were allowed to look at the stimuli as long as they were willing to. Their behavior was recorded digitally throughout the experiment.

Data Analysis

According to the exclusion criteria of previous studies (e.g., Yang et al., 2016; Kobayashi et al., 2018), we removed trials from analysis if (1) the infants’ looking time in the test period was less than 60% of the total duration of the test period or if they became fussy, (2) the infant looked back to the experimenter’s or parent’s face during the preceding baseline period, or (3) motion artifacts were detected by the analysis of sharp changes in the time courses of the raw oxy-Hb data.

We used a Hitachi ETG system to convert the light intensity data of the two wavelengths for each Hb concentration. The values of oxy-Hb, deoxy-Hb, and total-Hb in each channel were calculated by using the difference of the intensities between wavelengths of light (695 and 830 nm) based on the modified Beer-Lambert law. After converting each Hb concentration, we checked for motion artifacts. In order to detect motion artifacts, we used the formula and criteria used in previous NIRS studies (e.g., Yang et al., 2016; Kobayashi et al., 2018). We first calculated the value by dividing the average raw data at four time points (mM × mm) by the average raw data at four time points thereafter. If the value was larger than 0.8, we defined that the data (trial) included a body movement artifact, and removed it from the analysis.

The raw Hb concentration changes from the individual channels were digitally band-pass-filtered at 0.02–1.0 Hz to remove longitudinal signal drift and noise from the instrument. We averaged the raw data of each channel across trials within each participant in a time series from 3 s before the test trial onset to 10 s after the test trial offset. From the time series of raw data of oxy-, deoxy-, and total-Hb, we calculated the Z-scores at each time point separately for the matched and mismatched conditions. The Z-scores, as the difference of the means between the baseline and test condition, were calculated using the following formula:

Z score = (T e s t - M_{baseline}) / S

where Test represents the raw data values at each time point during test trials, For the value of M_baseline, we used the mean of the raw data during the 3 s immediately before the beginning of each test trial. S indicates the standard deviation of the raw data during the same time period as M_baseline.

Results

Hemodynamic data were obtained from 34 infants, and included more than three valid trials for each test trial. On average, we obtained approximately five valid trials for each test trial in the two conditions. There were five valid trials (SD = 1.50, range: 3–7) for Match and five valid trials (SD = 1.00, range: 3–6) for McGurk in the own-race-face condition. There were 4.6 valid trials (SD = 1.17, range: 3–7) for Match and 4.7 valid trials (SD = 1.10, range: 3–6) for McGurk in the other-race-face condition. We normalized the raw data of the hemodynamic responses using the mean and standard deviation (SD) of the baseline period for each channel and each participant before applying statistical analyses, because the raw data could not be averaged directly between participants and channels. Subsequently, we averaged the Z-scores of the oxygenated hemoglobin (oxy-Hb) across the 12 channels in each hemisphere and compared them to the baseline. Figure 3 shows the time course of the average changes in concentration for oxy-Hb and deoxy-Hb during the presentation of the Match and McGurk trials for each race-face condition (results of total-Hb change are provided in the Supplementary Information). In the own-race-face condition, the oxy-Hb concentration in the left temporal region increased during both Match and McGurk trials. This increased activation reached a peak and started to return toward the baseline level between 12 and 16 s after stimulus onset. Such activation was not observed in the other-race-face condition.

FIGURE 3

Figure 3. Time course of the changes in the oxygenated hemoglobin (Oxy-Hb) and deoxygenated hemoglobin (Deoxy-Hb) concentrations. Oxy-Hb and Deoxy-Hb concentrations were separately averaged in all groups during each condition in the left and right temporal regions. Panel (A) shows the results for the own-race-face condition, and (B) shows the results for the other-race-face condition. Solid lines represent the change in Oxy-Hb, and dotted lines represent the change in Deoxy-Hb. Blue lines and red lines represent the mean Z-score during the Match and McGurk trials, respectively. The vertical dashed lines at 0 and 9.6 s indicate the onset and offset of the test stimulus presentation, respectively.

In order to examine whether each temporal region was activated in response to audiovisual speech integration for each race-face, we conducted a two-tailed one sample t-test against zero response (baseline). As in common with infant studies of fNIRS (e.g., Issard and Gervain, 2017), we focused on concentrations of oxy-Hb. Firstly, to select the time window for averaged data, we compared oxy-Hb concentrations in each hemisphere in each condition against a baseline (z = 0) with cluster-based permutation tests. Such tests, which were successfully used in previous studies, can take into account temporal adjacency, clustering together samples that show a significant effect if they are adjacent in time (e.g., Maris and Oostenveld, 2007; Benavides-Varela and Gervain, 2017; Issard and Gervain, 2017). We first performed t-tests against baseline for each data point, then grouped data points temporally with a t-value greater than a standard threshold (t = 2), referred to in previous studies (e.g., Maris and Oostenveld, 2007; Benavides-Varela and Gervain, 2017). These analyses revealed a data-driven time window from 12 to 16 s after the stimulus onset.

We then performed statistical analyses with mean Z-scores during the 12–16 s after stimulus onset in the left and right temporal regions (Figure 4). A planned two-tailed one sample t-test with a zero response as the baseline was conducted for each region, with reference to previous studies of fNIRS in infants (e.g., Ujiie et al., 2018b). In the own-race-face condition, the concentration of oxy-Hb in the left temporal region increased significantly during both the Match [t(16) = 2.77, p = 0.028, false discovery rate (FDR) corrected, d = 0.92] and McGurk trials [t(16) = 2.55, p = 0.028, FDR corrected, d = 0.86]. In the right temporal region, the concentration of oxy-Hb increased significantly during the McGurk trials [t(16) = 4.55, p = 0.001, FDR corrected, d = 1.43] but not during the Match trials [t(16) = 1.41, p = 0.116, FDR corrected, d = 0.57]. In contrast, changes in the concentration of deoxy-Hb did not reach at a significant level in both temporal regions; the Match [t(16) = 1.56, p = 0.22, FDR corrected, d = 0.50] and McGurk trials [t(16) = 0.94, p = 0.46, FDR corrected, d = 0.57] in right temporal region, and the Match [t(16) = 1.46, p = 0.28, FDR corrected, d = 0.54] and McGurk trials [t(16) = 1.67, p = 0.36, FDR corrected, d = 0.33] in left temporal region. In the other-race-face condition, no significant activation for the concentration of oxy-Hb was found in the left temporal region during the Match [t(16) = 0.63, p = 0.99, FDR corrected, d = 0.21] and McGurk trials [t(16) = 0.65, p = 0.70, FDR corrected, d = 0.22], and in the right temporal region during the Match [t(16) = 0.11, p = 0.91, FDR corrected, d = 0.04] and McGurk trials [t(16) = 0.79, p = 0.99, FDR corrected, d = 0.27]. For the concentration of deoxy-Hb, no significant activation was found in the left temporal region during the Match [t(16) = 1.83, p = 0.35, FDR corrected, d = 0.61] and McGurk trials [t(16) = 0.26, p = 0.99, FDR corrected, d = 0.09], and in the right temporal region during the Match [t(16) = 1.23, p = 0.47, FDR corrected, d = 0.41] and McGurk trials [t(16) = 0.25, p = 0.99, FDR corrected, d = 0.09].

FIGURE 4

Figure 4. Mean Z-scores for oxygenated hemoglobin (Oxy-Hb) and deoxygenated hemoglobin (Deoxy-Hb) in all groups for the left temporal (left) and right temporal (right) regions. Panel (A) shows the results for Oxy-Hb (in the left panel) and Deoxy-Hb (in the right panel) for the own-race-face condition and (B) shows the results for the other-race-face condition. Each bar represents the mean Z-score for Oxy-Hb (or Deoxy-Hb) averaged across 12–16 s after stimulus onset. Blue bars and red bars represent the results for the Match and McGurk conditions, respectively. The error bars represent the 95% confidence interval of the mean. Asterisks indicate the significance level of the statistical differences against baseline (0): *p < 0.05, **p < 0.01.

A further analysis was conducted to examine the cortical areas that potentially exhibit brain activity related to audiovisual speech integration. Based on the locations of the 10–20 cortical projection points, individual channels in the fNIRS measurement can be estimated to represent anatomical brain areas in infants’ brain (Lloyd-Fox et al., 2014). We then conducted one sample t-test against the zero response (baseline) on the Z-scores of oxy-Hb for each channel separately. The significantly activated channels are summarized in Table 1. In our setting of the channels’ location (Figure 1), the left superior temporal channel (ch 4) can be considered to reflect the activation of the left superior temporal region, according to the anatomical regions for the projection of each fNIRS channel in Lloyd-Fox et al. (2014). Focusing on the left superior temporal channel (ch 4), oxy-Hb increased significantly during both the Match [ch4; t(16) = 3.38, p = 0.046, FDR corrected, d = 1.11] and McGurk trials [ch 4; t(16) = 3.52, p = 0.034, FDR corrected, d = 1.13] in the own-race-face condition. In contrast, no significant activation was found in the other-race-face condition during the Match [ch 4; t(16) = 1.36, p = 0.29, FDR corrected, d = 0.27] or McGurk trials [ch 4; t(16) = 0.88, p = 0.52, FDR corrected, d = 22]. The responses from the channel 4 could be assumed to be associated with the activation of the left superior temporal area, which is related to the processing of audiovisual speech (e.g., Calvert et al., 2000; Nath and Beauchamp, 2012). To summarize, the individual channel analysis indicated significant differences in oxy-Hb responses in the left temporal regions between the two race-face conditions, which suggests that the left superior temporal area may be selectively activated in response to the audiovisual stimulus of the own-race face.

TABLE 1

Table 1. Summary of significant channels from the individual channel analysis.

Discussion

In Experiment 1, we conducted fNIRS measurements to find the presence of a mapping of the McGurk effect in Japanese 8- to 9-month-old infants and to examine the difference between the activation patterns of own-race-face and other-race-face stimuli. We conducted analysis for both oxy-Hb and deoxy-Hb, but obtained significant results only for oxy-Hb, which is common with infant studies (e.g., Issard and Gervain, 2017). Our results indicate that (1) the McGurk stimuli induced activations in the bilateral temporal regions, while the audiovisual-matched stimuli induced the activation in the left temporal region; and that (2) this activation pattern was found in the own-race-face condition but not in the other-race-face condition. These results would support our assumption that the infant brain activates in response to the McGurk effect when an own-race face stimulus is presented but not when an other-race face is presented.

We found a difference in the activation patterns between the audiovisual-matched and the McGurk stimuli in the own-race-face condition. The matched stimuli induced activation of the left temporal region, while the McGurk stimuli induced activation of the bilateral temporal regions. Our results suggest that the mapping of the McGurk effect in the infant brain was different from that found in adult studies (Beauchamp et al., 2010; Nath and Beauchamp, 2012). In adults, the left STS, which is important for processing audiovisual speech (e.g., Calvert et al., 2000; Beauchamp et al., 2004), is responsible for the McGurk effect (Beauchamp et al., 2010; Nath and Beauchamp, 2012). In infants, in addition to the left temporal region, the McGurk effect induces the activation of the right temporal region, which is important for processing faces (e.g., Otsuka et al., 2007; Kobayashi et al., 2018). The activation in the right temporal region may come from an infant’s need to process the speaking face of the McGurk stimuli, because the development of the McGurk effect is immature (e.g., McGurk and MacDonald, 1976; Sekiyama and Burnham, 2008).

To support our fNIRS data, we used a familiarization/novelty preference method (e.g., Yang et al., 2016; Sato et al., 2017) to confirm whether infants could perceive the McGurk effect in the own-race-face condition and not in the other-race-face condition. Similar to the fNIRS experiment, we used two race-face conditions, each of which consisted of two phases: the familiarization phase and test phase. In the familiarization phase, we presented infants with six familiarization trials, which repeated the McGurk stimulus (auditory /pa/ and visual /ka/) six times per trial. In the test phase, we presented the infants with the familiarized trials and a novel trial. The familiarized trial consisted of a repeated presentation of a voiced “/ta/” syllable and vegetable images six times. The novel trial consisted of repeated presentation of a voiced “/pa/” syllable with vegetable images six times. We expected that if infants can perceive the McGurk effect, they would become familiarized with the “/ta/” sound in the familiarization phase, thus they would look longer at the novel trial (/pa/) in the test phase. In our hypothesis, a significant preference for the novel trial in the test phase would result from the presence of audiovisual speech integration in the familiarization phase.