Face Masks Impact Auditory and Audiovisual Consonant Recognition in Children With and Without Hearing Loss

Lalonde, Kaylah; Buss, Emily; Miller, Margaret K.; Leibold, Lori J.

doi:10.3389/fpsyg.2022.874345

ORIGINAL RESEARCH article

Front. Psychol., 13 May 2022

Sec. Perception Science

Volume 13 - 2022 | https://doi.org/10.3389/fpsyg.2022.874345

This article is part of the Research TopicLanguage Development Behind the MaskView all 8 articles

Face Masks Impact Auditory and Audiovisual Consonant Recognition in Children With and Without Hearing Loss

¹Audiovisual Speech Processing Laboratory, Boys Town National Research Hospital, Center for Hearing Research, Omaha, NE, United States
²Speech Perception and Auditory Research at Carolina Laboratory, Department of Otolaryngology Head and Neck Surgery, University of North Carolina School of Medicine, Chapel Hill, NC, United States
³Human Auditory Development Laboratory, Boys Town National Research Hospital, Center for Hearing Research, Omaha, NE, United States

Teachers and students are wearing face masks in many classrooms to limit the spread of the coronavirus. Face masks disrupt speech understanding by concealing lip-reading cues and reducing transmission of high-frequency acoustic speech content. Transparent masks provide greater access to visual speech cues than opaque masks but tend to cause greater acoustic attenuation. This study examined the effects of four types of face masks on auditory-only and audiovisual speech recognition in 18 children with bilateral hearing loss, 16 children with normal hearing, and 38 adults with normal hearing tested in their homes, as well as 15 adults with normal hearing tested in the laboratory. Stimuli simulated the acoustic attenuation and visual obstruction caused by four different face masks: hospital, fabric, and two transparent masks. Participants tested in their homes completed auditory-only and audiovisual consonant recognition tests with speech-spectrum noise at 0 dB SNR. Adults tested in the lab completed the same tests at 0 and/or −10 dB SNR. A subset of participants from each group completed a visual-only consonant recognition test with no mask. Consonant recognition accuracy and transmission of three phonetic features (place of articulation, manner of articulation, and voicing) were analyzed using linear mixed-effects models. Children with hearing loss identified consonants less accurately than children with normal hearing and adults with normal hearing tested at 0 dB SNR. However, all the groups were similarly impacted by face masks. Under auditory-only conditions, results were consistent with the pattern of high-frequency acoustic attenuation; hospital masks had the least impact on performance. Under audiovisual conditions, transparent masks had less impact on performance than opaque masks. High-frequency attenuation and visual obstruction had the greatest impact on place perception. The latter finding was consistent with the visual-only feature transmission data. These results suggest that the combination of noise and face masks negatively impacts speech understanding in children. The best mask for promoting speech understanding in noisy environments depend on whether visual cues will be accessible: hospital masks are best under auditory-only conditions, but well-fit transparent masks are best when listeners have a clear, consistent view of the talker’s face.

Introduction

Face masks are important for limiting the spread of the coronavirus disease 2019 (COVID-19), because they decrease aerosol transmission of the virus during speaking and coughing (Centers for Disease Control and Prevention, 2021a). To limit the spread of the coronavirus, the Centers for Disease Control and Prevention (CDC) stated in the summer of 2020 that individuals aged 2 years and older should wear masks in public and when around people who do not live in their household (Centers for Disease Control and Prevention, 2021b). Many United States schools returned to in-person or hybrid learning for the 2020–2021 school year, with teachers and students wearing face masks in most classrooms. Even after resolution of the current pandemic, face masks may be required to control outbreaks of COVID-19 or other contagious diseases.

While face masks are critically important for preventing the spread of COVID-19, behavioral studies have demonstrated a detrimental effect of face masks on adults’ speech recognition and recall (Wittum et al., 2013; Atcherson et al., 2017; Bottalico et al., 2020; Brown et al., 2021; Magee et al., 2021; Muzzi et al., 2021; Smiljanic et al., 2021; Toscano and Toscano, 2021; Truong et al., 2021; Vos et al., 2021; Yi et al., 2021). Specifically, the use of face masks can lead to difficulties with speech understanding by concealing lip-reading cues and reducing the transmission of high-frequency speech content (Palmiero et al., 2016; Atcherson et al., 2020; Corey et al., 2020; Goldin et al., 2020; Jeong et al., 2020; Pörschmann et al., 2020; Magee et al., 2021; Muzzi et al., 2021; Smiljanic et al., 2021; Toscano and Toscano, 2021; Truong et al., 2021; Vos et al., 2021; Yi et al., 2021). Adults with and without hearing loss report that both factors degrade speech recognition (Naylor et al., 2020; Saunders et al., 2020). When a conversation partner wears a mask, it affects the listener’s hearing and feeling of engagement with the talker (Saunders et al., 2020); among adults, those with hearing loss are more greatly impacted (Saunders et al., 2020).

Most prior studies investigating the effects of face masks on speech understanding are limited to adult subjects. Therefore, effects of face masks on children’s auditory and audiovisual speech understanding in environments, such as classrooms, are not yet understood. Additionally, there are many types of face masks in use, with different effects on acoustic and visual speech cues (e.g., Corey et al., 2020; Yi et al., 2021). The purpose of this study is to examine the effects of various types of face masks on auditory and audiovisual speech recognition in children with and without hearing loss, in hopes of providing recommendations regarding types of face masks that best support children’s speech understanding.

For the general public, including teachers, the CDC recommended non-medical disposable masks, breathable cloth masks made of multiple layers of tightly woven fabrics (such as cotton and cotton blends), or respirators without vents (Centers for Disease Control and Prevention, 2022a). However, teachers and others who interact with individuals who are deaf and hard of hearing were encouraged to wear transparent masks (Centers for Disease Control and Prevention, 2022b). In earlier guidelines, teachers were also encouraged to wear a transparent mask if they interact with students with special education or healthcare needs, teach young students who are learning to read, teach English as a second language, or teach students with disabilities, including hearing loss (Centers for Disease Control and Prevention, 2020). Acoustic attenuation is related to material attributes, such as thickness, weight, weave density, and porosity (Llamas et al., 2008). Among the common alternatives, hospital masks cause the least transmission loss (Atcherson et al., 2020; Bottalico et al., 2020; Corey et al., 2020; Goldin et al., 2020; Vos et al., 2021). Cloth masks vary considerably depending on fabric attributes and number of layers, with one study showing between 0.4 to 9 dB greater attenuation between 2 and 16 kHz for cloth masks than for a disposable hospital mask (Corey et al., 2020).

Opaque face masks conceal a substantial portion of visual cues used for lip-reading and audiovisual speech enhancement. In one study, occluding the lower half of the face decreased adults’ lip-reading of consonant-vowel (CV) syllables from an 80 to 20% accuracy, and their audiovisual enhancement from a ∼21 to 4% benefit (Jordan and Thomas, 2011). Transparent masks (made of plastic or vinyl) provide access to more visual speech cues compared to opaque masks. Adults with moderate to profound hearing loss benefit from visual cues available when the talker is wearing a mask with a transparent window (Atcherson et al., 2017). When listening to speech in noisy backgrounds, adults with normal hearing (ANH) also benefit from visual cues available when the talker is wearing a transparent mask (Yi et al., 2021). However, transparent masks also cause 7–16 dB greater high-frequency attenuation than opaque disposable masks (Atcherson et al., 2020; Corey et al., 2020).

Different types of transparent masks vary with respect to visual cues they provide access to. Some transparent masks, such as ClearMask™ (Clear Mask, LLC), conceal very little of the face (Figure 1E). Others, such as The Communicator™ (Safe’N’Clear, Inc.), are primarily opaque with a transparent window in the region of the mouth (Figure 1D). Although not previously examined, differences in visual cues available with different transparent masks might affect audiovisual speech perception. Some non-oral regions of the face contain subtle visual cues that could contribute to audiovisual enhancement (Scheinberg, 1980; Preminger et al., 1998). However, the impact of these non-oral visual cues may be negligible when oral cues are available, because the shape and movement of the oral area are highly correlated with movements of the jaw and cheeks (Munhall and Vatikiotis-Bateson, 1998).

FIGURE 1

Figure 1. Illustration of the five mask conditions. The top row contains photographs of the five conditions: (A) no mask, (B) hospital mask, (C) fabric mask, (D) Communicator™, and (E) ClearMask™. The bottom row shows the associated visual face mask simulations: (F) no mask, (G) hospital mask, (H) fabric mask, (I) Communicator™, and (J) ClearMask™.

Several recent studies have presented data on the effects of face masks on speech recognition and recall in adults. Among ANH, researchers have reported 12–23% poorer auditory-only (Wittum et al., 2013; Bottalico et al., 2020; Muzzi et al., 2021; Yi et al., 2021) and audiovisual (Brown et al., 2021; Smiljanic et al., 2021; Yi et al., 2021) word and sentence recognition in noise with face masks than without face masks. ANH are worse at recalling sentences presented audiovisually with a cloth or a hospital face mask than without a face mask (Smiljanic et al., 2021; Truong et al., 2021), and they report greater listening effort for speech produced with a face mask than without one (Bottalico et al., 2020; Brown et al., 2021). Among adult hearing aid users with moderate hearing loss, Atcherson et al. (2017) observed 7–8% worse auditory sentence recognition in multi-talker babble with the talker wearing a hospital mask than without one. Among adult cochlear implant users, Vos et al. (2021) observed 26% lower auditory sentence recognition accuracy in quiet with an N95 respirator in combination with a face shield but no effect of the N95 respirator alone. Deficits depend on the amount of background noise in the listening environment (Brown et al., 2021; Smiljanic et al., 2021), with limited effects among ANH tested in quiet or at high SNRs (+5 to +13 dB) (Llamas et al., 2008; Mendel et al., 2008; Atcherson et al., 2017; Brown et al., 2021; Smiljanic et al., 2021; Toscano and Toscano, 2021). Additionally, auditory-only performance may not be affected by face masks in individuals with severe to profound hearing loss (Atcherson et al., 2017; Vos et al., 2021), potentially because their hearing loss restricts the speech bandwidth with and without a face mask.

Recent studies have demonstrated that acoustic effects of face masks significantly impact auditory-only speech recognition in children with normal hearing (CNH; Flaherty, 2021; Sfakianaki et al., 2021), with a similar impact on children and adults. Flaherty, 2021 tested auditory-only speech recognition in a two-talker speech masker among 24 adults and 30 children (8–12 years of age), comparing masked speech recognition thresholds at baseline to a primarily transparent mask (ClearMask™), a face shield, an N95, and a hospital mask. Subjects required a more favorable target-to-masker ratio to understand speech produced with the ClearMask™ and face shield than with no mask, with similar effect sizes in children and adults. Sfakianaki et al. (2021) tested 10 adults and 10 children (6–8 years of age) in quiet, classroom noise, and a two-talker masker, comparing accuracy with no mask to accuracy with a hospital mask. Subjects accurately repeated back fewer words when listening to speech with the hospital mask than without it. There was a non-significant trend for greater impact of the mask in children than in adults.

Although the CDC recommended using a transparent mask when communicating with individuals who are deaf or hard of hearing (Centers for Disease Control and Prevention, 2022b), only one study has examined the effects of face masks on speech perception in children with hearing loss (CHL). Lipps et al. (2021) examined audiovisual word identification in quiet in 13 CHL (3–7 years of age), comparing accuracy at baseline to accuracy with a hospital mask, ClearMask™, and a transparent apron mask. The hospital mask and the transparent apron mask negatively impacted children’s audiovisual word identification accuracy, but the ClearMask™ did not. This finding indicates that a transparent mask is likely the best choice for young CHL listening to speech under quiet audiovisual conditions. We have only begun to scratch the surface of understanding how face masks affect communication in children with and without hearing loss, which limits our ability to make evidence-based recommendations about what type of face mask best supports communication in children with and without hearing loss.

Measurements of the acoustic transmission loss associated with the use of opaque and transparent face masks indicate a trade-off between concealment of visual cues and the degree of high-frequency acoustic attenuation (Atcherson et al., 2020; Corey et al., 2020; Vos et al., 2021). When listening to speech in moderate levels of background noise (−5 dB SNR), the availability of visual cues outweighs the additional high-frequency acoustic attenuation caused by the transparent mask in ANH (Yi et al., 2021). We expect that age- and hearing-related differences in susceptibility to both high-frequency acoustic attenuation and reliance on visual speech cues will affect where children fall in this trade-off. For instance, children may be particularly affected by the acoustic attenuation caused by face masks, because they require greater bandwidths than adults for masked speech understanding (Stelmachowicz et al., 2000, 2001; Mlot et al., 2010; McCreery and Stelmachowicz, 2011). However, the acoustic attenuation of face masks may have a smaller effect on CHL who have poor high-frequency aided audibility, because the hearing loss restricts the speech bandwidth both with and without a face mask. The loss of visual cues resulting from the use of opaque face masks is likely to affect both CNH and CHL, as both groups benefit significantly from visual speech cues (see reviews by Lalonde and McCreery, 2020; Lalonde and Werner, 2021). However, CNH are likely to be less affected than adults by the loss of these visual cues, because they benefit less from visual speech (Wightman et al., 2006; Ross et al., 2011; Lalonde and Holt, 2016). Finally, CHL (and particularly those with more severe hearing loss) may be more impacted by the loss of visual cues, because they benefit more from visual speech than CNH and CHL with less severe hearing loss (Lalonde and McCreery, 2020).

Previous studies examining the impact of face masks on speech understanding have measured effects at the word or sentence level. Face masks likely have a greater detrimental impact on some speech features than others. Reduced transmission of high-frequency speech content is likely to degrade the perception of speech features based primarily on high-frequency acoustic cues, such as consonants and specifically their place of articulation (Stevens, 2000). Similarly, reduced transmission of visual speech cues is likely to degrade the perception of speech features that are easy to speech-read, such as consonant place of articulation (Binnie et al., 1974; Owens and Blazek, 1985).

The purpose of this study was twofold. First, we aimed to determine the impact of face masks on auditory and audiovisual speech recognition in noise among CNH and CHL. Second, we aimed to determine what type of face mask may best support communication in these groups. We used four types of face masks: a hospital mask, a fabric mask, a primarily transparent mask (ClearMask™), and a primarily opaque mask with a transparent window (The Communicator™). We compared performance with these face masks to baseline conditions with no mask. Audiovisual conditions were included to investigate the impact of face masks on communication under ideal conditions. Auditory-only conditions were included to evaluate performance under less-than-ideal conditions, based on limited evidence that young children do not consistently orient to the target talker the way adults do (Ricketts and Galster, 2008).

Based on previous studies and children’s susceptibility to the detrimental effects of decreased auditory bandwidth, we expected all the masks to affect auditory-only performance in children with good high-frequency aided audibility, including CNH and some CHL. More specifically, we expected that reduced transmission of high-frequency speech content would affect the discrimination of consonants, especially consonant place of articulation (Stevens, 2000). Based on acoustic transmission data in previous studies, we expected hospital masks to affect auditory-only performance the least and transparent masks to affect auditory-only performance the most. In children with poor high-frequency aided audibility, we expected a smaller difference between scores in the auditory-only conditions. In audiovisual conditions, we expected opaque masks to reduce the transmission of visual speech cues, which would affect the discrimination of consonants, especially consonant place of articulation (Binnie et al., 1974; Owens and Blazek, 1985). As such, under audiovisual conditions, we expected children with poor high-frequency aided audibility to perform best with the transparent masks, because they benefit from visual speech cues (Lalonde and McCreery, 2020) and may be less impacted by the high-frequency attenuation caused by the masks. It is uncertain what to predict for CNH and CHL with good high-frequency aided audibility with respect to the trade-off between loss of visual cues and high-frequency attenuation. The results from this study will provide an evidence base for advising educators and others who work with children as to the best face masks for promoting speech understanding.

Overview, General Materials, and General Methods

This study was designed to test the effects of four types of face masks on auditory and audiovisual speech recognition in noise by ANH, CNH, and CHL. The four types of masks (Figures 1B–E) include a disposable hospital mask, a homemade pleated cloth mask with two layers of cotton blend, the Communicator™, and the ClearMask™. The Communicator™ is an FDA-registered single-use, disposable device that meets ASTM Level 1 hospital mask standards and includes a fog-resistant transparent window. The ClearMask™ is an FDA-approved, class II single-use transparent face mask that meets ASTM Level 3 standards. It serves the same function as traditional masks and provides a full, anti-fog plastic barrier.

We simulated the effects of face masks on audiovisual stimuli by filtering the acoustic speech from an unmasked talker to match the long-term average spectrum of each mask and by overlaying mask shapes onto videos of the unmasked talker. This method does not capture differences in production that talkers may adopt when wearing a face mask, but it has the advantage of ensuring that idiosyncrasies in production across conditions do not affect results.

Experiment 1 was conducted by delivering research equipment to the homes of CHL. CHL and members of their household completed auditory-only and audiovisual speech recognition tests in noise at 0 dB SNR. In Experiment 2, young ANH completed the same test in a laboratory sound booth. The effects of face masks vary depending on baseline difficulty; mask effects differ in subjects as a function of SNR (Toscano and Toscano, 2021). Therefore, the ANH in Experiment 2 were also tested at −10 dB SNR to better match the performance level of the CHL tested in Experiment 1. Both experiments were reviewed and approved by the Boys Town National Research Hospital Institutional Review Board. Adult participants provided their written informed consent to participate in the study; child participants and their caregivers provided written assent and permission, respectively.

Stimuli

Target stimuli included 36 audiovisual recordings of a 32-year-old female native speaker of mainstream American English (author KL) repeating CV words with the vowel /i/ and the format, “Choose /CV/”. The 12 CVs (/bi/, /si/, /di/, /hi/, /ki/, /mi/, /ni/, /pi/, /∫i/, /ti/, /vi/, and /zi/) were the same CVs used in a previous study evaluating auditory-only speech perception in children (Leibold and Buss, 2013), based on the Audiovisual Feature Test for Young Children (Tyler et al., 1991). Three tokens of each CV were used so that idiosyncratic differences between the videos (e.g., blinking) could not be used to discriminate the syllables. Including the carrier, these recordings had a mean duration of 856 ms (728–941 ms) and a mean F0 of 238 Hz (212–276 Hz).

The original unmasked target stimuli were professionally recorded in a large sound-attenuating booth with a Sennheiser EW 112-P G3-G lapel microphone and a JVC GY-GM710U video camera. Final Cut Pro video editing software was used to splice the recordings into individual videos, which began 333 ms (10 frames) before the onset of acoustic speech and ended 333 ms after the end of the acoustic speech, with the exception that additional frames may have been added to avoid beginning or ending the video mid-blink. Using Adobe Audition, the intensity of each sound file was adjusted so that the total RMS of the speech portion of all the files matched.

Acoustic Mask Simulation

Target stimuli in the four mask conditions were generated by filtering the original (no mask) target recordings. Filters were constructed based on recordings of the rainbow passage (Fairbanks, 1969), produced by the target talker, each lasting 1.7–1.9 min. Two recordings were made using a Shure-KSM42 (condenser) microphone in each of five conditions: no mask, hospital mask, fabric mask, ClearMask™, and Communicator™. The two recordings for each condition were concatenated, and the result was transformed into the frequency domain using the pwelch function in MATLAB (Mathworks), with a 512-point window and a 256-point overlap between sequential windows. Attenuation as a function of frequency for each mask was determined as the difference in amplitude spectrum relative to the no-mask condition. These attenuation functions were used to generate 128-point FIR filters using the fir2 function in MATLAB. An all-pass filter was generated for use with the no-mask stimuli; this was done so the stimulus preparation steps were identical across stimuli. The amplitude spectrum of the no-mask recordings was also used to construct a filter for generating speech-shaped noise, following the same procedures. Stimuli were filtered using the filter function in MATLAB and saved to a disk. Long-term average magnitude spectra for targets in each condition are shown in Figure 2A. The speech-shaped noise was 30 s in duration, and its long-term average magnitude spectrum (not shown) was similar to that of the no-mask targets.

FIGURE 2

Figure 2. (A) Long-term magnitude spectra for stimuli in each of the five conditions. (B) Example target stimulus waveform after filtering for the no mask (blue) and fabric mask (green) conditions. Boxes indicate consonant regions.

As illustrated in Figure 2A, power spectra in the four mask conditions were similar to the power spectrum of the no-mask to within approximately ±6 dB up to 2 kHz. Mean attenuation between 2 and 16 kHz was 2.4 dB for the hospital mask, 5.5 dB for the Communicator™, 5.9 dB for the ClearMask™, and 8.2 dB for the fabric mask. Peak attenuation within a 1/3-octave-wide band by mask type was 5.5 dB at 8 kHz for the hospital mask, 14.9 dB at 3.6 kHz for the Communicator™, 11.9 dB at 16 kHz for the ClearMask™, and 12.7 dB at 3.2 kHz for the fabric mask. Figure 2B shows an example stimulus waveform after filtering for no mask (blue) and fabric mask (green) conditions. The mask causes greater attenuation of the consonants than the vowels. We observed similar acoustic differences between the mask and no mask conditions in recordings of the talker wearing a mask.

Visual Mask Simulation

To visually simulate face masks, video files were processed using specialized software that automatically determines the position of 66 points on the face in each frame of the videos (Saragih et al., 2011; available at osf.io/g9jtr/). These included 17 points on the perimeter of the lower half of the face, 18 points at the inner and outer rim of the lips, 6 points for each eye, 5 points for each eyebrow, 4 points on the bridge of the noise, and 5 points at the bottom of the nostrils.

The x and y coordinates of these points were imported into MATLAB. The Computer Vision Toolbox was used to superimpose filled white polygons onto each video frame. For opaque face masks (hospital, cloth), the size and shape of the polygon were created based on the location of the 17 points at the perimeter of the lower half of the face and the marker at the upper middle of the bridge of the nose. An example of the simulated opaque face mask is shown in Figure 1G. The simulated Communicator™ mask was created in a similar fashion, except that two filled polygons were used to create the effect of a cutout in the middle of the simulated opaque mask shape. Adjustments to the simulated masks were made to correct for problems noted by independent observers (see Supplementary Material). An example of the simulated Communicator™ mask is shown in Figure 1I. For the ClearMask™, no mask was superimposed. The stimuli are available at https://osf.io/5wapg.

Experiment 1

The first experiment was conducted remotely in participants’ homes. The goal was to evaluate auditory-only and audiovisual speech recognition in the five conditions. Remote data collection made it possible to collect data without bringing the subjects to the laboratory. Collecting data from members of the CHL’s household increased the efficiency of the protocol and reduced the variance of test conditions across participant groups.

Materials and Methods

Participants

Eighteen children with bilateral hearing loss (8 males) and 39 members of their households completed a remote study. CHL varied in age, between 7.4 and 18.9 years (mean = 12.7 years, SD = 3 years). Sixteen had sensorineural hearing loss, and two had mixed hearing loss. Audiograms and aided Speech Intelligibility Index (SII) scores at 65 and 75 dB SPL were available from 17 children’s previous research visits, 3–24 months before data were collected for this study (median 7.5 months). An audiogram (but not aided SII scores) was acquired from the other child’s previous clinical visit 7.5 months before data were collected for this study. Mean better-ear aided SII was 72.1 at 55 dB SPL, 80.8 at 65 dB SPL, and 81.2 at 75 dB SPL. The distribution of better-ear aided SII scores and audiograms for each of the CHL are shown in Figure 3. Medical records indicate that 11 children had congenital hearing loss. Five passed their newborn hearing screening and had hearing loss identified between age 4 and 7 years. Two others had hearing loss identified at age 2 or 3, but no newborn hearing screening results were reported. Family members of the CHL included 16 children with parent-reported normal hearing (9 males) between 7.5 and 19.8 years of age (mean = 12.1 years, SD = 3.5 years) and 23 adults with self-reported normal hearing (8 male) between 33.6 and 50.7 years of age (mean = 43 years, SD = 4.3 years). No audiometric data were available for family members of the CHL. Sample sizes are consistent with previous studies involving CHL (e.g., Leibold et al., 2013, 2019; Lalonde and McCreery, 2020).

FIGURE 3

Figure 3. Values of (A) SII and (B) pure-tone thresholds for children with hearing loss (CHL). (A) The distribution of SII is plotted as a function of stimulus level, with results shown separately for the better and worse ears, as indicated at the top of the panel. Horizontal lines indicate the median, boxes span the 25th to 75th percentiles, and vertical lines span the 10th to 90th percentiles. (B) Pure-tone thresholds for the better and worse ears are shown. Better and worse ear were identified for each subject based on SII55. Better-ear thresholds are shown in the left panel, and worse-ear thresholds are shown in the right panel. Individual data are shown in gray, and means are shown with thick black lines.

Apparatus

The experiments were conducted using custom software on a 13″ MacBook Air laptop (MacOS Catalina 10.15.6). Auditory stimuli were routed directly from the laptop to two loudspeakers (QSC CP8 compact powered 90°) via a breakout cable with dual XLR inserts. The equipment was placed on a mat that was marked to show the intended location of the laptop and speakers (see Supplementary Figure 3). Speakers were placed 23 cm away from the edge of the laptop and oriented toward the participants’ ears. A manual explaining how to set up the equipment and run the experiment was also provided.

Task

The participants completed a closed-set consonant identification task in speech-shaped noise using a picture selection response. On each trial, the noise was presented at 70 dB SPL, and the target was presented at 0 dB SNR. The noise had 20-ms ramps at onset and offset. The videos began 186 ms after the initiation of noise onset and ≥333 ms before the onset of speech-related movements, ensuring that each video began with a neutral face. On average, acoustic speech began 756 ms (SD = 11 ms) after the offset of the noise ramp. The participants identified the CV token from a matrix of 12 alphabetically ordered illustrations that appeared immediately after the stimulus (Supplementary Figure 4). The same illustrations and similar methods were used in a previous experiment with children as young as 5 years old (Leibold and Buss, 2013). In each trial, the stimulus was presented once; there was no option to repeat the stimulus. Participants were instructed to take their best guess if they were unsure.

The participants completed testing in 11 randomly ordered conditions (four mask conditions conducted in auditory-only and audiovisual modalities and the no mask condition in auditory-only, audiovisual, and visual-only modalities). The five conditions included a no-mask baseline (Figures 1A,F) and simulations of speech produced with a hospital mask (Figures 1B,G), a fabric mask (Figures 1C,H), and two transparent masks [ClearMask™ (Figures 1E,J) and Communicator™ (Figures 1D,I)]. In audiovisual conditions, the participants saw a synchronous and congruent video of the speaker, with or without a simulated face mask. In auditory-only conditions, the participants saw a blank gray screen during auditory stimulus presentation. The 36 consonant stimuli (12 CVs, 3 samples each) were presented in random order in a block of trials.

Procedure

Informed consent was obtained by a research audiologist by videoconferencing via the HIPAA-protected WebEx Internet-based application and electronic consent forms developed in the Internet-based software called Research Electronic Data Capture (REDCap). All the participating family members were consented together, and each individually signed their forms electronically. Following completion of all consent paperwork, the audiologist provided an overview of the instructions. The participants were instructed to complete the testing in a quiet space free from distraction and to listen carefully to the woman who appeared on the screen. The audiologist arranged a time to deliver test equipment to the participants’ homes and provided her contact information so that she could be available for troubleshooting.

The participants received a manual with the test equipment. In the manual, parents were instructed to choose the most technologically savvy and most available participant in the home to participate first, to ensure the equipment was set up properly, and that the person could assist all other participants at home if needed. Furthermore, the participants were instructed to set up the equipment in a quiet space, away from all other participants, to avoid premature exposure to the stimuli. The recommended equipment configuration was to place the mat on a long table (at least 1.52 m × 0.76 m). There were cases in which a family member did not have a table large enough on which to set the equipment, so they sat with the equipment on the floor. Once the equipment was in place, parent participants were asked to test each speaker with a feature on the user interface. To ensure appropriate calibration, the participants were instructed to keep the laptop set at full volume. The speakers were also fitted with a custom 3D printed plastic barrier that prevented the participants from adjusting the speaker volume. It was recommended that the equipment stay in place once it was set up.

When they were ready for the test, the participants were instructed to sit directly in front of the laptop. Instructions specified that the CHL should wear their hearing aids in their typical configuration while completing the study. With the help of the manual, the participants were instructed to test themselves in the randomized order indicated on their individualized data collection sheet. Despite these instructions, 15 participants tested conditions in alphabetical order [some participants ran both sets of each condition in alphabetical order (AA, BB, CC, …), while others ran through each condition in alphabetical order twice (ABC…, ABC…)]. This resulted in the following order: no mask, ClearMask™, Communicator™, fabric mask, and hospital mask, with each auditory condition preceding the corresponding audiovisual condition; the visual-only condition was last. There were also instances in which participants began testing and then stopped in the middle of a condition once they realized they were not following the order on their individualized datasheet. These incomplete conditions were excluded from the analysis. The parents were asked to assist their children during testing. Total test time was 1.25 to 2 h. Breaks were suggested for all the participants, but strongly encouraged for younger participants. Most of the participants completed the testing in one sitting; others split up the testing into two sessions. The participants were paid with an electronic gift card for their participation. A research audiologist was on stand-by remotely to answer questions or help with the protocol.

Analyses

Analyses were performed in RStudio (version 1.2.1335). Linear mixed-effects models were fitted using the lmer and anova functions in the lmerTest package (Kuznetsova et al., 2017). The anova function provided F-statistics for the model generated using the lmer function. Non-significant interactions were systematically eliminated to arrive at the final model. Reference conditions were systematically varied as needed for post hoc comparisons. Auditory-only and audiovisual proportion correct data were transformed into rationalized arcsine units (RAUs) for statistical analyses.

Results

Most subjects in Experiment 1 completed two runs of data collection in each condition, but there were exceptions. Because of a programming error, two subjects in each group heard the no-mask audiovisual condition at 20 dB SNR rather than 0 dB SNR. These data were omitted. The visual-only condition was added after the data collection commenced; as a result, data in the visual-only condition were only obtained for 16/23 ANH, 11/16 CNH, and 12/18 CHL. Of the remaining 603 cases (subjects × conditions, minus missing data), 27% included data from a single run, 72% from two runs, and only 1% from more than two runs. Figure 4 shows the distribution of mean performance plotted by stimulus condition for the three groups of subjects tested in Experiment 1.

FIGURE 4

Figure 4. Distribution of scores for each of the auditory-only and audiovisual stimulus conditions, plotted separately for each of the three groups of subjects tested in Experiment 1, as indicated at the top of each panel. Mask type is indicated on the horizontal axis, and cue condition is indicated with box fill, as defined in the legend. Vertical lines indicate the median, boxes span the 25th to 75th percentiles, and vertical lines span the 10th to 90th percentiles.

Overall, auditory-only and audiovisual accuracies were poorer and more variable for CHL than for the two groups with normal hearing. These group differences were similar for auditory-only and audiovisual conditions. Audiovisual accuracy was higher than auditory-only accuracy, especially in the no mask, ClearMask™, and Communicator™ conditions. The masks degrade performance relative to the no mask condition, but the impact varied across groups, modalities, and mask conditions.

The initial linear mixed-effects model examining RAU-transformed accuracy included fixed effects and interactions of mask condition, modality, and group, as well as a random intercept per subject. The baseline no mask condition, auditory-only modality, and CNH group served as the reference condition for the model. The final model included significant main effects of modality (F₁,₉₁₆ = 981.9, p < 0.0001), mask condition (F₄,₉₁₆ = 176.6, p < 0.0001), and group (F₂,₅₅ = 37.4, p < 0.0001), as well as interactions of modality and mask condition (F₄,₉₁₆ = 72.1, p < 0.0001) and modality and group (F₂,₉₁₆ = 18.9, p < 0.0001). Model estimates and post hoc comparisons are shown in Supplementary Table 1. The CHL had lower speech recognition accuracy than the groups with normal hearing, in both the auditory-only and audiovisual modalities. The ANH and the CNH performed similarly in the auditory-only modality, but the ANH were more accurate than the CNH in the audiovisual modality. All the groups demonstrated significantly better overall performance in the audiovisual modality than in the auditory-only modality. The effect of modality differed across all the groups. The ANH exhibited greater audiovisual benefit (greater difference between the auditory-only and audiovisual modalities) than the CNH and the CHL. The CHL exhibited greater audiovisual benefit than the CNH.

The effect of mask type differed across modalities but not across groups. In the auditory-only modality, the ClearMask™, Communicator™, and fabric masks significantly degraded performance relative to the no-mask condition (B ≤ −8.15, t₉₁₆ ≤ −6.337, p < 0.0001), but the hospital mask did not (B = −2.17, t₉₁₆ = −1.7, p = 0.0895). The masks differed significantly in the severity of degradation. The fabric mask had the greatest detrimental impact on performance, and the hospital mask had the least impact; the Communicator™ mask had a greater detrimental impact than the ClearMask™. This pattern of results is consistent with the degree of high-frequency acoustic attenuation shown in Figure 2. In the audiovisual modality, all four of the masks degraded performance relative to the no-mask condition (B ≤ −7.23, t₉₁₆ ≤ −5.391, p < 0.0001). The opaque masks (hospital and fabric) had a greater detrimental impact than the transparent masks (ClearMask™, Communicator™) due to loss of visual cues. The fabric mask resulted in the greatest degradation in audiovisual conditions. Significant audiovisual benefit was observed for the no mask, Communicator™, and ClearMask™ conditions, and did not differ among these conditions. No significant audiovisual benefit was observed for the hospital and fabric mask conditions with the CNH as the reference group. Although there was no significant three-way interaction, audiovisual benefit was observed for the hospital and fabric mask conditions with the CHL or the ANH as the reference group.

Although this study was not designed to examine individual differences, we explored the impact of child age on the results reported above. Overall, older children recognized speech in noise with greater accuracy than younger children. An initial linear mixed-effects model examining RAU-transformed accuracy included fixed effects of mask condition, modality, group (CNH and CHL), and age; three-way interactions of modality, age, and group and modality, age, and mask condition; and a random intercept per subject. In addition to the effects and interactions of condition, modality, and group noted above, the model indicated a significant effect of age (F_1,32 = 13.5, p = 0.0009), an interaction of age and modality (F_1,517 = 5.5, p = 0.0193), and a marginal three-way interaction of age, modality, and mask condition (F_4,517 = 2.1, p = 0.0802). Age effects did not significantly differ between the CNH and CHL groups. The post hoc comparisons demonstrated that age effects were present in every condition (B ≥ 1.6, t₈₆ ≥ 2.648, p ≤ 0.0097) except for the auditory-only no mask and auditory-only hospital mask conditions (B ≤ 1, t₈₆ ≤ 1.795, p ≤ 0.0762). In auditory-only conditions, younger children were more negatively impacted by the ClearMask™, Communicator™, and fabric mask than older children (B ≥ 1.1, t₅₁₇ ≤ 2.065, p ≤ 0.0395). This could indicate greater bandwidth requirements in younger children, but ceiling effects could produce this pattern of results. These ceiling effects, along with limited representation of older children in our sample (n = 6 age 15 and higher), prohibit strong conclusions about how the negative impact of masks varies from 7 to 19 years of age.

Experiment 2

The second experiment was conducted in a sound booth in the laboratory. The goal of Experiment 2 was twofold. One goal was to replicate the results obtained remotely from the ANH under more controlled conditions. Another goal was to obtain data from adults at a more challenging SNR (−10 dB) to approximately match the overall performance level of CHL tested at 0 dB SNR.