A New Approach for Noise Suppression in Cochlear Implants: A Single-Channel Noise Reduction Algorithm1

Zhou, Huali; Wang, Ningyuan; Zheng, Nengheng; Yu, Guangzheng; Meng, Qinglin

doi:10.3389/fnins.2020.00301

ORIGINAL RESEARCH article

Front. Neurosci., 21 April 2020

Sec. Neural Technology

Volume 14 - 2020 | https://doi.org/10.3389/fnins.2020.00301

This article is part of the Research TopicA Conversation with the Brain: Can We Speak Its Language?View all 12 articles

A New Approach for Noise Suppression in Cochlear Implants: A Single-Channel Noise Reduction Algorithm¹

¹Acoustics Lab, School of Physics and Optoelectronics, South China University of Technology, Guangzhou, China
²Nurotron Biotechnology Inc., Hangzhou, China
³The Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China

The cochlea “translates” the in-air vibrational acoustic “language” into the spikes of neural “language” that are then transmitted to the brain for auditory understanding and/or perception. During this intracochlear “translation” process, high resolution in time–frequency–intensity domains guarantees the high quality of the input neural information for the brain, which is vital for our outstanding hearing abilities. However, cochlear implants (CIs) have coarse artificial coding and interfaces, and CI users experience more challenges in common acoustic environments than their normal-hearing (NH) peers. Noise from sound sources that a listener has no interest in may be neglected by NH listeners, but they may distract a CI user. We discuss the CI noise-suppression techniques and introduce noise management for a new implant system. The monaural signal-to-noise ratio estimation-based noise suppression algorithm “eVoice,” which is incorporated in the processors of Nurotron^® Enduro^TM, was evaluated in two speech perception experiments. The results show that speech intelligibility in stationary speech-shaped noise can be significantly improved with eVoice. Similar results have been observed in other CI devices with single-channel noise reduction techniques. Specifically, the mean speech reception threshold decrease in the present study was 2.2 dB. The Nurotron society already has more than 10,000 users, and eVoice is a start for noise management in the new system. Future steps on non-stationary-noise suppression, spatial-source separation, bilateral hearing, microphone configuration, and environment specification are warranted. The existing evidence, including our research, suggests that noise-suppression techniques should be applied in CI systems. The artificial hearing of CI listeners requires more advanced signal processing techniques to reduce brain effort and increase intelligibility in noisy settings.

Introduction

The cochlear implant (CI) is one of the most successful prostheses ever developed and aims to rehabilitate hearing by transmitting acoustic information into the brains of people with severe to profound hearing impairment by electrically stimulating auditory nerve fibers (Shannon, 2014). The artificial electric hearing provided by current CIs is useful for speech communication but is still far from satisfactory compared with normal hearing (NH), especially in the aspect of speech-in-noise recognition.

The noise issue is a common complaint of CI users (e.g., Ren et al., 2018). Because of variability associated with implant surgery time, hearing history, rehabilitation and training, surgical conditions, devices and signal processing, and so on, large differences in hearing abilities have always been reported within any group of CI users. These reasons behind the CI-NH gap and intersubject CI variance may be classified into “top-down” and “bottom-up” types (Moberly and Reed, 2019; Tamati et al., 2019).

From a practical standpoint, knowledge about “top-down” memory and cognition is useful for rehabilitation and making surgical decisions (Kral et al., 2019), whereas the relationship between speech performance and the “bottom-up” signal processing functions—especially those on the electrode interface—determines the engineering approaches used in current CI systems (Wilson et al., 1991; Loizou, 1999, 2006; Rubinstein, 2004; Zeng, 2004; Zeng et al., 2008; Wouters et al., 2015; Nogueira et al., 2018). Although the “top-down” approach has been suggested to be incorporated into CI systems to form an adaptive closed-loop neural prothesis (Mc Laughlin et al., 2012), we only introduce “bottom-up”–related techniques that might be useful for CI users to tackle the problem of noise masking, as discussed below.

How to send more useful information upward? Sound pressure waveforms are decomposed by healthy cochleae into fine temporal-spectral “auditory images”. CIs attempt to capture and deliver the same images but, unfortunately, in a coarse way. Theories in grouping, scene analysis, unmasking, and attention have demonstrated the significance of precise coding of acoustic cues including pitch or resolved harmonics, common onset, and spatial cues. For most CI systems, only temporal envelopes from a limited number of channels can be transferred to the nerve, and current interactions between channels are a key limitation of the multichannel CI framework.

Several research directions have been explored to improve the CI recognition performance of speech in noise by updating the technology of contemporary multichannel devices: (1) stimulating auditory nerves in novel physical ways such as optical stimulation (Jeschke and Moser, 2015) and penetrating nerve stimulation (Middlebrooks and Snyder, 2007); (2) developing intracochlear electrode arrays with different lengths, electrode shapes, and mechanical characteristics (Dhanasingh and Jolly, 2017; Rebscher et al., 2018; Xu et al., 2018); (3) steering and focusing the current spread by simultaneously activating multiple electrodes (Berenstein et al., 2008; Bonham and Litvak, 2008); (4) refining the strategies in the temporal domain by introducing harmonics (Li et al., 2012), timing of zero crossings (Zierhofer, 2003) or peaks (Van Hoesel, 2007), and slowly varying temporal fine structures (Nie et al., 2005; Meng et al., 2016); and (5) enhancing speech or suppressing noise before or within the core signal processing strategies. The first and second directions are developed from the perspective of neurophysiology; the third is mainly based on psychophysical tests; the fourth uses a combination of signal processing and psychophysics, and the fifth mainly concentrates on signal processing. All of these aspects are worth further investigation.

In the last two decades, the fifth approach of enhancing speech or suppressing noise before or within the core signal processing strategies has become a hot topic in academic and industrial research. Noise reduction and speech enhancement are two sides of the same coin, and the goal is to improve intelligibility or quality of speech in noise, in most cases with a signal-to-noise ratio (SNR) enhancement signal processing system. Some noise reduction techniques in telecommunications and hearing aids have been used to process noisy speech signals, and then the processed signals are presented through loudspeakers to CI users (e.g., classic single-channel spectral subtraction) (Yang and Fu, 2005) for feasibility verification. Now there are more sophisticated single-channel noise-reduction algorithms (NRAs) (Chen et al., 2015), directional microphone, or multimicrophone-based beamformers of hearing aids (Chung et al., 2004; Buechner et al., 2014), and more recently deep neural network–based algorithms (Lai et al., 2018; Goehring et al., 2019) that have been tried with CI listeners. Another line of research is to specifically optimize algorithm parameters with a consideration of the differences between CI and NH listeners. The parameters are generally related to the noise estimation or gain function for noise reduction (Hu et al., 2007; Kasturi and Loizou, 2007; Mauger et al., 2012a, b; Wang and Hansen, 2018). All these studies demonstrated significant improvements, which can be explained by the higher SNR yielded by the techniques before or within the CI core strategies.

In the newest versions of CI processors from current commercial companies such as Cochlear^® (Hersbach et al., 2012), Advanced Bionics^® (Buechner et al., 2010), and MED-EL^® (Hagen et al., 2019), one or multiple algorithms of SNR-based monaural noise reduction and spatial cue-based directional microphone or multimicrophone beamformers have been implemented and evaluated. Multimicrophone beamformers significantly improve speech intelligibility for CI recipients in noise. However, it is based on the assumption that target speech and noise sources are spatially separated. Thus, single-microphone NRAs in CI systems are still worthy of attention to improve speech perception in noise, especially in scenarios when the target speech and noise sources are not spatially separated.

Some single-microphone NRAs that are already implemented in commercial CI products have been reported in the literature. ClearVoice is a monaural NRA implemented with the HiRes 120 speech processing strategy (Buechner et al., 2010; Holden et al., 2013). It first estimates noise by assuming that speech energy amplitude changes frequently and background noise energy is less modulated. Then, gain is reduced for channels identified as having mainly noise energy. The noise estimation works at a time window of 1.3 s, which is the activation time of this algorithm. Experiments showed that ClearVoice can improve speech intelligibility in stationary noise (Buechner et al., 2010; Kam et al., 2012). Another monaural NRA is implemented with the ACE (advanced combination encoder) strategy in Nucleus devices. It uses a minimum statistics algorithm with an optimal smoothing method for noise estimation (Martin, 2001) and an a priori SNR estimate (McAulay and Malpass, 1980) in conjunction with a modified Wiener gain function (Loizou, 2007). It was reported to significantly improve hearing in stationary noise (Dawson et al., 2011).

We introduce a recently developed single-channel estimated-SNR–based NRA, termed “eVoice,” which has been implemented in the second-generation research processor Enduro^TM of Nurotron. Nurotron, a young company based in Irvine, CA, United States, and Hangzhou, Zhejiang, China, currently has more than 10,000 patients implanted. The Nurotron system has 24 electrode channels, and its users’ speech performance in quiet and postsurgery development status are comparable with previous data from other brands (Zeng et al., 2015; Gao et al., 2016). The noise estimation in eVoice is processed on a frame-by-frame basis, which is using a relatively short time window. It is based on classical signal processing algorithms and is not the first CI device to use this kind of approach. The aims of this study include reporting the intelligibility experiment results for eVoice and rethinking noise management of a new CI system, which in this case is the Nurotron system.

eVoice of Nurotron: a Single-Channel NRA

The default core strategy of Nurotron is the advanced peak selection (APS) strategy, which is similar to an “n-of-m” strategy (Zeng et al., 2015). The APS strategy is based on a short-time Fourier transform (STFT) and typically selects eight maxima (an automatic process defined in the coding strategy) for stimulation in each frame (Ping et al., 2017). A block diagram of the APS strategy and eVoice is shown in Figure 1. In APS, acoustic input signal is first preamplified, followed by bandpass filtering (the band number m typically equals the active electrode number, i.e., m = 24 in Nurotron devices) and envelope calculation. Then, in peak selection, n bands with the largest amplitude are selected for further non-linear compression and electrical stimulation (typically, n = 8 in Nurotron devices). The eVoice is an envelope-based noise reduction method implemented between envelope calculation and peak selection. It consists of two steps: noise estimation and gain calculation (Wang et al., 2017).

FIGURE 1

Figure 1. Block diagram of the APS strategy (black) and eVoice (red).

Noise Estimation

The noise estimation algorithm is based on an improved minima-controlled recursive averaging (MCRA-2) algorithm (Rangachari and Loizou, 2006). Noise power in each channel is estimated on a frame-by-frame basis instead of a time window that includes several frames to reduce response time.

Suppose that the noise is additive, then in the time domain, the input signal y(n) can be denoted as

y (n) = x (n) + d (n) (1)

where x(n) is the clean speech signal and d(n) is the additive noise signal. We use Y(λ, k), i.e., the STFT of y(n), to represent the summation magnitude of channel k in frame λ in the frequency domain. The power spectrum of the noisy signal can be smoothed and updated on a frame-by-frame basis using the recursion below:

P (λ, k) = η P (λ - 1, k) + (1 - η) {| Y (λ, k) |}^{2} (2)

where η is a smoothing factor. Then, the local minimum of the power spectrum in each channel can be tracked as follows:

P_{m i n} (λ, k) = {\begin{matrix} P (λ, k), P_{m i n} (λ - 1, k) \geq P (λ, k) \\ γ P_{m i n} (λ - 1, k) + \frac{1 - γ}{1 - β} (P (λ, k) - β P (λ - 1, k)), \\ P_{m i n} (λ - 1, k) < P (λ, k) \end{matrix} (3)

where P_min(λ,k) is the local minimum of the noisy speech power spectrum, and β and γ are constant parameters. The ratio of noisy speech power spectrum to its local minimum can be calculated as follows:

S_{r} (λ, k) = \frac{P (λ, k)}{P_{m i n} (λ, k)} (4)

This ratio is compared against a threshold T(λ,k) to determine the speech-presence probability I(λ,k) using the criterion below:

I (λ, k) = {\begin{matrix} 1, S_{r} (λ, k) \geq T (λ, k) \\ 0, S_{r} (λ, k) < T (λ, k) \end{matrix} (5)

where T(λ,k) is the threshold that is dynamically updated according to the estimated SNR of the previous frame. It is worth mentioning that this threshold is set at a constant level in the literature, and we found from our pilot data that dynamic thresholds performed better than constants during our assessment, so we decided to use dynamic thresholds.

This speech-presence probability I(λ,k) can be smoothed as follows:

K (λ, k) = α K (λ, k) + (1 - α) I (λ, k) (6)

where K(λ,k) is the smoothed speech-presence probability, and α is a smoothing constant. The smoothing factor to be used for noise estimation can be updated using the above calculated speech-presence probability:

α_{s} (λ, k) = α_{d} + (1 - α_{d}) K (λ, k) (7)

where α_s is the smoothing factor to be used for noise estimation, and α_d is a constant. Finally, the noise power of each channel is estimated as follows:

D (λ, k) = α_{s} (λ, k) D (λ - 1, k) + (1 - α_{s} (λ, k)) {| Y (λ, k) |}^{2} (8)

Gain Function for Noise Reduction

Using the estimated noise power, the SNR can be estimated according to

S N R (λ, k) = δ S N R (λ - 1, k) + (1 - δ) (\frac{P (λ, k)}{D (λ, k)} - 1) (9)

Then, we use a gain function like:

G (λ, k) = \frac{S N R (λ, k)}{S N R (λ, k) + 1} (10)

To suppress the noise to the maximum extent, the gain can be further adjusted:

G_{0} (λ, k) = {\begin{matrix} g, G (λ, k) < T_{g} \\ G (λ, k), G (λ, k) \geq T_{g} \end{matrix} (11)

where g is a minor constant value, and T_g is a dynamic threshold determined by SNR. T_g is also one of the key factors that determine algorithm sensitivity.

Finally, the signal power after noise reduction is as follows:

S (λ, k) = G_{0} (λ, k) P (λ, k) (12)

Example

An example of eVoice working in a speech-shaped noise (SSN) at +5 dB SNR is shown in Figure 2. eVoice was implemented with the APS coding strategy with a channel selection of 8-of-24 at a sampling rate of 16,000 Hz. Figure 2 shows the power comparison in the eighth channel, including the signals for clean speech, noisy speech, processed speech, and estimated noise plotted in different colors.

FIGURE 2

Figure 2. An example of noise reduction at channel 8 in SSN at +5 dB SNR. The frame shift is 8 ms.

Experiment 1: Subjective Preference and Speech Recognition in Noise

This experiment was designed to evaluate speech intelligibility with eVoice (denoted by “NR1”) compared with another NRA (denoted by “NR2”) that used a binary masking for noise reduction, as well as the APS strategy with no NRA (denoted by “APS”). NR2 uses the same noise estimation method with NR1 as described in Noise Estimation. After noise estimation, NR2 calculates an SNR that is used to set the gain. That is, if the SNR is higher than a threshold, set the gain to 1 (speech dominant), or a small constant if lower (noise dominant). NR2 was selected for comparison because it is as computationally effective as eVoice and the method of ideal binary masking had been studied in other CI systems (Mauger et al., 2012b). Speech intelligibility was measured with a speech-in-noise recognition test and a subjective rating questionnaire.

Methods

Participants

This experiment involved 11 experienced CI users (six females and five males), aged from 20 to 59 years (mean age = 41.2 years). All were postlingually deafened adults unilaterally implanted with a CS-10A implant and using a Venus^TM sound processor (i.e., first generation) programed with the APS strategy. The Enduro^TM sound processor was fitted instead of the Venus^TM in this experiment. There is an option in a remote control to select whether to use an NRA (one with NR1-eVoice and the other one with NR2-Binary Masking). Demographics for individual participants are presented in Table 1. All participants’ native language was Mandarin Chinese, and participants were paid for their time and traveling expenses. Written informed consent was obtained before the experiment, and all procedures were approved by the local institution’s ethical review board.

TABLE 1

Table 1. Demographic details of participants in experiment 1.

Procedures and Materials

In this experiment, NR1 and NR2 performances were assessed first in a subjective evaluation, followed by a speech-in-noise recognition test.

The subjective evaluation lasted for 2 weeks. At the beginning of week 1, participants were fitted with an Enduro^TM processor that was incorporated with the NR1 and were asked to have a take-home trial for 1 week. During that week, participants were free to turn the NR1 on and off and use it in various everyday listening scenarios. At the end of week 1, subjective ratings were collected using the questionnaire shown in Table 2. Similar procedures were followed for the NR2 in week 2. The questionnaire consists of eight questions that cover various everyday listening scenarios. A 5-point rating scale was used to collect participants’ subjective ratings of the NR1 or NR2 in each listening scenario after each 1-week take-home use: 2, strongly agree; 1, agree; 0, neutral; -1, disagree; -2, strongly disagree.

TABLE 2

Table 2. Questionnaire used for subjective evaluation.

In the test of speech recognition in noise, we used two noise types (an SSN and a babble noise) at three SNRs (5, 10, and 15 dB) to compare the three algorithms (APS, NR1, and NR2). This yielded a total of 21 test blocks (two noise types × three SNRs × three algorithms + baselines of the three algorithms in quiet). The three baseline blocks (three algorithms in quiet) were conducted first in a random order, followed by the remaining 18 blocks in a random order. We used sentence materials from two published Mandarin speech databases: the PLA General Hospital sentence recognition test (Xi et al., 2012) corpus and the House Research Institute sentence recognition test (Fu et al., 2011) corpus. The PLA General Hospital corpus consists of 12 lists each with 11 sentences, and each sentence includes six to eight key words. The House Research Institute corpus comprises 10 lists each with 10 phonetically balanced sentences, and each sentence contains seven words. All sentences were read by female speakers. Eleven of the 12 lists in the 301 corpus and all lists in the House corpus were used.

Because of the limited number of material lists, different lists from the PLA General Hospital and House Research Institute corpora were randomly assigned to blocks for each participant, with one list for each block. Special care was taken to ensure that the blocks of each algorithm used lists from the same corpus. In each block, sentences were presented in a random order, and a percentage word correctness score was calculated. Stimuli were presented in a soundproof room by a speaker located 1 m in front of the participant at a comfortable level (approximately 65 dBA). The tests were administered using QuickSTAR4TR software developed by Qianjie Fu (Emily Fu Foundation, 2019).

Statistical Analysis

Repeated-measures one-way analysis of variance (ANOVA) was used to analyze speech recognition in quiet. Repeated-measures three-way ANOVAs were performed to assess speech recognition in noise. Bonferroni adjustments were used for multiple comparisons.

Results

Subjective Evaluation Test

Figure 3 shows the results of the subjective ratings for NR1 and NR2.

FIGURE 3

Figure 3. Results of subjective evaluations of NR1 (left panel) and NR2 (right panel). The abscissa lists all eight questions used for the subjective evaluation, and the ordinate is the rating given by the participants. Along the ordinate, “-2” represents strong disagreement on the question, and “2” represents strong agreement. The larger the number, the more positive the subjective evaluation is that the NR could help in different noisy scenarios and did not impact listening in quiet settings. The size of the circles represents the number of participants who gave the corresponding ratings, with larger circles indicating more participants.

For NR1 (i.e., eVoice), there were many positive ratings and few negative ones. Most participants gave positive ratings to Q2, Q4, Q5, and Q6, which indicated better listening experience with NR1 on than off in scenarios such as multitalker communication, at an intersection, and in a vehicle. For Q3, Q7, and Q8, most participants had neutral ratings, which corresponded to scenarios such as in a restaurant or supermarket and near an air conditioner or fan. This result suggested comparable performance between NR1 on and off in these settings. There were a few participants who give positive ratings to Q3, Q7, and Q8 (better experience with NR1 turned on in listening scenarios such as a one-on-one conversation in a restaurant, by an air conditioner or fan, or in a busy supermarket). For listening in quiet, most participants reported that NR1 had no effect on a one-on-one conversation in quiet and gave positive ratings to Q1 (the NRA had no effect on one-on-one conversations in quiet rooms).

For NR2 (i.e., binary masking), the feedback was more variable. In general, ratings were almost evenly distributed between negative and positive for all eight questions except Q2 and Q8, which means that there were participants who thought NR2 was helpful in most listening scenarios. However, comparable numbers of participants thought it was not helpful or were neutral. For Q2 and Q8, most participants gave neutral ratings, which indicate that most thought the NR2 had no effect for multitalker communication in quiet or a one-on-one conversation in a supermarket.

Speech Intelligibility Test

Results of speech recognition in quiet are shown in Figure 4. A repeated-measures one-way ANOVA revealed no significant difference among the mean results (∼90%) of the three algorithms (p = 0.452).

FIGURE 4

Figure 4. Results of speech recognition score in quiet settings. The left panel shows the individual percent correct scores, and the right panel shows the group means, with error bars indicating the standard error of group means.

Figure 5 shows the results of speech recognition in the SSN and babble noise. Statistical significance was determined using ANOVA with the percent correct scores as the dependent variable and the noise type (SSN or babble), SNR (5, 10, or 15 dB), and algorithm (APS, NR1, or NR2) as within-subject factors. Tests of within-subjects effects indicated a significant effect of noise type (p = 0.022), SNR (p < 0.001), and algorithm (p = 0.002), as well as significant interactions between noise type and SNR (p < 0.001). Pairwise comparisons revealed that the overall performance of NR1 was significantly better than APS (p = 0.001) and NR2 (p = 0.016), and there was no significant difference between APS and NR2 (p = 0.612). When noise type and SNR were fixed to determine the effect of algorithms at specific SNRs in a particular noise type, NR1 performed significantly better than NR2 at the 5-dB SNR in the SSN (p = 0.010) and also significantly better than APS (p = 0.027) at the 5-dB SNR in the babble noise. In both the SSN and babble noise at the SNRs of 10 and 15 dB, there were no significant differences among the three algorithms. However, higher mean scores of NR1 could be observed against APS and NR2 at the 10-dB SNR in SSN (nearly eight percentage points), as well as at the 10-dB SNR (eight percentage points higher than APS) and 15-dB SNR (∼5 percentage points higher than APS and NR2) in the babble noise, although these improvements were not statistically significant.

FIGURE 5

Figure 5. Results of speech recognition score in SSN (left panel) and babble noise (right panel). Results of each individual participant are plotted, and the bars show the mean values, with error bars indicating the standard deviations.

Short Summary

In this experiment, we tested two NRAs: eVoice (NR1) and another that used binary masking (NR2). Both use the same noise estimation process but differ in the noise cancelation process. NR1 uses a smoothing gain function, whereas NR2 uses a binary masking. The subjective evaluation ratings show that NR1 was positively reviewed, whereas ratings of NR2 were almost evenly distributed from negative to positive, with a slight dominance of neutral responses. The speech recognition test results indicate overall better performance of NR1 compared to NR2 and APS. However, a significant benefit was only found at 5-dB SNR. The above results demonstrate that NR1 had better performance than NR2 for both speech recognition tests and subjective evaluations.

Experiment 2: Speech Reception Threshold Test

Rationale

The hypothesized significant benefit of eVoice was not always supported by the results of the first experiment. One reason may be from the fixed SNR procedure and large performance variance in the cohort. From the results of Experiment 1 (left panel in Figure 5), we noticed that the ceiling effect could be observed in some participants at the SNR of 15 dB, and the floor effect could be observed at the SNR of 5 dB. Speech perception in noise varied dramatically among participants, even at the same SNR in the same noise. This indicates a limit of testing percent correct scores at fixed SNRs because this type of test is not able to exclude potential ceiling and floor effects. To overcome this limitation, we designed Experiment 2, which used an adaptive speech reception threshold (SRT) test to measure the potential benefits of eVoice.

In the first experiment, we found clearly that NR1 (i.e., eVoice) provided better performance than NR2 (i.e., the ideal binary one) in the subjective test, although little improvement was observed in the speech-in-noise recognition test. To further explore the potential of eVoice and to save experiment time, only NR1 was evaluated in the second experiment.