AUTHOR=Kumalija Elhard , Nakamoto Yukikazu 

TITLE=Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech

JOURNAL=Frontiers in Signal Processing

VOLUME=Volume 2 - 2022

YEAR=2022

URL=https://www.frontiersin.org/journals/signal-processing/articles/10.3389/frsip.2022.999457

DOI=10.3389/frsip.2022.999457

ISSN=2673-8198

ABSTRACT=In VoIP applications, such as Interactive Voice Response and VoIP-phone conversation
transcription, speech signals are degraded not only by environmental noise but also by
transmission network quality, and distortions induced by encoding and decoding algorithms.
Therefore, there is a need for automatic speech recognition (ASR) systems to handle integrated
noise-network distorted speech. In this study, we present a comparative analysis of a speech-to-
text system trained on clean speech against one trained on integrated noise-network distorted
speech. Training an ASR model on noise-network distorted speech dataset improves its
robustness. Although the performance of an ASR model trained on clean speech depends
on noise type, this is not the case when noise is further distorted by network transmission. The
model trained on noise-network distorted speech exhibited a 60% improvement rate in the word
error rate (WER), word match rate (MER), and word information lost (WIL) over the model trained
on clean speech. Furthermore, the ASR model trained with noise-network distorted speech
could tolerate a jitter of less than 20% and a packet loss of less than 15%, without a decrease in
performance. However, WER, MER, and WIL increased in proportion to the jitter and packet loss
as they exceeded 20% and 15%, respectively. Additionally, the model trained on noise-network
distorted speech exhibited higher robustness compared to that trained on clean speech. The
ASR model trained on noise-network distorted speech can also tolerate signal-to-noise (SNR)
values of 5 and above, without the loss of performance, independent of noise type.