AUTHOR=Loakes Debbie 

TITLE=Automatic speech recognition and the transcription of indistinct forensic audio: how do the new generation of systems fare?

JOURNAL=Frontiers in Communication

VOLUME=Volume 9 - 2024

YEAR=2024

URL=https://www.frontiersin.org/journals/communication/articles/10.3389/fcomm.2024.1281407

DOI=10.3389/fcomm.2024.1281407

ISSN=2297-900X

ABSTRACT=This paper provides an update on Loakes (2022), which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being humanlike in accuracy. This study compares various ASR systems, including OpenAI's Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. Results show that transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poorquality (forensic-like) audio, Whisper was the best performing system, but had only 50% of the entire speech material correct. Results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used, and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a two-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.… speech that has been captured, typically in a covert (secret) recording obtained as part of a criminal investigation, and is later used as evidence in a trial. Such recordings provide powerful evidence, allowing the court to hear speakers making  admissions they would not make openly. One problem, however, is that the audio is often extremely indistinct, to the extent of being unintelligible without the assistance of a transcript.