AUTHOR=Loakes Debbie TITLE=Automatic speech recognition and the transcription of indistinct forensic audio: how do the new generation of systems fare? JOURNAL=Frontiers in Communication VOLUME=9 YEAR=2024 URL=https://www.frontiersin.org/journals/communication/articles/10.3389/fcomm.2024.1281407 DOI=10.3389/fcomm.2024.1281407 ISSN=2297-900X ABSTRACT=
This study provides an update on an earlier study in the “Capturing Talk” research topic, which aimed to demonstrate how automatic speech recognition (ASR) systems work with indistinct forensic-like audio, in comparison with good-quality audio. Since that time, there has been rapid technological advancement, with newer systems having access to extremely large language models and having their performance proclaimed as being human-like in accuracy. This study compares various ASR systems, including OpenAI’s Whisper, to continue to test how well automatic speaker recognition works with forensic-like audio. The results show that the transcription of a good-quality audio file is at ceiling for some systems, with no errors. For the poor-quality (forensic-like) audio, Whisper was the best performing system but had only 50% of the entire speech material correct. The results for the poor-quality audio were also generally variable across the systems, with differences depending on whether a .wav or .mp3 file was used and differences between earlier and later versions of the same system. Additionally, and against expectations, Whisper showed a drop in performance over a 2-month period. While more material was transcribed in the later attempt, more was also incorrect. This study concludes that forensic-like audio is not suitable for automatic analysis.