Skip to main content

ORIGINAL RESEARCH article

Front. Signal Process.
Sec. Audio and Acoustic Signal Processing
Volume 4 - 2024 | doi: 10.3389/frsip.2024.1440401
This article is part of the Research Topic Informed Acoustic Source Separation and Extraction View all 3 articles

New Insights on the Role of Auxiliary Information in Target Speaker Extraction

Provisionally accepted
  • International Audio Laboratories Erlangen (AudioLabs), Erlangen, Germany

The final, formatted version of the article will be published soon.

    Speaker extraction (SE) aims to isolate the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information. Several forms of auxiliary information have been employed in single-channel SE, such as a speech snippet enrolled from the target speaker or visual information corresponding to the spoken utterance. The effectiveness of the auxiliary information in SE is typically evaluated by comparing the extraction performance of SE with uninformed speech separation (SS) methods. Following this evaluation procedure, many SE studies have reported performance improvement compared to SS, attributing this to the auxiliary information. However, recent advancements in deep neural network architectures, which have shown remarkable performance for SS, suggest an opportunity to revisit this conclusion. In this paper, we examine the role of auxiliary information in SE across multiple datasets and various input conditions. Specifically, we compare the performance of two SE systems (audio-based and video-based) with SS using a unified framework that utilizes the commonly used dual-path recurrent neural network architecture. Experimental evaluation on various datasets demonstrates that the use of auxiliary information in the considered SE systems does not always lead to better extraction performance compared to the uninformed SS system. Furthermore, we offer new insights into how SE systems select the target speaker by analyzing their behavior when provided with different and distorted auxiliary information given the same mixture input.

    Keywords: Speaker extraction, speaker separation, deep learning, Auxiliary information, Single-channel

    Received: 29 May 2024; Accepted: 21 Oct 2024.

    Copyright: © 2024 Elminshawi, Mack, Chetupalli, Chakrabarty and Habets. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Mohamed Elminshawi, International Audio Laboratories Erlangen (AudioLabs), Erlangen, Germany

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.