Skip to main content

ORIGINAL RESEARCH article

Front. Robot. AI
Sec. Robot Vision and Artificial Perception
Volume 11 - 2024 | doi: 10.3389/frobt.2024.1490718
This article is part of the Research Topic Emerging Technologies in Surveillance: Novel Approaches to Video Anomaly Detection View all articles

Embedding-based Pair Generation for Contrastive Representation Learning in Audio-visual Surveillance Data

Provisionally accepted
  • Internet Technology and Data Science Laboratory, Faculty of Engineering and Architecture, Ghent University, Ghent, Belgium

The final, formatted version of the article will be published soon.

    Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, selfsupervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck.Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model's training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives.We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality-and task-specific approaches.

    Keywords: Self-supervised learning, surveillance, audio-visual representation learning, Contrastive learning, audio-visual event localization, anomaly detection, Event search

    Received: 03 Sep 2024; Accepted: 09 Dec 2024.

    Copyright: © 2024 Wang, De Coninck, Leroux and Simoens. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Wei-Cheng Wang, Internet Technology and Data Science Laboratory, Faculty of Engineering and Architecture, Ghent University, Ghent, Belgium

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.