AUTHOR=Jakubowski Kelly , Eerola Tuomas , Alborno Paolo , Volpe Gualtiero , Camurri Antonio , Clayton Martin TITLE=Extracting Coarse Body Movements from Video in Music Performance: A Comparison of Automated Computer Vision Techniques with Motion Capture Data JOURNAL=Frontiers in Digital Humanities VOLUME=4 YEAR=2017 URL=https://www.frontiersin.org/journals/digital-humanities/articles/10.3389/fdigh.2017.00009 DOI=10.3389/fdigh.2017.00009 ISSN=2297-2668 ABSTRACT=

The measurement and tracking of body movement within musical performances can provide valuable sources of data for studying interpersonal interaction and coordination between musicians. The continued development of tools to extract such data from video recordings will offer new opportunities to research musical movement across a diverse range of settings, including field research and other ecological contexts in which the implementation of complex motion capture (MoCap) systems is not feasible or affordable. Such work might also make use of the multitude of video recordings of musical performances that are already available to researchers. This study made use of such existing data, specifically, three video datasets of ensemble performances from different genres, settings, and instrumentation (a pop piano duo, three jazz duos, and a string quartet). Three different computer vision techniques were applied to these video datasets—frame differencing, optical flow, and kernelized correlation filters (KCF)—with the aim of quantifying and tracking movements of the individual performers. All three computer vision techniques exhibited high correlations with MoCap data collected from the same musical performances, with median correlation (Pearson’s r) values of 0.75–0.94. The techniques that track movement in two dimensions (optical flow and KCF) provided more accurate measures of movement than a technique that provides a single estimate of overall movement change by frame for each performer (frame differencing). Measurements of performer’s movements were also more accurate when the computer vision techniques were applied to more narrowly defined regions of interest (head) than when the same techniques were applied to larger regions (entire upper body, above the chest, or waist). Some differences in movement tracking accuracy emerged between the three video datasets, which may have been due to instrument-specific motions that resulted in occlusions of the body part of interest (e.g., a violinist’s right hand occluding the head while tracking head movement). These results indicate that computer vision techniques can be effective in quantifying body movement from videos of musical performances, while also highlighting constraints that must be dealt with when applying such techniques in ensemble coordination research.