A Methodological Framework for Assessing Social Presence in Music Interactions in Virtual Reality

Van Kerrebroeck, Bavo; Caruso, Giusy; Maes, Pieter-Jan

doi:10.3389/fpsyg.2021.663725

ORIGINAL RESEARCH article

Front. Psychol. , 11 June 2021

Sec. Cognitive Science

Volume 12 - 2021 | https://doi.org/10.3389/fpsyg.2021.663725

This article is part of the Research Topic Modeling Virtual Humans for Understanding the Mind View all 6 articles

A Methodological Framework for Assessing Social Presence in Music Interactions in Virtual Reality

Bavo Van Kerrebroeck^*

Giusy Caruso

Pieter-Jan Maes

Department of Art, Music, and Theatre Sciences, IPEM, Ghent University, Ghent, Belgium

Virtual reality (VR) brings radical new possibilities to the empirical study of social music cognition and interaction. In the present article, we consider the role of VR as a research tool, based on its potential to create a sense of “social presence”: the illusory feeling of being, and socially interacting, inside a virtual environment. This makes VR promising for bridging ecological validity (“research in the wild”) and experimental control (“research in the lab”) in empirical music research. A critical assumption however is the actual ability of VR to simulate real-life social interactions, either via human-embodied avatars or computer-controlled agents. The mediation of social musical interactions via VR is particularly challenging due to their embodied, complex, and emotionally delicate nature. In this article, we introduce a methodological framework to operationalize social presence by a combination of factors across interrelated layers, relating to the performance output, embodied co-regulation, and subjective experiences. This framework provides the basis for the proposal of a pragmatic approach to determine the level of social presence in virtual musical interactions, by comparing the outcomes across the multiple layers with the outcomes of corresponding real-life musical interactions. We applied and tested this pragmatic approach via a case-study of piano duet performances of the piece Piano Phase composed by Steve Reich. This case-study indicated that a piano duet performed in VR, in which the real-time interaction between pianists is mediated by embodied avatars, might lead to a strong feeling of social presence, as reflected in the measures of performance output, embodied co-regulation, and subjective experience. In contrast, although a piano duet in VR between an actual pianist and a computer-controlled agent led to a relatively successful performance output, it was inadequate in terms of both embodied co-regulation and subjective experience.

Introduction

Virtual reality (VR) encompasses a plethora of technologies to create new environments, or simulate existing ones, via computer-generated multisensory displays (Sutherland, 1968; Taylor, 1997; Scarfe and Glennerster, 2019). Complementary to multisensory displays are technologies for capturing physical body movement to facilitate embodied control and interactions with(in) computer-generated (virtual) environments (Yang et al., 2019). VR technologies provide hence a technological mediation between performed actions and multisensory perceptions, extending the natural sensorimotor capacities of humans into the digital world (Kornelsen, 1991; Biocca and Delaney, 1995; Ijsselsteijn and Riva, 2003). Crucially however, VR typically aims at making its mediation invisible, creating for users the illusory feeling of nonmediation; a feeling coined with the concept of “presence” (Lombard and Ditton, 1997; Riva, 2006). This concept of presence may encompass multiple categories, related to the physical environment, the user’s own body, as well as its social environment. The first category – physical presence or telepresence – pertains to the illusory feeling for users of actually being present in another environment than the one they are physically in (Minksy, 1980; Sheridan, 1992; Slater and Sanchez-Vives, 2016). Another category, self-presence is rooted in the capacity of VR to map the physical body movements of a user onto the moving body of a virtual avatar. The potential of embodying virtual avatars allows to create the illusory feeling for a user of owning, controlling, and being inside another body than its physical one (self-presence) (Kilteni et al., 2012; De Oliveira et al., 2016; Braun et al., 2018; Matamala-Gomez et al., 2020). In addition, (bodily) acting within a virtual social context may create a sense of being together (co-presence), or interacting with others (social presence) while actually being physically remote (Short et al., 1976; Garau et al., 2005; Parsons et al., 2017; Oh et al., 2018).

In the current paper, we advocate for establishing VR as a research tool for studying social music interaction and sense-making. We see the relevance of VR precisely within its capacity to create a sense of presence across the different categories described above. In the first part of the article, we will discuss in more detail how a VR-based approach has its roots in earlier, human-centered research across a broad range of scientific disciplines and how it holds potential for empirical research on social music cognition and interaction.

In the remainder of the article, we focus on a fundamental prerequisite for establishing the advocated VR-based research method; namely, the idea that VR can actually create a sense of social presence. This is particularly challenging given that music provides a highly particular context of human social interaction. It involves the body as a source of expressive and intentional communication between co-performers, carried out through a fine-tuned and skillful co-regulation of bodily articulations (Leman, 2008; Leman and Maes, 2015). This co-regulation of bodily articulations is a complex process, involving many body parts, and taking place across multiple, hierarchically organized spatial and temporal scales (Eerola et al., 2018; Hilt et al., 2020). Successful co-regulation requires hence that action-relevant information at multiple scales is properly exchanged through the different sensory modalities. In particular, the auditory and visual sense are important in signaling communicative cues, related to music-structural aspects and emotional expression (Williamon and Davidson, 2002; Goebl and Palmer, 2009; Keller and Appel, 2010; Coorevits et al., 2020). This complex, embodied nature of music interaction puts considerable demands to communication technologies that aim at mediating social music interactions via digital ways. VR however is, in principle, highly promising as it allows to animate full-body, three-dimensional virtual humans based on real-time mapping of body movements of actual people captured by motion capture systems (avatars) or based on computer-modeling and simulation of human behavior (agents) (Cipresso, 2015). These animated, three-dimensional avatars and agents can be observed by others from a freely chosen and dynamic first-person perspective providing a foundation for the complex information exchanges required for successful music interactions. This turns a VR environment into a potential digital meeting space where people located in physically distinct places, together with computer-generated virtual humans, can interact musically with one another. However, it is crucial to further assess the quality of social interactions with virtual avatars and agents (Kothgassner et al., 2017) and to assure the required levels of realism and social presence.

For that purpose, a main objective of this article is to introduce a methodological framework to assess social presence in virtual music interactions. We thereby consider social presence as a multi-layered concept rooted in, and emerging from, the behavioral and experiential dynamics of music interactions. We are able to assess these dynamics by integrating direct data measurement related to performance output, body movement, and (neuro)physiological activity with subjective self-report measures. As such, the framework facilitates the design of interactions, avatars, and agents to obtain empirical data and investigate aspects of the subjective experience such as for example empathy, intimacy, and togetherness. In the second part of the paper, we apply the framework to a case-study of a social music interaction in VR. The case-study presents real and virtual interactions between two expert pianists, a pianist with an avatar and a pianist with an agent and demonstrates similarities and differences revealed throughout the framework’s layers. Finally, we conclude the paper with a discussion on the relevance of our framework using insights from the case-study’s analysis and present directions for future work.

VR: A Research Tool for Studying Social Interactions

Around the 21st century, VR started to develop as a valuable methodological tool in human-centered research, including the social and cognitive (neuro)sciences (Biocca, 1992; Biocca and Delaney, 1995; Blascovich et al., 2002; Fox et al., 2009; Parsons, 2015; Slater and Sanchez-Vives, 2016; Parsons et al., 2017; Pan and de Hamilton, 2018), philosophy (Metzinger, 2018), the humanities (Cruz-Neira, 2003), product design (Berg and Vance, 2017), marketing (Alcañiz et al., 2019), medicine (Riva et al., 2019), and healthcare (Teo et al., 2016). Although VR-based research exhibits a richness in variety and discipline domains, the relevance of VR in human-centered research can, in general terms, be captured by two specific traits; namely the ability of VR to simulate existing, “real-life” contexts (simulation trait) and its ability to extend human functions or to create new environments and contexts (extension trait).

The simulation trait of VR relates to the inherent paradox in traditional approaches in empirical research. To obtain valid results and insights, the researcher is motivated to observe phenomena “in the wild” without interventions. However, this approach allows little control over stimuli; often has to cope with a number of confounding variables and provides challenges to perform reliable measurements. On the other end, the researcher performs experiments in a controlled lab setting to obtain generalizable results. This approach however is often overly reductionistic and not ecologically valid. The use of VR technology allows to bridge these extremes by simulating real-life settings in a controlled environment. In that sense, VR holds the potential of bringing the external validity (“research in the wild”) and internal validity (“research in the lab”) of social (music) cognition research closer together (Parsons, 2015; Kothgassner and Felnhofer, 2020). It can be understood as an alternative empirical research paradigm (Blascovich et al., 2002), offering substantial additional benefits over traditional research practices in laboratory or field conditions. The use of VR allows precise control over multimodal, dynamic, and context rich stimuli (Parsons et al., 2017) while retaining a level of realism required for realistic responses. Despite the need for technological expertise, research practices using VR are becoming more accessible and standardized and can thus provide representative sampling and better replicability (Blascovich et al., 2002). Given the digital nature of creating VR contexts and the requirement of appropriate sensorimotor sensors, VR technology also offers flexibility in the means of and choices in recording data.

A second trait can be related to McLuhan (1964) understanding of technology as an extension of the human body, mind, and biological functions. This view resonated in the early accounts of VR pointing to the ability of VR to create sensorimotor and social experiences not possible or desirable in the actual physical world. Accordingly, VR was defined in terms of a “medium for the extension of body and mind” (Biocca and Delaney, 1995), creating “realities within realities” (Heim, 1994) or “shared/consensual hallucinations” (Gibson, 1984; Lanier, 1988) “bounded […] only by desire and imagination” (Benedikt, 1991). Important to note is that, in most of current human-centered research, this ability of VR is seldomly employed as a form of mere escapism from the physical world. In contrast, VR is mostly used to “make us intensely aware of what it is to be human in the physical world, which we take for granted now because we are so immersed in it” (Lanier, 1988). Accordingly, the use of “impossible stimuli” and illusions generated in VR have contributed substantially to a better understanding of profound aspects of human embodied cognition and social interaction (Parsons et al., 2017; Metzinger, 2018). For instance, VR technology is capable of selectively modulating our perception of space (Glennerster et al., 2006), time (Friedman et al., 2014), (social) cognition (Tarr et al., 2018), and the body (Petkova and Ehrsson, 2008). It has the potential to influence different representational layers of the human self-model (Metzinger, 2018) leading to phenomena such as virtual embodiment, (virtual) body swapping (Petkova and Ehrsson, 2008; De Oliveira et al., 2016) and increasingly frequent and complex “social hallucinations” (Metzinger, 2018).

Given these traits and their potential, the use of VR in music research has increased over the recent decade (Çamci and Hamilton, 2020). A first category of studies primarily leveraged the simulation trait of VR. They created real-life virtual settings in which to investigate various topics, such as music therapy (Orman, 2004; Bissonnette et al., 2016), music education (Orman et al., 2017; Serafin et al., 2017), music performance (Williamon et al., 2014; Glowinski et al., 2015), and the relation between sound and presence (Västfjäll, 2003; Kern and Ellermeier, 2020; Kobayashi et al., 2020). A good example of simulating a real-life setting is given by Glowinski et al. (2015), who investigated the influence of social context on performance. Specifically, Glowinski and colleagues asked participants to perform a musical task in a virtual concert hall while controlling for audience gaze. Other studies focused more on extending real-life contexts. They range from the search toward new virtual instruments (Honigman et al., 2013; Berthaut et al., 2014; Serafin et al., 2016; Hamilton, 2019) to new interactions and the development of interaction design principles (Deacon et al., 2016; Atherton and Wang, 2020). While we made a clear conceptual distinction between the simulation and extension trait, this distinction is opaquer in practice. An effective research paradigm has been to simulate a musical scenario in VR, subsequently extending some human function such as modulating the feeling of body ownership using virtual embodiment, to investigate behavioral changes (Kilteni et al., 2013).

A critical requirement however for using VR as a research tool in the study of social music interaction is the ability to establish social presence: the illusory feeling of actually being together and interacting meaningfully with human-embodied avatars or computer-controlled agents in VR. Research on social presence may contribute to social music cognition and interaction in two important ways. First of all, referencing again to the quote by Lanier (1988), social music interaction in VR forces researchers to think about, and develop knowledge on, the general nature of human social cognition and sense-making, “which we take for granted now because we are so immersed in it.” Secondly, under the condition that social presence can be reliably established, it becomes possible to accurately control and manipulate the many variables that characterize a music interaction, including the context in which the interaction occurs. For instance, it becomes possible to control the perspective that people have on one another, the distance at which they are positioned, the sensory coupling between people, the appearance of people (for example, facial expression, age, and gender), environmental properties, the actual musical behavior, and bodily performance of VR agents (for example timing, quantity of motion), among other variables. This offers almost limitless possibilities to extend the empirical investigation of the principles of social music interaction and sense-making within (simulated) ecologically valid music environments. In the following section, we describe the methodological framework that we propose to define, measure, and test social presence in VR music interaction contexts.

A Methodological Framework to Assess Social Presence in VR

Most research so far has relied on self-report questionnaires to assess the subjective feeling of social presence (Cui, 2013; Oh et al., 2018). The mere use of subjective ratings however poses important limitations, as these provide only indirect and post hoc measures of presence, lack subtlety and are often unstable and biased (Cui, 2013). In the current article, we propose an alternative, pragmatic approach, considering social presence as emerging from the performative, behavioral, and experiential dynamics inherent to the social interaction. This allows the assessment of social presence using a combination of qualitative, performer-informed methods, and quantitative measures of the performance, behavior, and (neuro)physiological responses of users by operationalizing them into concrete, direct, and measurable variables. Crucially, in this pragmatic approach, we define the level of social presence as the extent to which social behavior and responses in simulated VR contexts resemble behavior and responses in corresponding real-life musical contexts (Minksy, 1980; Slater et al., 2009; Johanson, 2015; Scarfe and Glennerster, 2019).

To allow comparison between virtual and real-life scenarios, we rooted our framework for social presence in the “interaction theory,” which currently is the most dominant theory in the social sciences to understand social cognition and sense-making (Gallagher, 2001; De Jaegher and Di Paolo, 2007; Kiverstein, 2011; Froese and Fuchs, 2012; Gallotti and Frith, 2013; Schilbach et al., 2013; Fogel, 2017). Proponents of the interaction theory consider social cognition essentially as an embodied and participatory practice, emerging in real-time co-regulated interaction and not reducible to individual processes. In line with this account, we consider successful co-regulation as a foundational criterion to establish social presence in VR. Importantly, in our framework, we consider social co-regulation both from the viewpoint of the quantifiable bodily and multisensory patterns of interpersonal interaction, as from the viewpoint of the intersubjective experience and participatory sense-making (De Jaegher and Di Paolo, 2007). Together with the actual musical outcome, these two interrelated aspects of social co-regulation form the three main layers of our framework to determine the degree of social presence in VR music contexts. Layers of the framework are shown in Figure 1.

FIGURE 1

Figure 1. Overview of the methodological framework to operationalize social presence in virtual reality (VR) music contexts. The core of the framework consists of a comparative analysis of a simulated virtual context, with the corresponding real-life music context (which functions as “ground truth”) across three interrelated layers; performance output, embodied co-regulation, and subjective experience. (RQA = Recurrence quantification analysis)

Layer 1: Performance Output

The performance output layer relates to the (un/successful) realization of musical ideas or goals, which may be strictly prescribed in musical scores, loosely agreed upon, or emerge in the performance act itself, depending on the performance type and context. Music performance analysis has been advanced by research and development in the domain of music information retrieval, providing ample techniques and methods for assessing music performance properties (Lerch et al., 2020). These are typically extracted from audio recordings, although other multimodal signals such as body movement are increasingly being used. Further, we advocate for taking into account time-varying features related to timing, synchronization, (joint) multiscale recurrence patterns, and complexity measures as these may signal the quality of the performance output. These quantitative measures should ideally be complemented with qualitative, performer-inspired methods to reliably interpret the quantitative outcome measures. They include subjective evaluations in the form of aesthetic judgments of the performance output by the performers themselves.

Layer 2: Embodied Co-regulation

A successful musical output relies on a skillful, joint coordination of co-performers’ actions and sounds. In line with the interaction theory on social cognition described above, we consider social music interaction as a dynamic and continuously unfolding process of co-regulation, in which performers mutually adjust to one another in a complex interplay of action and multimodal perception. This process of co-regulation integrates various levels and mechanisms of control, ranging from low-level spontaneous coordination based on dynamical principles (Kelso, 1995; Tognoli et al., 2020), to higher-level learning, predictive processing, and active inference (Sebanz and Knoblich, 2009; Gallagher and Allen, 2018; Koban et al., 2019). In our proposed methodological framework, we specifically aim at capturing patterns, relationships, and recurrences in the process of co-regulation at the level of the interacting system as a whole. We thereby advocate for the integration of time-series analyses from the domain of dynamical systems theory, as these are ideally suited to unveil dynamic patterns of interpersonal coordination across multiple body parts and temporal and spatial scales (Eerola et al., 2018; Hilt et al., 2020). Patterns can then be found on multiple levels such as in the attention dynamics from a participant’s gaze direction, in expressive gestures with communicative cues from head nodding as well as in the structures of full-body movements resulting from body sway synchronization. The ability to quantify bodily and multisensory patterns of co-regulated interaction between music performers in VR-mediated music contexts is foundational in our approach, as in our view, successful co-regulation is a decisive factor in performers’ feelings of social presence.

Layer 3: Subjective Experience

This layer deals with the subjectively experienced interaction qualities and sense-making processes of individuals. It contains a combination of quantitative and qualitative methods to link mental states, (expressive) intentions and meaning attributions to observations in other layers. The quantitative methods include the analysis of (neuro)physiological signals as they can give access to low-dimensional aspects of the conscious experience. For instance, electromyography (Ekman, 1992) and pupillometry (Laeng et al., 2012) among others, have proven to provide valid markers of cognitive and affective user states in virtual performance contexts, such as attention and workload measures, vigilance, affect, and flow (Schmidt et al., 2019). Heart-rate and skin-conductance (Meehan et al., 2002), as well as electroencephalography (Baumgartner et al., 2006) represent good candidates as they are capable of directly assessing the feeling of presence in virtual performance contexts. Complementary, from a more qualitative and performer-oriented point of view, one can ask participants for time-varying ratings of their intentional (joint-)actions and expressions through audio-video stimulated recall (Caruso et al., 2016). In addition, via self-report questionnaires, one can probe for mental states, such as social presence, flow, and feelings of togetherness (Witmer and Singer, 1998; Lessiter et al., 2001; Martin and Jackson, 2008). Finally, because of the multi-layered nature of social presence, open questions, and semi-structured interviews focused on individual experience of the virtual other can help to fill analysis gaps and interpret quantitative findings across the different layers.

Operationalization Within a Performance Setting

Layered frameworks have been helpful in earlier research for structuring the investigation into music interactions (Camurri et al., 2001; Leman, 2008). The layered framework here extends these approaches with a focus on the complementary nature of a mixed-method approach and the time-varying aspects, viewing the musical interaction as consisting of multiple interdependent parts. Layers are functionally coupled by non-linear relations allowing the emergence of patterns in time-varying dynamics in each layer. They frame and couple the dynamics of quantitative bodily and multisensory coordination patterns with the (inter)subjectively felt qualities of the music interaction. The framework aims to serve as a template to map this dynamic landscape of time-varying dynamics and aid in the uncovering of insightful states and transitions in a broad range of social interactions. It allows the investigation of social presence in different performance settings and distinguishes interactions using a specific operationalization of qualitative and quantitative methods in each layer. These operationalizations can vary from simple setups with for example, audio-, video-recordings, and annotations in the performance, co-regulation, or subjective layer to the more complex setups that will be presented in the case-study below. One performance setting might have time-delay or phase as variable of interest in the performance layer, while another might focus on frequency. Some settings will require the observation of neurophysiological signals while others might focus on movement data or self-reporting. Direct assessment is preferred over post-experimental reports to avoid the influence of self-referential cognitive processes or interfering in the interaction. Application of the framework then allows to identify interactions with for example close coordination and intense subjective experiences but nevertheless inferior performance such as when two tennis players are struggling to have long rallies but nevertheless experience heightened attention and synchronized movements. Other interactions can have successful performance outcomes, but fail in creating fertile metastable dynamics (Kelso, 1995; Tognoli et al., 2020) in other layers. Examples can be found in the interactions between human users and current state-of-the art artificial intelligence in video games, humanoid robots, or virtual assistants that lack successful (embodied) co-regulation and dynamic intentional relations.

A Case-Study: Piano Phase

“Piano Phase” (1967) is a composition for two pianos, written by minimalist composer Steve Reich. The piece applies his phase-shifting technique as structuring principle of the composition and is written for two pianists. The piece was chosen as case-study, as it provides an excellent musical case to assess social presence in VR music performance across the different layers in our proposed methodological framework. First, the performance output, the instructed phase shifts throughout time, can be objectively assessed and compared across different performances. Second, the performance requires skilled co-regulation between pianists in order to successfully perform. And third, as also Reich acknowledges, the performance of the piece has profound psychological aspects, related to sensuous-intellectual engagement, and strong (inter)subjective experiences of heightened attention, absorption, and even ecstasy.

Research Question

The case-study was meant to empirically evaluate and test our pragmatic approach for the definition and measurement of social presence in real-time VR music performances based on the proposed methodological framework. For that purpose, we designed different performance contexts that enabled us to compare performances of Piano Phase in VR, with a corresponding (ground truth) performance of Piano Phase under normal, “real-life” conditions (see Design section). In all conditions, we captured an elaborated set of quantitative data related to the experience, behavior, and performance of the pianist duo. In addition, we complemented this data with qualitative methods to integrate experiences and intentions from a performer point of view. Based on this quantitative and qualitative data and guided by the layered analysis model inherent to the proposed methodological framework, we could then conduct comparative analyses across the performance contexts to evaluate social presence the test subject experienced in VR.

Participants

The case-study involved three expert pianists: one test subject and two research confederates. The experimental protocol was reviewed and approved by the ethical commission of the University of Ghent. All pianists had over 10 years of professional music experience. The test subject (female, 38 years, in the following termed “Test pianist”) was not familiar with the piece Piano Phase through earlier performances and did not have earlier VR experiences. A second pianist (male, 32 years, in the following termed “Confederate pianist 1”) functioned as research confederate in the first two conditions and had concert experiences in performing Piano Phase. Finally, the third pianist (female, 44 years, in the following termed “Confederate pianist 2”) was another research confederate in the third condition and co-author of this study. She had no experience in performing the piece but did have experience with VR.

Design

The experiment consisted of three conditions as presented in the schematic overview in Figure 2. Conditions are placed along a virtuality continuum represented by the arrow in the figure (Milgram et al., 1995) and correspond from left to right to an unmodelled, partially modeled, and fully modeled world. In each condition, the Test pianist performed Piano Phase together with a research confederate while wearing a Head-Mounted-Display (HMD) (Confederate pianist 1 in the first two conditions and Confederate pianist 2 in the third condition). The fundamental distinction between the three conditions was the level of behavioral realism of the confederate partner as perceived by the Test pianist:

1. Human condition (ground truth): the Test pianist and Confederate pianist 1 performed Piano Phase under normal, “real-life” concert conditions. The Test pianist visually perceived Confederate pianist 1 in a natural, physical manner. To match the two other performance conditions, we asked the Test pianist to wear a HMD with the pass-through camera activated. This was done in order to avoid that the constraints of the HMD would function as a confounding factor while maintaining a normal, physical exchange of auditory and visual information between the Test pianist and Confederate pianist.

2. Avatar condition: the Test pianist and Confederate pianist 1 performed together in real-time, but the Test pianist visually perceived Confederate pianist 1 as a human-embodied virtual avatar. The Test pianist was disconnected from visual information coming from the physical environment, and all visual information related to the virtual room, piano keyboard, hands, and the co-performing Confederate pianist 1 was provided to the Test pianist via the HMD. Full body movements and musical instrument digital interface (MIDI) piano performance of the virtual avatar were streamed in real-time from the performance of Confederate pianist 1.

3. Agent condition: the Test pianist performed together with a computer-controlled virtual agent. The Test pianist was disconnected from all direct visual information as similar to the Avatar condition. Full body movements and MIDI piano performance of the virtual agent were rooted in pre-recorded time-series data of an actual pianist (Confederate pianist 2), who was asked to perform the same role as the one of the virtual agent (see below, Task). For the virtual agent animation, we used the Kuramoto model to automatically phase-align these pre-recorded time-series data and the accompanying audiovisual VR animation to the real-time performance of the Test pianist. This allowed to accurately control the phase of the musical part of the virtual agent with respect to the Test pianist and hence, to perform the piece dynamically as prescribed by composer Steve Reich. Apart from the virtual agent, all other display factors were similar as in the Avatar condition.

FIGURE 2

Figure 2. Conditions in the case-study (Confederate pianist 1 plays in the Human and Avatar condition, a recording of Confederate pianist 2 plays in the Agent condition).

For more information on the display methods, see the Materials and Apparatus section below. We will use Human condition (HC), Avatar condition (AvC), and Agent condition (AgC) abbreviations in the Analysis and Results sections to refer to numerical results from the Human, Avatar, and Agent condition.

Materials and Apparatus

Piano Keyboards

The piano keyboards used for the performances were digital interface MIDI controllers. The Test pianist and Confederate pianist 1 played on a Yamaha P60 and Confederate pianist 2 played on a Roland RD700SX. MIDI information was processed in Ableton Live 10 to generate piano sounds using a Native Instruments Kontakt 6 plugin. Speakers were placed underneath each piano keyboard to assure coherent sound source localization throughout the different performances.

Performance Setting and Virtual Simulation Displays

The full experiment took place in the Art and Science Interaction Lab of IPEM, a 10 m-by-10 m-by-7 m (height) space surrounded by black curtains that resembles a realistic performance space. The two piano keyboards were placed opposite to each other under an angle of about 60° so pianists could turn toward each other (see Figure 3). The avatar and agent were based on full body movement captures using the Qualisys system described below. Motion capture data were processed in Unity. An important consideration in the study was to also simulate the hands of the Test pianist as earlier research indicated that this may substantially increase the feeling of (self-)presence (Banakou and Slater, 2014). We used the Leap Motion system for that purpose which allowed to track and display fine finger movements.

FIGURE 3

Figure 3. View of the Test pianist in each condition as seen through the Test pianist’s head-mounted display.

Data Measurement Setup

Multiple technologies were used to capture and measure bodily behavior and performance aspects of the pianists into quantitative time-series data. Concerning the performance, we recorded MIDI data from the piano keyboards, including MIDI note numbers, note-on/off times, and note velocities. Delays from piano keypresses to audio equaled 16 ± 5 ms. We captured full body movements of all pianists (3D position, 120 Hz) using a multi-camera Qualisys optical motion capture system (OQUS 7+ cameras). Real-time streaming of full-body movement data of Confederate pianist 1 to visualization in the HMD had a latency of 54 ± 11 ms. In addition, video was recorded using a four-camera Qualisys Miqus system. Further, we captured how the Test pianist distributed her body weight on the chair using four pressure sensors mounted underneath the four legs of the piano chair. Finally, we tracked the eye movements of the Test pianist using the Tobii eye-tracking technology from the HMD of model HTC Vive Pro Eye. An overview of the technical set-up is shown in Figure 4.

FIGURE 4

Figure 4. Schematic overview of the case-study’s technical set-up.

Task and Procedure

The task for the pianist duos in each condition was to perform Piano Phase as prescribed by the composer Steve Reich. The compositional idea of Piano Phase is to start in unison after which intermittent gradual tempo changes cause increasing phase shifts between the melodic patterns played by each pianist. These shifts, in their dynamic variety of interlocking, then lead to the emergence of a variety of harmonies over the course of the performance until pianists are back in unison. First bars of the piece are shown in Figure 5. Reich’s compositional instruction is as follows: “The first pianist starts at 1 and the second joins him in unison at 2. The second pianist increases his tempo very slightly and begins to move ahead of the first until (say 30–60 s) he is one sixteenth ahead, as shown at 3. The dotted lines indicate this gradual movement of the second pianist and the consequent shift of phase relation between himself and the first pianist. This process is continued with the second pianist gradually becoming an eight (4), a dotted eight (5), a quarter (6), etc., ahead of the first until he finally passes through all 12 relations and comes back into unison at 14 again” (Reich, 2002). In the performance, the pianist that is assigned the top part of the score keeps a constant tempo, while the pianist that is assigned the bottom part performs the gradual phase-shift by gradually increasing his/her tempo. In our study, the Test pianist was always assigned the top part of the score, while Confederate pianists 1 and 2 were assigned the bottom part in the Avatar and Agent condition, respectively. We fixed the number of repetitions for each bar at eight for better experimental control and kept the tempo at 72 BPM (one beat for six 16th notes or a dotted quarter note).

FIGURE 5

Figure 5. Annotated first 11 bars of Piano Phase (1967) by Steve Reich (blue = melodic pattern, green = phase difference, and red = tempo instruction; reprinted with kind permission by Universal Edition AG, Vienna).

A month before the experiment, the Test pianist was asked to prepare for a performance of the musical composition. The Test pianist received an audio recording of her part with isochronous notes, uniform velocities, and linear accelerations to help with practicing. Upon arrival, the Test pianist was told the experiment consisted of a preparation phase followed by three repetitions of the full piece. She was then given a questionnaire and asked to change into a motion capture suit afterwards. A data skeleton was build using some recordings of her walking and playing the piano after which she was asked to calibrate the HMD’s eye-tracking.

After the explanations, the Test pianist practiced the task with Confederate pianist 1 for about 15 min without wearing the HMD. When both pianists indicated they were ready for the performance, the Test pianist was given the HMD to get accustomed to the virtual environment after which they performed the three conditions. A questionnaire and break were given after each condition. The Test pianist was not told that the agent in the Agent condition was computer-controlled. The experiment concluded with a semi-structured interview about the experience. Five days after the experiment, the Test pianist was asked to listen and evaluate randomized audio recordings of each condition as well as fill in a final questionnaire.

Analysis

This subsection describes the analysis in the application of the methodological framework on the case-study. It presents the choice of quantitative and qualitative methods used in each layer to obtain insights that will be discussed in the Results section below.