Head-area sensing in virtual reality: future visions for visual perception and cognitive state estimation

Pettersson, K.; Tervonen, J.; Heininen, J.; Mäntyjärvi, J.

doi:10.3389/frvir.2024.1423756

PERSPECTIVE article

Front. Virtual Real., 20 September 2024

Sec. Virtual Reality and Human Behaviour

Volume 5 - 2024 | https://doi.org/10.3389/frvir.2024.1423756

This article is part of the Research TopicUse of body and gaze in extended realityView all 6 articles

Head-area sensing in virtual reality: future visions for visual perception and cognitive state estimation

K. Pettersson*

J. Tervonen

J. Heininen

J. Mäntyjärvi

VTT Technical Research Centre of Finland, Espoo, Finland

Biosensing techniques are progressing rapidly, promising the emergence of sophisticated virtual reality (VR) headsets with versatile biosensing enabling an objective, yet unobtrusive way to monitor the user’s physiology. Additionally, modern artificial intelligence (AI) methods provide interpretations of multimodal data to obtain personalised estimations of the users’ oculomotor behaviour, visual perception, and cognitive state, and their possibilities extend to controlling, adapting, and even creating the virtual audiovisual content in real-time. This article proposes a visionary approach for personalised virtual content adaptation via novel and precise oculomotor feature extraction from a freely moving user and sophisticated AI algorithms for cognitive state estimation. The approach is presented with an example use-case of a VR flight simulation session explaining in detail how cognitive workload, decrease in alertness level, and cybersickness symptoms could be modified in real-time by using the methods and embedded stimuli. We believe the envisioned approach will lead to significant cost savings and societal impact and will thus be a necessity in future VR setups. For instance, it will increase the efficiency of a VR training session by optimizing the task difficulty based on the user’s cognitive load and decrease the probability of human errors by guiding visual perception via content adaptation.

Introduction

Head-mounted displays (HMD) offer an unobtrusive platform for head-area sensing. In this article, the term head-area sensing denotes a virtual reality (VR) headset enabling unobtrusive monitoring of a variety of biosignals such as eye and head movements, pupil size, heart rate, skin conductivity, and brain activity providing versatile information on human physiology and psychophysiology. Existing VR headsets include built-in eye-trackers, and an increasing number of versatile biosensing capabilities are emerging. Further, enabling VR content adaptation with seamless real-time estimations on human visual perception and cognitive state, for instance, we could analyse and modify human perception and attention allocation in a specific task (e.g., safety critical, monitoring tasks), optimize learning and rehabilitation effect during a session, or even guide individual experience paths (e.g., in entertainment and tourism related use-cases).

In this article we envision why biosignal-based personalised virtual content adaptation should be a necessity in future VR systems. The functionality of such a system is presented with an example use-case of a VR flight simulation training session. We explain how signal and feature processing, adaptive classification, and decision making with virtual content parametrisations are composed to operate together as a sophisticated AI system. We illustrate examples of three distinctive occasions that would be most likely to affect the performance of the trainee; increase in cognitive workload, decrease in alertness level, and cybersickness symptoms estimation. The primary focus here is in the eye and oculomotor parameters, as well as head movements, given their crucial role in visual perception.

Since visual perception is tightly linked to the visual content, the discussion is limited to VR environments (in which the content is fully controllable). However, some of the protocols could be implemented in augmented reality (AR), especially in fully rendered mixed reality (MR), where the visual scene is known (Rauschnabel et al., 2022).

Perception is an active cognitive process we use to form our understanding of the complex and dynamic world around us. It involves receiving and processing sensory information selectively filtered by attention. Cognitive state, such as alertness, directly influences attention allocation and visual perception (Lim and Dinges, 2010). Therefore, knowledge on the person’s cognitive state is essential for understanding their perception and attentional processes, strategies, behaviour, and performance in specific tasks.

Currently, the user’s cognitive state, especially stress and cognitive workload, can be detected from biosignals with machine learning methods in constrained environments as a binary indicator (Giannakakis et al., 2022). VR experiments have achieved a similar goal by limiting to data from an integrated eye-tracker (Shadiev and Li, 2023) and by the inclusion of wearable measurement devices (Weibel et al., 2023), (Tao et al., 2022), (Miltiadous et al., 2022) or custom hardware attached to the headset (Luong et al., 2020). However, external devices are more error-prone, leading to synchronization challenges compared to integrated sensors. Moreover, we have found that by combining multiple biosignals (such as electro-oculography, EOG; electroencephalogram, EEG; electrocardiogram, ECG; electrodermal activity, EDA) it is possible to achieve better classification performance in a multiclass classification of stress and cognitive workload (Pettersson et al., 2020), (Tervonen et al., 2023). These points suggests that the advent of novel VR headsets integrating biosensing facilitate to deliver versatile and engaging stimuli in a less constrained environment to estimate cognitive states on a more fine-grained level.

In their extensive 2021 review, Halbic and Latoschik noted that although the integration of biosignals with VR applications is promising, there were no VR headsets with biosignal capabilities available at that time (Halbig and Latoschik, 2021). Now, 3 years later, many commercial manufacturers (e.g., OpenBCI, HP, Emteqlabs, LooxidLabs, Wearable Sensing) are increasingly integrating biosensing such as EOG, electromyogram, ECG, EEG, EDA, photoplethysmogram, and facial cameras in addition to eye-tracking (video-oculography, VOG) into the headsets.

Existing knowledge of human perception and active vision, has been mainly derived from experiments conducted in static 2D setups with limited field-of-view (FoV) e.g., (Ugwitz et al., 2022), and corresponding oculometric algorithms are also designed based on such setups. While advancements in headset integrated sensing and VR have opened new opportunities for studying the complexities of real-life perceptual experiences, e.g., (Agtzidis et al., 2019), (Haskins et al., 2020), (Merzon et al., 2022), they also enable us to update the oculomotor feature extraction algorithms to operate in more detail in new, more dynamic (large FoV, 360° VR, even 3D) settings with a freely moving person (Startsev and Zemblys, 2022).

In the near future the user’s experience can be modified with personalised content adaptation in VR. This is possible by combining headsets with time-synchronized VR content and biosignals, robust real-time oculomotor feature estimation in a dynamic 3D environment, and sophisticated real-time AI algorithms for cognitive state estimation. Examples already exist in the literature on bio-feedback and content adaptation approaches, such as (Miltiadous et al., 2022), (Halbig and Latoschik, 2021), (Qu et al., 2022). However, most of these examples present offline solutions or are related to use-cases involving meditation or relaxation.

We envision that the personalised content adaptation will improve immersion and experience as well as obtain more fine-grained knowledge on the user’s visual perception, cognitive state, and performance. Ultimately, the approach could enable behavioural changes, enhance and accelerate learning or rehabilitation, and even decrease the probability of human error. These in turn will lead to cost savings and a significant societal impact.

Personalised content adaptation

The functionality of the personalised content adaptation is presented with an example use-case of a VR flight simulation training session with three cognitive states: cognitive workload, alertness level, and cybersickness symptoms estimation. Cognitive states are estimated by using all the biosignals obtained with biosensors integrated to a VR headset. Next, we will describe how the eye and oculomotor features are estimated from a freely moving person in VR, what an embedded stimulus is and how it could be used in content adaptation, and AI-aspects for realising the real-time adaptation.

Oculomotor parameter estimation in VR

Figure 1 demonstrates the detection of eye and oculomotor features during a VR flight training session. The headset tracks the user’s head and eye movements, and concurrent feature extraction provides the time series into different eye and head movement events based on the signal changes.

Figure 1

Figure 1. Schematic overview of an illustrative VR flight simulation session where the headset tracks the user’s head movements (the light cyan signal) and eye movements with both VOG (the blue signal) and EOG (the black signal). Feature extraction provides the head and eye movement events into different classes based on their types and concurrence: (A) a detailed illustration of a horizontal saccade, (B) an example of an occasion of simultaneous saccade and head movement followed by a vestibular ocular reflex (VOR), (C) a detailed illustration of a blink.

Both the head and the eyes are stable during fixations (Figure 1: light purple). While fixating, an object is held in foveal vision to gather information and a longer fixation time may indicate, for example, deeper cognitive processing (Rayner, 1998). Fixation patterns such as scan paths, and fixation dispersion contain information about content, and how a person tracks the visual field, which can change due to fatigue or neurological dysfunction (Shiferaw et al., 2018), (Cox and Aimola Davies, 2020). Eyes can maintain the fixation with smooth pursuit even though the object moves (Figure 1: cyan). Since smooth pursuit is difficult to perform without a moving target, it is discussed in more detail later with embedded stimuli.

Gaze is shifted with fast eye movements, saccades (Figure 1: grey; Figure 1A) with or without a head movement. Smaller gaze shifts (<15°) are usually made without a head movement (Bahill et al., 1975a). In studies with fixed head position, saccade parameters (e.g., rate, duration, peak velocity, main sequence (Bahill et al., 1975b)) have shown to be sensitive to changes in cognitive state such as alertness level (Hirvonen et al., 2010), acute stress (Startsev and Zemblys, 2022), and cognitive load (Qu et al., 2022). We believe these parameters can be reliably estimated during the VR session and used to estimate cognitive workload of the trainee during the flight simulation.

When the target object is far away, the head tends to continue movement after gaze fixates on the target, causing the eyes to do a compensatory movement with vestibular ocular reflex, (VOR) to keep the target in foveal vision (Figure 1: purple; Figure 1B). Given the coordinated nature of head and eye movements during attention allocation, the latency between eye and head movement, amplitude ratio, and direction of the movements can serve as indicators of the user’s performance, strategy, cognitive state in VR. Moreover, the dysfunction of VOR (e.g., jerky eye movements instead of smooth), can indicate motion sickness and dizziness (Wallace and Lifshitz, 2016), (Clément and Reschke, 2018), (Biswas et al., 2024) and could be used for estimating cybersickness occasions of the user during the flight training session.

Blink and pupil size parameters are robust for the head movements and can thus be monitored from during the VR session. Spontaneous eye blink frequency (EBR), blink duration, blink waveform parameters as well as eye lid velocity and acceleration (see Figure 1: yellow; Figure 1C) are sensitive to changes in cognitive state, e.g., vigilance (Schleicher et al., 2008). EBR is also mediated by the central dopaminergic activity and indicates cognitive performance (e.g., reinforcement learning and motivation, and cognitive flexibility) (Jongkees and Colzato, 2016). Luminance of the stimulus influences the pupil size. Nevertheless, variations in pupil size can serve as an informative marker for cognitive states, especially excitement and engagement (Bradley et al., 2008), if the illumination remains constant or is known.

Eye and oculomotor reactions induced by embedded protocols

The visual scene guides the eye and especially oculomotor parameters. To get more versatile information on the user’s oculomotor behaviour and cognitive state, embedded stimuli could be timely included into the content. For instance, if a movie includes a salient target, the user will most likely do a reflexive saccade towards the target, enabling the estimation of the saccade latency and the occasions when the target has been missed (Figure 1A). In addition, smooth pursuit can be induced by adding, e.g., a flying object to the flight simulation content. Dysfunction of the pursuit system could be an indicator of, e.g., fatigue (Stone et al., 2019) or cognitive impairment caused by alcohol (Tyson, 2021).

Simultaneous eye and head movement can be induced by implementing a large and rapid target movement across the FoV, prompting the user to execute a simultaneous head movement and saccade, potentially followed by a VOR (Figure 1B). With the help of such embedded stimuli occurrences, e.g., cybersickness symptoms can be monitored during the VR session.

Most of the presented eye parameters are under voluntary control, thereby being closely associated with the user’s motivation and engagement, for example, the user can voluntarily inhibit the embedded target. However, in certain setups VR may incorporate stimuli that elicit startle responses. Startle responses (e.g., blink and pupil) are predominantly unconscious defensive reactions triggered by sudden or threatening stimuli, such as a loud noise or light flashes (Koch, 1999). The latency of the startle responses reflects the functioning of the startle reflex controlled by the brainstem (Koch, 1999), providing insight into both affective and cognitive processes (Bradley et al., 2003).

AI aspects and cognitive state estimation

Head-area sensing in VR may benefit from AI in several ways, ranging from adaptive feature extraction, user cognitive state detection, content adaptation, synthetic virtual content creation based on user preferences, or assistance in reaching the VR session objectives by guiding the overall session management. Figure 2 presents a schematic overview of the data flow from sensing to personalised content adaptation.

Figure 2

Figure 2. Flow of head-area and biosignal data from sensing to calibration, feature extraction, modelling, and personalised content adaptation.

We illustrate the proposed approach (Figure 2) by using the example of estimating and tuning the trainee’s 1) cognitive load, 2) alertness level, and 3) cybersickness occasions during a VR flight simulator session. In the imaginary use-case a user carries out flight simulator session. The system operates in the background with capabilities to run and personalise the VR content, automated feature extraction, and state detectors to reach a certain overall objective set for a particular user.

The example: In the first phase of the session, the VR basic flight content is rendered and separate parallel personalised models on cognitive workload, alertness, and cybersickness provide corresponding estimations based on session objectives (e.g., steadily increasing cognitive load, maintaining alertness level, and eliminating cybersickness). The content parameters of the first phase of the embedded protocol simulation are tuned for evaluating the individual model performance measures within each of the three cognitive state test cases. The cases can be run sequentially or some of them can be combined. After the first phase each of the model provide individual state estimations (e.g., cognitive workload medium, alertness low, cybersickness high) for suggesting corresponding embedded protocol content parameter updates (this comes through physiology and oculometry vs. model explanations). The content parameter updates are converted in the background into new simulation sequences (embedded protocol adaptation) which are then played (sequentially or combined) with fine-tuned estimation models in the second phase of the simulation. These steps can be repeated until the objectives of the training session are fulfilled.

More concretely, a flight simulation session has the objective of training a student to fly in various weather conditions and manage different malfunctions while flying from one location (A) to another (B). Here, personalised content adaptation optimises the task difficulty with embedded stimuli to maximize the learning effect. During a normal takeoff and climb to cruising altitude, the ML models for workload, alertness, and cybersickness are personalised. At the same time, cybersickness symptoms and vigilance level are checked. If, for instance, the system notices that the student’s low alertness level is not optimal for learning, stimulating elements are automatically added to the content (e.g., turbulence, flock of birds). When the student’s alertness level is increased, the actual task can begin. The task difficulty is increased by adding embedded stimuli, for instance, increasing the crosswind. Afterwards, the student’s performance (e.g., correct altitude and heading) and cognitive workload are checked, and the task difficulty increased with various embedded stimuli (e.g., malfunction, or another weather condition) automatically in order to keep the workload in an optimal level.

As perceptional, attentional, and physiological processes are specific to everyone, modelling and content adaptation should account for individual differences. Personalisation requires some prior information, which is not available for a new person, i.e., when a cold start occurs. Since new users will likely require a short period of time to get used to the setup, an AI-assisted calibration process can be run while collecting the required baseline information to calibrate the setup and personalise cognitive state detection.

Biosignal events and the corresponding extracted features are computed based on the visual scene, the task at hand, the presence of head movements and the temporality of the physiological phenomenon behind the biosignal feature. For instance, blinks occur every 3.5 s on average, making the calculation of blink rate over any shorter segments pointless. Moreover, cognitive responses and oculomotor behaviour emerge with varying velocity and duration, making some of them useful for fast-paced content adaptations and others more relevant for longer-term domain-specific applications. Essentially, adaptive feature sampling and segmentation involves three distinct time frames.

i) Super-fast reactions (few seconds at most) such as panicking which should be caught for fast safety critical interventions.

ii) Short reactions (about a minute) such as cognitive load, and acute stress, detectable from e.g., ECG, EDA, and eye parameters.

iii) Slow reactions (3–10 min or more) such as flow, and engagement which require slower interventions.

These interventions largely consist of the addition of embedded stimuli, like modifying the visual scene to show only essential information, although switching to automated operating mode might also be needed in some cases and applications. Reactions to interventions are monitored and given back as input in the feedback loop.

In the scenario the AI models responsible for user calibration, cognitive state estimation, and adaptations with embedded protocols work in conjunction with another AI model to generate synthetic data. Augmenting the calibration data with the synthetic counterpart could help to improve the performance of the state detection especially in a cold start. The true potential of generative AI, however, comes from counterfactual prediction of adaptations needed to direct user states to a desired direction, and then creating the required virtual world. Such an AI could, for example, shift interactive storytelling from active, explicit decisions made by the user to implicit deduction of one’s wishes and creation of corresponding multimodal narrative, with potential applications ranging from training and entertainment to rehabilitation. These processes relate directly to the creation and optimization of the virtual space for each user and situation. We see that these topics are the most challenging technical advancements for the near future to achieving the vision.

Discussion

The use of personalised content adaptation will lead into significant societal impact and cost savings by improving the efficiency of VR sessions (e.g., in education and rehabilitation) as well as decreasing the probability of human errors by guiding attention allocation (e.g., safety critical tasks). For these reasons we claim that our approach will be a necessity in future VR setups.

We have illustrated the details and potential our approach with the flight simulator training use-case. However, the real-time content and embedded stimuli adaptation opens the possibility to make desired interventions to a variety of use-cases. There are multiple application domains varying from entertainment and workplace to education, with examples including.

• Entertainment in VR games and movies to enhance the experience by e.g., modifying the level of engagement.

• Education in training contexts such as flight simulators to optimize the content for maximizing learning e.g., real-time optimization of the task difficulty.

• Demand to estimate the cognitive state of the human in the loop especially in safety critical work, such as control rooms.

• Wellbeing, wellness, and rehabilitation (e.g., stroke, post-traumatic stress disorder).

A comfortable user experience in VR requires synchrony between the audiovisual content and immediate controlling events, such as head turns moving the FoV and eye tracking sharpening the image once movement halts. However, the envisioned AI processes for visual perception and cognitive state modelling, adaptations with embedding stimuli, and especially the creation of virtual worlds all have potentially significant computational costs. The orchestration of these AI processes may require architecturally complex solutions and integration to edge or cloud processing to ensure the models work accurately in a timely conjunction.

The potential applications require processing sensitive personal data, and some of the states that can be detected are private to the user. Besides computational challenges, developers need to consider the ethical aspects of their applications, including user privacy and anonymity, information security, and fairness. Data processing and AI based systems are also increasingly regulated, with the General Data Protection Regulation and the recent AI Act in the EU, and the non-binding AI Bill of Rights in the US. Such regulations and guidelines help developers ensure responsible data processing and use of AI and should therefore be closely followed.

The envisioned approach will revolutionise our understanding of human visual perception, cognitive state as well as behaviour in VR. With the integration of context detection, it can be further extended to AR with an increasing number of real-world components, and even implemented in real-life settings by using smart eyewear. Such a setting would allow for an extremely diverse analysis of human perception, cognition, and behaviour in everyday life.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.

Author contributions

KP: Conceptualization, Funding acquisition, Methodology, Supervision, Visualization, Writing–original draft, Writing–review and editing. JT: Conceptualization, Methodology, Visualization, Writing–original draft, Writing–review and editing. JH: Visualization, Writing–original draft, Writing–review and editing. JM: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Visualization, Writing–original draft, Writing–review and editing.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. The work was funded by the Academy of Finland (grant 334092) and VTT.

Acknowledgments

The image of a human figure used in Figures 1, 2 is designed by Freepik (AI image generator beta)¹.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

¹https://www.freepik.com/ai/image-generator

References

Agtzidis, I., Startsev, M., and Dorr, M. (2019). “360-degree video gaze behaviour: a ground-truth data set and a classification algorithm for eye movements,” in Proceedings of the 27th ACM international conference on multimedia (Nice France: ACM), 1007–1015. doi:10.1145/3343031.3350947