Head and Gaze Orientation in Hemispheric Image Viewing

Kangas, Jari; Špakov, Oleg; Raisamo, Roope; Koskinen, Olli; Järvenpää, Toni; Salmimaa, Marja

doi:10.3389/frvir.2022.822189

ORIGINAL RESEARCH article

Front. Virtual Real., 24 February 2022

Sec. Technologies for VR

Volume 3 - 2022 | https://doi.org/10.3389/frvir.2022.822189

This article is part of the Research TopicEye movements in virtual environmentsView all 5 articles

Head and Gaze Orientation in Hemispheric Image Viewing

Jari Kangas¹*

Oleg Špakov¹

Roope Raisamo¹

Olli Koskinen¹

Toni Järvenpää²

Marja Salmimaa²

¹TAUCHI Research Center, Tampere University, Tampere, Finland
²Nokia Bell Labs, Tampere, Finland

Head mounted displays provide a good platform for viewing of immersive 360° or hemispheric images. A person can observe an image all around, just by turning his/her head and looking at different directions. The device also provides a highly useful tool for studying the observer’s gaze directions and head turns. We aimed to explore the interplay between participant’s head and gaze directions and collected head and gaze orientation data while participants were asked to view and study hemispheric images. In this exploration paper we show combined visualizations of both the head and gaze orientations and present two preliminary models of the relation between the gaze and the head orientations. We also show results of an analysis of the gaze and head behavior in relation to the given task/question.

1 Introduction

Recent development of low cost eye trackers has meant a significant expansion in the research on practical utilization of gaze in various kind of situations. Some practical use cases for eye tracking are, for example, 1) usability studies where the user’s gaze behavior gives valuable information of which features the user is paying and not paying attention to (Jacob and Karn, 2003; Poole and Ball, 2006), 2) market research where gaze behavior is studied to learn what features in a product are noticed (Wedel and Pieters, 2008), and 3) as an input method for human-computer interfaces (Kangas et al., 2016; Morimoto and Mimica, 2005; Rantala et al., 2020). A special application area for gaze tracking has been setting up human-computer input methods for such disabled people who are unable to use other input technologies (Bates et al. (2007).

In interaction between humans the gaze directions are often important cues of interests and the flow of the interaction. Technologies are developed to share the gaze information in co-operation, especially on remote work as there is clear evidence that sharing the gaze improves the common situation understanding (Hanna and Brennan, 2007; Brennan et al., 2008). It has been shown that there are differences in gaze use between experts and novices (Law et al., 2004; Gegenfurtner et al., 2011), and gaze tracking may be beneficial in learning (Wilson et al., 2011). An interesting application for the gaze tracking is to observe indirectly the cognitive processes (Liversedge and Findlay, 2000; Just and Carpenter, 1976), such as in which order the person studies complicated environment or navigates social situations. Eye tracking can also be used for studying possible driver fatigue in automotive applications (Bretzner and Krantz, 2005), cognitive load (Biswas et al., 2016) or cognitive dysfunction, such as autism (Falck-Ytter et al., 2013). Overall, there are plenty of examples of useful applications of gaze tracking.

However, gaze trackers are still a special equipment and not always available. New low-power trackers may change that in future (Angelopoulos et al., 2020; Li et al., 2020). One proposal has been to use the head orientation as a proxy measure of the gaze orientation. There have been some studies (for example, Ohn-Bar et al. (2014) where the observer’s “gaze is inferred using head-pose and eye-state.” In our experiments we wanted to collect data of the gaze and head behaviour to study potential new methods to do the prediction of gaze orientation.

2 Prior Research on Head and Gaze Dynamics

The coordination of head orientation in relation to gaze behavior has been extensively studied and a strong interplay of gaze and head control has often been demonstrated. See, for example, Guitton (1992) for a discussion of the basis of the coordination. In his review article Freedman (2008) reviewed the diversity of the coordination and lists several hypotheses of the neural control.

Many of the early studies used a simple stimulus that the participant was expected to look at, often only a single object appearing on an otherwise empty space. Such an arrangement may lead to somewhat one-sided visual behaviour that might be reflected in results about gaze and head coordination. Biguer et al. (1982) studied a system where a participant was to reach for an object and they showed that while the movements are sequentially ordered (eye, then head and finally arm), the onset of muscles for head and arm movements are synchronous with the eye movement. Goossens and Van Opstal (1997) studied how different stimulus modalities might affect the gaze and head coordination and their main results also supported the strong linkage of gaze and head movements.

In natural tasks the gaze and head coordination is affected by the predictability of the target. Kowler et al. (1992) demonstrated that in reading tasks the head may start returning to the beginning of next line before the gaze have reached the end of previous line. Similarly, Pelz et al. (2001) demonstrated that (in a well known visual environment) head movements showed considerable flexibility and were not tied to the gaze behavior, but were influenced by predictions of the future needs. In their words (Pelz and Canosa, 2001, p. 276): “Thus, the eye-head system is very flexible when the timing and goals of the movements are under the subject’s control.”

Land (1992) examined the oculomotor behaviour in driving situations, paying attention to situations where large gaze movements were needed, like at road junctions, and noticed that “The results show that the pattern of eye and head movements is highly predictable, given only the sequence of gaze targets.” Doshi and Trivedi (2012) were also studying the driving behaviour and found that there are different coordination strategies depending on situation. For example, they found a pattern of preparatory head motions before the gaze movement in task-oriented attention shifts. Zangemeister and Stark (1982a) and Zangemeister and Stark (1982b) noted that head and gaze coordination is modified by conditions such as “instructions to the subject, frequency and predictability of the target, amplitude of the movement, and development of fatigue.” Also Ron and Berthoz (1991), Guitton and Volle (1987) and Oommen et al. (2004) list situations where the order of gaze and head movements varies. Foulsham et al. (2011) and Hart et al. (2009) demonstrated that the participant’s gaze behavior is highly context dependent by studying the gaze behavior while moving in the world versus seeing the same or similar video views while in a laboratory keeping their head still. The gaze behavior seems to be more concentrated in real world context than in the laboratory. They didn’t study the dynamics between the head and gaze behavior, however.

Sitzmann et al. (2018) collected gaze and head orientation data on participants exploring stereoscopic omni-directional panoramas in virtual reality (VR) setting. The target of the study was mainly to analyze the working of visual saliency predictors, but the data also provided some insights to head and gaze dynamics. We define the visual saliency as the distinct subjective perceptual quality which makes some items in the world stand out from their neighbors and immediately grab our attention (Itti, 2019). Rai et al. (2017) and David et al. (2018) collected two databases of gaze and head orientations where a large number of participants were viewing either 360° images (Rai et al., 2017) or 360° videos (David et al., 2018) using head mounted displays (HMD). Their target was to provide test data and analysis tools to enable the study of the visual saliency in VR environment. Hu et al. studied the natural head moves (Hu et al., 2017) and then the correlations between head and gaze moves (Hu et al., 2018) while studying complex scenes in HMD. They were not recording the head and gaze data during the same session, though, but the correlation was between cumulative data from separate sessions.

Sidenmark and Gellersen (2019b, 2019b) have studied the coordination of head and gaze orientations for enabling fast and convenient gaze interaction mechanisms. In Sidenmark and Gellersen (2019b) the gaze and head were both used to select a location in a view. The dynamics between the head and gaze was then intentional as the user was trying to activate a target. A related method was one option in Kytö et al. (2018) where gaze was used for selection and gaze point error could be corrected by a deliberate head move. The natural behavior or head, gaze and torso directions and timing differences were studied in Sidenmark and Gellersen (2019a) based on measurements on when the respective moves have started and how the fixation directions differ.

Hu et al. (2019) collected gaze and head orientation data from a number of participants viewing 360° virtual worlds using HMD and a gaze tracker. The participants were able to freely navigate in the scenes. The purpose of the study was to develop and evaluate a real-time gaze prediction model based on the head data. The final model by Hu et al. was based on head angular velocities and not on head orientation.

An important limitation in using the HMD devices and studying the gaze behavior is the fact that the HMD field-of-view is naturally limited by the display size. It has been shown that this has a natural effect in the gaze and head behavior (Kollenberg et al., 2010; Pfeil et al., 2018) when studying natural scenes which are larger than the display size. In our study we targeted only for HMD use. The studies on HMD use are important as the near-eye displays will improve and become more common (Koulieris et al., 2019; Konrad et al., 2020; Padmanaban et al., 2017; Kim et al., 2019).

3 Head and Gaze Coordination in Image Viewing

While it has been shown that the coordination of gaze and head is highly dependent on external factors, like the subject’s expectations regarding future directions of gaze, we were interested in understanding the gaze-head behaviour in our special case, the viewing of images on head mounted displays. There have been studies where gaze and head data has been collected using head mounted displays (e.g., Sitzmann et al., 2018; Rai et al., 2017; David et al., 2018) but the main interest has been on analyzing visual saliency, while we were more interested on the general gaze and head dynamics. Also, Hu et al. (2019) were studying gaze-head behaviour when using head mounted displays, but using virtual world scenarios where the participants could freely navigate.

Our main target was to study the possibility of constructing a gaze predictor based on head behaviour. We expected that there would be some similarities to the results by Land (1992) (i.e., more restricted, “simple” gaze-head coordination). If the linkage between gaze and head direction was indeed found rather strong, we could hope for a successful gaze estimation based on the measurements of head coordination.

We decided to collect a database of gaze and head orientations where the participants were looking at a set of hemispherical images (the front half of a spherical image all around you, a 180° view of an image). The preliminary reason to collect the data was to be able to visualize and to analyze the interplay between the head and gaze orientations, i.e., to see if we can create some interesting models of the connection between gaze and head orientations. We expected to find simple relations between these two parameters.

We were also interested in knowing if the context of the image viewing has an effect on the head and gaze interplay. Such an effect has been studied for gaze only (Rayner., 2009; Torralba et al., 2006; Bulling et al., 2009), but to our knowledge nobody has studied the context recognition using the co-operation of the head and gaze directions for image viewing. Separately, we were then interested to make experiments on some measures to see if different tasks/contexts given to the image viewers would noticeably affect their viewing behavior.

Ishimaru et al. (2014) have studied the combination of head motion and eye blinks for activity recognition. Earlier it has also been shown that the gaze path can be used to infer task-based information (see, e.g., Coutrot et al., 2018), as well as the distributions of fixations on salient points to infer the task given to observers (Koehler et al., 2014). Therefore, we decided to do the data collection so that different participants were given different reasons to look at any specific image. For every image viewing we asked the participant a question about the image, or in some cases just asked him/her to look at the image. By analyzing possible differences in viewer behaviors we expected to notice some context effect.

Overall, the main purpose of our experiment was to do an exploratory study and try to find out some promising directions for more extensive future studies.

3.1 Contribution Statement

The studies in this paper are exploratory in nature and we were looking for potential effects and interplay of head and gaze orientations in different viewing tasks. The contribution of this paper is fourfold: 1) We report a data collection effort of both head and gaze orientation of a number of participants when viewing hemispheric images in a fixed virtual reality environment. 2) We show two related new visualizations of the interplay between head and gaze orientations. 3) We report studies of two models of co-ordination between the head orientation and gaze orientation when viewing hemispheric images. 4) We report studies of two measures of gaze and head behaviour to separate between different viewing tasks given to the participants.

4 Materials and Methods

4.1 Participants

We recruited 31 volunteer participants (12 females, 19 males) from university students and personnel. The age distribution was as follows: 3 participants up to 24 years, 15 in the range 25–34, 8 in the range 35–44, 3 in the range 45–54, and 2 older participants. Of the 31 participants 16 had normal vision, 15 had corrected-to-normal vision, 9 of which wore glasses while using the virtual reality head mounted displays.

4.2 Apparatus

In the experiment we used a Windows PC, an HTC Vive VR headset with an built-in Tobii gaze tracker, and a custom VR experience player application (Järvenpää et al., 2017) to render the images and to collect the head and gaze data. The application was developed using the Unity platform. The application has a wide cross-platform VR and gaze tracker system support and has been used in various VR experience evaluations. The HTC Vive VR headset had a 110° field of view and had 1440 × 1600 pixels per eye. The Tobii gaze tracker had a reported accuracy of 0.5 – 1.1. The original 180° images had been captured with Canon 6D digital camera with 3600 × 3600 pixels resolution and upscaled to 4096 × 4096 pixels for the player application.

The participant had no control devices other than the “head-gaze” pointer (head pointer) in the headset. The facilitator was using a separate mouse device to control the progress of the experiment software and to confirm the answers selected by the participant. The gaze and head orientation data was collected using a 75 Hz sampling rate. For the head orientation information we captured the direction of the head pointer often available on head mounted display applications for controlling the user interface, even though it was not visible for the participant at the time of viewing (see Section 5.2 for a discussion on the head orientation data).

4.3 Experimental Design

In the experiment all the participants were sequentially viewing six hemispherical images using a head mounted display device, one image at a time. In the background of Figure 1 one can see an example of the images, in equirectangular projection. For each image viewing the participants were first asked a question or just asked to look at the image. After 40 s of viewing (a predefined fixed length of time) the participants had to answer a question using a multiple choice format, or they were asked to continue to the next image if there was no question. All the questions and choices were shown on the head mounted display.

FIGURE 1

FIGURE 1. Gaze points (red) and head orientations (blue) collected from one participant while looking at an image for 40 s. The path of the head orientation consists of separate dots (like the more clustered gaze path) but due to the head turning relatively slowly, the adjacent dots are rather close to each other. For viewing this image this participant was not given any specific task.

In the experiment we were giving the participants different motivations for viewing the images. The task of the viewer was either to find an answer to a given question or to just look at the image freely. For analysis purposes we defined four categories, consisting of one free viewing category plus three different question categories. All the questions that we used in the experiment are listed in Table 1.

TABLE 1

TABLE 1. All the different questions in the three question categories and the single statement used in the experiment.

For clarifying the research work we formulated the following research questions (RQ1-3) to guide our work:

1. What is the relation between head and gaze orientations in image viewing?

2. Is it possible to estimate the gaze orientation from the head orientation?

3. How does the free viewing category differ in head and gaze orientation data from other categories?

For testing purposes, we also had the following hypotheses (H1-2) that we could try out with the final data concerning our use case:

1. The participants would turn their gaze before their head when looking around.

2. The measurements of head movement can be used to improve the estimate of gaze direction.

4.4 Procedure

Upon arrival, the participant was introduced to the study and to the equipment. The participant was seated on a fixed (non-revolving) chair that would guide the participant to keep facing the same direction throughout the experiment. The images would be pointed so that the exact center of the image would always stay in front of the viewer while sitting straight on the chair.

The participant would next put the head mounted display on and the facilitator would start the data collection program. First the participant was shown one example image that was mainly used to demonstrate how the hemispheric images look like and what is the extent of the image to the right and left, up and down. During that time the participant and the facilitator were also able to observe how well the display fit for the participant. After observing the example image the integrated gaze tracker was calibrated using the tracker manufacturer provided routines. After tracker calibration the participant was asked for some demographic information (age group, gender and vision), which were all answered inside the head-mounted-display. To select an answer, the participant moved the head pointer to the right choice visible at that stage. After all of that had been completed the head pointer was hidden and the real data collection started.

The data collection consisted of six question/image viewing tasks. For each task the participant was displayed one question/statement (see Table 1 for a full list) and after confirmation that s/he had understood the question the hemispherical image was shown. The image was displayed for 40 s, irrespective if the participant was ready with an answer or not. In deciding the length of viewing we wanted to have a fixed duration long enough for all the participants. After the 40 s had passed the image disappeared and the participant was shown the possible answers (multiple choice format) to the question among which s/he then made the selection. For the statement (“Just look … ”) the participant did not answer but proceeded to the next task. All the participants were shown the same six images, but there were several alternative questions/statements for each image and only one was shown to each participant. Each question/statement was used equally often during the experiment.

After going through all the six images, the experiment was over and the participant was instructed to remove the display. The facilitator was available for answering any possible questions about the experiment, but no other data was collected. The whole experiment took less than 30 min per participant. Simulator sickness, which might affect this kind of experiments, was not measured but none of the participants mentioned any such symptoms.

5 Data Analysis and Results

For visual analysis of the collected data we ended up using two different visualization methods. For the first method all the gaze points and head directions were mapped as separate dots on top of the equirectangular image that the participant had been viewing. One example of such a visualization is shown in Figure 1. In the example the participant has been mainly viewing the horizontal street level area, left and right, with some glances up and down.

For the second visualization method the dots of the head orientations were connected to respective gaze points by a thin line to emphasize the timing relation of the head and gaze behavior. One example of that type of visualization is shown in Figure 2. In the image one can observe, for example, how the head is occasionally moving while the gaze remains fixed.

FIGURE 2

FIGURE 2. Same data as in Figure 1 but here the head points (blue) are connected to their respective gaze points by (green) lines to emphasize the timing of the data points in the two sequences. One can notice that the gaze points on the image sides are generally farther away from the center of the image compared to the respective head points, especially when the head had to be turned more to the sides.

5.1 Statistical Significance Measures

We used a Monte Carlo permutation test (Dugard, 2014; Edgington and Onghena, 2007; Howell, 2007; Nichols and Holmes, 2001) to analyze possible statistically significant differences between different parameter sets. The permutation test is not dependent on as many assumptions on the sample distribution as some other tests such as ANOVA Dugard (2014), especially as the test sample need not be normally distributed. Also, we were using median values as the test statistic, while some other methods can only use the mean. Compared to the mean, median is more tolerant to outliers in the data.

In all tests an observed value of a measurement is compared against a distribution of measurements produced by resampling a large number of sample permutations assuming no difference between the sample sets (null hypothesis). The relevant p-value is then given by the proportion of the distribution values that is more extreme or equal than the observed value. To get the distribution of measurements assuming no difference between the conditions, we pooled the sample set values from both conditions and resampled from that generating 10,000 permutations to be measured.

5.2 Head Orientation Correction

While the gaze data directly indicates the direction of interest in the image, the head orientation is not as easy to interpret. First, the head pointing direction is defined relative to the head-mounted-display itself. And while the display rather naturally fits to the wearer’s face, there is no guarantee of the orientation. For example, the display might be slightly tilted because of head geometry, which is difficult to correct. Second, some participants habitually tilt their head slightly forward or backward, which would mean that the head pointing data on average is slightly below or above a “neutral” level defined by collecting and averaging similar data from a large number of participants.

To study the differences in the head and gaze direction distribution for vertical direction we analyzed the collected data of all participants viewing all the images, i.e., 186 viewings (31 participants each viewing 6 images). For each viewing separately we calculated the median values in vertical direction both for gaze directions and head directions, and then collected the histogram of their difference values. For the results see Figure 3 where positive values would mean head point being above the gaze point. The median value for the difference was −6.2°, i.e., the head pointer for most cases stayed slightly lower than their gaze point, on average. In Figure 1 one can observe that on the main horizontal area, at least, the head points (blue) are slightly below the gaze points (red) following the general trend. The vertical difference values for one image (shown in Figure 1) has been collected into Figure 4.

FIGURE 3

FIGURE 3. A histogram of the differences between the median values in vertical direction for gaze directions and head directions, for all six images and all participants. The median difference value was − 6.2°.

FIGURE 4

FIGURE 4. A histogram of the difference values between the centers of gaze and head orientations in vertical direction for one image. The displayed image was the same as in Figure 1. For the participant in Figure 1 the difference was − 8.0° while the median value for all participants was − 6.7°.

We separately computed the median vertical difference values for all six images over all participants. The median values varied from − 5.2° to − 6.8° depending on the image, which values are rather tightly clustered around the median value of all the data shown in Figure 3.

For the data analysis part we decided to compensate for the head and gaze directions center difference in vertical direction. We assume that there is such a value for the difference that we can compute from extensive enough material and then use as compensation for aligning (vertically) the head orientation values and the gaze orientation values to avoid that consistent bias. In the following studies we corrected the difference in our data using an image specific median difference value collected over all participants. An alternative might be to use person dependent correction that would be measured using a calibration type system in the beginning, or a global value that would be measured for large number of images and participants once. For example, for the data in Figure 1 we slightly (6.7°, the median value for all participants for that image) raised the head orientation data to align better with the gaze orientations.

5.3 Timing Difference Between Head and Gaze Data

Looking at the head and gaze data more closely, one notices that the gaze direction often changes first and the head direction follows a bit later (hinting towards the hypothesis H1). That makes intuitive sense as the gaze is easy to re-orient to different directions while head movements take more time.

Given the clear tendency that the gaze movement will happen slightly before the head movement we used the calculation of an average distance between the gaze direction and the (normalized as described in Section 5.2) head direction as a measure to optimize, trying out different delay values. The basic method of average distance computation with small delays is shown in the Eq. 1 where G is the sequence of gaze points, H is the sequence of the head points, N is the number of samples in the whole sequence, and d ∈ [0, … , M] is the delay value counted in samples.

D_{G, H} (d) = \frac{1}{N - d} \sum_{i = 1}^{N - d} | G_{i} - H_{i + d} |, (1)

An example of the result values given by the distance value computation for different delays is shown in Figure 5. The average distance first decreases when the delay is growing, reaches a minimum and then starts increasing again when the delay continues growing. The optimal delay is participant and image dependent but in these experiments most of the measured values were between 100 and 300 milliseconds. We collected all the measured optimal delays into Figure 6 with an overall median value of 211 milliseconds. In Sitzmann et al. (2018) the average delay was measured to be 58 ms, but as noted (Section 2), there is a large variability between the measures depending on the task of viewing among many other factors. Based on these results we can claim that the hypothesis H1 (The participants would turn their gaze before their head when looking around.) is plausible.

FIGURE 5

FIGURE 5. The average distance between gaze point and head point using different delay values for one example data. The minimum value of the distance can be found to be around 280 ms delay in this specific case.

FIGURE 6

FIGURE 6. The histogram of the best delay values for each participant and all images. The median value 211.0 is marked by a red vertical line.

5.4 Gaze Point Estimation From Head Data

If there is no gaze tracker available in a head mounted display to get information of the gazed point, for certain purposes it is possible to substitute the gaze direction with the head direction. We introduced a very simple model to get a slightly better estimate of the true gaze direction from the head direction than by just using the head direction as such. The model (see Eq. 2) was based on the observation that as the participant was turning his/her head the gaze direction usually moved farther away from the center than the head direction. I.e., by assuming that the very center of the viewing area is the center of our coordinate system, the absolute values of both the x and y coordinates grow when the head is turned and the (absolute values of) the gaze coordinates are slightly larger than the head coordinates, on average. The center of the viewing area can be defined for certain very common viewing situations in virtual reality, which are similar to the one used in our studies. For example, in many driving and control room simulations the user is continually facing one fixed direction while s/he is able to look around. The estimate of the gaze position E_G = (x_G, y_G) is computed by applying a fixed multiplier m to both x and y coordinates of the head point H:

E_{G} (m) = m * H . (2)

We selected a simple linear model for simplicity and we don’t assume that it would be the best choice in all viewing areas.

To measure the effect of the model parameter m in the gaze point estimation, we calculated the total distance between the collected gaze points and the estimated gaze points using a similar equation as above (Eq. 1), replacing the head orientation H by the gaze estimate E_G, fixing the delay parameter d to one given value and varying the multiplier m:

D_{G, E_{G}} (m, d = d_{o}) = \frac{1}{N - d} \sum_{i = 1}^{N - d} | G_{i} - E_{G} {(m)}_{i + d} |, (3)

where G is the sequence of gaze points, E_G is the sequence of the gaze estimates, N is the number of samples in the whole sequence, and d ∈ [0, … , M] is the delay value counted in samples. In the following calculations we used the median delay value d_o = 211.0 (as defined in previous calculations, see Figure 6).

One example of the distance value computation for different multiplier values m is shown in Figure 7. The optimal multiplier value m in this example is a little less than 1.1. We collected all the measured optimal multipliers into Figure 8 with the overall median multiplier value of 1.175.

FIGURE 7

FIGURE 7. The average distance between the measured gaze point and the estimated gaze point using the simple multiplicative model using different parameter values (see Eq. 2) for the same example data as in Figure 5. The average distance between gaze and head orientations calculated without any corrections (no delay correction, no position estimation) is shown as a green horizontal line for comparison.

FIGURE 8

FIGURE 8. The histogram of the best fitting multiplier values for each participant and all images. The median value 1.175 is marked by the red vertical line.

As seen in the Figure 7 the average distance between the head points and gaze points is clearly shorter (around 2.2° shorter) using the optimal parameter values compared to the average distance computed without any corrections (marked with the green line). We can answer the research question RQ2 (Is it possible to estimate the gaze orientation from the head orientation?) that while it is not possible to get the gaze orientation as such from head orientation we can improve the estimate using our models. In Figure 9 we show the histogram of the improvement values using the model for all images and all participants. As we were using common parameter values that were based on analysis of all the data (not tailored for each sample separately) there were cases where the average distance was growing, sometimes quite a lot. However, in the vast majority of cases the average distance decreased and the median improvement was found to be 2.3°. Therefore, based on these results we claim that the hypothesis H2 (The measurements of head movement can be used to improve the estimate of gaze direction.) is plausible.

FIGURE 9

FIGURE 9. The histogram of the changes in the median distances between gaze point and the estimated gaze point based on the simple multiplier model for each participant and all images. The median value − 2.3° of the changes is marked by the red vertical line. The median distance between the gaze point and head direction before the model use was 12.6°.

5.5 Measures to Separate Different Viewing Tasks

We also performed data analysis to automatically separate the samples by task categories (see Table 1). The method was to compute a measure of the gaze and/or head orientation data and to observe if there were statistically significant differences between the measured values for different categories. In the following we go through the results generated by using two measures developed for this experiment.

5.5.1 Time Delay Parameter

To see if the question category would affect the effect of head following the gaze we tried the time delay parameter that was introduced in Section 5.3. We computed the best fitting time delay parameter for each sample separately and divided them to different groups by the task category. To find any statistically significant differences we ran a permutation test (as explained in Section 5.1) for each category pair. As there were six different category pairs we used the Bonferroni corrected p-value limit 0.008. We found only one category pair (between Look and Analyze, p-value 0.003) where the distribution was statistically significantly different, the delay value was longer for category Look than for Analyze.

5.5.2 Distance Multiplier Parameter

To see if the question category would affect the relative amounts of gaze and head movements to the sides we tested the distance multiplier that was introduced in Section 5.4. We computed the best fitting multiplier for each sample separately and divided them to different categories. Running the permutation tests we observed that there were no statistically significant differences between the categories.

We can now answer the research question RQ3 (How does the free viewing category differ in head and gaze orientation data from other categories?) by not having clear differences on that data.

6 Discussion

The results of these experiments using the head mounted displays and virtual reality material again demonstrate (see, e.g., Freedman, 2008) that there exists an interesting interplay between the head and gaze directions. The presented visualizations (especially Figures 1, 2) and the developed models between head and gaze pointing show that there are structures in the data that can be utilized in making predictions of the use of gaze and in analyzing the context effects.

The correction term of the head orientation vectors (Section 5.2) was based on collecting the difference values from all the available participants for viewing one specific image. The correction term should preferably be determined for each participant separately but in this study we were lacking data. In future studies we need to collect more data from each individual participant, viewing many more images.

The delay term (Section 5.3) was calculated using the whole sample material (40 s of viewing). It is to be expected that if we would take shorter segments of the data at a time the optimal delay values would be different depending on, for example, if the viewer had been studying some object in detail or glancing around for a general view. In earlier studies the delay has varied and even the order of head and gaze movements have changed because of context effects (e.g., Zangemeister and Stark, 1982a; Zangemeister and Stark, 1982b). Hu et al. (2019) found that the delay might be different in vertical and horizontal movements, and computed from their data a 140 ms delay of a head movement for horizontal and a 70 ms delay for vertical directions. In future studies we need to study the delay in more detail, to see if that could be used for context recognition, say.

Our model of calculating the gaze point estimate from head point (Section 5.4) was extremely simple and didn’t consider, for example, the time development of the head point. Taking into account a short history of the head movements would be a natural way of extending the model. Hu et al. (2019) based their model on head velocities only. A combination of head orientation and head velocity would be a natural extension for the model. Also, the current model is the same in all image areas. It is possible that different areas might be better modeled with slightly different functional forms. For example, the extreme sides of the viewing area probably require somewhat different models than the very center. However, we can answer the research question RQ2 (Is it possible to estimate the gaze orientation from the head orientation?) that we can improve the estimate of gaze point using our models.

The measures that we developed for the context recognition (Section 5.5) didn’t separate the different question categories but only in one case, which means that we answer the research question RQ3 (How does the free viewing category differ in head and gaze orientation data from other categories?) by not having differences. One reason for that could be that as the data was collected for so long time some participants spent only a small part of the time answering the question and then moved to “just look at the image”-mode similar to the free viewing category. Then the measures would also look the same between the categories.

The work has been exploratory in nature and the reported results point an interesting direction for promising future studies.

7 Conclusion

In this paper we have demonstrated interesting interplay between the gaze movements and the head movements. The head movements, on average, happen slightly later than the respective gaze movements. We measured an average delay of 211 milliseconds for our samples. The gaze direction are, on average, slightly farther away from the center of an image (“forward” direction) than the head directions, i.e., the head turns slightly less than the gaze. We measured an optimal multiplier value 1.175 for our material, which means that the gaze is 17.5% more turned from the center of the view than the head, on average. Using measures calculated from the gaze and head data we were showing that there was significant difference between some question categories.

For the research question RQ1 (Section 4.3) our answer is described above. For the research questions RQ2 and RQ3 we found that we can improve the estimate of gaze orientation using our models and head orientation data, and that the free viewing category does not differ from other categories on our data. Both of our hypotheses (H1-2) are plausible.

Data Availability Statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Author Contributions

Conceptualization, JK, OŠ, OK, TJ, MS, and RR. Methodology, JK, OK and TJ. Software, OK and TJ. Investigation, JK and OK. Resources, RR. Data curation, JK. Writing—original draft preparation, JK. Writing—review and editing, JK, OŠ, OK, TJ, MS, and RR. Visualization, JK. Supervision, RR. Project administration, RR. Funding acquisition, RR All authors have read and agreed to the published version of the manuscript.

Funding

This work has received funding from the Academy of Finland, Augmented Eating Experiences project (Grant 316804).

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

We would like to thank the members of TAUCHI research unit for their help in preparing the experiment and the paper.

References

Angelopoulos, A. N., Martel, J. N. P., Kohli, A. P., Conradt, J., and Wetzstein, G. (2021). Event-Based Near-Eye Gaze Tracking beyond 10,000 Hz. IEEE Trans. Vis. Comput. Graphics 27, 2577–2586. arXiv preprint arXiv:2004.03577. doi:10.1109/TVCG.2021.3067784