Understanding Immersivity: Image Generation and Transformation Processes in 3D Immersive Environments

Kozhevnikov, Maria; Dhond, Rupali P.

doi:10.3389/fpsyg.2012.00284

ORIGINAL RESEARCH article

Front. Psychol. , 10 August 2012

Sec. Perception Science

Volume 3 - 2012 | https://doi.org/10.3389/fpsyg.2012.00284

This article is part of the Research Topic Mental Imagery View all 11 articles

Understanding immersivity: image generation and transformation processes in 3D immersive environments

Maria Kozhevnikov^1,2* and Rupali P. Dhond²

¹ Psychology Department, National University of Singapore, Singapore
² Department of Radiology, Martinos Center for Biomedical Imaging, Harvard Medical School, Charlestown, MA, USA

Most research on three-dimensional (3D) visual-spatial processing has been conducted using traditional non-immersive 2D displays. Here we investigated how individuals generate and transform mental images within 3D immersive (3DI) virtual environments, in which the viewers perceive themselves as being surrounded by a 3D world. In Experiment 1, we compared participants’ performance on the Shepard and Metzler (1971) mental rotation (MR) task across the following three types of visual presentation environments; traditional 2D non-immersive (2DNI), 3D non-immersive (3DNI – anaglyphic glasses), and 3DI (head mounted display with position and head orientation tracking). In Experiment 2, we examined how the use of different backgrounds affected MR processes within the 3DI environment. In Experiment 3, we compared electroencephalogram data recorded while participants were mentally rotating visual-spatial images presented in 3DI vs. 2DNI environments. Overall, the findings of the three experiments suggest that visual-spatial processing is different in immersive and non-immersive environments, and that immersive environments may require different image encoding and transformation strategies than the two other non-immersive environments. Specifically, in a non-immersive environment, participants may utilize a scene-based frame of reference and allocentric encoding whereas immersive environments may encourage the use of a viewer-centered frame of reference and egocentric encoding. These findings also suggest that MR performed in laboratory conditions using a traditional 2D computer screen may not reflect spatial processing as it would occur in the real world.

Introduction

Our ability to generate and transform three-dimensional (3D) visual-spatial images is important not only for our every-day activities (locomotion, navigation) but also for a variety of professional activities, such as architecture, air traffic control, and telerobotics. Difficulties of studying visual-spatial cognition within real world environments, where controlling the experimental stimuli and recording participants’ behavior is often impossible, have led researchers to increasingly employ 3D immersive (3DI) virtual environments (Chance et al., 1998; Klatzky et al., 1998; Loomis et al., 1999; Tarr and Warren, 2002; Macuga et al., 2007; Kozhevnikov and Garcia, 2011). Specifically, 3DI technology allows one to create a complex immersive environment of high ecological validity, in which participants are presented with and manipulate a variety of 3D stimuli under controlled conditions.

An immersive virtual environment involves computer simulation of a 3D space and a human computer-interaction within that space (Cockayne and Darken, 2004). There are two major characteristics of 3DI environments that distinguish them from non-immersive 2D non-immersive (2DNI) and 3D non-immersive (3DNI) environments. First, 3DI involves egocentric navigation (the user is surrounded by the environment) rather than exocentric navigation where the user is outside the environment, looking in. Second, unlike non-immersive environments where a scene is fixed on a 2D computer screen, 3DI involves image updating achieved by position and head orientation tracking. Although little is known about cognitive processes and neural dynamics underlying image encoding and transformation in 3DI environments, researchers have speculated that immersivity would differentially affect selection of a spatial frame of reference (i.e., spatial coordinate system) during object encoding processes (Kozhevnikov and Garcia, 2011).

Two different spatial frames of reference, environmental and viewer-centered, can be used for encoding and transforming visual-spatial images. An environmental frame may involve the “permanent environment” which is bound by standard orthogonal planes, i.e., the floor, walls, ceiling, and perceived direction of gravity or the local “scene-based” spatial environment where the target object’s components are encoded allocentrically in relation to another object, i.e., table-top, blackboard, computer screen, etc. In contrast to environmental frames of reference, the viewer-centered frame is egocentric, that is, it defines object configurations and orientations relative to the viewer’s gaze and it includes an embedded retinal coordinate system. In the case of imagined spatial transformations such as mental rotation (MR), the prevailing hypothesis is that individuals rely more upon an environmental, scene-based, rather than a viewer-centered frame of reference (Corballis et al., 1976, 1978; Rock, 1986; Hinton and Parsons, 1988; Palmer, 1989; Pani and Dupree, 1994). For example, Corballis et al. tested normal-mirror discriminations of rotated alphanumeric characters when participants’ heads or bodies were either aligned with the gravitational vertical or misaligned by up to 60°. The results showed that the participants made their judgments by rotating the characters to the gravitational vertical (Y axis) rather than using a viewer-centered (head-centered or retina-centered) reference frame. Furthermore, Hinton and Parsons (1988) reported that while mentally rotating two shapes positioned on a table into congruence, participants often rotated one shape until it had the same relationship to the table-top (and room) as the other shape (thus achieving scene-based alignment), even though this produced quite different retinal images. Thus, it appears that the orientation of the viewer is defined relative to the scene, rather than the orientation of the scene being defined relative to the viewer. This lends support for theories suggesting that the representation of spatial relationships is established primarily in terms of scene-based reference systems.

Additional evidence for primacy of scene-based reference frames comes from experiments (e.g., Parsons, 1987, 1995) comparing the speed of MR of classical Shepard and Metzler’s (1971) 3D forms around different axes (see Figure 1A). MR around different axes places different demands on the transformation processes, and results in different brain activity (Gauthier et al., 2002). Rotation in the picture plane preserves the feasibility of all the features of a shape, but perturbs the top-bottom relations between features. Rotation in depth around the vertical axis alters side-to-side relationships between features and the visibility of features, some coming into view and others becoming occluded. Rotation in depth around a horizontal axis is the most demanding rotation; it alters top-bottom relations between features and feature visibility. Interestingly, it has been consistently found that participants mentally rotate shapes in the depth plane just as fast as or even faster than in the picture plane (Shepard and Metzler, 1971; Parsons, 1987, 1995). If participants were in fact rotating viewer-centered 2D retina-based visual representations, the depth rotation would take longer than rotation in the picture plane since rotation in depth would have to carry out additional foreshortening and hidden line removal operations, not required during picture plane rotation.

FIGURE 1

Figure 1. Example of the MR test trial stimulus with (A) showing the subject’s view of the stimulus and (B) showing the three principle axes of rotation, X, Y, and Z, used in the current study. Note that this coordinate frame differs from the one normally used in computer graphics in which the positive Z-direction is perpendicular to the plane of the display, pointing toward the viewer.

Shepard and Metzler (1971) were the first to interpret similar slopes for rotation in depth and in the picture plane to indicate that latency was a function of the angle of rotation in three dimensions, not two, as in a retinal projection (for additional discussion see Pinker, 1988). In order to investigate this further, Parsons (1987) conducted an extensive experimental study examining the rates of imagined rotation not only around three principal axes of the observer’s reference frame, but also around diagonal axes lying within one of the principal planes (frontal, midsagittal, or horizontal) and around “skew” axes not lying in any of the principal planes. The findings indicated that the rotation around different axes, including rotation in depth around a horizontal axis perpendicular to the line of sight (Z axis, see Figure 1B) were as fast as or even faster than rotations in the picture plane (rotations around the axis defined by the line of sight, X-axis in this study). Parsons concluded that this equal ease of rotating images around different axes support scene-based encoding, during which the observers rely largely on representations containing more “structural” information (e.g., information about spatial relations among the elements of the object and their orientations with respect to the scene in which the objects lie) rather than on retina-based 2D representations of visual-spatial images.

One limitation of previous studies on MR is that they have been conducted using traditional non-immersive environments, where the stimuli were presented on a 2D computer screen or another flat surface (e.g., a table-top), which defines a fixed local frame of reference. This limited and fixed field of view (FOV) may encourage the use of a more structural scene-based encoding, during which the parts of the 3D image are encoded in relation to the sides of the computer screen or another salient object in the environment. However, because 3DI environments enclose an individual within the scene and allow images to be updated with respect to the observer’s head orientation, egocentric, viewer-centered encoding may predominate.

The primary goal of the current research was to examine how individuals process visual-spatial information (specifically encode and rotate 3D images) and what spatial frames of reference they rely upon in 3DI virtual environments vs. conventional non-immersive displays. In our first experiment, in order to control the effect of “three-dimensionality” vs. “immersivity,” we compared participants’ performance on the Shepard and Metzler (1971) MR task across the following three types of environments; traditional 2DNI, 3DNI (anaglyphic glasses), and 3DI [head mounted display (HMD) with position and head orientation tracking]. In the second experiment, we compared how participants encode and transform visual-spatial images in different 3DI environments with different backgrounds where shapes were embedded in a realistic scene vs. in a rectangular frame. Furthermore, if the neurocognitive correlates of visual-spatial imagery are affected by immersivity of visual presentation environment, this should be evidenced in the underlying temporal dynamic and/or spatial distribution of the electroencephalogram (EEG) response. Thus, in the third experiment, EEG was recorded while participants performed the MR task in 3DI and 2DNI environments.

Experiment 1

Materials and Methods

Participants

Fourteen volunteers (eight males and eight females, average age = 21.5) participated in the study for monetary compensation. The study was approved by George Mason University (Fairfax, VA, USA) as well as by The Partners Human Research Committee (PHRC, MA, USA) and informed consent was obtained from all subjects. Participants were asked about their ability to perceive stereoscopic images prior to the start of the experiment, and only those who did report difficulty with stereopsis were included.

Materials and design

Each participant completed the MR task – a computerized adaptation of Shepard and Metzler’s (1971) task – in three different viewing environments: 3DI, 3DNI, and 2DNI. For each trial, participants viewed two spatial figures, one of which was rotated relative to the position of the other (Figure 1A). Participants were to imagine rotating one figure to determine whether or not it matched the other figure and to indicate whether they thought the figures were the same or different by pressing a left (same) or right (different) button on a remote control device. Participants were asked to respond as quickly and as accurately as possible. Twelve rotation angles were used: 20, 30, 40, 60, 80, 90, 100, 120, 140, 150, 160, and 180°. The figures were rotated around three spatial axes: line of sight (X), vertical (Y), and horizontal (Z) corresponding to rotations parallel with the frontal (YZ), horizontal (XZ), and midsagittal (XY) anatomical planes, respectively (Figure 1B). The test included: 12 trial groups for the 12 rotation angles, 3 trial pairs for the 3 axes, and each pair had 1 trial with matching figures and 1 trial with different figures; thus, there were 72 (12 × 3 × 2) trials in total.

In the 3DI virtual environment, the shapes were presented to the participant through an nVisor SX60 (by Nvis, Inc.) HMD (Figure 2A). The HMD has a 44° horizontal by 34° vertical FOV with a display resolution of 1280 × 1024 and under 15% geometric distortion. During the experiment, participants sat on a chair in the center of the room, wearing the HMD to view “virtual” Shepard and Metzler images in front of them. Sensors on the HMD enabled real-time simulation in which any movement of the subject’s head immediately caused a corresponding change to the image rendered in the HMD. The participant’s head position was tracked by four cameras located in each corner of the experimental room and sensible to an infrared light mounted on the top of the HMD. The rotation of user’s head was captured by a digital compass mounted on the back of the HMD.

FIGURE 2

Figure 2. Three different viewing environments (A) 3DI, which includes HMD with position tracking, (B) 3DNI with anaglyphic glasses to present a stereo picture of three-dimensional spatial forms, and (C) 2D monocular viewing environment.

In the 3DNI environment, the shapes were presented to the participant on a computer screen. Stereoscopic depth was provided by means of anaglyphic glasses (Figure 2B). In the 2DNI environment, the shapes were presented for on a standard computer screen (Figure 2C).

The retinal image size of the stimuli was kept constant across all the environments (computed as ratio of image size over the participant’s distance to the screen). The Vizard Virtual Reality Toolkit v. 3.0 (WorldViz, 2007) was used to create the scenes and to record the dependent variables (latency and accuracy).

Before beginning the MR trials, participants listened to verbal instructions while viewing example trials in each environment. Eight practice trials were given to ensure participants’ comprehension of the instructions and that they were using a MR strategy (as opposed to a verbal or analytical strategy). If a response to a practice trial was incorrect, the participants were asked to explain how they solved the task in order to ensure the use of a rotation strategy (i.e., rather than verbal or analytical strategy). In 3DI, to familiarize the participants with immersive virtual reality, there was also an exploratory phase prior to the practice trials in which the participants were given general instructions about virtual reality and the use of the remote control device (about 7–10 min). During the practice and test phases, the participants remained seated in the chair, but were allowed to move and rotate their head to view 3D Shepard and Metzler shapes. The participants were also given similar time to familiarize themselves with the shapes in the 3DNI and 2DNI environment, and were also allowed to move and rotate their head to view Shepard and Metzler shapes.

Results

Descriptive statistics for performance in the three environments are given in Table 1. Outlier response times (RTs; i.e., RTs ± 2.5 SD from a participant’s mean) were deleted (a total of 2.59% of all trials). All simple main effects were examined using the Bonferroni correction procedure. Two participants that performed below chance level were not included in the analysis, thus the final analysis was performed on 12 participants only.

TABLE 1

Table 1. Descriptive statistics for three versions of the MR test in 2DNI, 3DNI, and 3DI.

Response accuracy (proportion correct) and RT for correct responses were assessed as a function of the rotation axis (X, Y, and Z) and environment (3DI, 3DNI, and 2DNI). Data were analyzed using a 3 (axis) × 3 (environment) repeated measures ANOVA with a General Linear Model (GLM). The effect of environment was marginally significant [F(2,22) = 2.9, p = 0.040] and as pairwise comparison showed, the accuracy in 3DNI and 3DI environments was slightly less than in 2DNI (p = 0.08). There was a significant main effect of axis [F(2,22) = 19.83, p < 0.001] where Y axis rotations were more accurate than X and Z axis rotations (ps < 0.01). The interaction was not significant (F < 1). Overall, the accuracy level was relatively high for all the environments and all axes, with the proportion correct ranging from 0.84 to 0.97. Given the high rate of accuracy, indicating that ceiling performance was reached for some rotations, we focused our remaining analyses on the RTs.

With respect to RT, there was a significant effect of axis [F(2,22) = 15.40, p < 0.001] with Y axis rotations being the fastest (ps < 0.05), see Figure 3. There was no significant effect of environment (F < 1), however, there was a significant interaction between axis and test environment [F(4,44) = 6.45, p < 0.001]. Analysis of simple main effects revealed that, RT for rotation around the Y axis was significantly faster than either around X (all ps < 0.05) or Z (all ps < 0.05) for 3DNI and 2DNI environments, consistently with previous studies (Shepard and Metzler, 1971; Parsons, 1987, 1995). However, rotations around X and Z axes were similar (p = 0.98 and 0.79 for 2DNI and 3DNI respectively). Interestingly the opposite occurred for MR in the 3DI environment. In 3DI, rotation around Z was significantly longer than X (p = 0.001) or Y (p = 0.01), while rotations around X and Y were similar (p = 0.97).

FIGURE 3

Figure 3. Response time as a function of axis of rotation and viewing environment (2DNI, 3DNI, and 3DI). Error bars represent standard error means.

Thus, our central finding is that in 3DI, the RT of rotation differed between X and Z axes (Z was slower) and that rotation around the Y axis was faster than Z but not faster than X rotations. In contrast, RT patterns for 2DNI and 3DNI environments were similar to those found in previous MR studies (i.e., Y rotations are faster than X and Z and X and Z are similar).

Rate of rotation as a function of axis and environment. RT as a function of rotation angles (i.e., orientation differences between two Shepard and Metzler shapes) around X, Y, and Z axes for 3DI, 3DNI, and 2DNI environments respectively are shown in Figure 4. The range of rotation angles was from 20 to 160; 180° was omitted due to participant’s reports that for this particular angle, they did not rotate shapes mentally, but only scanned two images for mirror-reversed symmetry.

FIGURE 4

Figure 4. Response time as a function of angle and axis of rotation in (A) 2DNI, (B) 3DNI, and (C) 3DI environments.

The slopes of the best-fit linear RT-Rotation Angle functions for each axis and each environment (representing rates of rotation around different axes in different environments) were computed and are presented in Table 2.

TABLE 2

Table 2. Mean regression slopes of RT-Rotation angle function (s/°).

A repeated measures ANOVA of slopes of best-fit linear regression equations of RT on Rotation Angle show a significant effect of axis on the slope [F(2,22) = 51.34, p < 0.001] and a significant interaction between environment and axis of rotation [F(4,44) = 3.38, p < 0.05], while the effect of environment is not significant. For 3DI, the rate of rotation around Z was more than 1.5 times slower than around X (p < 0.05). In both 3DNI and 2DNI, the rate of rotation around X and Z did not differ. Across the environments, the rate of rotation around X was significantly faster for 3DI than for 2DNI (p < 0.05), whereas the rate of rotation around Z was significantly slower for 3DI than for either 3DNI or 2DNI (ps < 0.05). There were no significant differences in the rate of rotation around the Y axis across environment, and the rate of rotation around Y seems to be the one of the fastest rotations. This is consistent with the findings of previous investigators (Rock and Leaman, 1963; Attneave and Olson, 1967; Parsons, 1987; Corballis, 1988) who argued that rotation around Y, a “gravitational vertical” axis, is the most common of all rotations in our ecology, so that the fast rate of rotation around it may result from our extraexperimental familiarity.

In summary, the results of Experiment 1 show that the rate of MR about the horizontal axis (Z axis) in 3DI (and only 3DI) was significantly slower than the rate of rotation about the line of sight (X-axis). This finding suggests that in the 3DI environment the participants were encoding and rotating 2D retina-based visual representations in relation to a viewer-centered frame of reference since only then would depth rotation take longer than rotation in the picture plane, due to the involvement of additional foreshortening and hidden line removal transformations. In contrast, in 2DNI and 3DNI environments, the rates of MR around the X and Z axes were not different, consistent with previous findings for MR using 2D traditional computer displays (Shepard and Metzler, 1971; Parsons, 1987). Thus, in non-immersive environments, participants seem to generate visual representations containing more allocentric information such as information about spatial relations among the elements of the object and their orientations with respect to the scene (i.e., the computer screen) in which the object is presented. The fact that there was equivalent performance in 2D and 3DNI environments suggests that depth information per se, which is provided in a 3DNI environment is insufficient to encourage the use of viewer-centered frame of reference.

Experiment 2

One of possible limitations of Experiment 1 is that, in the 3DI environment, the Shepard and Metzler shapes were presented to participants on a non-realistic “empty” background lacking any points of reference (such as ceilings, walls, other objects), which would usually be present in a real scene. Thus, viewer-centered encoding observed in 3DI could be due not to the immersivity of the environment, but rather due to the lack of any other objects – except the observers themselves – in relation to which Shepard and Metzler shapes could have been encoded. In Experiment 2, the participants in a 3DI condition were presented with Shepard and Metzler forms embedded in a realistic scene (city). In addition, we added a second condition in which the participants viewed Shepard and Metzler forms embedded in a virtual rectangular-shaped frame within the 3DI environment. This was done to examine whether the fixed frame around objects in a 3DI environment induces scene-based encoding similar to a computer screen in the real world.