Toward immersive communications in 6G

Shen, Xuemin (Sherman); Gao, Jie; Li, Mushu; Zhou, Conghao; Hu, Shisheng; He, Mingcheng; Zhuang, Weihua

doi:10.3389/fcomp.2022.1068478

REVIEW article

Front. Comput. Sci. , 11 January 2023

Sec. Networks and Communications

Volume 4 - 2022 | https://doi.org/10.3389/fcomp.2022.1068478

This article is part of the Research Topic Horizons in Computer Science 2022 View all 8 articles

Toward immersive communications in 6G

$\nXuemin (Sherman) Shen$ Xuemin (Sherman) Shen¹

Jie Gao²

Mushu Li³^*

Conghao Zhou¹

Shisheng Hu¹

Mingcheng He¹

Weihua Zhuang¹

¹Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, Canada
²School of Information Technology, Carleton University, Ottawa, ON, Canada
³Department of Electrical, Computer, and Biomedical Engineering, Toronto Metropolitan University, Toronto, ON, Canada

The sixth generation (6G) networks are expected to enable immersive communications and bridge the physical and the virtual worlds. Integrating extended reality, holography, and haptics, immersive communications will revolutionize how people work, entertain, and communicate by enabling lifelike interactions. However, the unprecedented demand for data transmission rate and the stringent requirements on latency and reliability create challenges for 6G networks to support immersive communications. In this survey article, we present the prospect of immersive communications and investigate emerging solutions to the corresponding challenges for 6G. First, we introduce use cases of immersive communications, in the fields of entertainment, education, and healthcare. Second, we present the concepts of immersive communications, including extended reality, haptic communication, and holographic communication, their basic implementation procedures, and their requirements on networks in terms of transmission rate, latency, and reliability. Third, we summarize the potential solutions to addressing the challenges from the aspects of communication, computing, and networking. Finally, we discuss future research directions and conclude this study.

1. Introduction

Ever since its birth, communication technology has been a symbol of the modernization of human society, and the evolution of communication technology has accompanied the advance of civilization. The commercialization of electrical telegraph and telephone during the second industrial revolution boosted globalization by facilitating finance and trade overseas (Wenzlhuemer, 2013). The debut of vehicle-mounted mobile radio systems (“car phones”) and the analog first generation (1G) mobile telecommunication systems from the 1950s to 1980s enabled voice calls on the go (del Peral-Rosado et al., 2018). The second generation (2G) mobile communication systems, which introduced roaming and preliminary data services in the form of text messages, emerged amidst and as a part of the third industrial revolution (i.e., the digital revolution) (Billström et al., 2006). Then, the next two decades witnessed the proliferation of mobile Internet and mobile multimedia services brought by the third and fourth generation (3G and 4G) mobile communication technology, which revolutionized how people communicate and changed the world. Nowadays, the fifth generation (5G) mobile communication systems are reshaping industries by facilitating the fourth industrial revolution (i.e., Industry 4.0) toward smart inter-connectivity and automation (Chen K.-C. et al., 2021).

Accustomed to the convenience brought by the latest communication technologies, many people may not realize that ordinary daily activities such as video calls or zoom meetings were nothing more than science fiction merely three decades ago. Indeed, from the so-called “telephot” in the pioneering novel “Ralph 124C 41+” to the video call scene in the classic movie “Back to the Future,” the simultaneous transmission of live image and sound was considered as a “technology of the future” in the most part of the twentieth century (Fowler et al., 1986; Gooday, 2005). When the fantasy of the past has become a reality, a question that naturally arises is: what will be the next revolutionary form of communications, potentially in the era of the sixth generation (6G)? Fortunately, we may again find clues in science fiction, with examples ranging from the famous scene of Princess Leia's three-dimensional (3D) holographic message in “Star Wars” (Conti, 2008) to the virtual world “OASIS” in the metaverse presented in the recent film “Ready Player One” (Sparkes, 2021). The fact that such scenes created a long-lasting influence on a vast audience reflected people's desire for more lifelike, immersive, and interactive communications (Xu et al., 2022).

Unfolding exactly as depicted in science fiction or not, immersive communications will come to reality and shift the current communication paradigm in three aspects. First, rather than two-dimensional (2D) images displayed on a flat screen, immersive communications will deliver 3D images with parallax information. Second, in addition to audiovisual information, immersive communications will involve haptic information. Third, the pursuit of immersive experiences will further blur the boundary between the physical and the virtual worlds, allowing new forms of interactions across the two worlds. These paradigm shifts can significantly enrich communication experiences of users and enable a plethora of new use cases such as 3D telepresence (Yu et al., 2021), ultra-realistic online interactive sports (Next G Alliance, 2022), and immersive learning in education (Pellas et al., 2020), to name a few. In particular, immersive communications can also enable human-machine collaboration in industrial environments and propel the next industrial revolution, i.e., industrial 5.0 (Leng et al., 2022; Maddikunta et al., 2022). As a result, immersive communications are expected to have a profound influence on the landscape of communication industries and impact how people study, work, and entertain in the years to come.

Motivated by the potentials of immersive communications, scientists and engineers over the world have been working on the development of related technologies, products, and platforms. Significant progress has been made in recent years, including but not limited to advancements in sensor systems and data capture techniques (Dahiya et al., 2019; Meyer et al., 2022), data processing and computing frameworks (Petkov et al., 2022; Qian et al., 2022; Song et al., 2022), and rendering and display devices (Hirayama et al., 2019; Schmitz et al., 2020; Xiong et al., 2021). Some component development of immersive communications is progressing faster than others, leading to the establishment of testbeds, prototypes, or even commercial products. Virtual reality (VR), as an example, has gained popularity, especially in the gaming industry (Jung et al., 2020). Devices such as VR headsets and haptic glove development kits are available in the market (Kugler, 2021; Chen et al., 2022), while researchers are building testbeds for extended reality (XR) (Huzaifa et al., 2022) and human-machine interaction with haptic feedback (Gokhale et al., 2020). An example of recent development in immersive communications is the VirtualCube system, a 3D video conference system capable of synthesizing remote and local participants so that they appear in the same environment (Zhang Y. et al., 2022). In addition, a research team in Germany is exploring VR-based full-body avatars for training police forces while evaluating their stress level and response to threats (Caserman et al., 2022) .

As the aforementioned progress and efforts are paving the way for realizing immersive communications, advancements in communication and networking technologies will be indispensable. Despite the advent of 5G systems and the accompanying advancements in network capabilities, there are still many challenges to achieving immersive communications in various aspects of communications, networking, and computing. The data rate required to transmit live 3D images can be so high, e.g., on the level of terabits per second (Tbps), that even 5G cannot support it, especially for high-resolution and 360° videos. The required end-to-end delay for delivering haptic information can be as low as a few milliseconds for a satisfactory user experience (Maier and Ebrahimzadeh, 2019; Sim et al., 2021). The synchronization of data streams from multiple cameras or sensors and that of audiovisual and haptic information in data transmission also create new challenges. The storing and processing of massive data for immersive communications demand new architectures and techniques for caching and computing (Glushakov et al., 2020; Liu et al., 2021; Taleb et al., 2021). Moreover, artificial intelligence (AI) is necessary both for supporting applications such as human-machine collaboration and user viewpoint/gesture prediction, and for orchestrating network resources to satisfy the demanding requirements of immersive communications (Maier et al., 2018; Tataria et al., 2021; Zawish et al., 2022). Since the realization of immersive communications can require integrated support for enhanced mobile broadband (eMBB) and ultra-reliable low-latency communications (URLLC) (Pang et al., 2022), which is beyond the capability of 5G, researchers look forward to breakthroughs in immersive communications in the era of 6G. Targeting 2030 for large scale 6G deployments, the 3rd Generation Partnership Project (3GPP) plans to start 6G studies in 2024 and complete its first 6G standard in 2028 (Ericsson, 2022), while the International Telecommunication Union (ITU)'s “IMT for 2030 and beyond” timeline aims at completing IMT-2030 specifications in 2029–2030 (Yrjölä et al., 2022).

Recognizing the importance of immersive communications, the research community in communications, networking, and computer science is expanding its effort in this field. Several recent review and survey articles can be found in the literature, among which some present state of the art in immersive communications, while others envision the next steps. Most of these articles focus on a specific aspect, such as supporting 360°/holographic video streaming (Yaqoob et al., 2020; Huang et al., 2022), evaluating the immersive experience of users (Gao et al., 2022a), analyzing the effects of user motions on network performance in XR (Chukhno et al., 2022), enabling the use case of Metaverse (Tang et al., 2022; Wang et al., 2022b; Xu et al., 2022), or facilitating distributed implementation of VR (Morín et al., 2022). Different from the above works, we present a comprehensive survey of immersive communications in this article. With a focus on the communication, networking, and computing perspectives, we review a large number of publications, especially the latest works in communications, networking, and computer science to present the representative use cases, the recent developments, the technical challenges, and the potential solutions related to immersive communications in the era of 6G communications. In specific, we focus on immersive communications by looking into its three main forms, i.e., XR, haptic communication, and holographic communication in the remainder of this article. Section 2 introduces representative use cases of immersive communications to illustrate its promising prospect. Section 3 presents the concepts, basic implementation procedures, and requirements of XR, haptic communication, and holographic communication to paint an overall picture of immersive communications. Section 4 focuses on the challenges and the state-of-the-art solutions toward realizing each of the three forms of immersive communications. Section 5 discusses some open issues regarding immersive communications in 6G, and Section 6 concludes this article. A list of the main acronyms used in this article is given in Table 1.

TABLE 1

Table 1. List of main acronyms.

2. Use cases

There are many potential use cases for immersive communications, relating to both commercial and enterprise scenarios and ranging from gaming to industrial control. In this section, we detail four representative use cases to illustrate the promising prospect of immersive communications. A list of representative use cases is given in Table 2.

TABLE 2

Table 2. Representative use cases of immersive communications.

2.1. Immersive gaming and entertainment

XR provides the ultimate gaming and entertainment experience by presenting convincing gaming environments through XR devices such as VR headsets or smartphones. Players can interact with each other without feeling a barrier between the virtual and the physical worlds (Bastug et al., 2017). XR devices display the virtual world of the game to players and capture their actions such as eye movements to allow them to interact with the virtual world (Elbamby et al., 2018b). With the success of advanced XR gaming consoles and headsets, e.g., Oculus and PlayStation VR, as well as games and platforms, e.g., Pokemon Go and Roblox, game developers are striving to offer more flexible XR experiences with wireless XR devices (Maimone and Wang, 2020). Through wireless XR devices, players can interact freely with other players or virtual objects, e.g., in XR sporting (Kim et al., 2018). Furthermore, haptic communication devices can be combined with XR to significantly enhance the immersive gaming experience. Transducer arrays, which can be attached to XR devices, can capture haptic data from players. As a result, XR devices can fuse haptic information into the virtual world and provide haptic feedback to players by mapping motions in the game to players' sensations. Players can use haptic devices, such as gloves, to control objects in the game (Hashimoto and Ishibashi, 2006) or synchronize their sensations with other players (Mauve, 2000).

2.2. Telesurgery

In telesurgery, surgeons remotely manipulate robotic arms to operate on patients by utilizing control panels and low-latency display of the surgical scenes. Telesurgery is beneficial in removing the barrier of distance among surgeons and patients, tackling the scarcity of surgeons in remote or difficult-to-reach areas such as countryside, battlefields and spacecraft, and facilitating the collaboration of surgeons at different locations (Choi et al., 2018; Mohan et al., 2021). The assistance of robotic arms can enhance the performance of surgeries by detecting and canceling out the physiological tremors of surgeons' hand motions (Kumar et al., 2020), performing delicate surgical operations and minimizing the surgical incision areas for reducing blood loss and incision-related complications (Diana and Marescaux, 2015). To guarantee the performance of surgeons, the display of surgical scenes to them should be highly precise and informative. To this end, 3D video of the surgical scenes with depth information, can be displayed to the surgeons, e.g., by using passive polarized glasses, and an eye-tracking mechanism can be used to quickly center the area where the surgeon is viewing in the visual display (Stark et al., 2015). In addition, augmented reality (AR) can be leveraged to overlay medical images such as ultrasound images and computed tomography (CT) images onto the video of surgical scenes (Liu X. et al., 2016). Besides visual information, haptic information in the surgeries, such as the texture of tissues and the tension in tying surgical sutures, can be captured by the haptic devices on the robotic arms and then transmitted to and reproduced by the haptic devices at the surgeons' side (El Rassi and El Rassi, 2020; Patel et al., 2022).

2.3. Immersive learning

Immersive learning integrates emerging technologies, including XR and haptic technologies, into teaching to provide students or trainees an interactive and engaging learning experience (Laamarti et al., 2014; Affan et al., 2021). During the recent COVID-19 pandemic, traditional methods of teaching, e.g., online courses, encountered the problem of engaging students in the learning process (Jumreornvong et al., 2020; Fitzek et al., 2021). To this end, immersive learning, as a potential solution to boost student engagement, is receiving increasing attention, especially from primary and secondary schools. With immersive learning, avatars of students and teachers can be created in the virtual world (Gupta et al., 2019), and each student is allowed to interact with the avatars of teachers and other students via the senses of sight, hearing, and touch. Such interactions can keep students' attention in learning process. Immersive learning is categorized as either asynchronous or synchronous. Training some skills, such as sports skills and cooperative tele-operation skills for industrial robots, requires real-time interactions, which can encourage active participation in the learning process (Kaluschke et al., 2021; Lee et al., 2021). Utilizing XR, haptic communication, and holography communication technologies, teachers can check whether the moves and actions of their students are correct and provide immediate corrections if not, regardless of their physical distance from each other. For the skills that do not need real-time interactions, information regarding teachers' positions, velocities, and applied forces can be recorded and displayed to students via XR and haptic devices asynchronously (Tan et al., 2020). Such “record-and-replay” strategy can allow a much larger number of students to learn at their own pace, despite the absence of real-time interactions (Yokokohji et al., 1996a,b; Steinbach et al., 2018).

2.4. Holographic teleconference

Teleconference is a convenient choice for users to remotely collaborate with each other. In the current video teleconferencing, remote participants can only be displayed on flat screens, which results in a very different perception in a virtual conference from that in an on-site conference. In order to provide an immersive experience in teleconferences, holographic teleconferences depict realistic 3D presence for people by projecting 3D images of remote participants as holograms (Jiang et al., 2021; Siemonsma and Bell, 2022; Zhang Y. et al., 2022; Zhou F. et al., 2022). Specifically, when a remote participant joins the holographic teleconference, 3D visual information and the corresponding audio information of the participant can be captured by multiple sensors, transmitted, and then reconstructed as a hologram on the side of other participants to provide 3D audiovisual information for interactions among participants (Strinati et al., 2019). In this case, holographic teleconference can reduce the impact on participants of the separation between the virtual and the physical worlds. In addition to the audio and video information, participants in a holographic teleconference are able to obtain haptic information from others to achieve an immersive experience with the sense of physical contacts (Tataria et al., 2021). For example, a participant with haptic sensors can sense a handshake with others, thereby enabling an immersive experience similar to in-person interactions.

2.5. Metaverse

A metaverse provides fully immersive and self-sustaining virtual spaces that merge the physical and digital worlds (Wang et al., 2022b). In the metaverse, users can have avatars as digital representations in simulated or imaginary environments, such as games and virtual cities. Through XR devices including phones or laptops, users can interact with digital avatars, other digital objects, and virtual environments. Metaverses require the synchronization between the physical and the digital worlds through two main information flows. One of them is from the physical world to digital worlds, in which sensors and actuators capture user activity so that the behaviors of a user in the physical world are reflected via their avatars in a digital world. The other is from digital worlds to the physical world, including the interactions among avatars, other digital objects, and metaverse services in the virtual environments. As a result of advanced networking technologies, big data analysis, blockchain, and AI, metaverses are expected to provide human-centric content for users to enable immersive social experiences (Heath, 2021), online collaborations (Suzuki et al., 2020), etc.

3. Immersive communications: Concepts and requirements

The use cases for immersive communications and their potential importance in 6G are intuitive. Understanding immersive communications beyond the use cases, however, requires answers to the question “what are immersive communications?”. Since the research of immersive communications is in an early stage, there is no commonly-agreed definition yet.

We consider immersive communications as a communication paradigm along with the supporting technologies that allow users to have lifelike experiences in the physical world, the virtual world, or both, with interactions via 3D audiovisual and/or haptic information exchange. In this section, we focus on the three main forms of immersive communications as illustrated in Figure 1, i.e., XR, haptic communication, and holographic communication.¹ Via introducing the concept, basic implementation procedure, and the network requirements for each of the three forms, we aim to sketch an overall picture of immersive communications. The requirements of representative immersive communications use cases are illustrated in Figure 2 and also summarized in Table 3.

FIGURE 1

Figure 1. Main forms and exemplary use cases of immersive communications.

FIGURE 2

Figure 2. Requirements of representative immersive communications use cases: an illustration.

TABLE 3

Table 3. Requirements of use cases in immersive communications.

3.1. Extended reality

In this subsection, we introduce the concept of XR and investigate two respective XR technologies: VR and AR. Then, we examine their implementation procedure and service requirements for 6G.

3.1.1. Concept

XR covers a range of technologies, including VR, AR, mixed reality (MR), and everything in between (Hu et al., 2020). In general, XR combines the physical and virtual worlds through extensive video processing and data fusion. Using XR devices, users can interact with virtual avatars and access XR content. Under the umbrella of XR, a variety of technologies are defined depending on the level of virtuality. Two representative technologies in XR are AR and VR. With the lowest level of virtuality, AR focuses on constructing artificial objects according to the objects (e.g., buildings, faces, or vehicles) residing in the physical world and enabling users to interact with them. Conversely, with the highest level of virtuality, VR creates an entirely artificial scenery and allows users to interact with the objects in a completely artificial environment generated by the headsets. In MR, the concepts of VR and AR can be combined to create different levels of virtuality. In spite of the variety of XR technologies, the methods to provide immersive experiences to users are similar, which combine sensory data with virtual environments to produce artificial sceneries, from either the physical or virtual worlds, using headsets or portable display devices.

The first VR flight simulator was developed in 1970s to train pilots for flights without exposing them to risks of flying (Earnshaw, 1993). In the early stage, VR headsets were cumbersome, and processing VR content required large supercomputers. Nowadays, VR technologies have gained momentum due to recent advances in computing and display technologies. The headsets, such as Oculus head-mounted displays and HTC Vive, are affordable and can support ultra-high resolutions (3,840 × 2,160 in Pimax 8K) and refresh rates (up to 120 Hz) (Hu et al., 2020). Most VR content is processed and rendered by user devices. Rendering content with a high level of virtuality requires extensive computing power. For a VR headset, a console is required to supply additional computing power to the headset, while a wired connection restricts the user to a workstation. Therefore, wireless VR is the primary focus of VR research now (Elbamby et al., 2018a). In addition, multi-sensory XR, as another future vision of XR, integrates human senses and perception, including visual, auditory, olfactory, and tactile into XR content, enabling a truly immersive experience. This requires the confluence of multiple disciplines, including AI, computer vision, biology, ultra-low-latency networking, etc., while linking the real and virtual worlds (Hu et al., 2021; Wang and Li, 2022).

3.1.2. Basic implementation procedure

While XR comprises several technologies with different levels of virtuality, its implementation procedure can be summarized into three steps: content transmission, rendering, and feedback collection. For each of the above three steps, communication networks can play an important role.

In the step of content transmission, VR content generated by VR content providers is transmitted from content servers and VR devices. VR devices play 360° spherical videos, which can be mapped to equirectangular videos. During playing VR content, these equirectangular videos are mapped onto a sphere, in which the user is situated at the center, to provide a 3D stereoscopic experience. The key feature of VR video is the ultra-high spatial resolution. A VR video has a resolution of up to 12K (11,520 × 6,480), while the conventional video normally has a resolution of 4K or less. Transmitting full equirectangular videos from content servers requires an ultra-high data rate. Thus, tile-based transmission is usually adopted in VR video delivery. As shown in Figure 3, a content server can divide equirectangular videos spatio-temporally into video chunks, i.e., tiled videos, and only the tiled videos within a user's field-of-view (FoV) is delivered (Son et al., 2018; Yadav and Ooi, 2020). In this way, VR content can be delivered in a significantly reduced data size. However, the tile-based solution requires VR headsets to detect and estimate user viewpoints to determine the region of FoV. Content servers should select which tiled videos to be delivered to users based on both the user's current viewpoint and network conditions (Zare et al., 2016). In terms of AR, AR devices generate raw content by the sensors at the local devices, such as cameras in smartphones (Ren et al., 2020a). In contrast to VR devices, which download content from a content server, AR devices can upload raw content to the server for further processing. Specifically, raw videos captured by AR devices are clipped into frames with a specific image format, and those frames can be offloaded to the server. The processed content is then delivered to and played on the AR devices.

FIGURE 3

Figure 3. VR video projection and partition.

In the step of content rendering, tiled VR videos transmitted to VR devices are stitched together, and computing resources are required to project 2D stereoscopic videos to 3D stereoscopic videos, i.e., generating two different videos for the left and right eyes respectively. This content rendering step can be performed on VR devices once all the required content has been received. In addition, due to the limited computing capability of VR devices, the workload of content rendering can be offloaded to adjacent edge servers enabled by mobile edge computing (MEC) (Sukhmani et al., 2018; Dang and Peng, 2019; Dai et al., 2020). Content processing and rendering are more complex in AR than in VR, where AR processing procedure is shown in Figure 4. Once the raw AR content, i.e., video frames, is captured by an AR device, a location tracking step determines the device's location and position according to the captured frames. Then, a mapping step establishes a virtual coordinate of the environment based on the result of the tracker, and an object recognizing step detects the objects to process in the video frames (Qiao et al., 2018; Ren et al., 2019). Based on the identified objects, the augmented data is retrieved from the local cache or network servers and attached to the frames accordingly. Specifically, a template matching step attaches the augmented data to the frames, and an annotation rendering step renders the processed frames at AR devices. The computing workload for conducting the above functions can be fully or partially offloaded from AR devices to network servers to minimize computing latency or improve energy efficiency at AR devices.

FIGURE 4

Figure 4. Content processing and rendering for AR applications.

After receiving and playing XR content, XR devices collect user feedback to select the content to deliver next. VR and AR devices have similar methods for feedback collection, with sensors or cameras attached to the devices to capture users' actions and motions. Moreover, VR requires additional feedback regarding the user's viewpoint. A user's viewpoint determines which tiled videos to deliver to render the FoV of the user. The viewpoint can be captured by motion tracking modules on a VR device. Additionally, motion emulation can be used to simulate a user's viewpoint movement on VR devices. VR devices can request the content proactively based on the emulation results to avoid performance degradation, such as rebuffering (Yao et al., 2017). In addition, for interactive applications such as XR gaming, the sensors connected to XR devices, such as inertial measurement units (IMUs), haptic gloves, etc., gather inputs from the users. Depending on the inputs, the XR devices can either process the inputs locally or upload the inputs to content servers for computing and updating.

3.1.3. Requirements

In general, XR has stringent latency requirements for accurate and smooth content playback based on user motions. In terms of VR, motion-to-photon (MTP) delay is the most important delay metric, which measures the time difference between the user's viewpoint movement and corresponding reflections at the output of the VR headset. If the MTP delay is larger than 20 ms, VR users may feel spatially disoriented and dizzy, referred to as VR sickness (Yao et al., 2017). Current VR industries target lower MTP delay (below 15 ms) for ideal user experience (Mangiante et al., 2017). In addition, for VR applications requiring extensive interactions, the requirement of response time for rendering the interactions into VR content can be longer than the MTP delay requirement. For example, in VR gaming, a latency of up to 50 ms for responding to player actions can be noticeable yet currently acceptable (Zhang et al., 2017). In terms of AR, the content is mainly captured by local devices. The MTP delay in AR can be minimized by playing the raw content captured by AR devices before the content is processed. However, users' immersive experiences can be adversely affected by delayed processing for rendering the user's motions into AR content. The delay requirements for reproducing user interactions in AR content are 75 ms for online gaming and 250 ms for telemetry based on the sensitivity of the human vestibular system (Mohan et al., 2020).

Furthermore, in order to achieve low content delivery latency, an ultra-high data transmission rate is required for delivering XR content. Specifically, users view VR videos on headsets placed a few centimeters from their faces. Therefore, high-resolution videos are required for VR applications to improve user experience. Although tile-based content transmission can reduce the data size in VR content delivery, data rate requirements can still be 2.35 gigabits per second (Gbps) or above for VR video delivery, which is more than 100 times higher than the data rate for current high-definition video streaming (Mangiante et al., 2017). For interactive XR applications, such as VR gaming and AR, extensive video processing is required. The computing capability of both network servers and user devices dominates the performance of interactive XR applications, and limited computing capability in the network can be another bottleneck for XR content delivery (Elbamby et al., 2018b).

3.2. Haptic communication

In this subsection, we first provide the concepts of haptics and haptic communication. Then, we detail the implementation procedure and service requirements of haptic communication in the 6G era.

3.2.1. Concept

The term haptics initially referred to interactions between humans and objects in the physical world that involve the sense of touch, e.g., swiping a phone screen (Steinbach et al., 2012). The development of tele-operation technologies over the past few decades have expanded the definition of haptics to all forms of interactions involving the sense of touch, including interactions between humans and virtual objects in the virtual world or the tele-operated machines in the physical world (O'malley and Gupta, 2008; Tan et al., 2020). The information conveying the sense of touch in such interactions is referred to as haptic information. The sense of touch relates to different types of mechanoreceptors in human skin and muscles, and the haptic information can be broadly classified into tactile and kinesthetic information (Abiri et al., 2019). Specifically, tactile information is related to the sense of surface texture, friction, and temperature felt by the human skin when in contact with objects, and kinesthetic information is related to the sense of position and motion of limbs along with the associated forces (Srinivasan and Basdogan, 1997; Steinbach et al., 2012). A device that supports haptic interactions and the transmission of haptic information is referred to as haptic interface (HI) or haptic device. Existing HIs can be broadly categorized into graspable, wearable, and touchable HIs. Generally, graspable HIs are mainly used for capturing and displaying kinesthetic information; wearable HIs are mainly used for capturing and displaying tactile information; and touchable HIs can be used in both kinesthetic and tactile information capture and display (Culbertson et al., 2018). An HI is comprised of haptic sensors and haptic actuators responsible for capturing and displaying haptic information, respectively (Antonakoglou et al., 2018). An HI can capture and display a variety of haptic information, and the number of independent coordinates used by the HI to specify the haptic information is referred to as the degrees of freedom (DoF) of the HI (Promwongsa et al., 2020).

Haptic communication refers to the process in which humans communicate and interact through the sense of touch over a communication network (Steinbach et al., 2012). The communication network supporting haptic communication is named as Tactile Internet in some existing works (Ali-Yahiya and Monnet, 2022).² With the use of HIs and the transmission of haptic information over communication networks, users can interact with virtual objects in the virtual world or remotely operate machines in the physical world (Steinbach et al., 2012). The transmission of haptic information can be unilateral, bilateral, or multilateral, depending on the number of users participating in the haptic communication. In the cases of one user manipulating a remote machine or two users interacting with each other, the haptic communication is unilateral (i.e., an HI either sends or receives haptic information) or bilateral (i.e., an HI both sends and receives haptic information). In other cases, haptic information can be transmitted multilaterally, e.g., in cooperative tele-operations involving multiple users. This is because the behavior of each user may have an effect on other users, resulting in interconnections and couplings in the exchanges of haptic information (Feth et al., 2009; Shahbazi et al., 2018). Since haptic communication centers on humans, some studies examine the human-in-the-loop nature of haptic communication and predict a paradigm shift from content delivery to skillset delivery, as a result of the emergence of haptic communication (Simsek et al., 2016; Ali-Yahiya and Monnet, 2022).

3.2.2. Basic implementation procedure

The implementation procedure of haptic communication depends on how the haptic information is transmitted. For bilateral haptic communication, the implementation procedure mainly consists of four steps: haptic information acquisition, data reduction, data transmission, and haptic display, as shown in Figure 5.³ In the first step, haptic information, including tactile and kinesthetic information, can be acquired by haptic sensors in HIs. In terms of tactile information, force sensors, thermistors, and laser scanners are mainly used in the measurement or evaluation of friction and hardness, warmth, and macroscopic roughness, respectively (Lederman and Klatzky, 2009; Fishel and Loeb, 2012; Okamoto et al., 2012; Liu et al., 2017). Haptic sensors such as IMUs are responsible for the acquisition of kinesthetic information, e.g., tracking the position, velocity, and angular velocity of sensors positioned at different parts of a human (Steinbach et al., 2018). The haptic sensors of interest can be dynamically selected, and only the haptic information captured by the selected haptic sensors needs to be collected for efficient haptic information acquisition (Van Den Berg et al., 2017). Due to the potentially high DoF of an HI, data reduction is adopted in the second step to reduce the amount of haptic data without degrading the users' immersive experience too much. Specifically, waveform-based representation and feature extraction algorithms can be used in the compression of tactile information, and perceptual coding techniques based on perceptual masking phenomenon can be applied for compressing kinesthetic information (Steinbach et al., 2010; Jayasankar et al., 2021). In addition, predictive methods (also called predictive coding techniques) can be leveraged to reduce the amount of transmitted haptic data by inferring upcoming haptic information (Steinbach et al., 2018). Haptic data reduction can be carried out at either HIs or network servers (Steinbach et al., 2012; Fitzek et al., 2021). Existing methods of haptic data reduction are detailed in Section 4.2. In the third step, the haptic data can be transmitted over a communication network, resulting in a haptic data stream between two HIs. The haptic data stream can consist of multiple haptic data substreams, each of which corresponds to a type of haptic information. Data traffic patterns and QoS requirements can vary across different haptic data substreams due to the differences in the sensitivity of human perception, such as reaction time and the range of perception (Fitzek et al., 2021). The respective QoS requirements of haptic data substreams should be satisfied, and the haptic data substreams should be synchronized in transmission. Moreover, a haptic data stream should be synchronized with audiovisual data streams in the case of immersive communications involving multiple modalities (Cizmeci et al., 2017). In the last step, i.e., haptic display, haptic actuators in an HI stimulate human mechanoreceptors to create realistic haptic sensations when the HI receives haptic data (Wang et al., 2019). In general, haptic display includes tactile display, e.g., adjusting the temperature, and kinesthetic display, e.g., creating motion and changing muscle tension (Pacchierotti et al., 2017; Steinbach et al., 2018; Ozioko et al., 2020). In the case when haptic data transmission is unreliable or delayed, predictive methods can be leveraged at the receiver side to estimate the haptic data not received timely for smooth haptic display.

FIGURE 5

Figure 5. Implementation procedure of bilateral haptic communication.

In the case of multilateral haptic communication, three additional steps take place besides the aforementioned four steps, especially for cooperative tele-operation applications (Feth et al., 2009). The implementation procedure of multilateral haptic communication is shown in Figure 6, and the three additional steps are highlighted with green rectangles. First, even if there is no direct haptic interaction between two users, they can still share haptic information (Takagi et al., 2017). For example, the information on tensile strength, texture, and depth of the tissue can be shared among surgeons to facilitate their collaboration in telesurgery. The data format and content of the transmitted haptic information in such haptic information sharing may differ from those of the transmitted haptic information in direct haptic interactions (Shahbazi et al., 2018). Second, it is necessary to properly fuse the haptic information from multiple users, e.g., the weighted sum, when their behaviors affect other users (Fujimoto et al., 2008; Thanh et al., 2012). Third, when one user's behavior affects multiple users at the same time, distributing haptic information to multiple users according to their different behaviors is required to achieve precise haptic display for individual users, e.g., different reaction forces are applied to tele-operators (Chen et al., 2016).

FIGURE 6

Figure 6. Implementation procedure of multilateral haptic communication.

3.2.3. Requirements

The data transmission rate requirement of haptic communication is determined by the packet rate and size of haptic data. The packet rate is the number of packets transmitted by an HI per second, which depends on the information update rate. For the smoothness and fidelity of haptic perception, haptic information typically needs to be updated at a rate above 1,000 times per second (Choi and Tan, 2004). If each update of haptic information is packetized and transmitted, the corresponding packet rate of haptic data is above 1,000 packets per second (Xu et al., 2015). The packet size of haptic data largely depends on the DoF of the haptic data (Holland et al., 2019). For kinesthetic data, controlling one movable component (e.g., a joint) on a tele-operator (e.g., a robotic arm) needs six coordinates to be specified to achieve 6 DoF, with three coordinates specifying the transitional motion in the 3D space and the other three specifying the rotational motion including roll, pitch and yaw, respectively (Promwongsa et al., 2020). Since a human hand consists of multiple movable components (e.g., finger joints and wrist joints), its kinesthetic data can be described by a 24-DoF model (Cobos et al., 2008). In addition, for reproducing tactile information with high fidelity, a dense array of haptic sensors/actuators needs to be deployed on a user (Hoggan et al., 2007). For example, for reproducing vibrotactile data, four actuators are deployed around one fingertip (Baik et al., 2020). As a result, tactile data can involve even higher DoF than kinesthetic data (Holland et al., 2019). The packet size of 1-DoF, 10-DoF and 100-DoF haptic data is about 8, 80 and 800 bytes, respectively, and the specific data transmission rate requirement can be derived accordingly (Holland et al., 2019).

The delay tolerance of haptic communication can be as low as 1 ms since the packet rate of haptic data can be above 1,000 packets per second (Fettweis et al., 2014). In practice, the delay requirement of haptic communication is determined by factors including the perceptual sensitivity of receivers, the dynamics of haptic interaction, and specific operation or interaction. First, higher perceptual sensitivity for haptic information generally indicates the need for a higher packet rate and thus a stricter delay requirement (Chaudhuri and Bhardwaj, 2018). For example, while touring a virtual museum of natural history, archaeologists can have a stricter delay requirement than the majority of visitors due to their higher perception sensitivities of artifacts and specimens. Second, similarly, higher dynamics of haptic interaction generally call for a higher packet rate and a lower delay. Specifically, the delay requirement when such dynamics is high (e.g., in tele-soccer), medium (e.g., in telerehabilitation) and low (e.g., in tele-maintenance) is 1–10 ms, 10–100 ms, and 100–1,000 ms, respectively (Holland et al., 2019). Moreover, for the same use case, the delay requirement can vary with the dynamics of the interaction. For example, in tele-training, the delay requirement when the trainee is being assessed and corrected by the trainer is 1–10 ms; the delay requirement when the trainee is observing the illustration of the trainer is 1–100 ms (Holland et al., 2019). Third, a delay below 2 ms is required for remote machine manipulation, while a delay below 50 ms is required for remote machine monitoring (Aijaz and Sooriyabandara, 2018).

The reliability of haptic communication can be evaluated in terms of bit error rate, packet loss rate, delay-bound violation probability, or prediction error when haptic data prediction is adopted (Promwongsa et al., 2020). The requirement for the reliability depends on factors such as the specific communication scenario and whether or not haptic data reduction is used. First, in terms of delay-bound violation probability, the reliability of haptic communication in immersive gaming is required to be above 99.9% (Holland et al., 2019). In contrast, when critical operation tasks are performed based on haptic information, higher reliability of haptic communication is required. For example, the reliability of above 99.999% is required for haptic communication in telesurgery and remote machine manipulation (Aijaz and Sooriyabandara, 2018; Gupta et al., 2019). Second, when haptic data reduction is adopted, the same packet loss or bit error rate can cause more degradation in the haptic information (Steinbach et al., 2010). As a result, the use of haptic data reduction can result in a stricter requirement for the reliability of haptic communication. For example, the reliability above 99.999% is required in immersive gaming when haptic data reduction is adopted (Holland et al., 2019).

3.3. Holography and holographic communication

In this subsection, we introduce holography and holographic communication, beginning from presenting the concept and different types of holography, followed by the basic implementation procedure of holographic communication, and ending with the data transmission rate and delay requirements.

3.3.1. Concept

As the name suggests, holographic communication depends on holography technology, which has made significant progress in the past decade. There are different stages in the development of holography technology. Optical holography generates holograms via recording and recreating optical wavefront, and the corresponding holograms are recorded interference patterns (e.g., on photographic emulsions) of an “object wave” and a “reference wave.” When the recorded interference pattern is illuminated by the reference wave, a 3D light field can be recreated using diffraction. The original idea of hologram was developed in 1940s, and real breakthrough was made in 1960s thanks to the development of laser (Gabor, 1972). Later, with advances in electronic devices, digital holography emerged, which uses image sensors to capture interference patterns. In digital holography, recording is done optically, while a 3D image is reproduced via numerical calculation of light wave diffraction using methods such as Fourier transform (Tahara et al., 2018). The latest development of holography is computer-generated holography, in which both the interference pattern and the 3D image in display are generated digitally using a computer (Sahin et al., 2021). With computer-generated holography, the object to be displayed does not have to be physically present, which yields great flexibility at the cost of high computational complexity (Shimobaba et al., 2022). Despite of the advance in recent years, generating dynamic 3D holograms in real time is challenging. As a result, alternative approaches to displaying 3D images emerge, which are sometimes referred to as “false holography.” Such approaches use glass panes or other “tricks” to create illusions of 3D images (Jones et al., 2007; Kerrigan, 2018). Among the false holography techniques, volumetric display has attracted significant interest in the field of computer-aided design and medical imaging (Favalora, 2005). Volumetric display, an umbrella term for many different techniques, renders volume-filling 3D images via the generation, absorption, and scattering of illumination in a confined space, e.g., a cube or cone (Yang et al., 2016). The study of volumetric display is active with exciting experiments (Smalley et al., 2018), and commercial products are also available (Gibney, 2019). Other approaches to imitate 3D display include the use of multiple projectors and a human-size retroreflective cylinder (Gotsch et al., 2018). For example, a circular multi-projector array can be implemented for a light field cylindrical display to differentiate perceived images from different viewing angles.

Based on either true holography or “false holography,” holographic communication is about transferring data representing dynamic 3D images of physical objects over a network and displaying the objects in 3D at the receiver.⁴ Integrating 3D data capturing, processing, transmission, and rendering, holographic communication is expected to enable exciting new services in 6G (Strinati et al., 2019; Clemm et al., 2020). At the moment, there is no consensus on the scope of holographic communication in the literature, and some researchers consider the transferring and rendering of 3D data in AR/VR as a type of holographic communication (Essaili et al., 2022). In this review, holographic communication refers to data transfer for autostereoscopic 3D display, i.e., 3D images that can be viewed by naked eye without the aid of eyewear or headsets and, ideally, are different when viewed from different positions, angles, or tilts. The 3D display at the receiver can be rendered via real holography, false holography such as volumetric display, or other techniques as long as the objective of autostereoscopic 3D display is achieved. Similar to existing multimedia communications, the content of holographic communication can be either generated in real time or recorded, and the communication mode can be unicast, multicast, or broadcast.

3.3.2. Basic implementation procedure

Although various approaches for holographic communication differ in the implementation procedure, the general process includes the steps of data capture, processing, transmission, and rendering. This is illustrated in Figure 7.

FIGURE 7

Figure 7. Implementation procedure of holographic communication.

Except for computer-generated holography, a capture system is required to record 3D images of a physical object. An ideal capture system for holographic communication would capture the light field, i.e., all the information of each light ray, in the target scene (Apostolopoulos et al., 2012). In practice, capture is conducted with visual sensors such as a camera array (Nakamura et al., 2019) or light detection and ranging (LIDAR) sensors (Fratz et al., 2021). The depth information of the object of interest is either directly captured (e.g., in the case of a capture system with LIDAR sensors) or computed in the subsequent data processing step (e.g., in the case of a capture system with a camera array). The performance of the visual capture system depends on factors such as the number of sensors and the camera sampling rate (Apostolopoulos et al., 2012).

In the data processing step, the depth information of target objects in the scene is computed (if not directly captured), and the output from capture sensors is fused to form a composite 3D representation of the captured scene (Javidi et al., 2005). For example, in digital holography, a computer can process 2D images taken from different angles and tilts by a camera array to form a single 3D representation of the captured scene (Essaili et al., 2022). The fusion of images may help achieve visualization enhancement in the rendered 3D images such as improvement in the resolution and contrast (Javidi et al., 2005), and it can be conducted either solely at the transmitter side or with the help of an edge server. In addition, the data processing step is responsible for the compression of the fused data to speed up the transmission and reconstruction, and reduce the required data transmission rate and storage in holographic communication (Kurbatova et al., 2015; Cheremkhin and Kurbatova, 2019). The compressed data for the 3D representation is then encoded and transmitted over a network.

At the receiver side, the received data is decoded using one or multiple chosen codecs and decompressed. The captured scene is then reconstructed, possibly with the help of an edge server, and rendered on a display device. An ideal display device for holographic communication would regenerate the light field in the captured scene to create an illusion that the user is placed in the scene. In practice, creating such an illustration is difficult as it requires each point (e.g., each pixel) of the display device to emanate different light rays in different directions. However, given the limitations of human perception, the feeling of visual immersion can be created by using equipment such as a cylindrical light field display (Gotsch et al., 2018), a persistence of vision (PoV) display (Gately et al., 2011), or a static volumetric display device (Kumagai et al., 2021). Such devices render 3D images by using a large curved display to fill the user's FoV, exploiting the phenomena of a lingering afterimage on the retina, and dynamic turning on/off of voxels in a confined 3D space, among other methods for creating illusions of 3D images.

It is worth noting that holographic communication may also involve audio data capture, processing, and rendering. In such a case, capturing the sound field in the target scene and ensuring audio and video synchronization are important for users to enjoy an immersive holographic communication experience (Apostolopoulos et al., 2012).

3.3.3. Requirements

Holograms mainly come in two types, namely volumetric-based holograms and image-based holograms. The transmission of the two types of holograms requires different data rates, ranging from hundreds of Mbps up to Tbps (Clemm et al., 2020). For volumetric-based holograms, a physical object is represented as a set of 3D pixels or voxels, such as a point cloud. Transmitting a point cloud targeting an object requires a data rate on the level of hundreds of Mbps to several Gbps, depending on the resolution of the 3D content (FG-NET2030, 2020). For example, to fully represent a human, the point cloud in each frame typically consists of 10⁵–10⁶ points, while each point needs 15 bytes of data to represent the color and 3D coordinate of the point. In the case of 30 frames per second, the data rate requirement is between 300 Mbps and 3 Gbps (Selinis et al., 2020; Essaili et al., 2022). For image-based holograms, such as light-field video (LFV), an object is presented by an array of images captured at different angles, tilts, and/or positions. An LFV-based hologram can be more precise as compared with a volumetric-based hologram, especially in high resolution when a large number of images from different tilts, angles, and positions are used per frame (Jiang et al., 2021). For example, if the 3D representation of an object requires a separate image every 0.3°, a hologram with an FoV angle range of 30° and a tilt range of 10° needs 3,300 separate 2D images. In order to transmit an LFV-based hologram for a human-sized object, the required data rate should be between 100 Gbps and 2 Tbps (Clemm et al., 2020).

To support real-time holographic communication, the overall delay, including data capturing, processing, transmission, and rendering delay, should be <100 ms (He et al., 2023). In addition to low delay, synchronization is important to holographic communication. Generally, the hologram of objects or humans may be sampled by multiple sensors from different angles and different distances. In this case, data from different sensors should be synchronized in transmission (Strinati et al., 2019). Taking holographic teleconference as an example, as multiple participants can join the teleconference from different locations, multi-source synchronization is necessary for them to have good quality of experience (QoE) in holographic communication. Otherwise, a part of the rendered hologram can be slightly ahead or behind relative to the rest of hologram for some users, resulting in poor QoE (Lesniak and Tucker, 2018). Moreover, holographic communication can involve multi-sensory information, e.g., the haptic, audio, and video information (Taleb et al., 2021). In this case, the synchronization of different sensory information in transmission is also important for users to see the hologram, hear the voice, as well as receive touch-sensory feedback from others without a degradation of the immersive experience due to out-of-sync issues. For holographic communication involving the transmission of audiovisual and haptic data, the tolerable difference in the delay of different types of data should be lower than 80 ms for satisfactory QoE (Montagud et al., 2018).

4. Immersive communications: Challenges and solutions

After introducing the concepts, implementation procedures, and requirements of immersive communications, we now discuss challenges in XR, haptic communication, and holographic communication, as well as the state-of-the-art solutions, with the most important ones summarized in Figure 8. Note that our review here focuses on the challenges and solutions related to the communication, computing, and networking aspects of immersive communications.

FIGURE 8

Figure 8. Potential solutions to immersive communications.

4.1. Extended reality

The main challenge of XR is delivering the required content to users on time, given the limited transmission resources and computing capability in a network. A variety of network functions and resources contribute to the performance of content delivery. Systematic solutions involving data processing, rendering, transmission, etc., have been developed to address these challenges. We summarize the solutions for implementing XR in three aspects: content selection, transmission improvement, and computing optimization.

4.1.1. Content selection

The fundamental step in supporting XR applications is to identify which content needs to be processed and transmitted. This step focuses on minimizing the overall data size of the content to deliver at the cost of tolerable performance degradation, thus reducing the delivery time.

In VR services, proactive content delivery is commonly used to meet MTP delay requirements. Thus, in tile-based content transmission, the primary research challenge is how to predict user viewpoints accurately so as to determine which tiled videos to deliver to users. The prediction of user viewpoints can be achieved by sequential learning and data analysis methods based on the user's viewpoint trajectory, such as linear regression (He et al., 2018; Nasrabadi et al., 2020), and long short-term memory (LSTM) (Hou et al., 2018). A lightweight viewpoint prediction function can be deployed at the VR headset for local viewpoint prediction. Alternatively, the viewpoint trajectory can be updated to a network server (e.g., edge server), in which a more advanced machine learning model can be applied for accurate prediction (Hou et al., 2021). If the viewpoints are predicted by the network server, the prediction can be conducted based on not only current viewpoint trajectories for a group of users (Sun et al., 2020) but also the historical viewpoint trajectory data to further improve the prediction accuracy (Xu Y. et al., 2018; Feng et al., 2019). Although viewpoint prediction enables proactive tile-based content delivery, perfect prediction cannot be achieved due to the dynamics of user viewpoint movement. Even if viewpoints are known in advance, dynamic network environments such as data traffic load and processing time require adaptive resource management to ensure playback performance. With stochastic decision-making methods, such as reinforcement learning, it is possible to identify the dynamics of user viewpoint movement and determine which tiled videos to deliver to the corresponding VR device (Hu F. et al., 2022). In addition, the portion of tiled videos with different video qualities transmitted in a given time interval can be adjusted according to the viewpoint movement of a user. Increasing the portion of low-quality videos can improve the robustness against viewpoint prediction errors, while increasing the portion of high-quality videos can improve the QoE of the user. The optimal tradeoff between the robustness and the QoE is evaluated for VR video delivery in Hu M. et al. (2022).

AR devices capture raw content, i.e., video frames, which can be offloaded to network servers for prompt content processing. Once the raw content is offloaded, the server detects and processes the objects within video frames captured by users' cameras, then returns the processed content to the AR devices. Though it is easier to satisfy the MTP delay requirement in AR than VR, enabling accurate and rapid content processing (e.g., object detection) by network servers requires sufficient bandwidth to provide low-latency two-way transmission for satisfactory QoE. To balance transmission bandwidth usage for computing offloading and content processing performance, current solutions mainly focus on using machine learning techniques to adjust the number of frames offloaded by an AR device per unit time, based on the network environment and AR device movement. Specifically, offloading more video frames to a network server can improve object detection accuracy, especially when the AR device moves quickly and generates new content frequently. However, the bandwidth usage increases accordingly due to a large number of frames to offload (Liu Q. et al., 2018). Taking AR device mobility and network dynamics into account, adaptive frame rate adjustment is investigated in Chen N. et al. (2021). A deep reinforcement learning approach is used to study how mobility dynamics affect AR service performance and to determine the optimal uploading frame rate for maximal object detection accuracy and playback fluency.

XR content is expected to be further enriched in the era of 6G. Digital twins can incorporate AI to collect environmental information, characterize physical objects, and construct digital models of the physical objects accordingly. Digital models from digital twins can be used for XR applications as a new type of XR content that can be accessed by XR devices (Zhang Z. et al., 2022). For example, in an industrial Internet-of-Things scenario, designers and workers can use XR devices to interact with the digital models of machines and products in a simulated virtual environment. In addition, XR devices can collect the interactions from designers and workers. Based on the interactions, digital twins can adaptively configure their settings, such as data collection frequency (Aheleroff et al., 2021). The combination of XR and digital twins can support emerging applications such as metaverse. However, synchronizing among the physical world, digital twins, and XR content requires considerable network resources. Game theoretic methods are adopted in Han et al. (2022) to adjust the synchronization rate between the physical world and digital twins based on the demand of virtual service providers that provide content to XR devices. A network slicing-based solution is proposed for providing metaverse services (Liu et al., 2022), which allocates multi-dimensional resources for content synchronization to improve the fidelity of digital twins and the QoE of XR users.

4.1.2. Transmission improvement

As discussed in Section 3.1.3, the main bottleneck for VR video delivery is a limited data rate. Therefore, a straightforward solution to overcome the bottleneck is to increase the data rate with advanced communication techniques. As a key technology in 5G, millimeter wave (mmWave) communications can facilitate VR content delivery due to their high data rate and ultra-low propagation latency (Abari et al., 2016). In 6G, the transmission rate can be further improved by the physical layer technologies of terahertz (THz) transmission and intelligent reflecting surface (IRS), which can be applied in VR video (Chaccour et al., 2020; Du et al., 2020). However, communication links using ultra-high frequency bands, such as mmWave and THz, are prone to outage as they require line-of-sight (LoS) channels. Physical obstacles in the environment, including the user's body, may break the communication links and severely degrade the communication quality. To address this issue, a sub-6 GHz frequency band can be used as a backup if the mmWave or THz bands does not provide satisfactory channel quality. However, dynamic frequency band switching can result in a time-varying data transmission rate, thereby degrading the content delivery performance. The work (Liu et al., 2019b) models communication link state transitions corresponding to switching different frequency bands (e.g., mmWave and sub-6 GHz bands) in VR content delivery as a Markov chain. Content processing policies are adjusted to compensate for transmission delays when channel state transitions occur. In addition to adapting to channel dynamics, the reliability of mmWave or THz communication links can be improved by establishing multiple communication links between a device and several edge servers for VR content delivery (Gu et al., 2022; Yang P. et al., 2022). In addition, on the link layer, IEEE 802.11 releases a new amendment standard IEEE 802.11be - Extremely High Throughput (EHT), i.e., WiFi-7, to support high-throughput and low-latency video applications, including XR, through aggregating multiple transmission bands, exploiting MIMO enhancements, and enabling multi-AP coordination (Deng et al., 2020).

On the network layer, a network virtualization-based solution is proposed for VR content delivery, in which network controllers can create private logic networks for VR applications to satisfy their service requirements and dynamically adapt the routing schemes according to the mode of content delivery (i.e., uni-cast or multi-cast) (Huawei Technologies Co., Ltd). The transmission protocols are designed according to the features of VR content delivery. The transmission protocol based on quick UDP Internet connections (QUIC) is proposed in Yen et al. (2019) to prioritize important tiled videos, such as the videos in the center of the user's FoV or the videos to be played soon, in transmission over a QUIC connection, in order to minimize the ratio of missing tiles when playing VR videos.

4.1.3. Computing optimization

Supporting wireless XR requires networks to have sufficient computing capability for processing and rendering the content, especially for interactive applications such as VR gaming. Processing the content locally at the XR devices can be time-consuming and energy-inefficient due to their limited computing capability. Instead, the computing workload can be fully or partially offloaded to network servers, and multi-tier computing can be a potential solution to reduce computing time and bandwidth consumption when providing computing services to XR devices. Accordingly, computing strategies should base on the features of diverse network servers to improve resource utilization and service performance.

In MEC, edge servers can provide additional computing capability for resource-limited devices to reduce content processing latency for mobile XR content delivery. Specifically, in VR, edge servers can project monoscopic videos to stereoscopic videos when content is transmitted from the content provider's cloud server to VR devices. Such MEC-assisted content delivery can reduce bandwidth consumption compared to delivering stereoscopic videos from the cloud server directly, and computing time can be reduced compared to projecting the videos at the local devices (Mangiante et al., 2017). In AR, devices can offload captured content to an edge server to minimize processing latency (Siriwardhana et al., 2021). In addition, edge servers can cache the processed XR content to further reduce the content delivery and processing time (Sukhmani et al., 2018). Joint computing, caching, and communication resource management for VR video delivery is investigated in Dang and Peng (2019) and Sun et al. (2019), which studies the tradeoffs between computing and caching resource allocation for minimizing content delivery delay, given stochastic content processing time and popularity. Deep reinforcement learning methods are adopted to allocate computing resources at an edge server for individual content delivery requests in Liu and Deng (2021) and Liu et al. (2019b), aiming to minimize content delivery delay while adapting to dynamic network environments and user viewpoint movement. The work (Liao et al., 2021) further investigates trusted caching collaboration for multiple edge servers in supporting VR/AR content delivery. A distributed caching scheme is proposed to optimize the cache space and policy for edge servers while incentivizing edge servers to participate in edge caching through verification schemes in the blockchain.

Nonetheless, the computing capability at edge servers may not always be sufficient for processing XR content. Compared to cloud servers, edge servers usually have limited storage resources for caching XR content. Targeting 6G, a multi-tier computing architecture provides a potential solution for further accelerating XR content delivery by coordinating computing and storage resources among cloud servers, fog servers (e.g., servers at the gateway), and edge servers across the network. By integrating computing resources across the entire network, content processing workloads can be optimally distributed among multiple servers, and storage capacity among servers can be utilized to satisfy offloaded computing demands. However, optimizing XR performance by multi-tier computing can be complicated when there are a multitude of computing offloading and caching options to choose from. The computing and caching resource coordination between the cloud server and edge servers is studied in Al-Abbasi et al. (2019) and Mehrabi et al. (2021). Based on the information of a static network environment, e.g., transmission rate and XR computing demand, mixed integer nonlinear programming is investigated. Considering dynamic network environments and user mobility, the work (Zhou C. et al., 2022) utilizes digital twins of end users to characterize network dynamics and statuses. The meta-learning method is adopted to jointly allocate computing and caching resources at servers on different tiers of a network for context-based applications, including XR, based on the captured network statuses from digital twins. The attention of users on the virtual objects in XR content is predicted in Du et al. (2022) by an alternating least square method, and a computing resource allocation scheme is proposed to prioritize processing of the virtual objectives that attract more user attention.

In addition to jointly allocating computing and caching resources at network servers, computing performance can be further enhanced by scheduling computing tasks at edge servers. Edge servers can provide location-based content to users, which can contribute to computing optimization for XR applications. Specifically, in AR, users at close locations may offload and require similar content, and therefore, raw content offloaded from the nearby users can be processed together for improving computing efficiency (Jia and Liang, 2018). Furthermore, rendering pipelines can be optimized based on real-time communication and computing performance of network servers and local devices when part of the workloads for content rendering are offloaded. A collaborative rendering pipeline is investigated in Xie et al. (2021), which dynamically arranges the execution order of sub-tasks in content rendering on both the edge server and XR devices, based on network characteristics, to facilitate parallel computing and improve content rendering efficiency. In addition, joint computing and communication resource management for efficiently supporting multiple users in a virtual world is investigated in Ren et al. (2020b). Device-to-device links are enabled to allow each AR device to leverage the computing resources of nearby AR devices for lightweight pre-processing of the captured frames to further improve computing resource utilization in the network.

4.2. Haptic communication

The main challenge in haptic communication is to satisfy the stringent delay and reliability requirements in the delivery of haptic data, especially when the data packet rate is high. To tackle this challenge, solutions have been developed in three aspects, including haptic data reduction to reduce the packet size or the packet rate, advanced communication and networking techniques to reduce delay and improve reliability, and haptic data prediction to compensate for excessive delay and packet loss over communication networks.

4.2.1. Haptic data reduction

To improve the fidelity of haptic perception, the number of haptic sensors/actuators deployed on an HI has been increasing (Steinbach et al., 2018). For example, electronic skin (e-skin) can be attached to prosthetic limbs for sensing haptic information, or to human skin for virtual social interaction (Dahiya, 2019; Yu et al., 2019). To reproduce the function of human skin, sensors/actuators need to be densely deployed on e-skin, for example, 25 sensors/actuators per 1 cm² (Liu et al., 2020). In addition, the required packet rate for haptic data can be higher than 1,000 packets per second (Orlosky et al., 2017). As a result, with a large number of devices and a high packet rate, the required data transmission rate of haptic communication can be high. To tackle this challenge, one solution is haptic data reduction, which is to reduce the packet size or rate of haptic data.

For reducing the packet size of haptic data, floating-point compression in the time domain or quantization of haptic data in the frequency domain can be exploited. In floating-point compression, one degree of freedom in the haptic information (e.g., the direction of the transitional movement in an axis) can be represented by a 32-bit floating-point number, and only the bits different from those in the previous haptic data are transmitted (You and Sung, 2008). Using time-frequency transformation algorithms such as discrete cosine transform, a sequence of haptic data packets in the time domain can be transformed into the data in the frequency domain, which are then quantized and transmitted (Tanaka and Ohnishi, 2009; Zeng et al., 2020). For reducing the packet rate of haptic data, the perceptual masking phenomenon is widely exploited, which suggests that a human cannot perceive the difference of haptic information below the just-noticeable difference (JND). According to the Weber's law, the JND of haptic information is proportional to the currently perceived value of the information, and the proportion is referred to as the Weber fraction (Steinbach et al., 2018). In this regard, the perceptual haptic reduction method is to transmit an updated haptic data packet only when the difference is larger than a threshold (e.g., JND) (Steinbach et al., 2010). In addition, the perceptual masking phenomenon in both time and frequency domains can be jointly exploited to achieve a higher data reduction ratio and lower data deviation (Wei et al., 2022). Moreover, by jointly evaluating the difference of the haptic information in terms of all the DoF among consecutive data packets, the perceptual haptic reduction can be further improved (Steinbach et al., 2012).

The use of haptic data reduction should adapt to the type of haptic data, the delay requirement and the reliability requirement for haptic communication. First, haptic data can exhibit different Weber fractions in the JND, e. g., 7∽15% for force data and 13∽28% for stiffness data, which results in different thresholds in perceptual haptic reduction (Chaudhuri and Bhardwaj, 2018). Second, data reduction in the frequency domain results in high processing delay since it is based on a sequence of data packets in the time domain. It is suitable for use cases with high delay tolerance, such as the passive perception and exploration of remote/virtual objects (Sachs et al., 2018). In contrast, data reduction in the time domain, implemented in real time, is suitable for use cases with low delay tolerance such as immersive gaming, which involves extensive interactions between the players (Holland et al., 2019). Third, haptic data reduction may not be suitable for use cases requiring high reliability. As discussed in Section 3.2.3, with the use of haptic data reduction, the required reliability of haptic communication increases. In this regard, for use cases with a high-reliability requirement (e.g., 99.999% for telesurgery), the reliability requirement can be difficult to satisfy if haptic data reduction is used.

4.2.2. Communication and networking solutions

To satisfy the ideal communication delay of below 1 ms for haptic communication, physical-layer delay of <0.1 ms is desired (Aijaz et al., 2016). For reducing queuing delay, haptic data may be allowed to preempt the data of other types in the downlink transmission (Ji et al., 2018). For uplink transmissions, non-orthogonal multiple access (NOMA) can improve spectrum efficiency and reduce channel access delay of haptic devices (Budhiraja et al., 2019). In addition, a grant-based user scheduling mechanism can take 0.3–0.4 ms for exchanging the scheduling request and transmission grant (Ji et al., 2018). Besides such delay, the signaling overhead, resulting from network control or grant-based scheduling, reduces the efficiency of data transmission (Ding et al., 2021). Therefore, grant-free user scheduling has been exploited to avoid the time-consuming scheduling, which periodically pre-reserves transmission resources, and the same resources can be pre-reserved to multiple haptic devices for improving resource utilization (Ali et al., 2021; Gao J. et al., 2021). To reduce the delay due to packet retransmissions, interference in multiple access should be properly managed. In grant-free NOMA, the interference can be managed by device activity detection (Ye et al., 2019) and successive interference cancellation (SIC) (Abbas et al., 2019). Rate-splitting multiple access (RSMA) encodes message streams intended for multiple devices into common streams and private streams based on available channel state information (CSI), and a device jointly decodes the common streams and the private stream intended for it, which can achieve flexible interference management and high robustness to imperfect CSI (Dizdar et al., 2020).

For improving the communication reliability of haptic communication, several approaches have been adopted in the literature. First, considering the small size of a haptic data packet, short block-length channel codes with strong error correction capabilities, such as low-density parity-check (LDPC) codes and short polar codes, have been investigated for haptic communication (Miloslavskaya and Vucetic, 2020; Yuan et al., 2022). Second, spatial diversity can be exploited by massive multiple-input and multiple-output (MIMO), IRS, and multi-connectivity techniques (Tarneberg et al., 2017; Tang et al., 2020; Anwar et al., 2021). Third, time diversity can be exploited by retransmission schemes such as K-repetition, in which a haptic device can automatically transmit K repetitions of a packet over consecutive slots, thereby avoiding the delay caused by waiting for a retransmission request from the receiver (Yang et al., 2021). NOMA can improve retransmission efficiency where the transmit power of a device can be optimized to retransmit the required minimum redundant bits for satisfying the reliability requirement (Kotaba et al., 2019, 2021).

To guarantee low delay and high reliability for haptic communication, network slicing, which allows multiple isolated virtual networks to be constructed over a shared physical network infrastructure, has been exploited (Polachan et al., 2020). The perceptual masking phenomenon of haptic information, as introduced in Section 4.2.1, can be exploited to accurately capture the maximum tolerable delay of haptic communication requests, which facilitates resource reservation in the network slice for haptic communication (Ge et al., 2019). For multiple tele-operation slices, diverse stability control capabilities of tele-operators in the presence of delay should be considered for customized transmission resource reservation (Liu S. et al., 2018). Moreover, by exploiting AI-based learning methods, traffic patterns of haptic devices can be accurately captured, and efficient resource reservation can then be facilitated (Shen et al., 2020).

4.2.3. Haptic data prediction

The delay requirement of haptic communication can impose a constraint on the distance between two users. For example, to satisfy a delay requirement of 10 ms, the distance between a transmitter HI and a receiver HI must be smaller than 3,000 km since the propagation speed is upper-bounded by the speed of light. This can create an issue for applications such as VR gaming with haptic interactions of players across continents. In addition, it is impossible to eliminate the loss of data packets or the violation of delay requirement in haptic communication (Aijaz and Sooriyabandara, 2018). To improve user experience considering the above facts, haptic data prediction can be exploited.

For haptic data prediction, model-based or model-free prediction algorithms can take historical haptic data and other correlated data as the input. In tele-operation, the force feedback from the tele-operator is predicted by evaluating the previous force feedback through an auto-regressive model (Sakr et al., 2007). In the tele-operated needle insertion, the force/torque feedback from the patient is predicted by inputting the force/torque commands of the surgeon to the hidden Markov model (HMM) (Boabang et al., 2020). Audiovisual data collected in the interaction with a surface material are input to a neural network-based semantic learning algorithm to predict the texture of the surface material (Wei et al., 2021).

Haptic data can be predicted either at the receiver side or at the transmitter side to compensate for an excessive delay or packet loss. The receiver can predict the haptic data from the transmitter when an excessive delay occurs (Maier and Ebrahimzadeh, 2019). For example, digital twin-based prediction can be used by the receiver for low-latency interactions (El Saddik, 2018). Alternatively, the transmitter can predict its future haptic data and transmit the predicted data to compensate for the transmission delay (Hou et al., 2019). In this case, the prediction of whether haptic interaction is about to occur can assist to determine whether the haptic data prediction and the subsequent transmission are necessary (Mondal et al., 2020).

Haptic data prediction algorithms, such as AI-based ones, can be computing-intensive. To this end, they can be implemented using computing resources in the network to satisfy the stringent delay requirements (Simsek et al., 2016; Sukhmani et al., 2018). In a tele-operation scenario, each of the two interacting haptic devices is associated with one edge server which caches the haptic interaction data, trains and implements the LSTM network-based prediction algorithm, and delivers the predicted haptic data to its associated haptic device (Li X. et al., 2021). Furthermore, with close proximity, auxiliary robots can be deployed around haptic devices to implement haptic data prediction and deliver the results to the devices using device-to-device (D2D) communications (Yu et al., 2022).

In addition to compensating for the delay or packet loss, haptic data prediction can be used to reduce the packet rate of haptic data (Antonakoglou et al., 2018). Specifically, the haptic transmitter can implement the haptic data prediction and evaluate the prediction deviation, and only transmit the data when the prediction deviation is higher than the JND of the receiver. If the haptic data has not been transmitted, the receiver can predict it based on the prediction algorithm shared with the transmitter.

4.3. Holographic communication

In holographic communication, users are able to view 3D holograms from different angles, tilts, and positions. As a result, a hologram synthesized with information from more viewpoints can produce more detailed and continuous visual information for users, thereby creating a more realistic immersive experience (Liu et al., 2019a). This however requires the transmission of a large amount of data. The main challenge in holographic communication is its stringent data rate and delay requirements. In this subsection, we focus on potential solutions for tackling this challenge in the aspects of data processing, communication, and networking.

4.3.1. Content selection, compression, and prediction

A high data rate is essential for holographic communication, and the demand for data rate can vary from hundreds of Mbps to several Tbps depending on the type of transmitted data, e.g., volumetric-based or image-based holograms. One way to relax the data rate requirement is to reduce the data size, for example, by transmitting only the most essential parts of a hologram through viewpoint-based content selection in holographic communication (Clemm et al., 2020). Since some parts of the hologram may not be observed depending on the user's viewpoint and position, as well as the presence of obstacles, those parts may not need to be transmitted. However, two issues remain even with the selective transmission. First, for an immersive experience in holographic communication, 6 DoF (yaw, pitch, roll, up/down, left/right, forward/backward) need to be considered when a user views a hologram, which makes content selection based on the user's viewpoint a complex problem. In addition, without head-mount devices such as VR headsets, tracking the position and viewpoint of the user is challenging and requires mechanisms such as full-body tracking (Xu W. et al., 2018) or eye tracking (Zhang X. et al., 2019).

Another solution for reducing the required data rate is to apply data compression. For a 2D real-time video, current media codecs can achieve a compression ratios from 250:1 to 1,000:1 (Selinis et al., 2020; Essaili et al., 2022). Similarly, format conversion and data compression can be applied to reduce the data size in holographic communication. The authors in Mekuria et al. (2017) propose a lossy real-time color-encoding method by exploiting the inter-frame redundancy of point clouds. Moreover, considering the strong correlation among different views in a hologram, multi-view coding (MVC) for LFV-based streaming is proposed in Xiang et al. (2016), which improves the compression rate by analyzing both horizontal and vertical correlations of images in adjacent angles and tilts. Meanwhile, many efforts have been made by standardization groups for the compression of holograms. For example, the Moving Picture Experts Group (MPEG) defined the video point cloud compression (V-PCC) by converting point clouds into two separate video sequences that capture the geometry and texture information, respectively (Schwarz et al., 2019). The Joint Photographic Experts Group (JPEG) intended to provide a standard representation framework to facilitate the compression of LFV-based or point cloud-based content for holographic communication (Schelkens et al., 2019). Different codecs for hologram compression are evaluated in Amirpour et al. (2021), in which the authors study the compression and restructure of holograms.

Retransmissions due to data packet loss result in additional delay. To avoid the retransmission delay, the lost data packets can be recovered based on predicted data according to historical information of an object such as its trajectory. For example, packets can be recovered from an LSTM-based prediction of human actions and movements in 3D (Liu J. et al., 2016) or a short-term prediction by analyzing the actions, movements, or gestures of users (Manolova et al., 2021). By predicting content, data packets can be generated at the receiver side in the event of packet loss to reduce the delay in holographic communication (Strinati et al., 2019).

4.3.2. Communication and networking solutions

In addition to data processing, some communication and networking solutions have been investigated for satisfying delay and data rate requirements of holographic communication, including computing architecture, transport protocols, and physical layer technologies.

In holographic communication, data captured from different sensors needs to be processed to form a 3D representation of the object, which is then rendered and reconstructed at the receiver side (Javidi et al., 2005). However, the limited computing capability of local devices may lead to a long processing delay due to the high workload of data fusion and rendering (Hu et al., 2017). Cloud computing is introduced to support high computing workloads for data processing in holographic communication. However, transmitting massive data to the cloud may result in a high communication delay (Wang K. et al., 2022), which is not suitable for real-time holographic communication. One promising solution is to offload computing tasks to MEC servers for data processing, since MEC servers possess considerable computing capability and are placed close to users (Gupta et al., 2021). Thanks to network function virtualization (NFV), functions such as data fusion, data compression, and data rendering can be virtualized and flexibly deployed at MEC servers. In this case, captured data from different sources can be aggregated, fused, and synchronized at an MEC server before rendering (Qian et al., 2022). Moreover, a multi-tier computing scheme is proposed for 6G networks, which can be utilized for holographic communication by integrating computing resources at cloud servers, MEC servers, and local devices, to achieve a low delay for data transmission and high computing capacity for data processing with collaboration among different servers (Yang et al., 2018; Wang K. et al., 2022). By integrating computing resources on different tiers, content can be processed at different servers to effectively utilize computing resources, and flexible computing resource management should be developed to facilitate multi-tier computing for holographic communication. For example, split rendering is introduced for an MEC server and a local device to cooperatively decode and render holograms according to the content (Essaili et al., 2022).

To satisfy the stringent delay and high reliability requirements of holographic communication, transport layer optimizations are also crucial. Current transport protocols, such as transmission control protocol (TCP) and user datagram protocol (UDP), can hardly satisfy the requirements of holographic communication. To improve the reliability and delay performance in real-time communication, new protocols based on UDP are introduced, such as QUIC over HTTP/3 (Seufert et al., 2019). Currently, the research on QUIC mainly focuses on traditional 2D video streaming services, while QUIC can serve as a potential solution for holographic communication, providing a quality-managed low-delay streaming option (Clemm et al., 2020). Moreover, the transmission of a hologram may consist of multiple substreams corresponding to different viewpoints, while the QoS requirement and the priority of each substream may be different. In this case, the transmission of the most essential substreams needs to be prioritized. To achieve this target, a new transport protocol is designed in Rozen-Schiff et al. (2021) for holographic communication to satisfy different QoS requirements of different flows by providing flow-level granular control. In addition, an adaptive retransmission mechanism based on TCP is designed to reduce retransmissions by analyzing and differentiating packets (Clemm et al., 2020). For example, only important data, such as the data used for rendering the part of the hologram in the center of the user's FoV, will be retransmitted if the related packets are lost, to reduce retransmissions.

Finally, physical layer technologies are important to supporting a high data rate for holographic communication. In order to transmit high-resolution LFV-based holograms, holographic communication requires a data rate of several Tbps, while current 5G networks cannot support it (David and Berndt, 2018; Shahraki et al., 2021). Featuring higher frequency and larger bandwidth compared with mmWave in 5G, THz communications have the potential to support holographic communication with Tbps-level data rate (Chen et al., 2019; Elayan et al., 2019). To overcome the severe propagation loss of THz communication, dense deployment of access points and extremely narrow beams can be adopted to improve connection density and communication reliability (Zhang Z. et al., 2019). Considering the absorption and reflection properties in the THz regime (Aazhang et al., 2019), the deployment of the THz base stations and the prediction of user motion require further investigation to provide sustainable LoS links for holographic communication (Chaccour et al., 2022). In addition to THz communications, visible light communication (VLC) can provide an alternative solution for holographic communication by providing large available bandwidth (Beysens et al., 2021). Featuring a high transmission data rate (Strinati et al., 2019) and accurate positioning (Li et al., 2015), VLC can potentially support holographic communication as well as user tracking in an indoor environment. The coordination of THz communication and VLC is studied in Wang et al. (2022a) for providing a reliable service with a high data rate.

Table 4 provides a summary of the solutions discussed in this section as well as their limitations or costs.

TABLE 4

Table 4. Potential solutions to immersive communications and the costs.

5. Immersive communication: Open issues and future directions

Despite an increasing amount of studies and solutions for supporting XR, haptic communication, and holographic communication, there exist many open issues to address before immersive communications can popularize. To name a few, synchronization of multi-modal communications, user QoE modeling and enhancement, and intelligent network management for immersive communications remain to be challenging problems. In this section, we present some major open issues in immersive communications and potential future directions to address these issues.

5.1. Multi-modal communications

While immersive communications have the potential to enhance user engagement and facilitate immersive interactions, effective network resource management for ensuring synchronized multi-modal perception in highly dynamic network environments is an open issue. The synchronization of multi-modal perception consists of two aspects: inter-stream (cross-modal) and intra-stream. First, the transmission of auditory, visual, and haptic data results in multiple data streams that should be synchronized in order to prevent motion sickness. For example, the time interval between perceived visual and tactile movement should not exceed 1 ms (Van Den Berg et al., 2017). Second, to enhance the immersive experience, a data stream can include multiple data substreams corresponding to different sensations, e.g., temperature and pressure, which also need synchronization. Data substreams corresponding to different DoF of an HI should be synchronized to maintain the stable perception of simultaneity, and data substreams transmitted from LIDAR sensors placed at different locations should be synchronized to render a 3D hologram precisely. There are many works that enable either intra-stream or inter-stream synchronization from the perspective of a single network layer (Cizmeci et al., 2017; Zhang et al., 2018). However, in order to synchronize multi-modal perception, both network-related and application-related information is necessary. This is because network resource management for multi-modal communications is affected by not only different data packet formats, data traffic patterns, and QoS requirements, but also different sensitivities of human perception. The cross-layer design of network protocols for multi-modal communications, which can support information sharing among different layers for efficient use of network resources, is a potential solution (Kumar and Muhammad, 2018; She et al., 2020). A higher-layer approach to synchronizing multi-modal information can benefit from information on network conditions at lower layers, e.g., adaptively changing the priority of modalities in transport-layer multiplexing according to real-time physical-layer data rates. In addition, lower-layer approaches can take into account application-related information for efficient network resource management, e.g., timely adjusting the amount of radio resources allocated to a user in response to the dynamic sensitivity of the user's perception. Since multi-modal perception data in immersive communications can include personal biometric information of individual users, privacy challenges can arise in the transmission and processing of such data, such as biometric data leakage or profiling (Shen et al., 2021b).

5.2. AI-native immersive communications

AI techniques have demonstrated outstanding performance in identifying data correlations and analyzing device dynamics. As a result, some application functions using AI techniques, i.e., AI-enabled functions, have been developed for exploring unknown device states in immersive communications, such as viewpoint predictions in VR devices and haptic data prediction (Wu et al., 2022). To support increased service demands on immersive communications in 6G, AI-enabled functions will be deployed at network servers, i.e., cloud and edge servers (Li M. et al., 2021). Accordingly, the network should support the entire lifecycle of AI for the functions, including data collection, data pre-processing, AI model training, inference, and AI model evaluation. By taking AI-enabled functions as the built-in component for supporting immersive communications, several potential future research directions should be investigated. First, AI-enabled functions can be configured according to network management policies for supporting immersive communications. For example, in haptic communication, the prediction horizon, i.e., the time window for the predicted information, of tactile and kinesthetic information can be adjusted to adapt to real-time network transmission and computing delay, AI-based prediction accuracy, and service reliability requirements. Second, efficient data management schemes can be developed, in which low-signaling-overhead and grant-free network management can be achieved by sharing the data obtained from AI-enabled functions. For example, in VR video delivery, network controllers can use a viewpoint prediction model or results from a viewpoint prediction function and allocate sufficient downlink communication resources to users with highly dynamic viewpoint movements. Additionally, effective resource management solutions should be developed to support AI model training in real-time, so that AI-enabled functions can be updated according to user behavior dynamics, where sufficient network resources should be allocated for supporting data collection and processing at edge and cloud servers. When supporting AI-native immersive communications, essential security issues should be addressed. For example, data and model poisoning attacks can lead to biased or incorrect results by injecting false samples into the training datasets and updating crafted local AI models in federated learning, respectively (Khisamova et al., 2019).

5.3. Time-sensitive and deterministic networking

The existing solutions mentioned in Section 4 can help reduce transmission delay in immersive communications. However, satisfying the stringent delay and reliability requirements of XR, haptic communication, and holographic communication, especially ms-level end-to-end delay, remains a challenge. Fortunately, the ongoing efforts of 3GPP, IEEE, and IETF in supporting time-sensitive networking (TSN) and deterministic networking (DetNet) (Messenger, 2018; Nasrallah et al., 2019) provide solutions to meet the requirements of immersive communications (Rost and Kolding, 2022). The current efforts largely focus on the link and network layers (i.e., layers 2 and 3) and mostly target industrial networks (Rost and Kolding, 2022). Therefore, the corresponding solutions may not be readily applicable to all use cases of immersive communications. Potential future directions of TSN and DetNet for immersive communications include the followings. First, a comprehensive solution integrating existing TSN and DetNet designs for delay minimization can be important to immersive communications. For example, the joint design of coordinated sensing/capturing and communication (on the physical layer), traffic shaping and scheduling (on the link layer), flow identification and packet treatment (on the network layer), and viewpoint/haptic data prediction (on the application layer) can help reduce the end-to-end delay in immersive communications. Second, instead of treating different data streams in a mutli-modal communication separately, joint prioritization and resource orchestration for different types of data given their respective delay and jitter requirements is another promising direction. Third, integrating environment-aware and service-oriented network management paradigms can potentially enable TSN and DetNet for immersive communications. An example is to incorporate adaptive radio access network (RAN) function splitting, network slicing, and AI-driven network management to minimize delay and jitter by customizing for a specific service and adapting to the network environment.

5.4. QoE-oriented networking

While QoS provisioning from a network perspective benefits the transmission of XR content, haptic information, and holograms, as detailed in Section 4, evaluating and guaranteeing individual users' QoE is crucial in providing them an immersive experience. This is because many factors, besides communication network conditions, can affect user experience in immersive communications, including coding, compression, and human perception. Therefore, QoE-oriented networking from users' perspective is a promising network management paradigm to support immersive communications in the 6G era, including two potential aspects: personalized QoE modeling and QoE-oriented network resource management. First, existing works on immersive communications have limitations on personalizing QoE models for individual users. Conventional QoE modeling are based on either subjective tests or objective quality assessments (Tasaka, 2022). The former, conducted in relatively static laboratory environments, is costly and inapplicable in dynamic network environments, whereas the latter, evaluated by empirical human perception models, does not differentiate individual users (Barakabitze et al., 2019; Ruan and Xie, 2021). Finding a way to model personalized QoE while adapting to dynamic network environments remains an open issue. Second, managing network resources to guarantee the QoE of individual users in immersive communications necessitates user-level information. Even if several users request the same service, they may have different resource demands for improving their QoE (Kougioumtzidis et al., 2022). For example, due to the difference in the sensitivity of haptic perception, e.g., reaction time, the haptic sensors of interest and the scan time for each haptic sensor may differ in supporting different users, yielding different communication and computing resource demands (Coutinho and Boukerche, 2022). In the 6G era, the paradigm of digital twins can be a potential solution for QoE-oriented networking. Specifically, individual users can be characterized by creating user digital twins, including user data profiles that contain extensive well-organized user data, and a variety of digital twin functions that support flexible and customized data collection and analysis (Shen et al., 2021a). Both personalized QoE modeling and QoE-oriented network resource management for immersive communications can benefit from extensive timely updated and fine-grained user-level information (Wang et al., 2021). Although QoE-centric networking can provide users with immersive experiences based on the preferences and features of individual users, privacy issues, such as unconscionable behavioral profiling and improper uses of the profiles, should be addressed when collecting and processing data with user preference information (Nguyen et al., 2021).

5.5. Collaborative multi-tier computing

Research on multi-tier computing is still at a nascent stage (Yang, 2019). In the 6G era, collaborative multi-tier computing can be a promising computing paradigm by leveraging the various characteristics of computing servers, such as service coverage and resource capacity. There are two research directions to facilitate immersive communications. First, computing tasks corresponding to different steps of immersive communications can be executed on different computing servers. Different steps of immersive communications may have different network resource demands, e.g., I/O-intensive data fusion tasks and CPU-intensive data encoding tasks require different communication and computing resources (Gao H. et al., 2021). Selecting proper computing servers for each step based on the features of computing servers and the resource demands of the step is beneficial for satisfying stringent QoS requirements of immersive communications. Second, context data management across computing servers at different tiers plays an important role in supporting immersive communications. A significant percentage of computing tasks in immersive communications will be stateful, meaning that context data are required during task execution, e.g., volumetric media objects or holograms in rendering (Gao et al., 2022b; Zhou C. et al., 2022). When stateful computing tasks are executed on a computing server, the required context data should be either stored locally on the computing server or downloaded remotely from other computing servers. As a result, managing context data, e.g., selecting proper computing servers from different tiers to proactively store context data based on the computing task arrival and mobility patterns of individual users, will have a significant impact on the performance of immersive communications. While collaborative multi-tier computing provides more options for context data management than conventional MEC, the coordination of computing servers at different tiers can significantly complicate the problem of context data management. In addition, establishing reliable trust relationships between computing servers and among computing servers and users, as well as measuring the credibility of users, is an open and important research direction in collaborative multi-tier computing for immersive communications (Shen et al., 2022).

5.6. New network architecture

Network architecture innovation is indispensable for a widespread realization of immersive communications, and innovations building on recent developments for 6G architecture are promising future directions. The need for new architectures manifests in several aspects. First, the computing-intensive nature of immersive communications, rooted from processing and compressing 3D data, predicting viewpoints and haptics data, and reconstructing 3D objects, demands a network architecture with extensive computing resources and reliable computing service provisioning. As a result, a heterogeneous network with multi-tier computing architecture (Yang, 2019; Zhou C. et al., 2022), featuring on-demand and collaborative computing task offloading and scheduling across the network, is important to immersive communications yet open to investigation at the moment. Second, as networks become increasingly complex and the requirements of immersive communications become exceedingly stringent, supporting immersive communications in 6G requires a network architecture with unprecedented scalability, flexibility, and adaptivity. A 6G architecture integrating digital twins, network slicing, and pervasive AI (Shen et al., 2021a) can be a foundation to immersive communications. Third, considering the diverse delay requirements of different XR, haptic communication, and holographic communication use cases, the Open-RAN (O-RAN) architecture featuring realtime, near-realtime, and non-realtime layers can benefit service differentiation in RAN management for immersive communications (Abdalla et al., 2022). Last, considering different user preferences and diverse user devices, a new architecture enabling user-centric networking, such as the everyone-centric architecture in Yang Y. et al. (2022), has a potential to empower immersive communications. However, as none of the above architectures is developed specifically for immersive communications, new designs and customizations based on them for supporting immersive communications are open for investigation.

6. Conclusion

In this article, we have delved into immersive communications toward 6G and presented a comprehensive review of the related concepts, representative use cases, technical challenges and potential solutions, and future directions. Focusing on XR, haptic communication, and holographic communication, we have illustrated their general procedures, network requirements, and recent developments in the context of a vision for 6G. Despite abundant emerging use cases and exciting recent advancements, we have shown that many challenges are yet to be conquered before the envisioned prosperity of immersive communications can occur. In particular, the exceeding transmission rate, delay, and reliability requirements, further complicated by the multi-modal and computing-intensive features of immersive communications, indicate the necessity of an unprecedented amount of communication and computing resources as well as novel paradigms such as, multi-tier computing and user-centric networking.

To respond to the challenges posed by supporting immersive communications and promote further research, we have presented various solutions and future directions in this survey. From physical-layer technologies such as Terahertz communications to application-layer solutions such as user behavior prediction, advances in each layer will contribute to the realization of immersive communications. Meanwhile, new paradigms envisioned for 6G, such as QoE-oriented networking and AI-native communications, represent promising future directions for researchers in the field to explore.

The paradigm shift to immersive communications is truly exciting and inspiring, especially when viewed in the context of the evolution toward 6G. Many opportunities exist, and more will emerge for researchers and engineers in the fields of communications, networking, and computer science to realize immersive communications. We hope this review inspires further interest among fellow researchers and provides fundamental knowledge on related research, thereby contributing to this much-anticipated paradigm shift and making immersive communications the next reality.

Author contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

Funding

This work was financially supported by research grants from the Natural Sciences and Engineering Research Council of Canada.

Acknowledgments

The authors would like to thank Dr. Dongxiao Liu for his helpful comments related to the security and privacy issues in immersive communications.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1. ^Note that the three forms may co-exist since a use case may involve more than one form, and additional forms of immersive communications may exist or emerge.

2. ^Haptic communication and the Tactile Internet are related as a service and a medium as in the case of voice over IP (VoIP) services and the Internet (Aijaz et al., 2016).

3. ^In unilateral haptic communication, either the step of haptic information acquisition or the step of haptic display is skipped depending on whether an HI is sending or receiving haptic information.

4. ^Note that the term “holographic communication” is also used in the literature of massive MIMO and IRS but with a different and unrelated meaning (Dardari and Decarli, 2021).

References

Aazhang, B., Ahokangas, P., Alves, H., Alouini, M.-S., Beek, J., Benn, H., et al. (2019). Key Drivers and Research Challenges for 6G Ubiquitous Wireless Intelligence (White Paper). Oulu: 6G Flagship; University of Oulu.

Abari, O., Bharadia, D., Duffield, A., and Katabi, D. (2016). “Cutting the cord in virtual reality,” in Proceedings of the 15th ACM Workshop on Hot Topics in Networks (Atlanta, GA: ACM), 162–168.

Google Scholar

Abbas, R., Shirvanimoghaddam, M., Li, Y., and Vucetic, B. (2019). A novel analytical framework for massive grant-free NOMA. IEEE Trans. Commun. 67, 2436–2449. doi: 10.1109/TCOMM.2018.2881120

Toward immersive communications in 6G

1. Introduction

2. Use cases

2.1. Immersive gaming and entertainment

2.2. Telesurgery

2.3. Immersive learning

2.4. Holographic teleconference

2.5. Metaverse

3. Immersive communications: Concepts and requirements

3.1. Extended reality

3.1.1. Concept

3.1.2. Basic implementation procedure

3.1.3. Requirements

3.2. Haptic communication

3.2.1. Concept

3.2.2. Basic implementation procedure

3.2.3. Requirements

3.3. Holography and holographic communication

3.3.1. Concept

3.3.2. Basic implementation procedure

3.3.3. Requirements

4. Immersive communications: Challenges and solutions

4.1. Extended reality

4.1.1. Content selection

4.1.2. Transmission improvement

4.1.3. Computing optimization

4.2. Haptic communication

4.2.1. Haptic data reduction

4.2.2. Communication and networking solutions

4.2.3. Haptic data prediction

4.3. Holographic communication

4.3.1. Content selection, compression, and prediction

4.3.2. Communication and networking solutions

5. Immersive communication: Open issues and future directions

5.1. Multi-modal communications

5.2. AI-native immersive communications

5.3. Time-sensitive and deterministic networking

5.4. QoE-oriented networking

5.5. Collaborative multi-tier computing

5.6. New network architecture

6. Conclusion

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher's note

Footnotes

References

95% of researchers rate our articles as excellent or good