DIP::Approach

Distributed
Immersive
Performance

D.I.P

Vision

Experiments

Approach

People

Publications

Related Links

DIP v.2

DIP v.1

DIP version 1: the first proposal
There are several technical challenges common to Distributed Immersive Performance scenario. Here are the details about the approach our team take in different aspects.

DIP Server, Network and Client Components
The following figure is the diagram of a fully-populated Distributed Immersive Performance (DIP) server, network and client components for a single unidirectional connection.

The shaded items at the left are audio-video acquisition hardware for real-time interface to a general network, shown at the bottom center. Items shaded at the top center are recording, storage and retrieval hardware that may be part of the local data source or placed elsewhere on a network. The client (rendering) site shown at the right may be a long distance away from the other sites. Dotted lines connect optional equipment needed for stereo video. Two sets of similar hardware are needed for a full bidirectional connection between two sites.

Low Latency Real-Time Continuous Media (CM) Stream Transmission and Network Protocols

DIP poses many unique networking challenges. In addition to the high-bandwidth consumed by HD video and multiple channels of audio, DIP requires very low latency, precise synchronization and smooth, uninterrupted data flow and among many media streams in order to achieve a realistic immersive experience [4]-[7]. The single greatest limiting factor for human interaction in this immersive environment is the effective transmission latency (delay). Typical latencies for the participants are shown in Figure 3. Traditional video and audio compression has been used to overcome bandwidth limitations of network transmission, at the expense of greatly increased delay. In DIP and other interactive applications, the delay due to compression may be intolerable, requiring the use of high bandwidth networks to transmit uncompressed (or minimally compressed) immersidata [4]-[7]. Initial experiments have shown that maximum allowable latencies range from tens of to one hundred milliseconds at most depending on the experimental conditions and content.

The latency in sending acquired data through a network involves packetization, unavoidable propagation delays due to the physical speed of data transmission) queuing at each hop and processing at each hop. Our focus is on approaches to reducing queuing delays as well as on reducing the number of hops taken; this is a difficult problem as we work with best-effort networks having highly varying and unpredictable traffic patterns. The protocol stack has a relatively low delay already, and processing delays are difficult to control, as we do not own the routers in the shared Internet2 environment.

We have assembled a test system with Linux PCs running UDP/IP and over a commodity 100BaseT network, sound cards, amplifiers, microphones and speakers. We performed several very recent experiments to determine latency effects on performances with two distributed musicians. The latency in the channel is precisely controlled from 3 ms to 500 ms, corresponding to the maximum round-trip-time (RTT) in the Internet. The communication was full duplex and the streaming software was custom-written. We found that he latency tolerance is highly dependent on the piece of music and the instruments involved. For a piece by Piazzolla (a fast paced tango), the tolerable latency using a synthesized accordion was about 25 ms, which increased to 100 ms with a synthesized piano. We observed that musicians can adapt to some level of varying delays depending on the piece, and that spatial immersive audio reproduction made a huge difference in musician comfort by re-creating the acoustics of a large performance hall.

Precise Timing: Synchronization Using GPS or CDMA Clocks

Precise timing and synchronization of the many heterogeneous interactive streams of audio and video as it is acquired, processed and sent through a shared network to its destination is required. This implies that the latency between players must be maintained within some bounds despite the variability present in the network. In addition, musicians cannot rely on their local clocks to maintain synchronization over the entire performance, which may last hours, due to clock drift. We are developing several new transmission protocols and low-latency buffering schemes to meet these challenges. Our solution is the use of timing signals from the Global Positioning Satellite (GPS) system, which is capable of maintaining synchronization among distributed clocks with an accuracy of 10 microseconds or better. CDMA cell phone transmitter sites broadcast time signals with similar accuracy that are synchronized to GPS. Either GPS or CDMA signals may be used depending the available signal level at an acquisition (server) or rendering (client) site. In general purpose PCs, operating system delays can reach up to tens of milliseconds, which has a strong impact on the capture side, where data acquisition and timestamping occurs, and on the playback side. Our approach is to use real-time extensions to the operating system or on dedicated real-time operating systems. We are developing techniques to realign streams accurately to maintain synchronization with clock drift. During realignment, streams may be truncated or padded, and this must be done using interpolation, or repeating previous frames without introducing artifacts in playback.

Error Concealment, Forward Error Correction (FEC), Retransmission

High fidelity reproduction and accurate synchronization require a high fidelity signal, free of loss and jitter across all participants, regardless of network conditions. Packet loss is inevitable in the Internet, and strict latency requirements severely limit flexibility in error recovery. Error concealment may be used in some cases (e.g., video) but not all. Forward error correction offers low latency, but is susceptible to burst loss. To mitigate this problem, we pursue a multi-path streaming technique, which is a promising approach to reducing the length of a burst loss [7]. This will also require investigation of the effects of multi-path streaming on latency characteristics. Retransmission may incur unacceptable latency, especially where large distances are involved. Hybrid approaches and various blends may be possible, but add complexity. Another barrier is packet loss, which can contribute to delays (in addition to reducing fidelity), as for example during the process of reconstructing lost information. Hence, characterization of loss characteristics and methodology for dealing with such losses is an important aspect of this work.

Low Latency, Real-Time Video Acquisition and Rendering

Standard compression techniques such as MPEG are designed to transmit video at reasonable resolution over limited bandwidth networks. This is done at the expense of long latencies needed to fill a buffer with many frames for intra-frame processing. This delay may be intolerable for real-time interaction as in the DIP scenario. With the increased bandwidth available in shared local-area and wide-area networks (such as Internet2), we are exploring different parts of the compression, quality and bandwidth space to find effective techniques for DIP video transmission. For real-time, interactive applications we are investigating new low-latency compression algorithms and new types of video cameras with high speed analog-to-digital (A/D) conversion and a network, SDI (serial digital interface) or FireWire interface that outputs video compressed within a single frame (JPEG) at real-time rates (33 ms per frame). These cameras produce output at greater than 30 frames per second at QSIF (320x240) resolution and NTSC (720x480) resolution. They are close to achieving 30 frames per second at high-definition resolution (1920x1080). Current standard desktop operating systems (e.g. Windows) are not designed for such short acquisition and rendering delays, and we are developing new Linux operating system modifications with reduced latency.

Low Latency, High-Quality, Real-Time Immersive Audio Acquisition and Rendering

For accurate reproduction of audio with full fidelity, dynamic range and directionality, immersive audio requires the transmission and recording 16 or more channels of audio information, followed by processing and rendering 12 channels of audio at the client site [8]-[11]. Accurate spatial reproduction of sound relative to visual images is essential for DIP, e.g., in rendering musical instruments being played or a singer's voice. Even a slight mismatch between the aurally-perceived and visually-observed positions of a sound causes a cognitive dissonance that can destroy the carefully-planned suspension of disbelief [8],[10]. To minimize latency, we are developing new audio acquisition methods that: place microphones very close to the participants (a few meters) and reduce the analog-to-digital conversion, packetizing and transmission time to less than 10 ms. Current standard recording techniques place microphones at a longer distance to the performers such that the ambient acoustics of, for example, a concert hall are captured in addition to the direct sound from the instruments. We are working on new virtual microphone techniques to recreate the acoustics at the client with low latency.

Even with minimum delays in video and audio data acquisition, existing standard desktop operating systems (e.g. Windows) are not designed for short acquisition and rendering delays. As described previously, we are developing new Linux operating system modifications that provide a reduction in kernel latency. One expected result of our work is to demonstrate through subjective evaluation that the realism of immersion increases with the video fidelity and number of audio channels.

Real-time Continuous Media (CM) Stream Recording, Storage, Playback

The recording, archiving and playback of performances is essential. This requires a multi-channel, multi-modal recording system that can store a distributed performance event in real-time, as it occurs. Ideally, such a system would allow us to playback the event with a delay that can range from very short to very long. For example, an audience member who tunes into a performance a few minutes late should be able to play it while the recording is still ongoing. Hence, the system should provide at least the following functionalities: time-shifting of an event; live viewing with flashbacks; querying of streams while recording; and skipping of breaks, etc. The challenge is to provide real-time digital storage and playback of the many synchronized streams of video and audio data from scalable, distributed servers. The servers require that resources are allocated and maintained such that: (a) other streams are not affected (recording or playback); (b) resources such as disk bandwidth and memory are used efficiently; (c) recording is seamless with no hiccups or lost data; and (d) synchronization between multiple, related streams is maintained [12]-[17].

Our recent research in scalable real-time streaming architectures has resulted in the design, implementation and evaluation of Yima (see Figure 7) [12]-[17]. Yima is a second generation continuous media server that incorporates lessons learned from first generation research prototypes and also complies with industry standards in content format (e.g., MPEG-2, MPEG-4) and communication protocols (RTP/RTSP).

The Yima server is based on a scalable cluster design. Each cluster node is a off-the-shelf personal computer with attached storage devices and, for example, a Fast Ethernet or Gigabit Ethernet connection. The server software manages the storage and network resources for DIP to provide real-time service to the various clients that are requesting media streams [17]. It provides storage and retrieval services for both HD video (MPEG-2 at 19.4 Mbps) and multi-channel immersive audio of up to 16 channels of uncompressed PCM audio samples with accurately synchronization (a total of 10.8 Mbps).

We are expanding the capabilities of the Yima system to include real-time stream recording. For applications that require live streaming (i.e., the latency between the acquisition of the data streams and their rendering in a remote location is below a fraction of a second) the data needs to be stored on the server in real-time. Such a capability would enable digital recording, time-shifted playback, editing, pause-and-resume, advertisement insertions and more. These functionalities currently exist to some degree in single-user, consumer personal video recorder (PVR) systems such as TiVo, ReplayTV, and UltimateTV. However, these systems support only a single stream, a single media type and a single user at a time. We plan to generalize this functionality in Yima as illustrated in Figure 7. Specifically we will provide support for many users and many streams concurrently [14]. Assuming a Yima system supports a total number of N=n+m streams, any combination of n concurrent retrievals and m writes should be possible. The challenges include the design of a fine-grained locking mechanism such that the same stream can be read back after writing with minimal delay. In addition, the scheduling module of Yima server should be modified to support both retrieval and write threads. Write threads may be assigned different priorities than read threads. Yima supports the retrieval of many different media types (e.g., MPEG-1, MPEG-2, MPEG-4, multi-channel audio, panoramic video) with various bandwidth requirements. The writing capabilities should be just as flexible and support a variety of different streams. For stored streams, flow control mechanisms can be used to regulate the optimal data flow between the source and the destination. With live stream acquisition the source must be in complete control and the destination (recorder) must absorb any variation that is provided in the data rate [15] [16]. With these extensions in place, Yima will have the functionality necessary to support the proposed DIP experiments.

Human Factors Studies: Psychophysical, Perceptual, Artistic, Performance Evaluation

We are using the two-way interactive audio and video functions of our experimental system described in Figures 4 and 5 as a test platform to measure the effects of variable latency, feedback and echo cancellation algorithms, and various compression and error control techniques on the quantitative and perceived audio and video quality to human participants. We are making psychophysical measurements of these factors as a function of network bandwidth, compression level, noise, packet loss, latency, etc. Algorithms for recognizing and tracking music structures will also be used to quantify synchronization among players. These methods will be used to define a metric and measure the effects of latency and presence on the music performance. The These measurements are being done for various active and passive interaction scenarios, including distributed interactive musical performances with two, three or more participants, and for other types of personal interaction (meetings, lectures, etc.). The goal of these measurements is to supplement existing knowledge about these fields and measure the comfort levels of participants and musicians in two-way interaction. We also explore the minimum level of interaction needed between several musicians for effective distributed collaboration, and develop engineering metrics and parameters applicable to two-way interaction scenarios in general. We expect that some types of musical interaction among the participants may have different minimum fidelity and latency needs. For example, the interaction between the musicians at the top of Figure 2 may have lower fidelity requirements but more stringent latency requirements than the interaction between the conductor and the musicians.

Robust Integration Into a Seamless Presentation to the Participants

This challenge is the integration of research in the areas described into an experimental testbed and demonstration system. Our integrated approach is different from previous approaches and uniquely considers the entire end-to-end process of acquisition, transmission and rendering as an complete system to be jointly optimized, rather as a set of individual pieces of technology. We are integrating these technical developments into an experimental three-site DIP testbed and demonstration system for fully interconnected audio and video communication and interaction over networks. The system will test three-way interactive communications as a function of latency, bandwidth, compression level, noise, packet loss, etc. Not all sites may use the same quality of audio and video acquisition and rendering hardware in a given experiment. One possibility is the use of new network protocols that transmit side information among active participants with extremely low-latency but low data rates for musician synchronization, phrasing and coordination. In this way we test the usability of the system to participants who may have lower data rate connections. We will test minimum latency synchronized streaming of audio and video from two or more servers and make psychophysical measurements in mixed active and passive interaction scenarios. We will conduct a distributed musical performance experiments in two scenarios: with three active participants; and with two active participants and a passive audience. These experiments will further improve our understanding of the effects of fundamental communication limits on distributed human interaction. We are undertaking a complete set of psychophysical and user-centered science tests and measurements for a variety of entertainment, gaming, simulation, tele-conferencing, social gathering and performance scenarios.