|
|||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||
|
DIP Server, Network and Client Components
Low Latency Real-Time Continuous Media (CM) Stream Transmission and Network Protocols
DIP poses many unique networking challenges. In addition to the high-bandwidth
consumed by HD video and multiple channels of audio, DIP requires very
low latency, precise synchronization and smooth, uninterrupted data flow
and among many media streams in order to achieve a realistic immersive
experience [4]-[7]. The single greatest limiting factor for human interaction
in this immersive environment is the effective transmission latency (delay).
Typical latencies for the participants are shown in Figure 3. Traditional
video and audio compression has been used to overcome bandwidth limitations
of network transmission, at the expense of greatly increased delay. In
DIP and other interactive applications, the delay due to compression may
be intolerable, requiring the use of high bandwidth networks to transmit
uncompressed (or minimally compressed) immersidata [4]-[7]. Initial experiments
have shown that maximum allowable latencies range from tens of to one
hundred milliseconds at most depending on the experimental conditions
and content. The latency in sending acquired data through a network involves packetization,
unavoidable propagation delays due to the physical speed of data transmission)
queuing at each hop and processing at each hop. Our focus is on approaches
to reducing queuing delays as well as on reducing the number of hops taken;
this is a difficult problem as we work with best-effort networks having
highly varying and unpredictable traffic patterns. The protocol stack
has a relatively low delay already, and processing delays are difficult
to control, as we do not own the routers in the shared Internet2 environment. We have assembled a test system with Linux PCs running UDP/IP and over
a commodity 100BaseT network, sound cards, amplifiers, microphones and
speakers. We performed several very recent experiments to determine latency
effects on performances with two distributed musicians. The latency in
the channel is precisely controlled from 3 ms to 500 ms, corresponding
to the maximum round-trip-time (RTT) in the Internet. The communication
was full duplex and the streaming software was custom-written. We found
that he latency tolerance is highly dependent on the piece of music and
the instruments involved. For a piece by Piazzolla (a fast paced tango),
the tolerable latency using a synthesized accordion was about 25 ms, which
increased to 100 ms with a synthesized piano. We observed that musicians
can adapt to some level of varying delays depending on the piece, and
that spatial immersive audio reproduction made a huge difference in musician
comfort by re-creating the acoustics of a large performance hall. Precise Timing: Synchronization Using GPS or CDMA Clocks Precise timing and synchronization of the many heterogeneous interactive streams of audio and video as it is acquired, processed and sent through a shared network to its destination is required. This implies that the latency between players must be maintained within some bounds despite the variability present in the network. In addition, musicians cannot rely on their local clocks to maintain synchronization over the entire performance, which may last hours, due to clock drift. We are developing several new transmission protocols and low-latency buffering schemes to meet these challenges. Our solution is the use of timing signals from the Global Positioning Satellite (GPS) system, which is capable of maintaining synchronization among distributed clocks with an accuracy of 10 microseconds or better. CDMA cell phone transmitter sites broadcast time signals with similar accuracy that are synchronized to GPS. Either GPS or CDMA signals may be used depending the available signal level at an acquisition (server) or rendering (client) site. In general purpose PCs, operating system delays can reach up to tens of milliseconds, which has a strong impact on the capture side, where data acquisition and timestamping occurs, and on the playback side. Our approach is to use real-time extensions to the operating system or on dedicated real-time operating systems. We are developing techniques to realign streams accurately to maintain synchronization with clock drift. During realignment, streams may be truncated or padded, and this must be done using interpolation, or repeating previous frames without introducing artifacts in playback. Error Concealment, Forward Error Correction (FEC), Retransmission High fidelity reproduction and accurate synchronization require a high fidelity signal, free of loss and jitter across all participants, regardless of network conditions. Packet loss is inevitable in the Internet, and strict latency requirements severely limit flexibility in error recovery. Error concealment may be used in some cases (e.g., video) but not all. Forward error correction offers low latency, but is susceptible to burst loss. To mitigate this problem, we pursue a multi-path streaming technique, which is a promising approach to reducing the length of a burst loss [7]. This will also require investigation of the effects of multi-path streaming on latency characteristics. Retransmission may incur unacceptable latency, especially where large distances are involved. Hybrid approaches and various blends may be possible, but add complexity. Another barrier is packet loss, which can contribute to delays (in addition to reducing fidelity), as for example during the process of reconstructing lost information. Hence, characterization of loss characteristics and methodology for dealing with such losses is an important aspect of this work. Low Latency, Real-Time Video Acquisition and Rendering
Standard compression techniques such as MPEG are designed to transmit
video at reasonable resolution over limited bandwidth networks. This is
done at the expense of long latencies needed to fill a buffer with many
frames for intra-frame processing. This delay may be intolerable for real-time
interaction as in the DIP scenario. With the increased bandwidth available
in shared local-area and wide-area networks (such as Internet2), we are
exploring different parts of the compression, quality and bandwidth space
to find effective techniques for DIP video transmission. For real-time,
interactive applications we are investigating new low-latency compression
algorithms and new types of video cameras with high speed analog-to-digital
(A/D) conversion and a network, SDI (serial digital interface) or FireWire
interface that outputs video compressed within a single frame (JPEG) at
real-time rates (33 ms per frame). These cameras produce output at greater
than 30 frames per second at QSIF (320x240) resolution and NTSC (720x480)
resolution. They are close to achieving 30 frames per second at high-definition
resolution (1920x1080). Current standard desktop operating systems (e.g.
Windows) are not designed for such short acquisition and rendering delays,
and we are developing new Linux operating system modifications with reduced
latency. Low Latency, High-Quality, Real-Time Immersive Audio Acquisition and Rendering For accurate reproduction of audio with full fidelity, dynamic range and directionality, immersive audio requires the transmission and recording 16 or more channels of audio information, followed by processing and rendering 12 channels of audio at the client site [8]-[11]. Accurate spatial reproduction of sound relative to visual images is essential for DIP, e.g., in rendering musical instruments being played or a singer's voice. Even a slight mismatch between the aurally-perceived and visually-observed positions of a sound causes a cognitive dissonance that can destroy the carefully-planned suspension of disbelief [8],[10]. To minimize latency, we are developing new audio acquisition methods that: place microphones very close to the participants (a few meters) and reduce the analog-to-digital conversion, packetizing and transmission time to less than 10 ms. Current standard recording techniques place microphones at a longer distance to the performers such that the ambient acoustics of, for example, a concert hall are captured in addition to the direct sound from the instruments. We are working on new virtual microphone techniques to recreate the acoustics at the client with low latency. Even with minimum delays in video and audio data acquisition, existing standard desktop operating systems (e.g. Windows) are not designed for short acquisition and rendering delays. As described previously, we are developing new Linux operating system modifications that provide a reduction in kernel latency. One expected result of our work is to demonstrate through subjective evaluation that the realism of immersion increases with the video fidelity and number of audio channels. Real-time Continuous Media (CM) Stream Recording, Storage, Playback
The recording, archiving and playback of performances is essential. This
requires a multi-channel, multi-modal recording system that can store
a distributed performance event in real-time, as it occurs. Ideally, such
a system would allow us to playback the event with a delay that can range
from very short to very long. For example, an audience member who tunes
into a performance a few minutes late should be able to play it while
the recording is still ongoing. Hence, the system should provide at least
the following functionalities: time-shifting of an event; live viewing
with flashbacks; querying of streams while recording; and skipping of
breaks, etc. The challenge is to provide real-time digital storage and
playback of the many synchronized streams of video and audio data from
scalable, distributed servers. The servers require that resources are
allocated and maintained such that: (a) other streams are not affected
(recording or playback); (b) resources such as disk bandwidth and memory
are used efficiently; (c) recording is seamless with no hiccups or lost
data; and (d) synchronization between multiple, related streams is maintained
[12]-[17]. Our recent research in scalable real-time streaming architectures has
resulted in the design, implementation and evaluation of Yima (see Figure
7) [12]-[17]. Yima is a second generation continuous media server that
incorporates lessons learned from first generation research prototypes
and also complies with industry standards in content format (e.g., MPEG-2,
MPEG-4) and communication protocols (RTP/RTSP). The Yima server is based on a scalable cluster design. Each cluster node
is a off-the-shelf personal computer with attached storage devices and,
for example, a Fast Ethernet or Gigabit Ethernet connection. The server
software manages the storage and network resources for DIP to provide
real-time service to the various clients that are requesting media streams
[17]. It provides storage and retrieval services for both HD video (MPEG-2
at 19.4 Mbps) and multi-channel immersive audio of up to 16 channels of
uncompressed PCM audio samples with accurately synchronization (a total
of 10.8 Mbps). We are expanding the capabilities of the Yima system to include real-time
stream recording. For applications that require live streaming (i.e.,
the latency between the acquisition of the data streams and their rendering
in a remote location is below a fraction of a second) the data needs to
be stored on the server in real-time. Such a capability would enable digital
recording, time-shifted playback, editing, pause-and-resume, advertisement
insertions and more. These functionalities currently exist to some degree
in single-user, consumer personal video recorder (PVR) systems such as
TiVo, ReplayTV, and UltimateTV. However, these systems support only a
single stream, a single media type and a single user at a time. We plan
to generalize this functionality in Yima as illustrated in Figure 7. Specifically
we will provide support for many users and many streams concurrently [14].
Assuming a Yima system supports a total number of N=n+m streams, any combination
of n concurrent retrievals and m writes should be possible. The challenges
include the design of a fine-grained locking mechanism such that the same
stream can be read back after writing with minimal delay. In addition,
the scheduling module of Yima server should be modified to support both
retrieval and write threads. Write threads may be assigned different priorities
than read threads. Yima supports the retrieval of many different media
types (e.g., MPEG-1, MPEG-2, MPEG-4, multi-channel audio, panoramic video)
with various bandwidth requirements. The writing capabilities should be
just as flexible and support a variety of different streams. For stored
streams, flow control mechanisms can be used to regulate the optimal data
flow between the source and the destination. With live stream acquisition
the source must be in complete control and the destination (recorder)
must absorb any variation that is provided in the data rate [15] [16].
With these extensions in place, Yima will have the functionality necessary
to support the proposed DIP experiments. Human Factors Studies: Psychophysical, Perceptual, Artistic, Performance Evaluation We are using the two-way interactive audio and video functions of our experimental system described in Figures 4 and 5 as a test platform to measure the effects of variable latency, feedback and echo cancellation algorithms, and various compression and error control techniques on the quantitative and perceived audio and video quality to human participants. We are making psychophysical measurements of these factors as a function of network bandwidth, compression level, noise, packet loss, latency, etc. Algorithms for recognizing and tracking music structures will also be used to quantify synchronization among players. These methods will be used to define a metric and measure the effects of latency and presence on the music performance. The These measurements are being done for various active and passive interaction scenarios, including distributed interactive musical performances with two, three or more participants, and for other types of personal interaction (meetings, lectures, etc.). The goal of these measurements is to supplement existing knowledge about these fields and measure the comfort levels of participants and musicians in two-way interaction. We also explore the minimum level of interaction needed between several musicians for effective distributed collaboration, and develop engineering metrics and parameters applicable to two-way interaction scenarios in general. We expect that some types of musical interaction among the participants may have different minimum fidelity and latency needs. For example, the interaction between the musicians at the top of Figure 2 may have lower fidelity requirements but more stringent latency requirements than the interaction between the conductor and the musicians. Robust Integration Into a Seamless Presentation to the Participants This challenge is the integration of research in the areas described into an experimental testbed and demonstration system. Our integrated approach is different from previous approaches and uniquely considers the entire end-to-end process of acquisition, transmission and rendering as an complete system to be jointly optimized, rather as a set of individual pieces of technology. We are integrating these technical developments into an experimental three-site DIP testbed and demonstration system for fully interconnected audio and video communication and interaction over networks. The system will test three-way interactive communications as a function of latency, bandwidth, compression level, noise, packet loss, etc. Not all sites may use the same quality of audio and video acquisition and rendering hardware in a given experiment. One possibility is the use of new network protocols that transmit side information among active participants with extremely low-latency but low data rates for musician synchronization, phrasing and coordination. In this way we test the usability of the system to participants who may have lower data rate connections. We will test minimum latency synchronized streaming of audio and video from two or more servers and make psychophysical measurements in mixed active and passive interaction scenarios. We will conduct a distributed musical performance experiments in two scenarios: with three active participants; and with two active participants and a passive audience. These experiments will further improve our understanding of the effects of fundamental communication limits on distributed human interaction. We are undertaking a complete set of psychophysical and user-centered science tests and measurements for a variety of entertainment, gaming, simulation, tele-conferencing, social gathering and performance scenarios. |