`
`Scalable Internet video using MPEG-4
`
`Hayder Radha*, Yingwei Chen, Kavitha Parthasarathy, Robert Cohen
`
`Philips Research, 345 Scarborough Rd, Briarcliw Manor, New York, 10510, USA
`
`Abstract
`
`Real-time streaming of audio-visual content over Internet Protocol (IP) based networks has enabled a wide range of
`multimedia applications. An Internet streaming solution has to provide real-time delivery and presentation of a continu-
`ous media content while compensating for the lack of Quality-of-Service (QoS) guarantees over the Internet. Due to the
`variation and unpredictability of bandwidth and other performance parameters (e.g. packet loss rate) over IP networks,
`in general, most of the proposed streaming solutions are based on some type of a data loss handling method and a layered
`video coding scheme. In this paper, we describe a real-time streaming solution suitable for non-delay-sensitive video
`applications such as video-on-demand and live TV viewing.
`The main aspects of our proposed streaming solution are:
`1. An MPEG-4 based scalable video coding method using both a prediction-based base layer and a "ne-granular
`enhancement layer;
`2. An integrated transport-decoder bu!er model with priority re-transmission for the recovery of lost packets, and
`continuous decoding and presentation of video.
`In addition to describing the above two aspects of our system, we also give an overview of a recent activity within
`MPEG-4 video on the development of a "ne-granular-scalability coding tool for streaming applications. Results for the
`performance of our scalable video coding scheme and the re-transmission mechanism are also presented. The latter
`results are based on actual testing conducted over Internet sessions used for streaming MPEG-4 video in real-
`time. ( Published by 1999 Elsevier Science B.V. All rights reserved.
`
`1. Introduction
`
`Real-time streaming of multimedia content over
`Internet Protocol (IP) networks has evolved as one
`of the major technology areas in recent years.
`A wide range of interactive and non-interactive
`multimedia Internet applications, such as news on-
`demand, live TV viewing, and video conferencing
`rely on end-to-end streaming solutions. In general,
`
`* Corresponding author.
`E-mail address: hmr@philabs.research.philips.com (H. Radha)
`
`streaming solutions are required to maintain real-
`time delivery and presentation of the multimedia
`content while compensating for the lack of Quality-
`of-Service (QoS) guarantees over IP networks.
`Therefore, any Internet streaming system has to
`take into consideration key network performance
`parameters such as bandwidth, end-to-end delay,
`delay variation, and packet loss rate.
`To compensate for the unpredictability and
`variability in bandwidth between the sender and
`receiver(s) over the Internet and Intranet net-
`works, many streaming solutions have resorted
`to variations of layered (or scalable) video cod-
`ing methods (see for example [22,24,25]). These
`
`0923-5965/99/$ - see front matter ( 1999 Published by Elsevier Science B.V. All rights reserved.
`PII: S 0 9 2 3 - 5 9 6 5 ( 9 9 ) 0 0 0 2 6 - 0
`
`1
`
`Stingray Ex. 1007
`
`
`
`96
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`solutions are typically complemented by packet
`loss recovery [22] and/or error resilience mecha-
`nisms [25] to compensate for the relatively high
`packet-loss rate usually encountered over the Inter-
`net [2,30,32,33,35,47].
`Most of the references cited above and the ma-
`jority of related modeling and analytical research
`studies published in the literature have focused on
`delay-sensitive (point-to-multipoint or multipoint-
`to-multipoint) applications such as video con-
`ferencing over the Internet Multicast Backbone
`} MBone. When compared with other types of
`applications (e.g. entertainment over the Web),
`these delay-sensitive applications impose di!erent
`kind of constraints, such as low encoder complexity
`and very low end-to-end delay. Meanwhile, enter-
`tainment-oriented Internet applications such as
`news and sports on-demand, movie previews and
`even &live’ TV viewing represent a major (and grow-
`ing) element of the real-time multimedia experience
`over the global Internet [9].
`Moreover, many of
`the proposed streaming
`solutions are based on either proprietary or video
`coding standards that were developed at times
`prior to the phenomenal growth of the Internet.
`However, under the audio, visual, and system
`activities of the ISO MPEG-4 work, many aspects
`of the Internet have being taken into considera-
`tion when developing the di!erent parts of the
`standard.
`In particular, a recent activity in
`MPEG-4 video has focused on the development of
`a scalable compression tool for streaming over IP
`networks [4,5].
`In this paper, we describe a real-time streaming
`system suitable for non-delay-sensitive1 video ap-
`plications (e.g. video-on-demand and live TV view-
`ing) based on the MPEG-4 video-coding standard.
`The main aspects of our real-time streaming system
`are:
`1. A layered video coding method using both a
`prediction-based base layer and a "ne-granular
`enhancement layer: This solution follows the
`
`1 Delay sensitive applications are normally constrained by an
`end-to-end delay of about 300}500 ms. Real-time, non-delay-
`sensitive applications can typically tolerate a delay on the order
`of few seconds.
`
`recent development in the MPEG-4 video group
`for the standardization of a scalable video com-
`pression tool for Internet streaming applications
`[3,4,6].
`2. An integrated transport-decoder bu!er model
`with a re-transmission based scheme for the re-
`covery of lost packets, and continuous decoding
`and presentation of video.
`The remainder of this paper is organized as follows.
`In Section 2 we provide an overview of key design
`issues one needs to consider for real-time, non-
`delay-sensitive IP streaming solutions. We will also
`highlight how our proposed approach addresses
`these issues. Section 3 describes our real-time
`streaming system and its high level architecture.
`Section 4 details the MPEG-4 based scalable video
`coding scheme used by the system, and provides an
`overview of the MPEG-4 activity on "ne-granu-
`lar-scalability. Simulation results for our scalable
`video compression solution are also presented in
`Section 4. In Section 5, we introduce the integrated
`transport layer-video decoder bu!er model with
`re-transmission. We also evaluate the e!ectiveness
`of the re-transmission scheme based on actual tests
`conducted over the Internet involving real-time
`streaming of MPEG-4 video.
`
`2. Design considerations for real-time streaming
`
`The following are some high-level issues that
`should be considered when designing a real-time
`streaming system for entertainment oriented ap-
`plications.
`
`2.1. System scalability
`
`The wide range of variation in e!ective band-
`width and other network performance character-
`istics over the Internet [33,47] makes it necessary
`to pursue a scalable streaming solution. The vari-
`ation in QoS measures (e.g. e!ective bandwidth)
`is not only present across the di!erent access
`technologies to the Internet (e.g. analog modem,
`ISDN, cable modem, LAN, etc.), but it can even
`be observed over relatively short periods of time
`over a particular session [8,33]. For example, a
`recent study shows that the e!ective bandwidth
`
`2
`
`
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`97
`
`of a cable modem access link to the Internet may
`vary between 100 kbps to 1 Mbps [8]. Therefore,
`any video-coding method and associated streaming
`solution has to take into consideration this wide
`range of performance characteristics over IP net-
`works.
`
`2.2. Video compression complexity, scalability,
`and coding ezciency
`
`The video content used for on-demand applica-
`tions is typically compressed o!-line and stored
`for later viewing through unicast IP sessions. This
`observation has two implications. First, the com-
`plexity of the video encoder is not as major an
`issue as in the case with interactive multipoint-to-
`multipoint or even point-to-point applications (e.g.
`video conferencing and video telephony) where
`compression has to be supported by every terminal.
`Second, since the content is not being compressed
`in real-time, the encoder cannot employ a vari-
`able-bit-rate (VBR) method to adapt to the avail-
`able bandwidth. This emphasizes the need for
`coding the material using a scalable approach. In
`addition, for multicast or unicast applications in-
`volving a large number of point-to-multipoint ses-
`sions, only one encoder (or possibly very few
`encoders for simulcast) is (are) usually used. This
`observation also leads to a relaxed constraint on
`the complexity of the encoder, and highlights the
`need for video scalability. As a consequence of the
`relaxed video-complexity constraint for entertain-
`ment-oriented IP streaming, there is no need to
`totally avoid such techniques as motion estimation
`which can provide a great deal of coding e$ciency
`when compared with replenishment-based solu-
`tions [24].
`Although it is desirable to generate a scalable
`video stream for a wide range of bit-rates (e.g.
`15 kbps for analog-modem Internet access to
`around 1 Mbps for cable-modem/ADSL access), it
`is virtually impossible to achieve a good coding-
`e$ciency/video-quality tradeo! over such a wide
`range of rates. Meanwhile, it is equally important
`to emphasize the impracticality of coding the video
`content using simulcast compression at multiple
`bit-rates to cover the same wide range. First, simul-
`cast compression requires the creation of many
`
`streams (e.g. at 20, 40, 100, 200, 400, 600, 800 and
`1000 kbps). Second, once a particular simulcast
`bitstream (coded at a given bit-rate, say R) is se-
`lected to be streamed over a given Internet session
`(which initially can accommodate a bit-rate of
`R or higher), then due to possible wide variation
`of the available bandwidth over time, the Inter-
`net session bandwidth may fall below the bit-
`rate R. Consequently, this decrease in bandwidth
`could signi"cantly degrade the video quality. One
`way of dealing with this issue is to switch, in real-
`time, among di!erent simulcast streams. This, how-
`ever, increases complexities on both the server and
`the client sides, and introduces synchronization
`issues.
`A good practical alternative to this issue is to
`use video scalability over few ranges of bit-rates.
`For example, one can create a scalable video
`stream for the analog/ISDN access bit-rates (e.g.
`to cover 20}100 kbps bandwidth), and another
`scalable stream for a higher bit-rate range (e.g.
`200 kbps}1 Mbps). This approach leads to another
`important requirement. Since each scalable stream
`will be build on the top of a video base layer, this
`approach implies that multiple base layers will be
`needed (e.g. one at 20 kbps, another at 200 kbps,
`and possibly another at 1 Mbps). Therefore, it is
`quite desirable to deploy a video compression stan-
`dard that provides good coding e$ciency over
`a rather wide range of possible bit-rates (in the
`above example 20 kbps, 200 kbps and 1 Mbps). In
`this regard, due to the many video-compression
`tools provided by MPEG-4 for achieving high
`coding e$ciency and in particular at low bit-rates,
`MPEG-4 becomes a very attractive choice for
`compression.
`
`2.3. Streaming server complexity
`
`Typically, a unicast server has to output tens,
`hundreds, or possibly thousands of video streams
`simultaneously. This greatly limits the type of pro-
`cessing the server can perform on these streams in
`real-time. For example, although the separation of
`an MPEG-2 video stream into three temporal
`layers (I, P and B) is a feasible approach for a
`scalable multicast (as proposed in [22]), it will be
`quite di$cult to apply the same method to a large
`
`3
`
`
`
`98
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`number of unicast streams. This is the case since the
`proposed layering requires some parsing of the
`compressed video bitstream. Therefore, it is desir-
`able to use a very simple scalable video stream that
`can be easily processed and streamed for unicast
`sessions. Meanwhile, the scalable stream should be
`easily divisible into multiple streams for multicast
`IP similar to the receiver-driven paradigm used in
`[22,24].
`Consequently, we adopt a single, "ne-granular
`enhancement layer that satis"es these require-
`ments. This simple scalability approach has two
`other advantages. First, it requires only a single
`enhancement layer decoder at the receiver (even if
`the original "ne-granular stream is divided into
`multiple sub-streams). Second, the impact of packet
`losses is localized to the particular enhancement-
`layer picture(s) experiencing the losses. These and
`other advantages of the proposed scalability ap-
`proach will become clearer later in the paper.
`
`2.4. Client complexity and client-server
`communication issues
`
`There is a wide range of clients that can access
`the Internet and experience a multimedia streaming
`application. Therefore, a streaming solution should
`take into consideration a scalable decoding ap-
`proach that meets di!erent client-complexity re-
`quirements. In addition, one has to incorporate
`robustness into the client for error recovery and
`handling, keeping in mind key client-server com-
`plexity issues. For example, the deployment of an
`elaborate feedback scheme between the receivers
`for #ow control and error
`and the sender (e.g.
`handling) is not desirable due to the potential im-
`plosion of messages at the sender [2,34,35]. How-
`ever, simple re-transmission techniques have been
`proven e!ective for many unicast and multicast
`multimedia
`applications
`[2,10,22,34]. Conse-
`quently, we employ a re-transmission method for
`the recovery of lost packets. This method is com-
`bined with a client-driven #ow control model that
`ensures the continuous decoding and presentation
`of video while minimizing the server complexity.
`
`In summary, a real-time streaming system
`tailored for entertainment IP applications should
`
`provide a good balance among these requirements:
`(a) scalability of the compressed video content,
`(b) coding e$ciency across a wide range of bit-
`rates, (c) low complexity at the streaming server,
`and (d) handling of lost packets and end-to-end
`#ow control using a primarily client-driven ap-
`proach to minimize server complexity and meet
`overall system scalability requirements. These ele-
`ments are addressed in our streaming system as
`explained in the following sections.
`
`3. An overview of the scalable video streaming
`system
`
`The overall architecture of our scalable video
`streaming system is shown in Fig. 1.2 The system
`consists of three main components: an MPEG-4
`based scalable video encoder, a real-time streaming
`server, and a corresponding real-time streaming
`client which includes the video decoder.
`MPEG-4 is an international standard being de-
`veloped by the ISO Moving Picture Experts Group
`for the coding and representation of multimedia
`content.3 In addition to providing standardized
`methods for decoding compressed audio and video,
`MPEG-4 provides standards for the representa-
`tion, delivery, synchronization, and interactivity of
`audiovisual material. The powerful MPEG-4 tools
`yield good levels of performance at low bit-rates,
`while at the same time they present a wealth of new
`functionality [20].
`The video encoder generates two bitstreams:
`a base-layer and an enhancement-layer compressed
`video. An MPEG-4 compliant stream is coded
`based on an MPEG-4 video Veri"cation Model
`(VM).4 This stream, which represents the base
`
`2 The "gure illustrates the architecture for a single, unicast
`server-client session. Extending the architecture shown in the
`"gure to multiple unicast sessions, or to a multicast scenario is
`straightforward.
`3 http://drogo.cselt.stet.it/mpeg/
`4 The VM is a common set of tools that contain detailed
`encoding and decoding algorithms used as reference for testing
`new functionalities. The video encoding was based on the
`MPEG-4 video group, MoMuSys software Version VCD-06-
`980625.
`
`4
`
`
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`99
`
`Fig. 1. The end-to-end architecture of an MPEG-4 based scalable video streaming system.
`
`is
`layer of the scalable video encoder output,
`coded at a low bit-rate. The particular rate selected
`depends on the overall range of bit-rates targeted
`by the system and the complexity of the source
`material. For example, to serve clients with ana-
`log/ISDN modems’ Internet access, the base-layer
`video is coded at around 15}20 kbps. The video
`enhancement layer is coded using a single "ne-
`granular-scalable bitstream. The method used
`for coding the enhancement
`layer follows the
`recent development in the MPEG-4 video "ne-
`granular-scalability (FGS) activity for Internet
`streaming applications [4,5]. For the above ana-
`log/ISDN-modem access example, the enhance-
`ment layer stream is over-coded to a bit-rate
`
`around 80}100 kbps. Due to the "ne granularity
`of the enhancement layer, the server can easily
`select and adapt to the desired bit-rate based on
`the conditions of
`the network. The scalable
`video coding aspects of the system are covered in
`Section 4.
`The server outputs the MPEG-4 base-layer
`video at a rate that follows very closely the bit-rate
`at which the stream was originally coded. This
`aspect of the server is crucial for minimizing under-
`#ow and over#ow events at the client. Jitter is
`introduced at the server output due, in part, to the
`packetization of the compressed video streams.
`Real-time Transport Protocol (RTP) packetization
`[15,39] is used to multiplex and synchronize the
`
`5
`
`
`
`100
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`base and enhancement layer video. This is accomp-
`lished through the time-stamp "elds supported
`in the RTP header. In addition to the base and
`enhancement streams, the server re-transmits lost
`packets in response to requests from the client. The
`three streams (base, enhancement and re-transmis-
`sion) are sent using the User Datagram Protocol
`(UDP) over IP. The re-transmission requests
`between the client and the server are carried in
`an end-to-end,
`reliable control
`session using
`Transmission Control Protocol (TCP). The server
`rate-control aspects of the system are covered in
`Section 5.
`In addition to a real-time MPEG-4 based, scala-
`ble video decoder, the client includes bu!ers and
`a control module to regulate the #ow of data and
`ensure continuous and synchronized decoding of
`the video content. This is accomplished by deploy-
`ing an Integrated Transport Decoder (ITD) bu!er
`model which supports packet-loss
`recovery
`through re-transmission requests. The ITD bu!er
`model and the corresponding re-transmission
`method are explained in Section 5.
`
`4. MPEG-4 based scalable video coding for
`streaming
`
`4.1. Overview of video scalability
`
`Many scalable video-coding approaches have
`been proposed recently for real-time Internet ap-
`plications. In [22] a temporal layering scheme is
`applied to MPEG-2 video coded streams where
`di!erent picture types (I, P and B) are separated
`into corresponding layers (I, P and B video layers).
`These layers are multicasted into separate streams
`allowing receivers with di!erent session-bandwidth
`characteristics to subscribe to one or more of these
`layers. In conjunction with this temporal layering
`scheme, a re-transmission method is used to re-
`cover lost packets. In [25] a spatio-temporal layer-
`ing scheme is used where temporal compression is
`based on hierarchical conditional replenishment
`and spatial compression is based on a hybrid
`DCT/subband transform coding.
`In the scalable video coding system developed in
`[45], a 3-D subband transform with camera-pan
`
`compensation is used to avoid motion compensa-
`tion drift due to partial reference pictures. Each
`subband is encoded with progressively decreasing
`quantization step sizes. The system can support,
`with a single bitstream, a range of bit-rates from
`kilobits to megabits and various picture resolutions
`and frame rates. However, the coding e$ciency of
`the system depends heavily on the type of motion
`in the video being encoded. If the motion is other
`than camera panning, then the e!ectiveness of the
`temporal redundancy exploitation is limited. In ad-
`dition, the granularity of the supported bit-rates is
`fairly coarse.
`Several video scalability approaches have been
`adopted by video compression standards such as
`MPEG-2, MPEG-4 and H.263. Temporal, spatial
`and quality (SNR) scalability types have been de-
`"ned in these standards. All of these types of scala-
`ble video consist of a Base Layer (BL) and one or
`multiple Enhancement Layers (ELs). The BL part
`of the scalable video stream represents, in general,
`the minimum amount of data needed for decoding
`that stream. The EL part of the stream represents
`additional information, and therefore it enhances
`the video signal representation when decoded by
`the receiver.
`For each type of video scalability, a certain scala-
`bility structure is used. The scalability structure
`de"nes the relationship among the pictures of
`the BL and the pictures of the enhancement layer.
`Fig. 2 illustrates examples of video scalability
`structures. MPEG-4 also supports object-based
`scalability structures for arbitrarily shaped video
`objects [17,18].
`Another type of scalability, which has been
`is xne-
`primarily used for coding still
`images,
`granular scalability. Images coded with this type
`of scalability can be decoded progressively. In
`other words,
`the decoder can start decoding
`and displaying the image after receiving a very
`small amount of data. As more data is received,
`the quality of the decoded image is progressively
`enhanced until the complete information is re-
`ceived, decoded, and displayed. Among lead inter-
`national standards, progressive image coding is
`one of the modes supported in JPEG [16] and
`the still-image, texture coding tool
`in MPEG-4
`video [17].
`
`6
`
`
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`101
`
`Fig. 2. Examples of video scalability structures.
`
`When compared with non-scalable methods,
`a disadvantage of video scalable compression is
`its inferior coding e$ciency. In order to increase
`coding e$ciency, video scalable methods normally
`rely on relatively complex structures (such as the
`spatial and temporal scalability examples shown in
`Fig. 2). By using information from as many pictures
`as possible from both the BL and EL, coding
`e$ciency can be improved when compressing an
`enhancement-layer picture. However, using predic-
`tion among pictures within the enhancement layer
`either eliminates or signi"cantly reduces the "ne-
`granular scalability feature, which is desirable for
`environments with a wide range of available band-
`width (e.g. the Internet). On the other hand, using
`a "ne-granular scalable approach (e.g. progressive
`JPEG or the MPEG-4 still-image coding tool) to
`compress each picture of a video sequence prevents
`the employment of prediction among the pictures,
`and consequently degrades coding e$ciency.
`
`4.2. MPEG-4 video based xne-granular-scalability
`(FGS)
`
`In order to strike a balance between coding-
`e$ciency and "ne-granularity requirements, a
`recent activity in MPEG-4 adopted a hybrid scala-
`bility structure characterized by a DCT motion
`compensated base layer and a "ne granular scal-
`able enhancement
`layer [4,5]. This scalability
`structure is illustrated in Fig. 3. The video cod-
`ing scheme used by our system is based on this
`scalability structure [5]. Under this structure, the
`server can transmit part or all of the over-coded
`enhancement layer to the receiver. Therefore, un-
`like the scalability solutions shown in Fig. 2, the
`FGS structure enables the streaming system to
`adapt to varying network conditions. As explained
`in Section 2, the FGS feature is especially needed
`when the video is pre-compressed and the con-
`dition of the particular session (over which the
`
`7
`
`
`
`102
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`Fig. 3. Video scalability structure with "ne-granularity.
`
`Fig. 4. A streaming system employing the MPEG-4 based "ne-granular video scalability.
`
`bitstream will be delivered) is not known at the time
`when the video is coded.
`Fig. 4 shows the internal architecture of the
`MPEG-4 based FGS video encoder used in our
`streaming system. The base layer carries a min-
`imally acceptable quality of video to be reliably
`delivered using a re-transmission, packet-loss re-
`covery method. The enhancement layer improves
`upon the base layer video, fully utilizing the esti-
`
`mated available bandwidth (Section 5.5). By em-
`ploying a motion compensated base layer, coding
`e$ciency from temporal redundancy exploitation
`is partially retained. The base and a single-en-
`hancement layer streams can be either stored for
`later transmission, or can be directly streamed
`by the server in real-time. The encoder interfaces
`with a system module that performs estimates of
`the range of bandwidth [R.*/, R.!9] that can be
`
`8
`
`
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`103
`
`
`
`
`
`R%2"R2!R1,2, R%N"RN!RN~1,
`
`supported over the desired network. Based on this
`information, the module conveys to the encoder the
`
`bit-rate RBL)R.*/ that must be used to compress
`the base-layer video.5 The enhancement layer is
`
`over-coded using a bit-rate (R.!9!RBL). It is im-
`portant to note that the range [R.*/, R.!9] can be
`determined o!-line for a particular set of Internet
`"20 kbps
`access technologies. For example, R.*/
`"100 kbps can be used for analogue-
`and R.!9
`modem/ISDN access technologies. More sophisti-
`cated techniques can also be employed in real-time
`to estimate the range [R.*/, R.!9]. For unicast
`streaming, an estimate for the available bandwidth
`R can be generated in real-time for a particular
`session. Based on this estimate, the server transmits
`the enhancement layer using a bit-rate REL:
`
`REL"min(R.!9!RBL, R!RBL).
`Due to the "ne granularity of the enhancement
`layer, its real-time rate control aspect can be imple-
`mented with minimal processing (Section 5.5). For
`multicast streaming, a set of intermediate bit-rates
`R1, R2,2, RN can be used to partition the en-
`hancement layer into substreams. In this case,
`N "ne-granular streams are multicasted using the
`bit-rates:
`
`R%1"R1!RBL,
`
`
`
`
`where
`(R22(RN~1(RN)R.!9.
`
`
`
`RBL(R1
`Using a receiver-driven paradigm [24], the client
`can subscribe to the base layer and one or more of
`the enhancement layers’ streams. As explained
`earlier, one of the advantages of the FGS approach
`is that the EL sub-streams can be combined at the
`receiver into a single stream and decoded using
`a single EL decoder.
`
`
`
`5 Typically, the base layer encoder will compress the signal
`using the minimum bit-rate R.*/. This is especially the case when
`the BL encoding takes place o!-line prior to the time of trans-
`mitting the video signal.
`
`compression
`alternative
`are many
`There
`methods one can choose from when coding the
`BL and EL layers of the FGS structure shown in
`Fig. 3. MPEG-4 is highly anticipated to be the
`next widely-deployed audio-visual standard for in-
`teractive multimedia applications. In particular,
`MPEG-4 video provides superior low-bit-rate cod-
`ing performance when compared with other
`MPEG standards (i.e. MPEG-1 and MPEG-2),
`and provides object-based functionality. In addi-
`tion, MPEG-4 video has demonstrated its coding
`e$ciency even for medium-to-high bit-rates. There-
`fore, we use the DCT-based MPEG-4 video tools
`for coding the base layer. There are many excellent
`documents and papers describe the MPEG-4 video
`coding tools [17,18,43,44].
`For the EL encoder shown in Fig. 4, any embed-
`ded or "ne-granular compression scheme can be
`used. Wavelet-based solutions have shown excel-
`lent coding-e$ciency and "ne-granularity perfor-
`mance for image compression [41,37]. In the
`following sub-section, we will discuss our wavelet
`solution for coding the EL of the MPEG-4 based
`scalable video encoder. Simulation results of our
`MPEG-4 based FGS coding method will be pre-
`sented in Section 4.3.2.
`
`4.3. The FGS enhancement layer encoder using
`wavelet
`
`In addition to achieving a good balance between
`coding e$ciency and "ne granularity, there are
`other criteria that need to be considered when
`selecting the enhancement layer coding scheme.
`These criteria include complexity, maturity and ac-
`ceptability of that scheme by the technical and
`industrial communities for broad adaptation. The
`complexity of such scheme should be su$ciently
`low, in particular, for the decoder. The technique
`should be reasonably mature and stable. Moreover,
`it is desirable that the selected technique has some
`roots in MPEG or other standardization bodies to
`facilitate its broad acceptability.
`Embedded wavelet coding satis"es all of the
`above criteria. It has proven very e$cient in coding
`still images [38,41] and is also e$cient in coding
`video signals [46]. It naturally provides "ne granu-
`lar scalability, which has always been one of its
`
`9
`
`
`
`104
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`strengths when compared to other transform-based
`coding schemes. Because wavelet-based image
`compression has been studied for many years now,
`and because its relationship with sub-band coding
`is well established there exist fast algorithms and
`implementations to reduce its complexity. More-
`over, MPEG-4 video includes a still-image com-
`pression tool based on the wavelet transform [17].
`This still-image coding tool supports three com-
`pression modes, one of which is "ne granular. In
`addition, the image-compression methods current-
`ly competing under the JPEG-2000 standardiz-
`ation activities are based on the wavelet transform.
`All of the above factors make wavelet based coding
`for the FGS enhancement layer a very attractive
`choice.
`Ever since the introduction of EZW (Embedded
`Zerotrees of Wavelet coe$cients) by Shapiro [41],
`much research has been directed toward e$cient
`progressive encoding of images and video using
`wavelets. Progress in this area has culminated re-
`cently with the SPIHT (Set Partitioning In Hier-
`archical Trees) algorithm developed by Said and
`Pearlman [38]. The still-image, texture coding
`tool
`in MPEG-4 also represents a variation of
`EZW and gives comparable performance to that
`of SPIHT.
`Compression results and proposals for using dif-
`ferent variations of the EZW algorithm have been
`recently submitted to the MPEG-4 activity on FGS
`video [6,17,19,40]. These EZW-based proposals in-
`clude the scalable video coding solution used in our
`streaming system. Below, we give a brief overview
`of the original EZW method and highlight how the
`recent wavelet-based MPEG-4 proposals (for cod-
`ing the FGS EL video) di!er from the original
`EZW algorithm. Simulation results are shown at
`the end of the section.
`
`4.3.1. EZW-based coding of the enhancement-layer
`video
`The di!erent variations of the EZW approach
`[6,17,19,37,38,40,41] are based on: (a) computing
`a wavelet-transform of the image, and (b) coding
`the resulting transform by partitioning the wavelet
`coe$cients into sets of hierarchical, spatial-orienta-
`tion trees. An example of a spatial-orientation tree
`is shown in Fig. 5. In the original EZW algorithm
`
`Fig. 5. Examples of the hierarchical, spatial-orientation trees of
`the zero-tree algorithm.
`
`[41], each tree is rooted at the highest level (most
`coarse sub-band) of the multi-layer wavelet trans-
`form. If there are m layers of sub-bands in the
`hierarchical wavelet transform representation of
`the image, then the roots of the trees are in the ‚‚
`m
`sub-band of the hierarchy as shown in Fig. 5. If the
`number of coe$cients in sub-band ‚‚
`m is Nm, then
`there are Nm spatial-orientation trees representing
`the wavelet transform of the image.
`In EZW, coding e$ciency is achieved based on
`‘
`a
`the hypothesis of
`decaying spectrum
`: the energies
`of the wavelet coe$cients are expected to decay in
`the direction from the root of a spatial-orientation
`tree toward its descendants. Consequently, if the
`wavelet coe$cient cn of a node n is found insigni"c-
`ant (relative to some threshold „
`"2k), then it is
`k
`highly probable that all descendants D(n) of the
`node n are also insigni"cant (relative to the same
`threshold „
`k). If the root of a tree and all of its
`descendants are insigni"cant
`then this tree is
`referred to as a Zero-Tree Root (ZTR). If a node
`D(„
`n is insigni"cant (i.e. Dcn
`k) but one (or more)
`its descendants is (are) signi"cant then this
`of
`scenario represents a violation of the &decaying
`spectrum’ hypothesis. Such a node is referred to as
`an Isolated Zero-Tree (IZT). In the original EZW
`
`10
`
`
`
`H. Radha et al. / Signal Processing: Image Communication 15 (1999) 95}126
`
`1