`
`3456 (U-2)
`
`Kyoya Tsutsui, I-Iiroshi Suzuki, Osamu Shimoyoshi,
`Mito Sonohara, Kenzo Akagiri, and Robert M. Heddle
`Sony Corporate Research Laboratories
`Sinhagawa—ku, Tokyo, Japan
`
`Presented at
`
`the 93rd Convention
`1992 October 1-4
`San Francisco
`
`A u D I o
`
`[1
`
`This preprint has been reproduced from the author's advance
`manuscript, without editing, corrections or consideration by the
`Review Board. The AES,takes no responsibility for the
`contents.
`
`Journal of the Audio Engineering Society.
`
`Additional preprints may be obtained by sending request and
`remittance tothe Audio Engineering Society, 60 East 42nd St.)
`New York, New York 10165-2520, USA.
`
`All rights reserved. Reproduction of this preprint, or any portion
`thereof, is not permitted without direct permission from the
`
`AN AUDIO ENGINEERING SOCIETYPREPRINT
`
`1
`
`BOSE 2021
`SDI TECHNOLOGIES, INC. v BOSE CORPORATION
`|PR2014-00465
`
`1
`
`BOSE 2021
`SDI TECHNOLOGIES, INC. V BOSE CORPORATION
`IPR2014-00465
`
`
`
`ATRAC: Adaptive Transform Acoustic Coding for MiniDisc
`
`Kyoyo Tsutsui
`Hiroshi Suzuki
`Osamu Shimoyoshi
`Miia Sonohara
`
`Kenzo Akagiri
`Robert M. Heddle
`
`Sony Corporate Research Laboroio-n'es
`6-7-35 Kitashinagawa, Shinagawa-kn, Tokyo 141 Japan
`
`Abstract
`
`ATRAC is an audio coding system based on psychoacoustic principles. The input
`signal is divided into three subbands which are then transformed into the frequency
`domain using a. variable block length. Transform coefficients are grouped into
`nonuniform bands to reflect the human auditory system, and then quantized on the
`basis of dyna.m.ic sensitivity and masking cliaracteristics. ATRAC compresses com-
`pact disc audio to approximately 1/5 of the original data. rate with virtually no loss
`in sound quality. -
`i
`
`1 Introduction
`
`Recently, there has been an increasing consumer demand for a portable recordable
`high-quality digital audio media. The MiniDisc system was developed to meet this
`demand. The MiniDisc.is based on a 64 mm optical or magneto-optical disc which
`has approximately 1/5 of the data storage capacity of a. standard compact disc.
`Despite the reduced storage capacity, it was necessary that the MiniDisc maintain
`high sound quality and a playing time of 74 minutes. The ATRAC (Adaptive
`Transform Acoustic Coding) data compression system was therefore designed to meet
`the following criteria:
`
`- Compression of 16-bit 44.1 kHz stereo audio into less than 1/5 of the original
`data. rate with minimal reduction in sound quality.
`4- Simple and inexpensive hardware implementation suitable for portable players
`and recorders.
`
`When digital audio data is compressed, there is normally a certain amount of quanti-
`zation noise introduced into the signal. The goal of many audio coding systems I1-6]
`is to control the time-frequency distribution of this noise in such a way as to render it
`inaudible to the human ear.
`If this is completely successful, the reconstructed signal
`
`2
`
`
`
`-2-
`
`will be indistinguishable from the original.
`
`In general, audio coders operate by decomposing the signal into a set of units, each
`corresponding a certain range in time and frequency. Using this time-frequency dis-
`tribution, the signal is analyzed according to psychoacoustic principles. This analysis
`indicates which units are critical and must be coded with high precision, and which
`units are less sensitive and can tolerate some quantization noise without degrading
`the perceived sound quality. Based on this information, the available bits are allo-
`cated to the time-frequency units. The spectral coefficients in each unit are then
`quantized using the allocated bits.
`In the decoder, the quantized spectra are recon-
`structed according to the bit allocation and then synthesized into an audio signal.
`
`The ATRAC system operates as above, with several enhancements. ATRAC uses
`psychoacoustics not only in, the bit allocation algorithm, but also in the time-
`frequency splitting. Using a combination of subband coding and transform coding
`techniques,
`the input signal
`is analyzed in nonuniform frequency divisions which
`emphasize the important
`low-frequency regions.
`In addition, ATRAC uses a
`transform block length which adapts to the input signal. This ensures efficient cod-
`ing of stationary passages without sacrificing time resolution during transient pas-
`sages.
`
`This paper begins with a review of the relevant psychoacoustic principles. The
`ATRAC encoder is then described in terms of time-frequency splitting, quantization
`of spectral coefficients, and bit allocation. Finally, the ATRAC decoder is described.
`
`2 Psychoacoustics
`
`2.1 Equi-loudness Curves
`
`The sensitivity of the ear varies with frequency. The ear is most sensitive to frequen-
`cies in the neighbourhood of 4 kHz; sound pressure levels which are just detectable at
`4 kHz are not detectable at other frequencies.
`In general, two tones of equal power
`but different frequency will not sound equally loud. The perceived loudness of a
`sound may be expressed in sones, where 1 sone is defined as the loudness of a 40 dB
`tone at 1 kHz. Equi-loudness curves at several loudness levels are shown in Figure 1.
`The curve labeled “hearing threshold in quiet” indicates the minimum level (by
`definition, 0 sone) at which the ear can detect a tone at a given frequency.
`
`These curves indicate that the ear is more sensitive at some frequencies than it is at
`others. Distortion at insensitive frequencies will be less audible than at sensitive fre-
`quencies. -
`
`2.2 Masking
`
`Masking [7] occurs when one sound is rendered inaudible by another. Simultaneous
`masking occurs when the two sounds occur at the same time, such as when a conver-
`sation (the masked signal) is rendered inaudible by a passing train (the masker).
`
`3
`
`
`
`-3-
`
`Backward masking occurs when the masked signal ends before the masker begins; for-
`ward masking occurs when the masked signal begins after the masker has ended.
`
`Masking becomes stronger as the two sounds get closer together in both time and fre-
`quency. For example, simultaneous masking is stronger than either forward or back-
`ward masking because the sounds occur at the same time. Masking experiments are
`generally performed by using a narrow band of white noise as the masking signal, and
`measuring the just-audible level of a pure tone at various times and frequencies.
`Examples of simultaneous masking and temporal masking are shown in Figure 2 and
`Figure 3 respectively.
`
`Important conclusions may be drawn from these graphs. First, simultaneous masking
`is more effective when the frequency of the masked signal is equal to or higher than
`that of the masker. Second, while forward masking is effective for a. considerable
`time after the masker has stopped, backwards masking may only be effective for less
`than 2 or 3 ms before tl1e onset of the masker.
`
`2.3 Critical Bands
`
`Critical bands [7] arose from the idea that the ear analyzes the audible frequency
`range using a set of subbands. The frequencies within a critical band are similar in
`terms of the ear’s perception, and are processed separately from other critical bands.
`Critical bands arose naturally from experiments in human hearing and can also be
`derived from the distribution of sensory cells in the inner ear. Critical bands can be
`thought of as thelfrequency scale used by the ear
`
`The critical band scale is shown in Table 1.
`
`It is clear that critical bands are much
`
`narrower at low frequencies than at high frequencies; in fact, three quarters of the
`critical bands are located below 5 kHz. This indicates that the ear receives more
`
`information from the low frequencies and less from higher frequencies.
`
`Table 1: Discrete critical bands
`
`Critical
`Band
`
`Frequency (Hz)
`Low
`High Width
`
`1
`2
`3
`4
`5
`6
`7
`8
`9
`10
`11
`12
`
`High Width 0
`
`Frequency (Hz)
`Low
`2000
`2320
`2700
`3150
`3700
`4400
`5300
`6400
`7700
`9500
`12000
`15500
`
`2320 ‘
`2700
`3150
`3700
`4400
`5300
`6400
`7700
`9500
`12000
`15500
`22050
`
`320
`380
`450
`550
`700
`900
`1100
`1300
`1800
`2500
`3500
`6550
`
`4
`
`
`
`3 The ATRAC Encoder
`
`A block diagram of the encoder structure is shown in Figure 4. The encoder has
`three components. The analysis block decomposes the signal into spectral coefficients
`grouped into Block Floating units (BFU’s). The bit allocation block divides the
`available bits between the BFU’s, allocating fewer bits to insensitive units. The
`quantization block quantizes each spectral coefficient to the specified wordlength.
`
`3.1 Time-Frequency Analysis
`
`This block (Figure 6) generates the BFU’s in three steps, combining techniques from
`subband coding and transform coding. First, the signal is broken down into three
`subbands: 0-5.5 kHz, 5.5—11 kHz, and 11-22 kHz. Each of these subbands is then
`transformed into the frequency domain, producing a set of spectral coefficients.
`Finally, these spectral coefficients are grouped nonuniformly into BFU’s.
`
`The subband decomposition is performed using Quadrature Mirror Filters (QMF’s)
`[9—10]. The input signal is divided into upper and lower frequency bands by the fir.-st
`QMF, and the lower frequency band is divided again by a second QMF. Use of
`QMF’s ensures that time-domain aliasing caused by the subband decomposition will
`be cancelled during reconstruction.
`
`Each of the three subbands is then transformed into the frequency domain using the
`Modified Discrete_ Cosine Transform (MDCT) [11—12]. The MDCT allows up to 50%
`overlap between ‘time-domain windows,
`leading to improved frequency resolution
`while maintaining critical sampling.
`Instead of a fixed transform block length, how-
`ever, ATRAC chooses the block length adaptively based on the signal characteristics
`in each band. There are two modes: long mode (11.6 ms) and short mode (1.45 ms in
`the high frequency band, 2.9 ms in the others). Normally long mode_ is used to pro-
`vide good frequency resolution. However, problems may occur during attack portions
`of the signal. Specifically, the quantization noise is spread over the entire signal
`block, and the initial quantization noise is not masked (Figure 8a); this problem is
`called pre-echo.
`In order to prevent pre-echo, ATRAC switches to short mode (Fig-
`ure Sb) when it detects an attack signal.
`In this case, because there is only a short
`segment of noise before the attack, the noise will" be masked by backward masking
`(section 2.2). Backward masking is not effective for Long Mode because of its very
`short duration.‘ Thus, ATRAC achieves efficient coding in stationary regions while
`responding‘ quickly to transient passages.
`
`Note that short mode is not necessary for signal decay, because the quantization
`noise will be masked by forward masking which lasts much longer than backward
`masking. For maximum flexibility, the block size mode can be selected indepen-
`dently for each band.
`
`The MDCT spectral coefficients are then grouped into BFU’s. Each unit contains a
`fixed number of coefficients.
`In the case of long mode, the units reflect 11.6 ms of a
`narrow frequency band; in the case of short mode, each block reflects a shorter time
`but a wider frequency band (Figure 9). Note that the concentration of BFU's is
`greater at low frequencies than at high frequencies; this reflects the psychoacoustic
`
`5
`
`
`
`characteristics of the human ear.
`
`3.2 Spectral Quantization
`
`The spectral values are quantized using two parameters: wordlength and scale factor.
`The scale factor defines the full-scale range of the quantization, and the wordlength
`defines the precision within that scale. Each BFU has the same wordlength and scale
`factor, reflecting the psychoacoustic similarity of the grouped frequencies.
`
`The scale factor is chosen from a. fixed list of possibilities, and reflects the magnitude
`of the spectral coefficients in each BFU. The wordlength is determined by the bit
`allocation algorithm (section 3.3).
`
`For each sound frame (corresponding to 512 input points), the following information
`is stored on disc:
`
`- MDCT block size mode (long or short).
`
`- Wordlength data for each Block Floating unit.
`
`- Scale factor code for each Block Floating unit.
`
`- Quantized spectral coefficients.
`
`In order torguarantee accurate reconstruction of the input signal, critical data. such
`as the block size mode, wordlength and scale factor data may be stored redundantly.
`Information about quantities of redundant data is also stored on the disc.
`
`3.3 Bit Allocation
`
`The bit allocation algorithm divides the available data bits between the various
`BFU’s. Units with a large number of bits will have little quantization noise; units
`with few or no bits will have significant quantities of noise. For good sound quality,
`the bit allocation algorithm must ensure that critical units have sufficient bits, and
`that the noise in non-critical units is not perceptually significant.
`
`ATRAC does not specify a bit allocation algorithm; any appropriate algorithm may
`be used. The wordlength of each BFU is stored on the MiniDisc along with the
`quantized spectra, so the decoder is completely independent of the allocation algo-
`rithm. This provides for
`the evolutionary improvement of the encoder without’
`changing the MiniDisc format or the decoder.
`
`There are many possible algorithms, ranging from very simple to extraordinarily com-
`plex. For portable MiniDisc recorders, however, the possibilities are limited some-
`what by the fact that they must be implemented on low-cost low-power compact
`hardware. Nevertheless, ATRAC is capable of good sound quality using even a sim-
`ple bit allocation algorithm, provided it is soundly based on psychoacoustic princi-
`ples. ATRAC’s nonuniform adaptive time-frequency structure is already based on
`psychoacoustics, relieving the pressure on the bit allocation algorithm.
`
`6
`
`
`
`-5.
`
`One suggested algorithm uses a combination of fixed and variable bits. The fixed
`bits emphasize the important
`low-frequency regions, allocating fewer bits to the
`BFU’s in higher frequencies. The variable bits are allocated according to the loga-
`rithm of the spectral coefficients within each BFU. The total bit allocation 11,0, is the
`weighted sum of the fixed bits b,,~c(L-) and the variables bits b,,,,,(k). Thus, for each
`BFU in
`
`bictiki
`
`'1 Tbvar +
`
`The weight T is a measure of the tonality of the signal, taking a value close to 1 for
`pure tones, and close to 0 for white noise. This means that the proportion of fixed
`and variables bits is itself variable. Thus, for pure tones, the available bits will be
`concentrated in a small number of BFU’s. For more noise-like signals, the algorithm
`will emphasize the fixed bits in order to reduce the number of bits allocated to the
`insensitive high frequencies.
`'
`
`The above equation is not‘ concerned with overall bit rate, and will in general allocate
`more bits than are available.
`In order to ensure a fixed data rate, an offset buff (the
`same for all BFU’s) is calculated. This value is subtracted from b,,,,(l:) for each unit,
`giving the final bit allocation b(k):
`
`b(l:) = V integer{b,,,,(l.:)~ bofl}
`If the subtraction generates a negative wordlength, that BFU is allocated 0 bits.
`This algorithm is illustrated in Figure 10.
`
`4 The ATRAC Decoder
`
`A block diagram of the decoder structure is shown in Figure 5. The decoder first
`reconstructs the MDCT spectral coefficients from the quantized values, using the
`wordlength and scale factor parameters. These spectral coefficients are then used to
`reconstruct the original audio signal (Figure 7). The coefficients are first transformed
`back into the time domain by the inverse MDCT (IMDCT) using either long mode or
`short mode as specified in the parameters. Finally, the three time-domain signals are
`synthesized into the output signal by QMF synthesis filters.
`
`5 Conclusions
`
`Through a combination of various t.ecl111iques including psychoacoustics, subband
`coding and transform coding, ATRAC succeeds in coding digital audio with virtually
`no perceptual degradation in sound quality. Listening tests indicate that the differ-
`ence between ATRAC sound and the original source is not perceptually annoying nor
`does it reduce the sound quality. Furthermore, the system is sufficiently compact to
`be installed in portable consumer products. Using ATRAC, the MiniDisc provides a
`practical solution for portable digital audio.
`
`7
`
`
`
`6 References
`
`[1]
`
`[2]
`
`MPEG/AUDIO CA11172-3, 1992.
`
`“ASPEC ( Source: AT&T Bell Labs et al.
`JTC1/SC2/WG8 MPEG-AUDIO, Oct. 18, 1989.
`
`)” Doc. No. 89/205, ISO-IEC
`
`R. Veldhuis, M. Breeuwer and R. van der Waal, “Subband coding of digital
`audio signals without loss of quality,” Proc. 198.9 International Conference on
`Acoustics, Speech and Signal Processing, Glasgow, pp. 2009-2012.
`
`A. Sugiyama, F. Hazu, M. Iwaclare and T. Nishitani, “Adaptive transform cod-
`ing with an adaptive block size (ATCABS),” Proc. 1990 International Confer-
`ence on Acoustics, Speech and Signal Processing, Albuquerque, pp. 1093-1096.
`
`[5]
`
`C. Davidson, L. Fielder and M. Antill, “High-quality audio transform coding at
`I
`128 kbits s ” Proc. 1990 International Conference on Acoustics, Speech and Sig-
`not Processing,.Albuquerque, pp. 1117-1120.
`
`G. Davidson, L. Fielder and M. Antill, “Low-complexity transform coder for
`satellite link applications,” Audio Engineering Society 89th Convention preprint
`2966, Sept. 1990.
`
`J. S. Tobias‘, Ed., Foundations of Modern Auditory Theory, Vol. 1, Academic
`Press, New York, 1970.
`
`E. Zwicker and U. T. Zwiclcer, “Audio engineering and psychoacoustics: Match-
`ing signals to the final
`receiver,
`the human auditory system.”
`J. Audio
`Engineering Society, Vol. 39 No. 3, pp. 115-126, March 1991.
`
`D. Esteban and C. Galand, “Application of quadrature mirror filters to split
`band voice Coding schemes," Proc. 1977 IEEE International Conference on
`Acoustics, Speech and Signal Processing, Hartford CT, pp. 191-195.
`
`P. P. Vaidyanathan, “Quadrature mirror filter banks, M-band extensions and
`perfect—reconstruction techniques, IEEE ASSP Magazine; Vol. 4, pp. 4-20, July
`1987.
`
`J. Princen and A. Bradley. “Analysis/synthesis filter band design based on
`time-domain aliasing cancellation,” IEEE Trans. Acoustics, Speech and Signal
`Processing, Vol. 34, pp. 1153-1161, 1986.
`
`J. Princen, A. Johnson and A. Bradley, “Subband/transform coding using filter
`band designs based on time domain aliasing cancellation,” Proc. 1987 IEEE
`International Conference on Acoustics, Speech and Signal Processing, Dallas, pp.
`‘ 2161-2164.
`
`[7]
`
`[8]
`
`[11]
`
`[12]
`
`8
`
`
`
`0.05
`
`Freqency (kHz)
`
`(dotted line = hearing threshold in quiet)
`Figure 1: Enui-loudness curves
`(adapted from Robinson-Dadson curvesL
`
`0.5
`
`Freqency ikHfl
`
`Figure 2:
`
`Simultaneous masking cuive.
`(adapted
`from [8])
`
`including narrow-band noise and masked tone
`
`9
`
`
`
`
`evei
`(68)
`
`.
`
`’ Backward
`
`Masking
`
`Simultaneous
`
`Masking
`
`Forward
`Masking
`
`
`
`€>l
`
`F? 2ms
`
`F?-15ms*%F
`
`Time
`
`(ms)
`
`Figure 3:
`
`Example of
`
`temporal masking
`
`
` Bit
` Parameters
`Allocation
`
`
`
`
`
`Time-Frequency
`Analysis Straight POM
`
`Spectral
`Quantization
`
`Figure 4: Block diagram of ATRAC enccdeh
`
`Parameters
`
`
`
`Ouantlzed
`
`Spectral
`Reconstruction
`Spectra
`
`
`
`Time—Fraquency
`Synthesis
`
`
`
`Output
`
`Straight POM
`
`Figure 5: Block diagram of ATRAC decoden
`
`10
`
`10
`
`
`
`Stvaiuht
`
`Figure 6: Time-frequency analysis structurm
`
`OMF
`
`|MgcT_M
`
`IMDCT-H
`
`
`
`Synthesis
`Filter 1
`
`
`
`IMDCT-L
`
`Figure 7: Time-frequency synthesis structurm
`
`11
`
`11
`
`
`
`(a)
`
`Long mods
`
`tnnuUQ
`
`not3Z
`
`H0
`
`Ll3ZtHanUQ
`
`n0
`
`(b)
`
`Short mods
`
`long and short MDGT block size modes
`Figure 8: Illustration of
`in this situation.
`short mode is preferablo
`
`12
`
`12
`
`
`
`
`
`Long Mode
`
`e.dI0IMHnuo.hIS'
`
`III|ll|n*IIlIIIII
`
`I"'."'*I'I.‘.‘I
`
`I
`I
`I
`gun...-
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`l
`I
`I
`I
`I
`I
`I
`I
`uncan-I
`
`IIIIIIIIII
`
`f"l'"l"
`
`I‘-Ir:
`
`1
`
`-4.4.
`
`:-0--I---o
`
`
`
`IlllllllE-IIIIIIll0»tl.:llIlIu.IlIIlIlI¢IIIIIIIIIIollllllll
`
`
`
`
`
`Frequency (kHz)
`
`Figure 9: Example of nonuniform time—frequency divisions.
`
`13
`
`13
`
`
`
`Variable bits: Tb“,
`
`Fixed bits:
`
`(1—T)bm
`
`b(U
`
`
`
`‘I0b
`
`..///
`
`/////////a
`
`k
`
`J /
`
`Block Floating Unit
`
`Figure 10: Example bit allocation algorithm showing final
`bit assignment b(kL
`
`14