`
`PROCEEDINGS OF THE IEEE, VOL. 67, NO. 12, DECEMBER 1979
`
`Enhancement and Bandwidth Compression
`of Noisy Speech
`
`has been considerable
`the past several years there
`Aktmct-Over
`problem of enhancement and bandwidth
`attention focused on the
`compression of speech degraded by additive background noise. This
`interest is motivated by several factors including a broad set of impor-
`tant applications, the apparent lack of robustness in current speech-
`compression systems
`and the development of
`several potentially
`solutions. One objective of this paper is to
`promising and practical
`provide an overview of the variety of techniques that have been pro-
`posed for enhancement and bandwidth compression of speech degraded
`by additive background noise. A second objective is to suggest a uni-
`fying framework in terms of which the relationships between these
`systems is more visible and which hopefully provides a structure which
`wiU suggest fruitfhl directions for further research.
`
`I. INTRODUCTION
`HERE ARE a wide variety of contexts in which it is
`desired to enhance speech. The objective of enhance-
`ment may perhaps be to improve the overall quality, to
`increase intelligibility, to reduce listener fatigue, etc. Depend-
`ing on the specific application, the enhancement system may
`be directed
`at only one of these objectives or several. For
`example, a speech communication system
`may introduce a
`low-amplitude long-time delay echo or a narrow-band additive
`disturbance. While these degradations may not by themselves
`reduce intelligibility for the purposes for which the channel
`is used, they are generally objectionable and an improvement
`in quality perhaps even at the expense of some intelligibility
`may be
`desirable. Another example
`is the communication
`between a pilot and an
`air traffic control tower. In this
`environment, the speech is typically degraded by background
`noise. Of central importance is the intelligibility of the speech
`and it would generally be acceptable to sacrifice quality if the
`intelligibility could be improved.
`Even with normal unde-
`graded speech, it is sometimes useful or desirable to provide
`enhancement. As a simple example high-pass filtering of nor-
`mal speech is often used to introduce a “crispness” which is
`generally perceived as an improvement in quality.
`The speechenhancement problem covers a broad spectrum
`of constraints, applications and issues. Environments in which
`an additive background signal has been introduced are com-
`mon. The background may be
`noise-like such as in aircraft,
`street noise, etc. or may be speech-like such as an environment
`with competing speakers. Other examples in which the need
`
`Manuscript received June 2 2 , 1979; revised August 28, 1979. This
`work was supported in part by the Defense Advance Research Rojects
`Agency monitored by the Office of Naval Research under Contract
`N00014-7542-0951-NR049-328 at M.I.T. Research Laboratory of Elec-
`tronics and in part by the Department of the Air Force under Contract
`F19628-78C-0002 at M.I.T. Lincoln Laboratory.
`The authors are with M.I.T. Research Laboratory of Electronics and
`M.I.T. Lincoln Laboratory, Cambridge, MA 02139.
`
`for speech enhancement arises include correcting for reverber-
`ation, correcting for the distortion of the speech of underwater
`divers breathing a helium-oxygen mixture, and correcting
`the distortion of speech due to pathological difficulties of the
`speaker or introduced due to an attempt to speak too rapidly.
`Even for these examples,
`the problem and techniques vary,
`depending on the availability of other signals or information.
`of speech in
`For example, for enhancement
`an aircraft a
`separate microphone can be used to monitor the background
`to
`noise so that the characteristics of the noise can be used
`adjust or adapt the enhancement system.
`At the air-traffic
`control tower, however, the only signal available for enhance-
`ment is the degraded speech.
`Another very important application for speech enhancement
`sys-
`is in conjunction with speech bandwidth compression
`tems. Because of the increasing role of digital communication
`channels coupled with
`the need for encrypting of speech and
`increased emphasis on integrated voice-data networks, speech-
`bandwidth-compression systems are destined
`to play an in-
`creasingly important role
`in speechcommunication systems.
`The conceptual
`basis for narrow-band speechcompression
`systems stems from a model for
`the speech signal based on
`what is known about the
`physics and physiology of speech
`production. Because of this reliance on a model for the signal
`it is not unreasonable to expect that as the signal deviates from
`the model due to distortion such as additive noise,
`the per-
`regard to
`formance of the speech compression system with
`factors such
`as quality, intelligibility, etc.,
`will degrade. It
`is generally agreed
`that the performance
`of current speech-
`compression systems
`degrades rapidly in
`the presence of
`is currently
`additive noise and other distortions and there
`considerable interest and
`attention being directed at the
`development of more robust speech compression systems.
`There are two basic approaches which are typically considered
`either of which may be preferable in a
`given situation. One
`compression on the as-
`approach is to base the bandwidth
`sumption of undistorted speech and
`develop a preprocessor
`to enhance the degraded speech in preparation for further
`processing by the bandwidth compression system. It is impor-
`tant to
`recognize that in enhancing speech in preparation
`for bandwidth compression the
`effectiveness of the prepro-
`cessor is judged on the basis of the output of the bandwidth-
`compression system in comparison with
`the output
`if no
`preprocessor is used. Thus, for example,
`it is possible that
`the output of the preprocessor would be judged by a listener
`to be inferior (by
`some measure) to the input but that the
`output of the bandwidth-compression system with
`the pre-
`processor is preferred to the output without it. In this
`case,
`the preprocessor would clearly be considered to be effective
`
`0018-9219/79/1200-1586$00.75 O 1979 IEEE
`
`Exhibit 1021
`Page 01 of 19
`
`
`
`
`
`
`
`LIM AND OPPENHEIM: ENHANCEMENT AND BANDWIDTH COMPRESSION
`
`1587
`
`in enhancing the speech in preparation for bandwidth com-
`pression. Another approach
`to bandwidth compression of
`degraded speech is to incorporate into the model for the signal
`degradation. A number of systems
`information about the
`based on such an approach have recently been proposed and
`will be discussed in detail in this paper.
`As is evident from the above discussion, the general problem
`of enhancing speech is broad and the constraints, information,
`and objectives are heavily dependent on the specific context
`In this paper, we consider only a small
`and applications.
`subset of possible topics, specifically the enhancement and
`bandwidth compression of speech degraded by additive noise.
`Furthermore, we assume that the only signal available is the
`degraded speech and that the noise does not depend on the
`original speech. Many practical problems, some of which have
`already been discussed, fall into this framework and some
`so that they do.
`problems that do not
`can be transformed
`noise or convolutional noise
`For example, multiplicative
`degradation can be converted t o an additive noise degradation
`[ l l , [21. As another
`by a homomorphic transformation
`noise in pulsecode
`example, signal-dependent quantization
`modulation (PCM) signal coding can be converted to a signal
`independent additive noise
`by a pseudo-noise technique
`[31-[51.
`Even within the limited framework outlined above, there is a
`objectik of this
`diversity of approaches and systems. One
`paper is to provide an overview of the variety of techniques
`that have been proposed for enhancement of speech degraded
`by additive background noise both for direct listening and as
`a preprocessor for subsequent bandwidth cornpression. Many
`of these systems were developed independently of each other
`and on the surface often appear to be unrelated. Thus another
`in
`objective of the paper is to provide a unifying framework
`terms of which the relationship between these systems is more
`visible, and which hopefully will provide a structure which
`will suggest further fruitful directions for research.
`In Section 11, we present an overview of the general topic.
`In this overview we classify the various enhancement systems
`based on the information assumed about the speech and the
`noise. Some systems based on timeinvariant Wiener filtering,
`on an assumed noise power spectrum
`for example, rely only
`and on long-time average characteristics of speech, such as the
`fact that the average speech spectrum decays with frequency
`on aspects
`at approximately 6 dB/octave. Other systems rely
`of speech perception or speech production in general or on a
`detailed model of speech.
`Sections 111-V present a more detailed discussion of several
`of these categories of speechenhancement systems. In partie
`I11 is concerned with
`the general principle of
`ular, Section
`based on estimation of the short-time
`speech enhancement
`spectral amplitude of the speech. This basic principle encom-
`passes a variety
`of techniques and systems including
`the
`specific methods of spectral subtraction, parametric Wiener
`filtering, etc. In Section IV, speech enhancement techniques
`which rely principally on the concept of the short-time period-
`icity of voiced speech are reviewed, including comb-filtering
`and related systems. Section V discusses a variety of systems
`that rely on more specific modeling of the speech waveform.
`As we will discuss in detail, in some cases, parameters of the
`model are obtained from an analysis of the degraded speech and
`In other cases, the
`used to synthesize the enhanced speech.
`results of an analysis based on a model for speech are used
`to control an enhancement filter, perhaps with the procedure
`
`being iterative so that the output of an enhancement filter is
`then subjected to further analysis, etc. Many of these systems
`also incorporate a number of the techniques introduced
`in
`Section 111, including Wiener filtering and spectral subtraction.
`In Sections 111-V, the focus is entirely on systems for en-
`hancement with the evaluation of the systems being based
`further processing. In Section VI, we
`on listening without
`consider the related but separate problem
`of bandwidth
`compression of speech degraded by additive noise.
`In Section VII, we discuss in some detail
`the evaluation of
`the performance of the various systems presented in the earlier
`sections. In general, the performance evaluation of a speech-
`in large measure
`is extremely difficult,
`enhancement system
`because the appropriate
`criteria for evaluation are
`heavily
`dependent on the specific application of the system. Relative
`importance of such factors as quality, intelligibility, listener
`the application. In
`fatigue, etc., may vary considerably with
`Section VII, we summarize the performance evaluations that
`have been reported for the various systems presented in this
`paper. Since the evaluation of different systems has generally
`been based on different procedures, environments, etc.,
`no
`attempt is made in the section to compare individual systems.
`In general, however, we will see that while many of the en-
`hancement systems reduce
`the apparent background
`noise
`and thus perhaps increase quality, many
`of them to varying
`In the context of bandwidth
`degrees, reduce intelligibility.
`compression, however, various systems provide an increase
`in intelligibility over that obtained without the incorporation
`of speech enhancement.
`II. OVERVIEW OF SYSTEMS FOR ENHANCEMENT AND
`BANDWIDTH COMPRESSION OF NOISY SPEECH
`As indicated in the previous section, our focus in this paper
`is on degradation due to the presence of additive noise. Even
`within this limited
`context there
`are a wide variety of ap-
`proaches which have been proposed and explored. Conceptu-
`attempt to capitalize on available
`ally any approach should
`information about the
`signal, i.e., the speech, and
`the back-
`is a special subclass of audio signals
`ground noise. Speech
`and there are reasonable models in terms of which the speech
`waveform can be described and categorized. The more
`speci-
`fically we attempt to model the speech signal, the more poten-
`tial'for separating it from the background noise. On the other
`hand, the more we assume about the speech the more sensitive
`the enhancement system will be to inaccuracies or deviations
`from these assumptions. Thus incorporating assumptions and
`information about the speech signal represents tradeoffs which
`the various systems. In a similar manner sys-
`are reflected in
`tems can attempt to incorporate
`detailed information about
`the background noise.
`For example, the type of processing
`suggested if the background noise is a competing speaker
`is
`different than if it is wide-band random noise. Thus enhance-
`ment systems also tend
`to differ in terms of the assumptions
`the background noise. As with assumptions
`made regarding
`related to the
`signal, the more an enhancement system at-
`tempts to capitalize on assumed characteristics of the noise
`the more susceptible it is likely to be to deviations from these
`assumptions.
`consideration in speech enhancement
`Another important
`stems from the fact that the criteria for enhancement ulti-
`mately relate to an evaluation by a human listener. In different
`contexts the criteria for evaluation may differ depending on
`whether quality, intelligibility, or some other attribute
`is the
`
`Exhibit 1021
`Page 02 of 19
`
`
`
`1588
`
`PITCH PERIOD
`
`I
`
`DIGITAL FILTER COEFFICIENTS
`
`I I
`
`RANDOM
`NOISE
`
`I
`AMPLITUDE
`
`Fig. 1. A speech production model.
`
`(b)
`(a)
`Fig. 2. An example of resonant frequencies of an acoustic cavity.
`(a) Vocal-tract transfer function. (b) Magnitude spectrum of a speech
`sound with the resonant frequencies shown in (a).
`
`most important. Thus speech enhancement must inevitably
`aspects of human perception. As we will
`take into account
`indicate shortly, some systems are
`heavily motivated by per-
`ceptual considerations,
`others rely more on mathematical
`cases, of course, the mathematical criteria
`criteria. In such
`must in some way be consistent with human perception, and,
`while an optimum mathematical criterion is not known, some
`mathematical error criteria are understood
`to be a
`better
`match than others to aspects of human perception.
`In the following discussion we briefly describe some aspects
`of speech production and speech perception that in
`varying
`degrees pray a role in speechenhancement systems. Following
`that we present a brief overview of a representative collection
`of speechenchancement systems, with
`the intent
`of cate-
`of the various aspects of
`gorizing these systems
`in terms
`speech production and perception on
`which they attempt to
`capitalize.
`Speech is generated by exciting an acoustic cavity, the vocal
`tract, by pulses of
`air released through the vocal cords for
`voiced sounds, or by turbulence for unvoiced sounds. Thus
`a simple but useful model for speech production consists
`of
`a linear system, representing
`the vocal tract, driven by an
`excitation function which is a periodic pulse train for voiced
`sounds and wide-band noise for unvoiced sounds, as illustrated
`in Fig. 1. Furthermore, since the linear system represents an
`is of a resonant nature, so that
`acoustic cavity, its response
`its transfer function
`is characterized by a set of resonant
`frequencies, referred to as formants, as illustrated in Fig. 2(a).
`are fixed,
`Thus, if the excitation and vocal-tract parameters
`then as indicated in Fig. 2(b), the speech spectrum
`has an
`envelope representing
`the vocal-tract transfer function
`of
`Fig. 2(a) and a fine structure representing the excitation.
`Many of the techniques for speech enhancement, particu-
`111 and V are conceptually based on
`larly those in Sections
`the representation of the speech signal as a stochastic process.
`This characterization of speech is clearly more appropriate in
`the case of unvoiced sounds for which the vocal tract is driven
`by wide-band noise. The vocal tract of course changes shape
`as different sounds are generated and this
`is reflected in a
`
`
`
`
`
`PROCEEDINGS OF THE IEEE,
`
`VOL. 67, NO. 12, DECEMBER 1979
`
`time varying transfer function for the linear system in Fig. 1.
`However, because
`of the mechanical and physiological con-
`straints on the motion
`of the vocal tract and articulators
`it is reasonable to represent the
`such as the tongue and lips,
`linear system in Fig. 1 as a slowly varying linear system so that
`on a short-time basis it is approximated as stationary. Thus
`some specific attributes of the speech signal, which can be
`capitalized on in an enhancement system are that
`it is the
`response of a slowly varying
`linear system, that on
`a short-
`is characterized by a set of
`time basis its spectral envelope
`resonances, and that for voiced sounds, on a short-time basis
`it has a harmonic structure. This simplified model for speech
`production has generally been very successful
`in a variety of
`engineering contexts including speech enchancement, synthe-
`sis, and bandwidth compression. A more detailed discussion
`of models for speech production can be found in [ 61 -[ 81 .
`of speech are considerably more
`The perceptual aspects
`complicated and
`less well understood. However, there are a
`number of commonly accepted aspects of speech perception
`which play an important role in speechenchancement systems.
`to be important in the
`For example, consonants are known
`intelligibility of speech even though they represent a relatively
`small fraction of the signal energy. Furthermore, it is generally
`is of central impor-
`understood that the short-time spectrum
`tance in
`the perception of speech and
`that, specifidy, the
`formants in the short-time spectrum are more important than
`other details of the spectral envelope. It appears also, that the
`first formant, typically in the range of 250 to 800 Hz, is less
`[9], [lo].
`important perceptually, than the second formant
`Thus it is possible to apply a certain degree of high pass filter-
`the f i i t
`ing [ 1 1 ], [ 121 to speech which may perhaps affect
`formant without introducing
`serious degradation
`in intelligi-
`bility. Similarly low-pass
`filtering with a cutoff frequency
`above 4 kHz, while perhaps affecting
`crispness and quality
`will in general not seriously affect intelligibility. A good repre-
`sentation of the magnitude of the short-time spectrum is also
`generally considered to be important whereas the phase is
`aspect of the
`relatively unimportant. Another perceptual
`auditory system that plays a role in speech enhancement is the
`ability to mask one signal with another. Thus, for
`example,
`narrow-band noise and many forms of artificial noise or deg-
`radation such as might be produced by a vocoder are more
`unpleasant to listen to than broad-band noise and a speech-
`enhancement system might include the introduction of broad-
`band noise to mask the narrow-band or artificial noise.
`All speech-enhancement systems rely
`to varying degrees on
`the aspects of speech production and perception outlined
`above. One of the simplest approaches to enhancement is the
`use of low-pass
`or bandpass filtering
`to attenuate the noise
`outside the band of perceptual importance for speech. More
`generally, when the power spectrum of the noise is known,
`one can consider the use of Wiener filtering, based on the long-
`time power spectrum of speech. While in some cases such as
`the presence of narrow-band background noise, this is reason-
`ably successful, Wiener filtering based on the long-time power
`spectrum of the speech and noise is limited because speech is
`not stationary. Even if speech were truly stationary, mean-
`on which Wiener
`square error which
`is the error criterion
`filtering is based is not strongly correlated with perception and
`to apply to
`thus is not a particularly effective error criterion
`speech processing systems. This
`is evidenced, for example, in
`the use of masking for enhancement. By adding broad-band
`
`Exhibit 1021
`Page 03 of 19
`
`
`
`
`
`
`
`LIM AND OPPENHEIM: ENHANCEMENT
`
`AND BANDWIDTH COMPRESSION
`
`1589
`
`noise to mask other degradation, we are, in effect, increasing
`that suggests that
`the meansquare error. Another example
`is not well matched to the perceptually
`meansquare error
`important attributes in speech is the fact that distortion of the
`speech waveform by processing with an all-pass filter results
`in essentially no audible difference if the impulse response of
`the all-pass filter is reasonably short but can result in a sub-
`the origiaal and filtered
`stantial mean-square error between
`speech. In other words, mean-square error is sensitive to phase
`of the spectrum whereas perception tends not to be.
`Masking and bandpass filtering represent
`two simple ways
`in which perceptual aspects of the auditory system
`can be
`exploited in speech enhancement. Another system whose
`motivation depends heavily on aspects of speech perception
`was proposed by Thomas and Niederjohn [ 121 as a preproces-
`to the introduction of noise in those applications
`sor prior
`where noise-free speech is available for processing. In essence,
`their system applies high-pass filtering to reduce or remove the
`first formant followed by infinite
`clipping. The motivation
`for the system lies in the observation that at a given signal-
`clipping will increase, relative to the
`to-noise ratio infinite
`vowels, the amplitude of
`the perceptually important low-
`as consonants thus making them less
`amplitude events such
`In addition, for vowels
`susceptible to masking by noise.
`the filtering will increase the amplitude of higher formants
`formant, thus making the perceptually
`relative to .the fiit
`more important higher formants less susceptible to degrada-
`in this
`tion. In the
`speech enhancement problem considered
`paper, noise-free speech is not available for processing as re-
`quired in the above system. Thomas
`and Ravindran [ 131,
`however, applied
`high-pass fitering followed by infinite
`clipping to noisy speech as an experiment. While quality may
`be degraded by the process of filtering and clipping, they claim
`to
`a noticeable improvement
`in intelligibility when applied
`enhance speech degraded by wide-band random noise. One
`possible explanation may be that the high-pass filtering opera-
`tion reduces
`the masking of
`perceptually important higher
`formants by the
`relatively unimportant 1owXrequency
`components.
`Another system which relies heavily on human perception of
`[ 141. Based on some per-
`speech was proposed by Drucker
`that one primary cause for
`ceptual tests, Drucker concluded
`the intelligibility loss in speech degraded by wide-band random
`noise is the confusion among the fricative and plosive sounds
`which is partly due to the loss of short pauses immediately
`filtering one of the
`before the plosive sounds. By high-pass
`inserting short pauses
`the /s/ sound, and
`fricative sounds,
`before the plosive sounds (assuming that their locations can
`be accurately determined), Drucker claims a
`significant
`im-
`provement in intelligibility.
`perceptual attributes we indicated that the
`In discussing
`to be
`is generally considered
`short-time spectral magnitude
`important whereas the phase is relatively unimportant. This
`forms the basis for a class of speech enhancement systems
`various ways to estimate the short-time
`which attempt in
`spectral magnitude of the speech without particular regard to
`the phase and to use this to recover or reconstruct the speech.
`This class of systems includes spectral subtraction techniques
`originally due to Weiss et al. I1 51, [ 161, and which have
`recently received a great. deal of attention [ 171 -[22] and
`as Wiener filtering and
`optimum filtering techniques such
`These systems will be discussed in
`power spectrum fitering.
`
`see, many of
`in Section 111. As we will
`considerable detail
`on the surface to be different
`these systems which appear
`are in fact identical or very closely related.
`In addition to directly or indirectly utilizing perceptual
`attributes most enhancement systems rely to varying degrees
`on aspects of speech production. For example, in Section IV,
`we describe in detail a variety
`of systems that attempt,
`in
`some way, to capitalize on short-time periodicity
`of speech
`during voiced sounds. As a consequence of this periodicity,
`during voiced intervals the speech spectrum has a harmonic
`structure which suggests the possibility of applying comb
`filtering or as proposed by Parsons [231 attempting to extract
`in other ways, the components of the speech spectrum only
`at the harmonic
`frequencies.
`In essence, knowledge of the
`harmonic structure of voiced sounds allows us in principle to
`remove the noise in the spectral bands between the harmonics.
`As discussed in Section IV, speech enhancement by comb
`fitering can also be viewed in terms of averaging successive
`periods of the noisy speech to partially cancel the noise.
`Another system, which attempts to take advantage of the
`quasi-periodic nature of the speech was proposed by Sambur
`[241. As developed in more detail in Section IV, his system
`is based on the principles of adaptive noise cancelling. Unlike
`the classical procedure Sambur’s method is designed to cancel
`out the clean speech
`signal, taking advantage of the quasi-
`of the
`periodic nature of the speech to form an estimate
`value of the signal one
`speech at each time instant from the
`period earlier.
`In the model of speech production, we represented the
`speech signal as generated by exciting a quasistationary linear
`system with a
`pulse train for
`voiced speech and noise for
`unvoiced speech. Based on this model, an approach to speech
`of the
`enhancement is to attempt to
`estimate parameters
`speech itself and to then use this to
`model rather than the
`synthesize the speech, i.e., to enhance speech through the
`use of an analysis-synthesis system.
`A particularly novel
`application of this concept was used by Miller [251 to remove
`the orchestral accompaniment from early recordings of Enrico
`Caruso. In this system homomorphic deconvolution was used
`to estimate the impulse response of the model in Fig. 1. A
`similar approach to noise reduction was proposed by Suzuki
`[261, [27] whereby the short-time correlation function of
`is used as an estimate of the impulse
`the degraded speech
`is referred to as
`response of the linear system. This system
`(SPAC). A modification
`splicing of .auto correlation function
`of SPAC is referred to as splicing of
`cross-correlation func-
`tion (SPOC). A number of systems also attempt to model
`the vocal-tract impulse response in more detail. As we dis-
`cussed previously
`the vocal-tract transfer function
`is charac-
`terized by a set of resonances or formants that are perceptually
`important. This suggests the possibility of representing the
`vocal-tract impulse response in terms
`of a pole-zero model
`with the analysis procedure directed at estimating the
`associ-
`ated parameters. The poles in particular would provide a
`reasonable representation of the formants.
`success in
`All-pole modeling of speech has had notable
`analysis-synthesis systems
`for clean speech. A number of
`recent efforts have been directed toward estimating the param-
`eters in an all-pole model from
`noisy observations of the
`speech such as the systems by Magill and Un [281, Lim and
`Oppenheim 1291, Lim [ 18 I , and Done and Rushforth
`[30].
`Extensions to pole-zero modeling have also been proposed
`
`Exhibit 1021
`Page 04 of 19
`
`
`
`1590
`
`by Musicus and Lim [31 I and Musicus [ 321. These various
`approaches are described and compared in detail in Section V.
`intended as a brief overview of
`The above discussion was
`the general approaches to speech enhancement. In
`the next
`three sections we explore in more detail many of the systems
`mentioned above.
`In particular, in Section
`111, we focus on
`speech-enhancement techniques based on short-time spectral
`OUT focus is on speech
`amplitude estimation. In Section IV
`enhancement based on periodicity
`of voiced speech and in
`Section V on speech-enhancement techniques using an analysis-
`synthesis procedure.
`III. SPEECH ENHANCEMENT TECHNIQUES BASED ON
`SHORT-TIME SPECTRAL AMPLITUDE ESTIMATION
`In general, in enhancement of a signal degraded by additive
`noise, it is significantly easier to estimate the spectral ampli-
`tude associated with the original signal than it is to estimate
`phase. As we discussed in Section 11,
`both amplitude and
`it is principally the short-time spectral amplitude rather than
`phase that is important for speech intelligibility and quality.
`of speech-
`As we discuss
`in this section, there are a variety
`on this aspect
`enhancement techniques that capitalize
`of
`speech perception by focusing on enhancing only the short-
`to be discussed can
`time spectral amplitude. The techniques
`be broadly classified into two groups. In the first, presented
`in Section 111-A, the short-time spectral amplitude is estimated
`in the frequency domain, using the spectrum of the degraded
`speech. Each short-time segment of
`the enhanced speech
`waveform in the time domain is then obtained by
`inverse
`transforming this spectral amplitude estimate combined with
`the phase of the degraded speech.
`In the second class, dis-
`is first used to
`cussed in Section 111-B the degraded speech
`obtain a filter which is then applied to the degraded speech.
`Since these procedures
`lead to zero-phase filters, it is again
`is enhanced, with the phase
`only the spectral amplitude that
`of the filtered speech being identical to that of the degraded
`speech.
`In both classes of systems discussed below no conceptual
`distinction is made between voiced and unvoiced speech and in
`particular in contrast
`to the techniques to be discussed
`in
`Section IV the periodicity of voiced speech is not exploited.
`Both classes of systems in this section are most
`easily inter-
`preted in terms of a stochastic characterization of the speech
`signal. While this characterization
`for
`is more justifiable
`empirically to also lead
`unvoiced speech it has been shown
`to successful procedures for voiced speech.
`
`A . Speech Enhancement Based on Direct Estimation
`of Short-Time Spectral Amplitude
`When a stationary random signal s(n) has been degraded by
`uncorrelated additive noise d(n) with a known power density
`spectrum, the power density spectrum
`or spectral amplitude
`of the signal is easily estimated through a process of spectral
`subtraction. Specifically, if
`r(n) = s(n) + d(n)
`(1)
`and P,,(o), P,(o), and Pd(o) represent the power density
`spectra of y(n), s(n), and d(n), respectively, then
`
`Consequently, a reasonable estimate for P,(w) is obtained by
`
`PROCEEDINGS OF THE IEEE, VOL. 67, NO. 12, DECEMBER 1979
`
`Pd(o) from an estimate of
`subtracting the known spectrum
`P,,