`Source Localization, Denoising, and
`Dereverberation
`
`A thesis presented
`
`by
`
`Scott Matthew Griebel
`
`to
`
`The Division of Engineering and Applied Sciences
`in partial fulfillment of the requirements
`for the degree of
`
`Doctor of Philosophy
`
`in the subject of
`
`Engineering Sciences
`
`Harvard University
`Cambridge, Massachusetts
`
`April 2002
`
`IPR PETITION
`US RE48,371
`Sonos Ex. 1027
`
`
`
`c(cid:176) [2002 - Scott Matthew Griebel]
`All rights reserved.
`
`EV
`
`IR
`
`TA S
`
`
`
`Thesis advisor
`Michael S. Brandstein
`
`Author
`Scott M. Griebel
`
`A Microphone Array System for Speech Source Localization,
`Denoising, and Dereverberation
`
`Abstract
`
`There is a great deal of potential for advancement in distant-talker speech acquisition
`
`research, and a wealth of current and future technology depends upon these advances. The
`
`goal of this work is to allow users the opportunity to roam unfettered in diverse environ-
`
`ments while still providing a high quality speech signal and a robustness to background noise
`
`and reverberation effects. In this thesis, a microphone array speech enhancement system is
`
`presented which has three main components: source localization, background noise reduc-
`
`tion, and dereverberation. The localization algorithm is effective in the presence of both
`
`background noise and reverberations and simultaneously produces relative time delay esti-
`
`mates and a source location estimate. It provides a procedure applicable to all time delay
`
`estimators which either maximize or minimize an appropriate objective function, improv-
`
`ing the estimators’ robustness to environmental degradations. The denoising algorithm is a
`
`multi-microphone extension to the Minimum Statistics denoising technique [Martin (2001)].
`
`This algorithm also has an additional and optional SNR-dependent beamforming stage that
`
`is shown to be very useful in certain environments. The final component is a multi-channel
`
`dereverberation algorithm which models the speech source and room reverberations inde-
`
`pendently. A weighting function is estimated and applied in the Wavelet Transform domain
`
`to de-emphasize portions which are less coherent across microphone signals, an indication
`
`of reverberation effects. Results for the various components are provided as proof of the
`
`effectiveness of the proposed multi-microphone speech enhancement system.
`
`
`
`Acknowledgements
`
`I would like to express my gratitude for the advising and friendship of Prof. Michael
`
`Brandstein whose guidance in matters both personal and professional was always helpful.
`
`I also thank my thesis committee members Prof. Roger Brockett and Prof. Irvin Schick
`
`for their input over the years.
`
`I would like to acknowledge my partners in crime at the Harvard Multi-Media Lab: Daniel
`
`Freedman (the smartest man I know), Ce Wang (who would die for his dream), and Daniel
`
`Kirsanov (who was always with basketball, but never played basketball). Those outside
`
`the lab that were equally appreciated include Soundouss Bouhia, Anthony Volpe, Winston
`
`Yu, and Brendan Zinn. They all sparked numerous pointless conversations that were often
`
`needed to break up a day’s work.
`
`I also thank Homer J. Simpson for his weekly words of wisdom. His knowledge regarding
`
`nuclear engineering proved helpful even in my unrelated field of speech signal processing.
`
`I would also like to express my appreciation for my parents for their caring and support
`
`throughout my graduate career.
`
`Finally, I thank my wonderful wife Kristen (the smartest woman I know). She waited
`
`patiently for me to begin providing some sort of real salary, but always understood that
`
`my quest for “higher knowledge” was, to me at least, well worth the trip.
`
`iv
`
`
`
`If there’s a way to do it better... find it.
`
`Every time I learn something new, it pushes out something
`old.
`
`Thomas A. Edison
`
`Homer J. Simpson
`
`v
`
`
`
`Contents
`
`1 Problem Overview and Motivation
`1.1 Source Localization and Time Delay Estimation . . . . . . . . . . . . . . . .
`1.2 Background Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.3 Speech Dereverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`2 Speech Production and Processing Background
`2.1 The Speech Production Model
`. . . . . . . . . . . . . . . . . . . . . . . . .
`2.2 Speech Signal Processing Tools . . . . . . . . . . . . . . . . . . . . . . . . .
`2.2.1 The Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.2.2 The Short-Time Fourier Transform . . . . . . . . . . . . . . . . . . .
`2.2.3 The Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . .
`2.3 Reverberant Channel Modeling . . . . . . . . . . . . . . . . . . . . . . . . .
`2.4 Quantitative Speech Quality Measures . . . . . . . . . . . . . . . . . . . . .
`2.4.1
`Segmental Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . . .
`2.4.2 Bark Spectral Distortion . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3 Robust Multi-Channel Source Localization
`3.1 Traditional TDE-based Localization . . . . . . . . . . . . . . . . . . . . . .
`3.1.1 The Mathematics of Time Delay Estimation . . . . . . . . . . . . . .
`3.1.2 Generalized Cross-Correlation Time Delay Estimation . . . . . . . .
`3.2 Localization Using Consistent Delay Vectors . . . . . . . . . . . . . . . . . .
`3.3 Consistent Multi-Channel Maximum Likelihood Time Delay Estimation . .
`3.4 Practical Time Delay Estimation Systems . . . . . . . . . . . . . . . . . . .
`3.4.1 Discrete Time Delay Estimation . . . . . . . . . . . . . . . . . . . .
`3.4.2 Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . .
`3.4.3 Numerical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5.1
`Simulations Results
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5.2 Real Room Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.6 Source Localization Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`1
`2
`4
`5
`6
`
`7
`7
`12
`12
`13
`14
`29
`31
`32
`33
`
`40
`43
`43
`46
`50
`55
`59
`59
`60
`60
`61
`61
`69
`71
`
`vi
`
`
`
`4 Background Noise Reduction
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.1 Single Channel Approaches
`4.1.1
`Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.1.2 Wiener Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.1.3 Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.2 Multi-Channel Approaches
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.3 Background Noise Estimation and Voice Activity Detection . . . . . . . . .
`4.4 Multi-Channel Minimum Statistics Denoising (MC-MSD)
`. . . . . . . . . .
`4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.5.1
`Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.5.2 Real Room Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.5.3 Real Room Circular Array . . . . . . . . . . . . . . . . . . . . . . . .
`4.6 Denoising Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`75
`75
`76
`77
`78
`79
`80
`81
`89
`89
`89
`91
`94
`
`97
`5 Multi-Microphone Model-Based Dereverberation
`98
`5.1 Single-Channel Approaches
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`99
`5.2 Multi-Channel Approaches
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`5.3 Multi-Channel Model-Based Dereverberation Algorithm (MC-MBD) . . . . 104
`5.3.1
`Joint Inverse LPC Filter . . . . . . . . . . . . . . . . . . . . . . . . . 106
`5.3.2 Wavelet Transform of LPC Residual Signals . . . . . . . . . . . . . . 107
`5.3.3 Wavelet Transform Coefficient Extrema Clustering . . . . . . . . . . 108
`5.3.4
`Short-Term Coherence Scaling . . . . . . . . . . . . . . . . . . . . . 109
`5.3.5 Wavelet Transform Coefficient Extrema Reconstruction . . . . . . . 112
`5.3.6 Forward LPC Filter
`. . . . . . . . . . . . . . . . . . . . . . . . . . . 113
`5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
`5.5 Dereverberation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
`
`121
`6 System Performance
`6.1 Simulations of Denoising and Dereverberation . . . . . . . . . . . . . . . . . 121
`6.2 Speech Processing for Automated Videoconferencing . . . . . . . . . . . . . 124
`6.2.1 Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
`6.2.2 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
`6.3 System Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
`
`134
`7 Conclusion
`7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
`7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
`
`A The Multi-Channel Maximum Likelihood Time Delay Estimator
`
`137
`
`vii
`
`
`
`List of Figures
`
`1.1 Block diagram of speech enhancement system . . . . . . . . . . . . . . . . .
`
`2.1 Physiological components of speech production . . . . . . . . . . . . . . . . .
`2.2 Schematic diagram of speech production system . . . . . . . . . . . . . . . .
`2.3 Discrete-time components of speech production . . . . . . . . . . . . . . . .
`2.4 Glottal waveform example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.5 Two stages of the Wavelet Transform analysis filter bank for the Algorithme
``a Trous where ¯hk[n] = hk[−n] . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.6 Two stages of the Wavelet Transform synthesis filter bank for the Algorithme
``a Trous
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.7 The wavelet and scaling functions . . . . . . . . . . . . . . . . . . . . . . . .
`2.8 A Wavelet Transform example using a derivative mother wavelet. The top
`plot is the original input signal and the remaining plots are the Wavelet
`Transform coefficients at scales 21 through 24 . . . . . . . . . . . . . . . . .
`2.9 An example of the Wavelet Transform of a speech signal. The top plot is
`the original input signal and the remaining plots are the Wavelet Transform
`coefficients at scales 21 through 24
`. . . . . . . . . . . . . . . . . . . . . . .
`2.10 The Wavelet Transform of speech plus additive white noise. The top plot is
`the original input signal and the remaining plots are the Wavelet Transform
`coefficients at scales 21 through 24
`. . . . . . . . . . . . . . . . . . . . . . .
`2.11 Clean (red) and reverberant (blue) speech signals and corresponding Wavelet
`Transforms
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.12 Alternating projection algorithm [Mallat and Zhong (1992)]
`. . . . . . . . .
`2.13 Wavelet extrema reconstruction example . . . . . . . . . . . . . . . . . . . .
`2.14 SNR of the first six scales and the reconstructed signal for Wavelet Transform
`extrema reconstruction example . . . . . . . . . . . . . . . . . . . . . . . . .
`2.15 Wavelet extrema reconstruction speech example . . . . . . . . . . . . . . . .
`2.16 SNR of the first six scales and reconstructed signal for Wavelet Transform
`extrema reconstruction example using a speech signal as input . . . . . . . .
`2.17 Block diagram of Bark Spectrum calculation . . . . . . . . . . . . . . . . . .
`2.18 Hertz-to-Bark scale transformation for the approximation and experimental
`values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.19 Critical band filter
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.20 Equal loudness curves. MAF = Minimum Audible Field . . . . . . . . . . .
`2.21 Conversion from phon scale to sone scale
`. . . . . . . . . . . . . . . . . . .
`
`3
`
`8
`8
`9
`11
`
`15
`
`16
`18
`
`20
`
`21
`
`22
`
`23
`24
`26
`
`27
`27
`
`28
`33
`
`36
`36
`37
`38
`
`viii
`
`
`
`3.1 Two-channel time delay estimation problem . . . . . . . . . . . . . . . . . .
`3.2 Bearing angle estimates derived from independent PHAT GCC time delay
`estimates. The true source location is indicated by (cid:164) and the 8 microphones
`are indicated by ×.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.3 Plot of the PHAT GCC for the rightmost microphone pair in Figure 3.2.
`The true delay is -3.01 samples, while the estimated delay is -11.2 samples.
`3.4 Bearing angle estimates for consistent PHAT GCC time delay estimation.
`The true source location is indicated by (cid:164) and the 8 microphones are indi-
`cated by ×.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5 Qualitative plots of the −ELS and ECON criteria typical of a source near the
`array for the four TDE methods considered. The source location is indicated
`by (cid:164), the source location estimate by +, and the 8 microphones (two rows of
`4) by ×.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.6 Response pattern for a cardioid microphone . . . . . . . . . . . . . . . . . .
`3.7 Qualitative plots of the −ELS and ECON criteria typical of a source far
`from the array for the four TDE methods considered. The source location is
`indicated by (cid:164), the source location estimate by +, and and the 8 microphones
`(two rows of 4) by ×.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.8 Qualitative plots of the −ELS and ECON criteria for 2 simultaneous sources
`for the four TDE methods considered. The source location is indicated by (cid:164)
`and the 8 microphones (two rows of 4) by ×.
`. . . . . . . . . . . . . . . . .
`3.9 10 random source locations, indicated by (cid:164) used for TDE simulations. The
`microphones are positioned in two rows of 4, indicated by ×.
`. . . . . . . .
`3.10 Analysis of the LS (XC-LS, BEN-LS, PHAT-LS, MLE-LS) and consistency-
`based (XC-CON, BEN-CON, PHAT-CON, MLE-CON) locators for various
`reverberation levels and 15 dB SNR additive white noise . . . . . . . . . . .
`3.11 Comparison of the consistency-based (XC-CON, BEN-CON, PHAT-CON,
`MLE-CON) locators for various reverberation levels and 15 dB SNR additive
`white noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.12 Comparison of PHAT-CON accuracy results for different angle criteria
`. .
`3.13 Comparison of PHAT-CON, PHAT-CON-ALL, and MLE-CON locators for
`various reverberation levels and 15 dB SNR additive white noise
`. . . . . .
`3.14 Real room plots of the −ELS and ECON criteria. The source location is
`indicated by (cid:164), the source location estimate by ◦, and the 8 microphones
`(two rows of 4) by × . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.15 Real room qualitative plots of the −ELS and ECON criteria. The source
`location is indicated by (cid:164), the source location estimate by ◦, and the 8 mi-
`crophones (two rows of 4) by × . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.1 Outline of the proposed multi-channel denoising algorithm . . . . . . . . . .
`4.2 Speech signal with additive noise and the time-varying DFT energy of the
`586 Hz frequency bin and resulting Minimum Statistics noise estimate. . . .
`4.3 True and estimated noise variances before and after scaling to correct for
`downward bias of Minimum Statistics noise estimate . . . . . . . . . . . . .
`4.4 Oversubtraction factor used for Spectral Subtraction . . . . . . . . . . . . .
`
`44
`
`52
`
`53
`
`54
`
`62
`63
`
`65
`
`66
`
`67
`
`68
`
`69
`70
`
`70
`
`73
`
`74
`
`82
`
`84
`
`85
`86
`
`ix
`
`
`
`90
`91
`
`92
`93
`94
`
`4.5 Multi-channel denoising results for various SNR levels . . . . . . . . . . . .
`4.6 Real room denoising results
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.7 Location of talker, fan noise source, and 16 microphones in real room de-
`noising experiment
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.8 Circular array of 16 microphones used in real room denoising experiment
`.
`4.9 Circular array experiment room setup . . . . . . . . . . . . . . . . . . . . .
`4.10 Optimal SNR microphone subset selection where subset 1 is centered around
`the topmost microphone in Figure 4.9 . . . . . . . . . . . . . . . . . . . . .
`4.11 Circular array denoising results . . . . . . . . . . . . . . . . . . . . . . . . .
`5.1 Array response for a horizontal slice of a 4 m × 4 m room for θ = 0◦, 15◦,
`30◦, 45◦, 60◦, and 75◦ from vertical at a frequency of 2000 Hz where the 8
`microphones are represented by × and the point of focus is indicated by ◦ . 102
`5.2 Outline of the proposed dereverberation algorithm . . . . . . . . . . . . . . . 106
`5.3 Plot of 8 reverberant channels at scale 25 and clustered peaks
`. . . . . . . . 109
`5.4 Clustered reverberant peaks and scale 25 of clean speech
`. . . . . . . . . . . 110
`5.5 Short-term coherence function for scale 23 . . . . . . . . . . . . . . . . . . . 111
`5.6 Weighted peaks and scale 25 of clean speech . . . . . . . . . . . . . . . . . . 112
`5.7 LPC residual of clean speech, reverberant speech, and after wavelet clustering
`technique
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
`5.8 Comparison frame 1: clean, reverberant, beamformed, and MC-MBD result
`115
`5.9 Comparison frame 2: clean, reverberant, beamformed, and MC-MBD result
`116
`5.10 Comparison frame 3: clean, reverberant, beamformed, and MC-MBD result
`117
`5.11 Comparison of clean, reverberant, beamformed, and the proposed MC-MBD
`algorithm (Reverberation-Only Case) . . . . . . . . . . . . . . . . . . . . . . 119
`5.12 Multi-channel dereverberation results for various reverberation times . . . . 120
`
`95
`96
`
`6.1 Comparison of clean, reverberant, beamformed, and the proposed MC-MSD-
`MBD algorithm (reverberation plus noise case)
`. . . . . . . . . . . . . . . . 123
`6.2 Results for varying levels of reverberation for an additive noise level of 10
`dB SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
`6.3 Results for varying levels of SNR for a reverberation time of 200 ms . . . . 125
`6.4 Outline of the overall automated videoconferencing system . . . . . . . . . . 127
`6.5 Overhead view of the room showing microphone and camera positions
`. . . 129
`6.6 Speech enhancement results
`. . . . . . . . . . . . . . . . . . . . . . . . . . . 129
`6.7 Face tracking example [Wang et al. (2001)]
`. . . . . . . . . . . . . . . . . . 131
`6.8 Examples of face orientation estimation where the detected hair region is
`yellow on the bottom right of each figure and the clock face indicates the
`estimated pose angle [Wang et al. (2001)]
`. . . . . . . . . . . . . . . . . . . 132
`
`x
`
`
`
`List of Tables
`
`2.1 Coefficients of analysis and synthesis filters . . . . . . . . . . . . . . . . . .
`2.2 Mean Opinion Score scoring system . . . . . . . . . . . . . . . . . . . . . .
`2.3 Relationship between critical band rate z, critical band limits, and critical
`band center frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`18
`31
`
`35
`
`6.1 Segmental SNR scores for varying values of SNR and reverberation levels for
`beamforming, independent denoising and dereverberation, and joint denois-
`ing and dereverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
`6.2 BSD scores for varying values of SNR and reverberation levels for beam-
`forming, independent denoising and dereverberation, and joint denoising and
`dereverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
`
`xi
`
`
`
`Chapter 1
`
`Problem Overview and Motivation
`
`Speech signals have become increasingly vital components in modern human-machine inter-
`
`faces. Background noise and reverberation effects produce aesthetically undesirable effects
`
`and diminish a system’s ability to convey information. Signal quality also plays a key
`
`role in the performance of speech analysis systems. For instance, speech recognition and
`
`speaker identification techniques exhibit dramatic performance degradations when low qual-
`
`ity speech signals are examined. Therefore, speech acquisition and the ensuing enhancement
`
`of that speech are both essential for the reliable performance of these interfaces.
`
`Currently, most speech acquisition systems rely on the user to be physically close to the
`
`microphone to achieve reasonable sound quality. This close-talker condition significantly
`
`simplifies the acquisition problem by emphasizing the desired signal relative to background
`
`noise and other sources, and by reducing the effects of the physical channel on the signal
`
`of interest. However, there are many situations in which it is not possible or desirable to
`
`have a talker or talkers physically linked to the acquisition device.
`
`The ultimate goal of this work is to allow users the opportunity to roam freely in
`
`diverse environments while still providing a high quality speech signal and a robustness
`
`to background noise and reverberation effects. Besides enhancing current human-machine
`
`1
`
`
`
`§1.1 source localization and time delay estimation
`
`2
`
`interfaces, advances in this technology are providing for a variety of new avenues for human-
`
`machine interaction. In recent years, the use of arrays of multiple microphones has received
`
`considerable attention as a means for dramatically improving the performance of tradi-
`
`tional single-microphone systems. This multi-sensor approach has opened up a range of
`
`additional topics of study [Brandstein and Ward (2001), Flanagan and Silverman (1994),
`
`Flanagan and Silverman (1992)].
`
`For single channel systems, recent research has focused on appropriate modeling of
`
`the speech signal, while multi-channel system research has emphasized improved spatial
`
`filtering.
`
`In this thesis, a microphone array speech enhancement system is
`
`presented for signals acquired in challenging environments degraded
`
`by background noise and room reverberations. These novel speech
`
`source localization, denoising, and model-based dereverberation al-
`
`gorithms perform well in the presence of such degradations. The
`
`improved performance of the proposed system relative to classical
`
`methods is demonstrated using quantitative speech quality measures.
`
`The three main components of the enhancement system are shown in Figure 1.1. The
`
`remainder of this chapter presents a brief introduction to each of the three components of
`
`the proposed system.
`
`1.1 Source Localization and Time Delay Estimation
`
`Nearly all multi-channel (i.e., multi-microphone) speech acquisition problems require the
`
`reliable estimation of active talker locations and the resulting time delays relative to the
`
`set of microphones employed. This is the goal of the first stage of the proposed speech en-
`
`
`
`§1.1 source localization and time delay estimation
`
`3
`
`Model-Based
`
`Dereverberation
`
`- Single-Channel
`Enhanced Speech
`
`--
`
`-
`
`...
`
`Background
`
`Noise
`
`Reduction
`
`--
`
`-
`
`...
`
`Source
`
`Localization
`
`--
`
`-
`
`...
`
`Multi-Channel
`
`Speech Data
`
`Figure 1.1: Block diagram of speech enhancement system
`
`hancement system. For speech enhancement applications, accurate knowledge of locations
`
`of the desired talker and the interference sources is necessary to effectively steer the array
`
`and enhance a desired source, while simultaneously attenuating those deemed undesirable.
`
`Location data may be used as a guide for discerning individual speakers in a multi-source
`
`scenario. With this information available, it would be possible to automatically focus on
`
`and track a given source for an extended time period. Of particular interest lately is the
`
`application of the speaker location estimates for aiming a camera or series of cameras in
`
`a video-conferencing system [Wang et al. (2001), Wang and Chu (1997)]. The source loca-
`
`tion estimate and its implied vector of relative time-delay estimates are integral parts of the
`
`microphone array system. Such information is a necessary precursor to time-alignment of
`
`microphone signals, rejection of noise sources, and the solution of the general beamforming
`
`problem.
`
`The localization algorithm presented in this thesis is effective in the presence of both
`
`background noise and reverberations, and it simultaneously produces the relative time
`
`delay estimates and a source location estimate. This approach modifies the traditional
`
`two-step localization procedure of time delay estimation followed by source localization,
`
`and simultaneously solves for each element in the estimated time delay vector. The result
`
`is a procedure that adds a level of robustness to any time delay estimation technique
`
`which attempts to maximize or minimize an appropriate objective function independently
`
`
`
`§1.2 background noise reduction
`
`4
`
`for the different microphone pairs in the array. Simulations are performed across a range
`
`of reverberation conditions to illustrate the efficacy of the proposed method relative to
`
`standard approaches.
`
`1.2 Background Noise Reduction
`
`The second and third stages of the proposed speech enhancement system deal with the
`
`reduction of background noise and reverberations, respectively. Background noise (e.g.,
`
`fan noise) and reverberations are typically handled with different approaches, as the former
`
`degradation is additive in nature while the latter is convolutional. Both types of degradation
`
`are common and problematic in any real environment. The enhancement of speech that
`
`has been degraded by background noise is an often-researched problem, and there is a
`
`rich history of work addressing the use of single channel methods for speech enhancement.
`
`Summaries of these techniques may be found in [Ephraim (1992), Furui and Sondhi (1992),
`
`Deller et al. (1987), Lim (1983)]. While capable of improving perceived quality in certain
`
`environments (i.e., additive noise, high signal-to-noise ratios, and no reverberations), these
`
`approaches do not perform well in the face of reverberant distortions and severe noise
`
`conditions.
`
`Single channel techniques for background noise reduction typically involve some form of
`
`spectral modification (e.g., Spectral Subtraction [Boll (1979)]) or postfiltering (e.g., Wiener
`
`filtering [Deller et al. (1987), Lim (1983)]). However, many of these methods require an es-
`
`timate of the background noise spectrum. This noise spectrum is often estimated from
`
`non-speech intervals, though the detection of non-speech intervals is a difficult problem
`
`itself. A practical method for background noise estimation is the Minimum Statistics pro-
`
`cedure developed in [Martin (2001)]. This approach hypothesizes that the minimum of a
`
`given power spectral frequency bin over an extended period of time, including both speech
`
`
`
`§1.3 speech dereverberation
`
`5
`
`and non-speech intervals, provides a reasonable estimate of the noise floor - an intuitive
`
`argument. In this thesis, this technique is used in a multi-microphone setting as a precursor
`
`to a Delay-and-Sum Beamforming operation. This technique is both effective at removing
`
`background noise and capable of running in real time. Results will be presented for varying
`
`levels of signal-to-noise ratios to show how much of an improvement is offered relative to
`
`the use of beamforming alone.
`
`1.3 Speech Dereverberation
`
`The final stage of the enhancement system is multi-channel dereverberation. The proposed
`
`system incorporates nonlinear, model-based processing to reduce room reverberations. The
`
`goal is to combine the advantages of spatial filtering achieved through beamforming with
`
`knowledge of the desired time-series attributes and intuitive nonlinear processing. A multi-
`
`channel wavelet algorithm which incorporates these principles is presented for which large
`
`Wavelet Transform coefficients correspond to events - either speech production events or
`
`arrivals of multipath signals (i.e., reverberations). This allows for the discrimination of por-
`
`tions of the signal produced by the original speech utterance from those due to reverberation
`
`effects. Because reverberations depend largely on the geometry of the environment as well
`
`as the location of the speech source and microphones, the signal acquired at each micro-
`
`phone will be degraded to varying degrees, as evidenced by variations in Wavelet Transform
`
`coefficient extrema from channel to channel. By deciding which time delay compensated
`
`signal portions are coherent over all microphones, it is possible to recover the structure of
`
`the original speech from the noisy, received data.
`
`The algorithm is shown to be capable of identifying and attenuating reverberant portions
`
`of the speech signal. This is achieved without explicitly requiring estimation of the channel
`
`characteristics, a feature which is essential for practical applications, but which is absent
`
`
`
`§1.4 thesis organization
`
`6
`
`from the majority of existing dereverberation approaches.
`
`1.4 Thesis Organization
`
`This chapter provided a brief introduction to the proposed enhancement algorithm for
`
`speech degraded by background noise and room reverberations. Chapter 2 presents some
`
`fundamentals of speech production and some speech processing tools necessary for the re-
`
`mainder of the thesis. Chapters 3 through 5 detail the three components of the enhancement
`
`system shown in Figure 1.1. Chapter 3 deals with robust time delay estimation. Chapter 4
`
`presents a multi-channel extension of Martin’s Minimum Statistics approach to background
`
`noise reduction. Chapter 5 discusses the model-based dereverberation algorithm. Chap-
`
`ter 6 presents some results of the overall system, while Chapter 7 offers some concluding
`
`remarks and avenues for future research.
`
`
`
`Chapter 2
`
`Speech Production and Processing
`
`Background
`
`This chapter discusses the speech production model as well as the mathematics of speech
`
`processing. These concepts and equations will be used throughout the remainder of the
`
`thesis and are presented here as background information for the reader new to these topics.
`
`2.1 The Speech Production Model
`
`The human speech production system consists primarily of the vocal cords, the vocal tract,
`
`the nasal cavity, and the lips [Rabiner and Shafer (1978), Deller et al. (1987)]. Figure 2.1,
`
`taken from [Deller et al. (1987)], shows the different physiological elements of this system.
`
`Air is forced from the lungs through the vocal cords and into the vocal tract which consists
`
`of the pharyngeal and oral cavities. Between these two cavities hangs a fold known as the
`
`velum which opens to allow for the coupling of the vocal and nasal tracts necessary to
`
`produce nasal sounds.
`
`Figure 2.2 shows a schematic diagram encompassing the main components of the speech
`
`7
`
`
`
`§2.1 the speech production model
`
`8
`
`Figure 2.1: Physiological components of speech production
`
`Figure 2.2: Schematic diagram of speech production system
`
`
`
`§2.1 the speech production model
`
`9
`
`Figure 2.3: Discrete-time components of speech production
`
`production system [Deller et al. (1987)]. For voiced sounds such as vowels, the lungs force
`
`air through an opening in the vocal cords, known as the glottis. Adjusting the tension in
`
`the vocal cords causes them to open and close in a quasi-periodic fashion. This results in
`
`a palpable vibration of the vocal tract area. For unvoiced sounds such as /s/ and /p/,
`
`air is forced through a constriction at some point in the vocal tract and the periodicity
`
`characteristic of voiced sounds is not present. Other sounds such as /v/ and /z/ are the
`
`results of a mixed excitation signal containing both voiced and unvoiced elements. The
`
`excitation function, whether voiced, unvoiced, or mixed, drives the speech system and the
`
`resulting speech waveform is radiated from the lips (and perhaps the nostrils, should the
`
`nasal cavity be coupled to the vocal tract).
`
`The discrete-time model of speech production based on the above discussion is shown in
`
`Figure 2.3 [Rabiner and Shafer (1978), Deller et al. (1987)]. Mathematically, voiced speech
`
`sounds are modeled by:
`
`s[n] = ep[n] ∗ g[n] ∗ hv[n] ∗ r[n]
`
`(2.1)
`
`
`
`§2.1 the speech production model
`
`10
`
`where ep[n] is a quasi-periodic impulse train excitation signal, g[n] is a glottal shaping
`
`pulse, hv[n] is the impulse response of the vocal tract for the particular configuration of the
`
`physiological elements, and r[n] models the radiator effects at the lips. Unvoiced speech is
`
`modeled by:
`
`s[n] = ew[n] ∗ hv[n] ∗ r[n]
`
`(2.2)
`
`where ew[n] is a white noise excitation signal, with hv[n] and r[n] defined as above. In the
`
`Fourier domain, Equation 2.1 and Equation 2.2 become:
`
`S(ω) = Ep(ω)G(ω)Hv(ω)R(ω)
`
`S(ω) = Ew(ω)Hv(ω)R(ω)
`
`(2.3)
`
`(2.4)
`
`and
`
`respectively.
`
`Returning to voiced sounds, as pressure beneath the glottis increases, the vocal cords
`
`are forced open. However, as the glottis opens and the air volume velocity increases,
`
`the accompanying decrease in pressure forces the glottis closed again, a process that is
`
`repeated throughout the voiced sound. The time between successive glottal closures (or
`
`openings) can be defined as the fundamental period, or pitch, of the speech signal. The
`
`fundam