throbber
A Microphone Array System for Speech
`Source Localization, Denoising, and
`Dereverberation
`
`A thesis presented
`
`by
`
`Scott Matthew Griebel
`
`to
`
`The Division of Engineering and Applied Sciences
`in partial fulfillment of the requirements
`for the degree of
`
`Doctor of Philosophy
`
`in the subject of
`
`Engineering Sciences
`
`Harvard University
`Cambridge, Massachusetts
`
`April 2002
`
`IPR PETITION
`US RE48,371
`Sonos Ex. 1027
`
`

`

`c(cid:176) [2002 - Scott Matthew Griebel]
`All rights reserved.
`
`EV
`
`IR
`
`TA S
`
`

`

`Thesis advisor
`Michael S. Brandstein
`
`Author
`Scott M. Griebel
`
`A Microphone Array System for Speech Source Localization,
`Denoising, and Dereverberation
`
`Abstract
`
`There is a great deal of potential for advancement in distant-talker speech acquisition
`
`research, and a wealth of current and future technology depends upon these advances. The
`
`goal of this work is to allow users the opportunity to roam unfettered in diverse environ-
`
`ments while still providing a high quality speech signal and a robustness to background noise
`
`and reverberation effects. In this thesis, a microphone array speech enhancement system is
`
`presented which has three main components: source localization, background noise reduc-
`
`tion, and dereverberation. The localization algorithm is effective in the presence of both
`
`background noise and reverberations and simultaneously produces relative time delay esti-
`
`mates and a source location estimate. It provides a procedure applicable to all time delay
`
`estimators which either maximize or minimize an appropriate objective function, improv-
`
`ing the estimators’ robustness to environmental degradations. The denoising algorithm is a
`
`multi-microphone extension to the Minimum Statistics denoising technique [Martin (2001)].
`
`This algorithm also has an additional and optional SNR-dependent beamforming stage that
`
`is shown to be very useful in certain environments. The final component is a multi-channel
`
`dereverberation algorithm which models the speech source and room reverberations inde-
`
`pendently. A weighting function is estimated and applied in the Wavelet Transform domain
`
`to de-emphasize portions which are less coherent across microphone signals, an indication
`
`of reverberation effects. Results for the various components are provided as proof of the
`
`effectiveness of the proposed multi-microphone speech enhancement system.
`
`

`

`Acknowledgements
`
`I would like to express my gratitude for the advising and friendship of Prof. Michael
`
`Brandstein whose guidance in matters both personal and professional was always helpful.
`
`I also thank my thesis committee members Prof. Roger Brockett and Prof. Irvin Schick
`
`for their input over the years.
`
`I would like to acknowledge my partners in crime at the Harvard Multi-Media Lab: Daniel
`
`Freedman (the smartest man I know), Ce Wang (who would die for his dream), and Daniel
`
`Kirsanov (who was always with basketball, but never played basketball). Those outside
`
`the lab that were equally appreciated include Soundouss Bouhia, Anthony Volpe, Winston
`
`Yu, and Brendan Zinn. They all sparked numerous pointless conversations that were often
`
`needed to break up a day’s work.
`
`I also thank Homer J. Simpson for his weekly words of wisdom. His knowledge regarding
`
`nuclear engineering proved helpful even in my unrelated field of speech signal processing.
`
`I would also like to express my appreciation for my parents for their caring and support
`
`throughout my graduate career.
`
`Finally, I thank my wonderful wife Kristen (the smartest woman I know). She waited
`
`patiently for me to begin providing some sort of real salary, but always understood that
`
`my quest for “higher knowledge” was, to me at least, well worth the trip.
`
`iv
`
`

`

`If there’s a way to do it better... find it.
`
`Every time I learn something new, it pushes out something
`old.
`
`Thomas A. Edison
`
`Homer J. Simpson
`
`v
`
`

`

`Contents
`
`1 Problem Overview and Motivation
`1.1 Source Localization and Time Delay Estimation . . . . . . . . . . . . . . . .
`1.2 Background Noise Reduction . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.3 Speech Dereverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`2 Speech Production and Processing Background
`2.1 The Speech Production Model
`. . . . . . . . . . . . . . . . . . . . . . . . .
`2.2 Speech Signal Processing Tools . . . . . . . . . . . . . . . . . . . . . . . . .
`2.2.1 The Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.2.2 The Short-Time Fourier Transform . . . . . . . . . . . . . . . . . . .
`2.2.3 The Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . .
`2.3 Reverberant Channel Modeling . . . . . . . . . . . . . . . . . . . . . . . . .
`2.4 Quantitative Speech Quality Measures . . . . . . . . . . . . . . . . . . . . .
`2.4.1
`Segmental Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . . .
`2.4.2 Bark Spectral Distortion . . . . . . . . . . . . . . . . . . . . . . . . .
`
`3 Robust Multi-Channel Source Localization
`3.1 Traditional TDE-based Localization . . . . . . . . . . . . . . . . . . . . . .
`3.1.1 The Mathematics of Time Delay Estimation . . . . . . . . . . . . . .
`3.1.2 Generalized Cross-Correlation Time Delay Estimation . . . . . . . .
`3.2 Localization Using Consistent Delay Vectors . . . . . . . . . . . . . . . . . .
`3.3 Consistent Multi-Channel Maximum Likelihood Time Delay Estimation . .
`3.4 Practical Time Delay Estimation Systems . . . . . . . . . . . . . . . . . . .
`3.4.1 Discrete Time Delay Estimation . . . . . . . . . . . . . . . . . . . .
`3.4.2 Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . .
`3.4.3 Numerical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5.1
`Simulations Results
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5.2 Real Room Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.6 Source Localization Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`1
`2
`4
`5
`6
`
`7
`7
`12
`12
`13
`14
`29
`31
`32
`33
`
`40
`43
`43
`46
`50
`55
`59
`59
`60
`60
`61
`61
`69
`71
`
`vi
`
`

`

`4 Background Noise Reduction
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.1 Single Channel Approaches
`4.1.1
`Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.1.2 Wiener Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.1.3 Other Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.2 Multi-Channel Approaches
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.3 Background Noise Estimation and Voice Activity Detection . . . . . . . . .
`4.4 Multi-Channel Minimum Statistics Denoising (MC-MSD)
`. . . . . . . . . .
`4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.5.1
`Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.5.2 Real Room Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.5.3 Real Room Circular Array . . . . . . . . . . . . . . . . . . . . . . . .
`4.6 Denoising Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`75
`75
`76
`77
`78
`79
`80
`81
`89
`89
`89
`91
`94
`
`97
`5 Multi-Microphone Model-Based Dereverberation
`98
`5.1 Single-Channel Approaches
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`99
`5.2 Multi-Channel Approaches
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`5.3 Multi-Channel Model-Based Dereverberation Algorithm (MC-MBD) . . . . 104
`5.3.1
`Joint Inverse LPC Filter . . . . . . . . . . . . . . . . . . . . . . . . . 106
`5.3.2 Wavelet Transform of LPC Residual Signals . . . . . . . . . . . . . . 107
`5.3.3 Wavelet Transform Coefficient Extrema Clustering . . . . . . . . . . 108
`5.3.4
`Short-Term Coherence Scaling . . . . . . . . . . . . . . . . . . . . . 109
`5.3.5 Wavelet Transform Coefficient Extrema Reconstruction . . . . . . . 112
`5.3.6 Forward LPC Filter
`. . . . . . . . . . . . . . . . . . . . . . . . . . . 113
`5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
`5.5 Dereverberation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
`
`121
`6 System Performance
`6.1 Simulations of Denoising and Dereverberation . . . . . . . . . . . . . . . . . 121
`6.2 Speech Processing for Automated Videoconferencing . . . . . . . . . . . . . 124
`6.2.1 Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
`6.2.2 Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
`6.3 System Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
`
`134
`7 Conclusion
`7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
`7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
`
`A The Multi-Channel Maximum Likelihood Time Delay Estimator
`
`137
`
`vii
`
`

`

`List of Figures
`
`1.1 Block diagram of speech enhancement system . . . . . . . . . . . . . . . . .
`
`2.1 Physiological components of speech production . . . . . . . . . . . . . . . . .
`2.2 Schematic diagram of speech production system . . . . . . . . . . . . . . . .
`2.3 Discrete-time components of speech production . . . . . . . . . . . . . . . .
`2.4 Glottal waveform example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.5 Two stages of the Wavelet Transform analysis filter bank for the Algorithme
``a Trous where ¯hk[n] = hk[−n] . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.6 Two stages of the Wavelet Transform synthesis filter bank for the Algorithme
``a Trous
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.7 The wavelet and scaling functions . . . . . . . . . . . . . . . . . . . . . . . .
`2.8 A Wavelet Transform example using a derivative mother wavelet. The top
`plot is the original input signal and the remaining plots are the Wavelet
`Transform coefficients at scales 21 through 24 . . . . . . . . . . . . . . . . .
`2.9 An example of the Wavelet Transform of a speech signal. The top plot is
`the original input signal and the remaining plots are the Wavelet Transform
`coefficients at scales 21 through 24
`. . . . . . . . . . . . . . . . . . . . . . .
`2.10 The Wavelet Transform of speech plus additive white noise. The top plot is
`the original input signal and the remaining plots are the Wavelet Transform
`coefficients at scales 21 through 24
`. . . . . . . . . . . . . . . . . . . . . . .
`2.11 Clean (red) and reverberant (blue) speech signals and corresponding Wavelet
`Transforms
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.12 Alternating projection algorithm [Mallat and Zhong (1992)]
`. . . . . . . . .
`2.13 Wavelet extrema reconstruction example . . . . . . . . . . . . . . . . . . . .
`2.14 SNR of the first six scales and the reconstructed signal for Wavelet Transform
`extrema reconstruction example . . . . . . . . . . . . . . . . . . . . . . . . .
`2.15 Wavelet extrema reconstruction speech example . . . . . . . . . . . . . . . .
`2.16 SNR of the first six scales and reconstructed signal for Wavelet Transform
`extrema reconstruction example using a speech signal as input . . . . . . . .
`2.17 Block diagram of Bark Spectrum calculation . . . . . . . . . . . . . . . . . .
`2.18 Hertz-to-Bark scale transformation for the approximation and experimental
`values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.19 Critical band filter
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`2.20 Equal loudness curves. MAF = Minimum Audible Field . . . . . . . . . . .
`2.21 Conversion from phon scale to sone scale
`. . . . . . . . . . . . . . . . . . .
`
`3
`
`8
`8
`9
`11
`
`15
`
`16
`18
`
`20
`
`21
`
`22
`
`23
`24
`26
`
`27
`27
`
`28
`33
`
`36
`36
`37
`38
`
`viii
`
`

`

`3.1 Two-channel time delay estimation problem . . . . . . . . . . . . . . . . . .
`3.2 Bearing angle estimates derived from independent PHAT GCC time delay
`estimates. The true source location is indicated by (cid:164) and the 8 microphones
`are indicated by ×.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.3 Plot of the PHAT GCC for the rightmost microphone pair in Figure 3.2.
`The true delay is -3.01 samples, while the estimated delay is -11.2 samples.
`3.4 Bearing angle estimates for consistent PHAT GCC time delay estimation.
`The true source location is indicated by (cid:164) and the 8 microphones are indi-
`cated by ×.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.5 Qualitative plots of the −ELS and ECON criteria typical of a source near the
`array for the four TDE methods considered. The source location is indicated
`by (cid:164), the source location estimate by +, and the 8 microphones (two rows of
`4) by ×.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.6 Response pattern for a cardioid microphone . . . . . . . . . . . . . . . . . .
`3.7 Qualitative plots of the −ELS and ECON criteria typical of a source far
`from the array for the four TDE methods considered. The source location is
`indicated by (cid:164), the source location estimate by +, and and the 8 microphones
`(two rows of 4) by ×.
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.8 Qualitative plots of the −ELS and ECON criteria for 2 simultaneous sources
`for the four TDE methods considered. The source location is indicated by (cid:164)
`and the 8 microphones (two rows of 4) by ×.
`. . . . . . . . . . . . . . . . .
`3.9 10 random source locations, indicated by (cid:164) used for TDE simulations. The
`microphones are positioned in two rows of 4, indicated by ×.
`. . . . . . . .
`3.10 Analysis of the LS (XC-LS, BEN-LS, PHAT-LS, MLE-LS) and consistency-
`based (XC-CON, BEN-CON, PHAT-CON, MLE-CON) locators for various
`reverberation levels and 15 dB SNR additive white noise . . . . . . . . . . .
`3.11 Comparison of the consistency-based (XC-CON, BEN-CON, PHAT-CON,
`MLE-CON) locators for various reverberation levels and 15 dB SNR additive
`white noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.12 Comparison of PHAT-CON accuracy results for different angle criteria
`. .
`3.13 Comparison of PHAT-CON, PHAT-CON-ALL, and MLE-CON locators for
`various reverberation levels and 15 dB SNR additive white noise
`. . . . . .
`3.14 Real room plots of the −ELS and ECON criteria. The source location is
`indicated by (cid:164), the source location estimate by ◦, and the 8 microphones
`(two rows of 4) by × . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`3.15 Real room qualitative plots of the −ELS and ECON criteria. The source
`location is indicated by (cid:164), the source location estimate by ◦, and the 8 mi-
`crophones (two rows of 4) by × . . . . . . . . . . . . . . . . . . . . . . . . .
`
`4.1 Outline of the proposed multi-channel denoising algorithm . . . . . . . . . .
`4.2 Speech signal with additive noise and the time-varying DFT energy of the
`586 Hz frequency bin and resulting Minimum Statistics noise estimate. . . .
`4.3 True and estimated noise variances before and after scaling to correct for
`downward bias of Minimum Statistics noise estimate . . . . . . . . . . . . .
`4.4 Oversubtraction factor used for Spectral Subtraction . . . . . . . . . . . . .
`
`44
`
`52
`
`53
`
`54
`
`62
`63
`
`65
`
`66
`
`67
`
`68
`
`69
`70
`
`70
`
`73
`
`74
`
`82
`
`84
`
`85
`86
`
`ix
`
`

`

`90
`91
`
`92
`93
`94
`
`4.5 Multi-channel denoising results for various SNR levels . . . . . . . . . . . .
`4.6 Real room denoising results
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.7 Location of talker, fan noise source, and 16 microphones in real room de-
`noising experiment
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`4.8 Circular array of 16 microphones used in real room denoising experiment
`.
`4.9 Circular array experiment room setup . . . . . . . . . . . . . . . . . . . . .
`4.10 Optimal SNR microphone subset selection where subset 1 is centered around
`the topmost microphone in Figure 4.9 . . . . . . . . . . . . . . . . . . . . .
`4.11 Circular array denoising results . . . . . . . . . . . . . . . . . . . . . . . . .
`5.1 Array response for a horizontal slice of a 4 m × 4 m room for θ = 0◦, 15◦,
`30◦, 45◦, 60◦, and 75◦ from vertical at a frequency of 2000 Hz where the 8
`microphones are represented by × and the point of focus is indicated by ◦ . 102
`5.2 Outline of the proposed dereverberation algorithm . . . . . . . . . . . . . . . 106
`5.3 Plot of 8 reverberant channels at scale 25 and clustered peaks
`. . . . . . . . 109
`5.4 Clustered reverberant peaks and scale 25 of clean speech
`. . . . . . . . . . . 110
`5.5 Short-term coherence function for scale 23 . . . . . . . . . . . . . . . . . . . 111
`5.6 Weighted peaks and scale 25 of clean speech . . . . . . . . . . . . . . . . . . 112
`5.7 LPC residual of clean speech, reverberant speech, and after wavelet clustering
`technique
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
`5.8 Comparison frame 1: clean, reverberant, beamformed, and MC-MBD result
`115
`5.9 Comparison frame 2: clean, reverberant, beamformed, and MC-MBD result
`116
`5.10 Comparison frame 3: clean, reverberant, beamformed, and MC-MBD result
`117
`5.11 Comparison of clean, reverberant, beamformed, and the proposed MC-MBD
`algorithm (Reverberation-Only Case) . . . . . . . . . . . . . . . . . . . . . . 119
`5.12 Multi-channel dereverberation results for various reverberation times . . . . 120
`
`95
`96
`
`6.1 Comparison of clean, reverberant, beamformed, and the proposed MC-MSD-
`MBD algorithm (reverberation plus noise case)
`. . . . . . . . . . . . . . . . 123
`6.2 Results for varying levels of reverberation for an additive noise level of 10
`dB SNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
`6.3 Results for varying levels of SNR for a reverberation time of 200 ms . . . . 125
`6.4 Outline of the overall automated videoconferencing system . . . . . . . . . . 127
`6.5 Overhead view of the room showing microphone and camera positions
`. . . 129
`6.6 Speech enhancement results
`. . . . . . . . . . . . . . . . . . . . . . . . . . . 129
`6.7 Face tracking example [Wang et al. (2001)]
`. . . . . . . . . . . . . . . . . . 131
`6.8 Examples of face orientation estimation where the detected hair region is
`yellow on the bottom right of each figure and the clock face indicates the
`estimated pose angle [Wang et al. (2001)]
`. . . . . . . . . . . . . . . . . . . 132
`
`x
`
`

`

`List of Tables
`
`2.1 Coefficients of analysis and synthesis filters . . . . . . . . . . . . . . . . . .
`2.2 Mean Opinion Score scoring system . . . . . . . . . . . . . . . . . . . . . .
`2.3 Relationship between critical band rate z, critical band limits, and critical
`band center frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`18
`31
`
`35
`
`6.1 Segmental SNR scores for varying values of SNR and reverberation levels for
`beamforming, independent denoising and dereverberation, and joint denois-
`ing and dereverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
`6.2 BSD scores for varying values of SNR and reverberation levels for beam-
`forming, independent denoising and dereverberation, and joint denoising and
`dereverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
`
`xi
`
`

`

`Chapter 1
`
`Problem Overview and Motivation
`
`Speech signals have become increasingly vital components in modern human-machine inter-
`
`faces. Background noise and reverberation effects produce aesthetically undesirable effects
`
`and diminish a system’s ability to convey information. Signal quality also plays a key
`
`role in the performance of speech analysis systems. For instance, speech recognition and
`
`speaker identification techniques exhibit dramatic performance degradations when low qual-
`
`ity speech signals are examined. Therefore, speech acquisition and the ensuing enhancement
`
`of that speech are both essential for the reliable performance of these interfaces.
`
`Currently, most speech acquisition systems rely on the user to be physically close to the
`
`microphone to achieve reasonable sound quality. This close-talker condition significantly
`
`simplifies the acquisition problem by emphasizing the desired signal relative to background
`
`noise and other sources, and by reducing the effects of the physical channel on the signal
`
`of interest. However, there are many situations in which it is not possible or desirable to
`
`have a talker or talkers physically linked to the acquisition device.
`
`The ultimate goal of this work is to allow users the opportunity to roam freely in
`
`diverse environments while still providing a high quality speech signal and a robustness
`
`to background noise and reverberation effects. Besides enhancing current human-machine
`
`1
`
`

`

`§1.1 source localization and time delay estimation
`
`2
`
`interfaces, advances in this technology are providing for a variety of new avenues for human-
`
`machine interaction. In recent years, the use of arrays of multiple microphones has received
`
`considerable attention as a means for dramatically improving the performance of tradi-
`
`tional single-microphone systems. This multi-sensor approach has opened up a range of
`
`additional topics of study [Brandstein and Ward (2001), Flanagan and Silverman (1994),
`
`Flanagan and Silverman (1992)].
`
`For single channel systems, recent research has focused on appropriate modeling of
`
`the speech signal, while multi-channel system research has emphasized improved spatial
`
`filtering.
`
`In this thesis, a microphone array speech enhancement system is
`
`presented for signals acquired in challenging environments degraded
`
`by background noise and room reverberations. These novel speech
`
`source localization, denoising, and model-based dereverberation al-
`
`gorithms perform well in the presence of such degradations. The
`
`improved performance of the proposed system relative to classical
`
`methods is demonstrated using quantitative speech quality measures.
`
`The three main components of the enhancement system are shown in Figure 1.1. The
`
`remainder of this chapter presents a brief introduction to each of the three components of
`
`the proposed system.
`
`1.1 Source Localization and Time Delay Estimation
`
`Nearly all multi-channel (i.e., multi-microphone) speech acquisition problems require the
`
`reliable estimation of active talker locations and the resulting time delays relative to the
`
`set of microphones employed. This is the goal of the first stage of the proposed speech en-
`
`

`

`§1.1 source localization and time delay estimation
`
`3
`
`Model-Based
`
`Dereverberation
`
`- Single-Channel
`Enhanced Speech
`
`--
`
`-
`
`...
`
`Background
`
`Noise
`
`Reduction
`
`--
`
`-
`
`...
`
`Source
`
`Localization
`
`--
`
`-
`
`...
`
`Multi-Channel
`
`Speech Data
`
`Figure 1.1: Block diagram of speech enhancement system
`
`hancement system. For speech enhancement applications, accurate knowledge of locations
`
`of the desired talker and the interference sources is necessary to effectively steer the array
`
`and enhance a desired source, while simultaneously attenuating those deemed undesirable.
`
`Location data may be used as a guide for discerning individual speakers in a multi-source
`
`scenario. With this information available, it would be possible to automatically focus on
`
`and track a given source for an extended time period. Of particular interest lately is the
`
`application of the speaker location estimates for aiming a camera or series of cameras in
`
`a video-conferencing system [Wang et al. (2001), Wang and Chu (1997)]. The source loca-
`
`tion estimate and its implied vector of relative time-delay estimates are integral parts of the
`
`microphone array system. Such information is a necessary precursor to time-alignment of
`
`microphone signals, rejection of noise sources, and the solution of the general beamforming
`
`problem.
`
`The localization algorithm presented in this thesis is effective in the presence of both
`
`background noise and reverberations, and it simultaneously produces the relative time
`
`delay estimates and a source location estimate. This approach modifies the traditional
`
`two-step localization procedure of time delay estimation followed by source localization,
`
`and simultaneously solves for each element in the estimated time delay vector. The result
`
`is a procedure that adds a level of robustness to any time delay estimation technique
`
`which attempts to maximize or minimize an appropriate objective function independently
`
`

`

`§1.2 background noise reduction
`
`4
`
`for the different microphone pairs in the array. Simulations are performed across a range
`
`of reverberation conditions to illustrate the efficacy of the proposed method relative to
`
`standard approaches.
`
`1.2 Background Noise Reduction
`
`The second and third stages of the proposed speech enhancement system deal with the
`
`reduction of background noise and reverberations, respectively. Background noise (e.g.,
`
`fan noise) and reverberations are typically handled with different approaches, as the former
`
`degradation is additive in nature while the latter is convolutional. Both types of degradation
`
`are common and problematic in any real environment. The enhancement of speech that
`
`has been degraded by background noise is an often-researched problem, and there is a
`
`rich history of work addressing the use of single channel methods for speech enhancement.
`
`Summaries of these techniques may be found in [Ephraim (1992), Furui and Sondhi (1992),
`
`Deller et al. (1987), Lim (1983)]. While capable of improving perceived quality in certain
`
`environments (i.e., additive noise, high signal-to-noise ratios, and no reverberations), these
`
`approaches do not perform well in the face of reverberant distortions and severe noise
`
`conditions.
`
`Single channel techniques for background noise reduction typically involve some form of
`
`spectral modification (e.g., Spectral Subtraction [Boll (1979)]) or postfiltering (e.g., Wiener
`
`filtering [Deller et al. (1987), Lim (1983)]). However, many of these methods require an es-
`
`timate of the background noise spectrum. This noise spectrum is often estimated from
`
`non-speech intervals, though the detection of non-speech intervals is a difficult problem
`
`itself. A practical method for background noise estimation is the Minimum Statistics pro-
`
`cedure developed in [Martin (2001)]. This approach hypothesizes that the minimum of a
`
`given power spectral frequency bin over an extended period of time, including both speech
`
`

`

`§1.3 speech dereverberation
`
`5
`
`and non-speech intervals, provides a reasonable estimate of the noise floor - an intuitive
`
`argument. In this thesis, this technique is used in a multi-microphone setting as a precursor
`
`to a Delay-and-Sum Beamforming operation. This technique is both effective at removing
`
`background noise and capable of running in real time. Results will be presented for varying
`
`levels of signal-to-noise ratios to show how much of an improvement is offered relative to
`
`the use of beamforming alone.
`
`1.3 Speech Dereverberation
`
`The final stage of the enhancement system is multi-channel dereverberation. The proposed
`
`system incorporates nonlinear, model-based processing to reduce room reverberations. The
`
`goal is to combine the advantages of spatial filtering achieved through beamforming with
`
`knowledge of the desired time-series attributes and intuitive nonlinear processing. A multi-
`
`channel wavelet algorithm which incorporates these principles is presented for which large
`
`Wavelet Transform coefficients correspond to events - either speech production events or
`
`arrivals of multipath signals (i.e., reverberations). This allows for the discrimination of por-
`
`tions of the signal produced by the original speech utterance from those due to reverberation
`
`effects. Because reverberations depend largely on the geometry of the environment as well
`
`as the location of the speech source and microphones, the signal acquired at each micro-
`
`phone will be degraded to varying degrees, as evidenced by variations in Wavelet Transform
`
`coefficient extrema from channel to channel. By deciding which time delay compensated
`
`signal portions are coherent over all microphones, it is possible to recover the structure of
`
`the original speech from the noisy, received data.
`
`The algorithm is shown to be capable of identifying and attenuating reverberant portions
`
`of the speech signal. This is achieved without explicitly requiring estimation of the channel
`
`characteristics, a feature which is essential for practical applications, but which is absent
`
`

`

`§1.4 thesis organization
`
`6
`
`from the majority of existing dereverberation approaches.
`
`1.4 Thesis Organization
`
`This chapter provided a brief introduction to the proposed enhancement algorithm for
`
`speech degraded by background noise and room reverberations. Chapter 2 presents some
`
`fundamentals of speech production and some speech processing tools necessary for the re-
`
`mainder of the thesis. Chapters 3 through 5 detail the three components of the enhancement
`
`system shown in Figure 1.1. Chapter 3 deals with robust time delay estimation. Chapter 4
`
`presents a multi-channel extension of Martin’s Minimum Statistics approach to background
`
`noise reduction. Chapter 5 discusses the model-based dereverberation algorithm. Chap-
`
`ter 6 presents some results of the overall system, while Chapter 7 offers some concluding
`
`remarks and avenues for future research.
`
`

`

`Chapter 2
`
`Speech Production and Processing
`
`Background
`
`This chapter discusses the speech production model as well as the mathematics of speech
`
`processing. These concepts and equations will be used throughout the remainder of the
`
`thesis and are presented here as background information for the reader new to these topics.
`
`2.1 The Speech Production Model
`
`The human speech production system consists primarily of the vocal cords, the vocal tract,
`
`the nasal cavity, and the lips [Rabiner and Shafer (1978), Deller et al. (1987)]. Figure 2.1,
`
`taken from [Deller et al. (1987)], shows the different physiological elements of this system.
`
`Air is forced from the lungs through the vocal cords and into the vocal tract which consists
`
`of the pharyngeal and oral cavities. Between these two cavities hangs a fold known as the
`
`velum which opens to allow for the coupling of the vocal and nasal tracts necessary to
`
`produce nasal sounds.
`
`Figure 2.2 shows a schematic diagram encompassing the main components of the speech
`
`7
`
`

`

`§2.1 the speech production model
`
`8
`
`Figure 2.1: Physiological components of speech production
`
`Figure 2.2: Schematic diagram of speech production system
`
`

`

`§2.1 the speech production model
`
`9
`
`Figure 2.3: Discrete-time components of speech production
`
`production system [Deller et al. (1987)]. For voiced sounds such as vowels, the lungs force
`
`air through an opening in the vocal cords, known as the glottis. Adjusting the tension in
`
`the vocal cords causes them to open and close in a quasi-periodic fashion. This results in
`
`a palpable vibration of the vocal tract area. For unvoiced sounds such as /s/ and /p/,
`
`air is forced through a constriction at some point in the vocal tract and the periodicity
`
`characteristic of voiced sounds is not present. Other sounds such as /v/ and /z/ are the
`
`results of a mixed excitation signal containing both voiced and unvoiced elements. The
`
`excitation function, whether voiced, unvoiced, or mixed, drives the speech system and the
`
`resulting speech waveform is radiated from the lips (and perhaps the nostrils, should the
`
`nasal cavity be coupled to the vocal tract).
`
`The discrete-time model of speech production based on the above discussion is shown in
`
`Figure 2.3 [Rabiner and Shafer (1978), Deller et al. (1987)]. Mathematically, voiced speech
`
`sounds are modeled by:
`
`s[n] = ep[n] ∗ g[n] ∗ hv[n] ∗ r[n]
`
`(2.1)
`
`

`

`§2.1 the speech production model
`
`10
`
`where ep[n] is a quasi-periodic impulse train excitation signal, g[n] is a glottal shaping
`
`pulse, hv[n] is the impulse response of the vocal tract for the particular configuration of the
`
`physiological elements, and r[n] models the radiator effects at the lips. Unvoiced speech is
`
`modeled by:
`
`s[n] = ew[n] ∗ hv[n] ∗ r[n]
`
`(2.2)
`
`where ew[n] is a white noise excitation signal, with hv[n] and r[n] defined as above. In the
`
`Fourier domain, Equation 2.1 and Equation 2.2 become:
`
`S(ω) = Ep(ω)G(ω)Hv(ω)R(ω)
`
`S(ω) = Ew(ω)Hv(ω)R(ω)
`
`(2.3)
`
`(2.4)
`
`and
`
`respectively.
`
`Returning to voiced sounds, as pressure beneath the glottis increases, the vocal cords
`
`are forced open. However, as the glottis opens and the air volume velocity increases,
`
`the accompanying decrease in pressure forces the glottis closed again, a process that is
`
`repeated throughout the voiced sound. The time between successive glottal closures (or
`
`openings) can be defined as the fundamental period, or pitch, of the speech signal. The
`
`fundam

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket