`
`Reverberant Environments Using Microphone Arrays
`
`by
`
`Joseph Hector DiBiase
`
`B.S., Trinity College, 1991
`
`Sc.M., Brown University, 1993
`
`Thesis
`
`Submitted in partial fulfillment of the requirements for
`
`the Degree of Doctor of Philosophy
`
`in the Division of Engineering at Brown University
`
`Providence, Rhode Island
`
`May 2000
`
`Page 1 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`
`© Copyright
`
`by
`
`Joseph Hector DiBiase
`
`2000
`
`
`
`Page 2 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`
`
`
`The Vita of Joseph Hector DiBiase
`
`Joseph was born December 26, 1969 in Providence, Rhode Island. He grew up in Cranston, Rhode Island
`
`until he and his family moved to Jamestown, Rhode Island in 1979. He attended Trinity College in
`
`Hartford, Connecticut from 1987 until 1991, graduating with a Bachelor of Science degree in Electrical
`
`Engineering with departmental honors. He went on to Brown University in Providence, Rhode Island to
`
`study signal processing and began research on microphone arrays. He received a Master of Science degree
`
`in Electrical Engineering in 1993 and continued to pursue his work towards a Doctor of Philosophy degree.
`
`While a student at Brown, he held several appointments as a research assistant. He also held several
`
`appointments as a teaching assistant for various electrical engineering courses.
`
`
`
`
`
`iv
`
`Page 3 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`Acknowledgements
`
` wish to thank my advisor, Professor Harvey Silverman, for his constant support and feedback. Thanks to
`
`
`
` I
`
`the other members of the microphone array group for their efforts to keep the project going: John Adcock,
`
`Michael Brandstein, Stu Kirtman and Paul Meuse. I thank all my fellow graduate students, my friends, for
`
`all those special lunches, inside and outside: Michael Blane, Aaron Smith, Mike Wazlowski and the array
`
`group. Thanks to my readers, Professor Michael Brandstein and Professor David Cooper for their
`
`corrections to my thesis and their attentiveness during my defense. Thanks to Professor Bill Patterson for
`
`answering a wide range of questions for me. Thanks to the Department of Engineering for the research and
`
`teaching assistant appointments that were granted to me as I worked towards my degree. Thanks to LEMS
`
`for providing the laboratory facilities for my research and to Arpie Kaloustian for all his technical
`
`assistance on the array systems and other equipment in the lab. I would also like to thank my parents, Peter
`
`and Linda, for their investment in my education and their assurance that this was a worthwhile goal. And
`
`to my wife, Furhana, for providing support when I needed it the most and for helping me laugh during this
`
`long and challenging process.
`
`
`
`
`
`v
`
`Page 4 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`Contents
`
`1
`
`Introduction ..............................................................................................................................1
`
`1.1 Methods for Pairwise Time-Delay Estimation ................................................................................3
`1.2 Methods for Steered-Beamformer Localization ..............................................................................5
`1.3
`This Thesis ......................................................................................................................................6
`
`2 Sound Wave Propagation .........................................................................................................9
`
`Simple Acoustic Conditions............................................................................................................9
`2.1
`Direct Path Propagation.................................................................................................................10
`2.2
`2.3 Multi-Path Propagation and the Room Impulse Response ............................................................12
`2.4
`A Hybrid Multi-Path Model ..........................................................................................................12
`2.5 Microphone Signal Model.............................................................................................................14
`2.6
`Direction of Propagation and Arrival ............................................................................................16
`2.6.1
`Direction of Propagation .......................................................................................................16
`2.6.2
`Near Field versus Far Field ...................................................................................................17
`2.6.3
`Direction of Arrival (DOA)...................................................................................................17
`
`3 Microphone Array Data: Acquisition and Processing ............................................................20
`
`The Brown Megamike II ...............................................................................................................21
`3.1
`The Conference-Room data set .....................................................................................................24
`3.2
`Signal-to-Noise Power...................................................................................................................28
`3.3
`Processing the Microphone Signals in Blocks...............................................................................30
`3.4
`Speech/Silence Detection: Block SNR and the SNR Mask...........................................................31
`3.5
`3.6 Measuring Room Impulse Responses............................................................................................34
`3.6.1
`Least-Squares Fit to Input-Output Data.................................................................................34
`3.6.2
`Application to the Conference-Room Data Set .....................................................................37
`3.6.3
`The Conference Room Reverberation Time..........................................................................40
`
`4 Generalized Cross Correlation (GCC)....................................................................................41
`
`4.1
`
`GCC Defined.................................................................................................................................42
`
`
`
`vi
`
`Page 5 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`
`4.1.1 Maximum Likelihood (ML) Weighting Function .................................................................44
`4.1.2
`The Phase Transform (PHAT) Weighting Function..............................................................45
`4.1.3
`Bandpass Weighting Function...............................................................................................46
`4.2
`Implementation of GCC ................................................................................................................46
`4.3
`RMS TDOA Error for an Array ....................................................................................................48
`4.4
`Source Localization by Minimization of the RMS TDOA Error ..................................................49
`
`5 Experimental Performance Evaluations of GCC....................................................................52
`
`GCC Experiment #1: TDOA Estimation with a Single Pair of Microphones ...............................53
`5.1
`5.1.1
`TDOA Estimation..................................................................................................................54
`5.1.2
`Experimental Results and Discussion....................................................................................55
`5.2
`GCC Experiment #2: RMS TDOA Error with a Triad Array........................................................60
`5.2.1
`RMS TDOA Errors................................................................................................................60
`5.2.2
`RMS TDOA Error Rates with Gaussian Sources ..................................................................62
`5.2.3
`RMS TDOA Error Rates with Speech Sources .....................................................................64
`5.3
`GCC Experiment #3: DOA Estimation with an 8-Element Array.................................................67
`5.3.1
`DOA Estimation by Minimization of the RMS TDOA Errors ..............................................67
`5.3.2
`Visualizing the RMS TDOA Error........................................................................................69
`5.3.3
`GCC Time-averaging ............................................................................................................71
`
`6 The Steered Response Power (SRP).......................................................................................73
`
`6.1
`6.2
`6.3
`6.4
`6.5
`6.6
`
`Beamforming.................................................................................................................................74
`The Steered Response....................................................................................................................76
`SRP in Terms of GCC ...................................................................................................................78
`Combining the Phase Transform and Steered Response Power: SRP-PHAT ...............................80
`Implementation of SRP .................................................................................................................82
`Time Averaging versus Spatial Averaging....................................................................................83
`
`7 Experimental Performance Comparisons of SRP, SRP-PHAT and GCC-PHAT ..................85
`
`Experiment #1: DOA Estimation with an 8-Element Array..........................................................86
`7.1
`7.1.1
`Performance Comparison ......................................................................................................87
`
`
`
`vii
`
`Page 6 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`
`Visualizing the Steered Response Power ..............................................................................88
`7.1.2
`7.2
`Experiment #2: DOA Estimation with a 15-Element Array..........................................................91
`7.2.1
`Performance Comparison ......................................................................................................92
`7.3
`Experiment #3: 3D Source Localization using the Huge Microphone Array (HMA)...................93
`7.3.1
`Data and Setup.......................................................................................................................94
`7.3.2
`Location Estimation...............................................................................................................95
`7.3.3
`Experimental Results.............................................................................................................96
`7.3.4 Multi-talker Resolution .........................................................................................................99
`
`8 Summary, Conclusions and Future Work.............................................................................101
`
`8.1
`8.2
`8.3
`
`Summary .....................................................................................................................................101
`Computational Complexity .........................................................................................................102
`Future Work ................................................................................................................................104
`
`Bibliography ................................................................................................................................105
`
`
`
`
`
`viii
`
`Page 7 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`List of Tables
`
`3.1 Locations and DOAs for the Conference Room Sources......................................................................27
`
`
`
`
`
`ix
`
`Page 8 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`List of Figures
`
`2.1 A Source and Microphone Located in a Cartesian Coordinate System ................................................11
`
`2.2 Propagation Vectors..............................................................................................................................16
`
`2.3 Definition of DOA in Terms of Azimuth, Elevation, and Propagation Vector.....................................18
`
`3.1 A Picture of the Brown Megamike II ...................................................................................................20
`
`3.2 The Megamike Recorder’s Application Window .................................................................................22
`
`3.3 The Megamike’s Channel Meters.........................................................................................................23
`
`3.4 A Picture of the Conference Room.......................................................................................................24
`
`3.5 Conference Room Layout.....................................................................................................................25
`
`3.6 The Planar, 15-Element Conference-Room Microphone Array ...........................................................26
`
`3.7 Estimated SNRs of all 15 Microphone Channels..................................................................................29
`
`3.8 Block Powers of the Speech Signal and Background Noise.................................................................32
`
`3.9 Block Power Averaged over Microphone.............................................................................................33
`
`3.10 Room Impulse Response of Microphone 1 and Source 1 .....................................................................38
`
`3.11 A Close-Up of a 10-Millisecond Segment of the Room Impulse Response .........................................39
`
`3.12 The Smoothed Powers of the Impulse Reponses ..................................................................................40
`
`4.1 An Example of how TDOAs Parameterize Source Location................................................................41
`
`5.1 Microphones 2 and 9 comprise a Pair with a 36-cm Separation Distance ............................................53
`
`5.2 Plane Wave DOA-TDOA Relationship ................................................................................................54
`
`5.3 Histograms of the TDOA Estimates of Source 2 and Source 3 ............................................................55
`
`5.4 Histogram of TDOA Estimates from Source 1.....................................................................................56
`
`5.5 Histogram of TDOA Estimates with Cross-Correlation .......................................................................57
`
`5.6 Normalized GCC Responses over Time for Gaussian Source 1...........................................................58
`
`5.7 Microphones 2, 9 and 13 form a Triad Array .......................................................................................60
`
`5.8 RMS TDOA Error Histograms for Three Gaussian Sources and the Triad Array................................61
`
`5.9 Histogram and Error Rate of RMS TDOA Error for Gaussian Source 1..............................................62
`
`5.10 RMS TDOA Error Rates for Gaussian Source 1, 2 and 3.....................................................................63
`
`
`
`x
`
`Page 9 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`5.11 Error Rates for the Three Source Locations and Two Source Signals..................................................65
`
`5.12 An 8-Element, 33 by 36 Centimeter Array...........................................................................................67
`
`5.13 RMS Error Rates...................................................................................................................................68
`
`5.14 Speech Segment with Nine Frames of TDOA Error Surfaces..............................................................70
`
`5.15 DOA Error Rates for Various Cross-Spectrum Accumulation Times ..................................................71
`
`6.1 The Structure of a Filter-and-Sum Beamformer ...................................................................................76
`
`7.1 The Highlighted Microphones Form an 8-Element Array....................................................................86
`
`7.2 DOA Error Rates for Three Different Sources .....................................................................................87
`
`7.3 The Speech Segment Used to Compute the Steered Responses of Figures 7.4 and 7.5........................88
`
`7.4 Steered Responses of the Delay-and-Sum Beamformer .......................................................................89
`
`7.5 Steered Responses of SRP-PHAT ........................................................................................................90
`
`7.6 DOA Error Rates for the 15-Element Planar Array..............................................................................91
`
`7.7 The HMA Layout with 128 (of 256) Microphones...............................................................................93
`
`7.8 Plot of Block Power Averaged over Microphone.................................................................................95
`
`7.9 Location Error Rates for SRP, GCC-PHAT and SRP-PHAT using 128 Microphones ........................97
`
`7.10 Location Error Rates for Different Cross-Spectra Accumulation Times ..............................................98
`
`7.11 Steered Responses Using the 128-Element Array and Three Simultaneous Talkers............................99
`
`
`
`
`
`
`
`
`
`
`
`
`
`xi
`
`Page 10 of 122
`
`SONOS EXHIBIT 1021
`
`
`
` 1
`
` Introduction
`
`A combination of microphone arrays and sophisticated signal processing has been applied to the remote
`
`acquisition of high-quality speech audio. These applications all exploit the spatial filtering ability of an
`
`array, which allows the speech signal from one talker to be enhanced as the signals from other talkers and
`
`unwanted sources are suppressed. This process in generally referred to as beamforming. While some
`
`array-systems are designed to focus on sounds emanating from a preset location or direction, most employ
`
`adaptive algorithms that track the positions of one or more talkers and adjust the array’s focus accordingly.
`
`This “electronically steerable” feature eliminates the need for manually operated equipment, such as
`
`shotgun or boom-mounted microphones. Furthermore, an array-system has the potential to replace the use
`
`of hand-held or head-mounted microphones in some applications.
`
`
`
`Microphone arrays have been implemented in many applications, including teleconferencing
`
`[25][35][60][61][96], speech recognition [2][21][22][40][55][56][79], talker characterization [91] and
`
`voice capture in reverberant environments [34][39][57][98]. Some novel and interesting array designs have
`
`been studied, including a small spherical array [31] and one employing superdirectivity [24]. Both
`
`theoretical and practical aspects of array-systems are being actively researched, as reported by the
`
`participants of three special microphone array workshops [36][37][38]. Some of this work has been based
`
`on simulations using mathematical models (such as [3]) of the acoustic environment [57][82], and other
`
`work relies on pre-recorded array data of actual talkers. Still other work focuses on the design and
`
`construction of hardware [63][86], as well as the implementation of real-time software [29][76].
`
`With the emergence of powerful and inexpensive DSP microprocessors, microphone array-
`
`systems have been introduced as commercially viable products. Examples of this are the teleconferencing
`
`products by PictureTel and Ploycom. Both companies have applied microphone-array technology to
`
`quality voice-capture products designed for use in small-room environments. There are also products by
`
`these companies that automatically steer a robotic camera and frame active talkers. The camera-steering
`
`array-system by PictureTel uses the location estimates produced by a 4-element array [96].
`
`Most of these applications require accurate passive localization techniques that produce estimates
`
`at a high rate with minimal latency. When tracking multiple, moving talkers [92], there must be many
`
`
`
`1
`
`Page 11 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`reliable location estimates produced per second. If a beamformer is to be used to focus on these talkers,
`
`then their motion must be negligible for the duration of each data segment used to compute an estimate.
`
`Furthermore, the update rate must be high enough to avoid the undesirable effects of misaiming. These
`
`effects include high-frequency rolloff in the beamformer output [5], and a general attenuation of the target
`
`source signal. Furthermore, the latency due to the accumulation of long data segments for processing
`
`before beamforming may result in unacceptable delays between the production of the speech by the talker
`
`and the output of that speech through the beamformer. For real-time applications, such long delays can be
`
`quite disruptive. These factors place tight constraints on the microphone data requirements. While the
`
`computation time required by the algorithm largely determines the latency of the locator, it is the data
`
`requirements that define theoretical limits. Hence, this thesis focuses on reducing the size of the data
`
`segments necessary for accurate source localization in realistic room environments.
`
`The performance of voice-capture techniques generally improves with the number of microphones
`
`in the array, and this has spawned the research and construction of medium [29] and large array systems
`
`[86]. When acoustic conditions are favorable, source localization can be performed using a modest number
`
`of microphones. For example, the automatic voice-steering camera by PictureTel includes only four
`
`microphones. Hence, in this regard, large arrays composed of tens or hundreds of microphones are
`
`redundant. By integrating the data from a multitude of microphones, the redundancy of a large array can be
`
`exploited to improve localization in the presence of adverse acoustic effects such as reverberation and
`
`background noise.
`
`For various reasons, including the reduction of computational costs, many source-localization
`
`algorithms break the array into pairs of microphones (See Section 1.1 for references of work in this area.).
`
`Pairwise time-delay estimation (TDE) is used to determine the time difference of arrival (TDOA) of speech
`
`sounds between the microphones comprising each pair. The redundancy of a multitude of TDOA estimates
`
`has been exploited by statistically averaging in some way to give an estimate of the talker’s location [12].
`
`However, pairwise techniques suffer considerably from acoustic reverberation. The performance of
`
`pairwise techniques improves with the amount of data used, but the desire for high update rates and low
`
`latency places strict limits on this.
`
`
`
`2
`
`Page 12 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`
`When a system has a multitude of microphones, far more than a sufficient number for source-
`
`localization, they should be used in a manner that will make the algorithm robust to reverberation. The
`
`application of error-prone pairwise TDE does not seem to be the best way to achieve this. An alternative
`
`approach is one where a beamformer is used to search over a predefined spatial region looking for a peak
`
`(or peaks) in the power of its output signal [59]. While this is computationally more intensive than
`
`pairwise methods, it inherently combines the signals from multiple microphones rather than reducing the
`
`data from each pair to a single time-delay parameter. This approach is able to compensate for the short
`
`duration of each data segment used for localization by integrating the data from many, or all, of the
`
`microphones prior to parameter estimation. An additional advantage that beam-steering techniques have
`
`over TDE-based techniques is the ability to localize multiple simultaneous talkers. In such a scenario, the
`
`power of a steered beamformer should peak multiple times, and each peak should correspond to the
`
`location of an active talker. Although these techniques have not been a popular choice for speech-array
`
`applications, a new steered-beamformer method is proposed in this thesis, which combines the best features
`
`of the beamformer with those of a popular pairwise technique. It will be demonstrated that this new
`
`steered-beamformer produces highly reliable location estimates, in rooms with reverberation times of 200
`
`and 400 milliseconds, using 25-millisecond data segments.
`
`1.1 Methods for Pairwise Time-Delay Estimation
`
`Many passive talker localization techniques rely on pairwise time delay estimation (TDE) [11][13][64].
`
`These techniques use the time difference of arrival (TDOA) of speech sounds between two spatially
`
`separated microphones to parameterize the source location [7][8][12][20][93]. The best results are
`
`obtained when the microphone pairs are strategically positioned to give optimal spatial accuracy [9][10].
`
`Pairwise TDE has been applied to automatic camera steering for videoconferencing [96]. For this
`
`application of TDE, update rates of 200-300ms are acceptable. With such long data segments, reliable
`
`estimates are produced, even in adverse acoustic conditions. However, applications such as adaptive
`
`beamforming and the tracking of multiple talkers [92] require a much higher estimate rate; positional
`
`estimates must be updated as quickly as 20-30ms. When the data segments become this short, acoustic
`
`
`
`3
`
`Page 13 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`reverberation has a severe impact on the performance of pairwise delay estimators. Many techniques have
`
`been proposed to improve their performance in reverberant environments.
`
`The most common pairwise TDE method is generalized cross-correlation (GCC) [64]. The type of
`
`filtering, or weighting, used with GCC is crucial to performance. Maximum likelihood (ML) weightings
`
`are theoretically optimal when there is single-path propagation in the presence of uncorrelated noise, but
`
`their performance degrades significantly with increasing reverberation [19]. The phase transform (PHAT)
`
`weighting is more robust against reverberation than ML, even though it is sub-optimal under ideal
`
`conditions. Also known as the cross-power spectrum phase (CSP), GCC-PHAT has been shown to
`
`perform well in a realistic environment [72].
`
`Other approaches, such as cepstral prefiltering [88], attempt to deconvolve the effects of
`
`reverberation prior to applying GCC. However, deconvolution requires long data segments since the
`
`duration of a typical small-room impulse response is 200-400ms. It is also very sensitive to the high
`
`variability and non-stationarity of speech signals. In fact, the experiments performed in [88] avoided the
`
`use of speech as input altogether. Instead, colored Gaussian noise was used as the source signal.
`
`While identification of room impulse responses is impossible when the source signal is unknown,
`
`the method proposed in [54], which is based on eigenvalue decomposition, efficiently detects the direct
`
`paths of two impulse responses. This method is effective with speech as input, but requires 250ms of
`
`microphone data to converge.
`
`Reverberation effects can also be overcome to some degree by classifying TDEs acquired over
`
`time and associating them with the direction of arrival (DOA) of the sound waves [90]. This approach,
`
`however, is not suitable for short-time TDE. Under extreme acoustic conditions, a large percentage of the
`
`TDEs are anomalous, and it takes a considerable period (1-2 seconds in [90]) to acquire enough estimates
`
`for a statistically meaningful classification.
`
`A short-time TDE method, which is more complex than GCC, is presented in [14]. It involves the
`
`minimization of a weighted least-squares function of the phase data. It was shown to outperform both
`
`GCC-ML and GCC-PHAT in reverberant conditions. This improvement comes at a cost; since the phase
`
`data is discontinuous, a complicated searching algorithm must be applied to the minimization. The
`
`marginal improvement over GCC-PHAT may not justify this added cost in computational complexity.
`
`
`
`4
`
`Page 14 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`
`Among the methods described here, those that rely on long data segments generally outperform
`
`those that do not. [81] is another example of a GCC method that performs adequately only when the data
`
`segments are sufficiently long in duration. Cross-correlation techniques are known to improve with
`
`increasing data lengths. Hence, it is not surprising that GCC-based TDE methods also improve with more
`
`data. Those that are not GCC-based generally require larger amounts of data to be effective as well.
`
`However, the dynamic environments of many speech array applications require high update rates, which
`
`limits the duration of the data segments.
`
`1.2 Methods for Steered-Beamformer Localization
`
`Methods that rely on the array’s ability to focus on signals originating from a particular location or
`
`direction in space are generally referred to as beamformers [59]. When the location of the source is not
`
`known, a beamformer can be used to scan, or steer, over a predefined spatial region by adjusting its
`
`steering parameters. The output of a beamformer, when used in this way, is known as the steered response.
`
`The steered response power (SRP) may peak under a variety of circumstances, but with favorable
`
`conditions, it is maximized when the point (or direction) of focus matches the location of the source.
`
`Beamforming has been used extensively
`
`in speech-array applications for voice capture
`
`[37][38][23][41][97]. However, due to the efficiency and satisfactory performance of pairwise correlation
`
`methods, it has rarely been applied to the talker localization problem. Furthermore, the steered response of
`
`a conventional beamformer is highly dependent on the spectral content of the source signal. Many optimal
`
`derivations are based on a priori knowledge of the spectral content of the background noise, as well as the
`
`source signal [17][45]. In the presence of significant reverberation, the noise and source signals are highly
`
`correlated, and this makes accurate estimation of the noise nearly impossible. Furthermore, in nearly all
`
`array-applications, little or nothing is known about the source signal. Hence, such optimal estimators are
`
`not very practical in realist speech-array environments.
`
`
`
`The simplest type of steered response is obtained using the output of a delay-and-sum
`
`beamformer. This is what is most often referred to as a conventional beamformer [59]. Delay-and-sum
`
`beamformers apply time shifts to the array signals to compensate for the propagation delays in the arrival of
`
`the source signal at each microphone. Once these signals are time-aligned, they are summed together to
`
`
`
`5
`
`Page 15 of 122
`
`SONOS EXHIBIT 1021
`
`
`
`
`form a single output signal. More sophisticated beamformers apply filters to the array signals as well as
`
`this time alignment. The derivation of the filters in these filter-and-sum beamformers is what distinguishes
`
`one method from the other.
`
`Many optimal steered-beamformer techniques have been derived for stationary, narrow-band
`
`signals. These include minimum variance beamforming [58][26][66], linear prediction [58] and generalize
`
`sidelobe cancellers [44][16]. These methods can be extended to the wideband case and are appropriate for
`
`speech signals when applied over short, stationary segments. However, the beamformer filters for all of
`
`these methods are defined in terms of the spatial correlation matrix. When this matrix is unknown, it must
`
`be estimated using the observed data. Such estimation, especially in adverse acoustic conditions, may
`
`require long segments of stationary data. For the dynamic conditions of speech-array applications, long
`
`interval for which the source is both spatial and temporally stationary are rarely encountered. Hence, such
`
`methods are difficult to apply to the localization of speech sources.
`
`In this thesis, filters for a steered-beamformer are derived, which incorporate the features of a
`
`popular pairwise technique known as the phase transform (PHAT). The phase transform is a sub-optimal
`
`method, although it has been shown to perform well in reverberant environments. In