throbber
IPR PETITION
`US RE48,371
`Amazon Ex. 1021
`
`

`

`DISTANT SPEECH
`
`RECOGNITION
`
`

`

`DISTANT SPEECH
`RECOGNITION
`
`Matthias W¨olfel
`Universit¨at Karlsruhe (TH), Germany
`
`and
`John McDonough
`Universit¨at des Saarlandes, Germany
`
`A John Wiley and Sons, Ltd., Publication
`
`

`

`This edition first published 2009
`© 2009 John Wiley & Sons Ltd
`
`Registered office
`
`John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
`
`For details of our global editorial offices, for customer services and for information about how to apply for
`permission to reuse the copyright material in this book please see our website at www.wiley.com.
`
`The right of the author to be identified as the author of this work has been asserted in accordance with the
`Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
`any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
`the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
`
`Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
`available in electronic books.
`
`Designations used by companies to distinguish their products are often claimed as trademarks. All brand names
`and product names used in this book are trade names, service marks, trademarks or registered trademarks of their
`respective owners. The publisher is not associated with any product or vendor mentioned in this book. This
`publication is designed to provide accurate and authoritative information in regard to the subject matter covered.
`It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional
`advice or other expert assistance is required, the services of a competent professional should be sought.
`
`Wiley also publishes its books in a variety of electronic formats. Some content that appears
`in print may not be available in electronic books.
`
`Library of Congress Cataloging-in-Publication Data
`
`W¨olfel, Matthias.
`Distant speech recognition / Matthias W¨olfel, John McDonough.
`p. cm.
`Includes bibliographical references and index.
`ISBN 978-0-470-51704-8 (cloth)
`1. Automatic speech recognition. I. McDonough, John (John W.) II. Title.
`TK7882.S65W64 2009
`(cid:2)
`006.4
`54 – dc22
`
`2008052791
`
`A catalogue record for this book is available from the British Library
`
`ISBN 978-0-470-51704-8 (H/B)
`
`Typeset in 10/12 Times by Laserwords Private Limited, Chennai, India
`Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire
`
`

`

`98
`
`Distant Speech Recognition
`
`Hence, it is apparent that the optimal MMSE estimator is equivalent to the conditional
`mean,
`
`Ep(xk|y1:k ) {xk|y1:k} =
`xk p(xk|y1:k) dxk.
`(4.8)
`Similarly, it follows that knowledge of the filtering density p(xk|y1:k−1) enables all other
`less general estimates of xk to be readily calculated.
`
`(cid:2)
`
`4.2 Wiener Filter
`Stochastic filter theory was established by the pioneering work of Norbert Wiener (1949),
`Wiener and Hopf (1931), and Andrey Kolmogorov (1941a, b). A Wiener filter provides
`the optimal static, linear, MMSE solution, where the mean square error is calculated
`between the output of the filter and some desired signal. We discuss the Wiener filter in
`this section, because such a filter is equivalent to the Kalman filter described in Section 4.3
`without any process noise. Hence, the Wiener filter is in fact a Bayesian estimator (Simon
`2006, sect. 8.5.2). We will derive both the time and frequency domain solutions for the
`finite impulse response (FIR) filter.
`
`4.2.1 Time Domain Solution
`Let x[n] denote the desired signal and let d[n] represent some additive distortion. The
`primary assumptions inherent in the Wiener filter are that the second-order statistics of
`both x[n] and d[n] are stationary. The corrupted signal is then defined as
`y[n] (cid:2) x[n] + d[n].
`
`The time domain output of the FIR Wiener filter, which is the estimate ˆx[n] of the desired
`signal x[n], is by definition obtained from the convolution
`
`ˆx[n] (cid:2) L−1(cid:11)
`
`l=0
`
`h[l] y[n − l],
`
`(4.9)
`
`(cid:13)
`(cid:12)
`where h[n] is the filter impulse response of length L. Upon defining
`h[0] h[1] ··· h[L − 1]
`(cid:12)
`h (cid:2)
`y[n] y[n − 1] ··· y[n − L + 1]
`y[n] (cid:2)
`
`(cid:13)
`
`T
`
`,
`
`T
`
`,
`
`the output of the filter can be expressed as
`ˆx[n] = hT y[n].
`The estimation error is [n] (cid:2) x[n] − ˆx[n], and the squared-estimation error is given by
`ζ (cid:2) E{ T [n] [n]} = E{(x[n] − hT y[n])T (x[n] − hT y[n])}.
`
`(4.10)
`
`

`

`Bayesian Filters
`
`99
`
`which must be minimized. Equation (4.10) can be rewritten as
`ζ = E{xT [n] x[n]} − 2 hT rxy + hT Ry h,
`
`where
`
`Ry (cid:2) E{y[n] yT [n]},
`rxy (cid:2) E{y[n] x[n]}.
`
`The Wiener filter is based on the assumption that the components Ry and rxy are stationary.
`In order to solve for the optimal filter coefficients, we set
`
`= −2 rxy + 2 hT Ry = 0,
`
`∂ζ
`∂h
`
`which leads immediately to the famous Wiener–Hopf equation
`Ry h = rxy .
`
`The solution for the optimal coefficients is then
`ho = R
`
`−1
`y rxy .
`
`(4.11)
`
`(4.12)
`
`Note that the optimal solution can also be found through the well-known orthogonality
`principle (Stark and Woods 1994), which can be stated as
`E{y[n − i] [n]} = 0∀ i = 0, . . . , L − 1.
`
`(4.13)
`
`In other words, the orthogonality principle requires that the estimation error [n] is
`orthogonal to all of the inputs y[n − i] for i = 0, . . . , L − 1 used to form the estimate
`ˆx[n].
`
`4.2.2 Frequency Domain Solution
`In order to derive the Wiener filter in the frequency domain, let us express (4.13) as
`
`(cid:16)(cid:17)
`
`hopt[l] y[n − l]
`
`= 0∀ i = 0, . . . , L − 1.
`
`(cid:14)
`
`E
`
`y[n − i]
`
`(cid:15)
`
`x[n] − L−1(cid:11)
`
`l=0
`
`Equivalently, we can write
`
`rxy[n] − hopt[n] ∗ ry[n] = 0,
`
`(4.14)
`
`

`

`100
`
`Distant Speech Recognition
`
`where the cross-correlation sequence of x[n] and y[n] as well as the autocorrelation
`sequence of y[n] are, respectively,
`E{y[n − l] x[n]}, ∀ l = 0, . . . , L − 1,
`0,
`otherwise,
`E{y[n − l] y[n]}, ∀ l = −L + 1, . . . , L − 1,
`0,
`otherwise.
`
`rxy[l] (cid:2)
`
`ry[l] (cid:2)
`
`(cid:14)
`(cid:14)
`
`Taking the Fourier transform of (4.14) provides
` XY (ω) − Hopt(ω) Y (ω) = 0,
`where1 rxy[n] ↔ XY (ω), h[n] ↔ Hopt(ω), and ry[n] ↔ Y (ω). This leads immediately
`to the solution
`
`Hopt(ω) = XY (ω)
`
` Y (ω)
`
`.
`
`(4.15)
`
`Given that X(ω) and D(ω) are statistically independent by assumption, it follows that
` Y (ω) = X(ω) + D(ω),
` XY (ω) = X(ω).
`
`Hence, we can rewrite (4.15) as
`Hopt(ω) =
`
` X(ω)
`
`,
`
`(4.16)
`
` X(ω) + D(ω)
`the form in which the Wiener filter is most often seen. Alternatively, the frequency
`response of the filter can be expressed as
`Hopt(ω) =
`
`1
`1 + D(ω)
` X(ω)
`
`,
`
`from which it is apparent that when the spectral power of the disturbance comes to
`dominate that of the signal, the gain of the filter is reduced. When the signal dominates
`the disturbance, on the other hand, the gain increases. In all cases it holds that
`0 ≤ |Hopt(ω)| ≤ 1.
`
`As presented here, the classical Wiener filter presents something of a paradox in that
`it requires that the desired signal x[n] or its power spectrum X(ω) is known before the
`1 The notation ry[n] ↔ Y (ω) indicates that ry[n] and Y (ω) comprise a Fourier transform pair; see Section 3.1.2
`for details.
`
`

`

`Bayesian Filters
`
`101
`
`filter coefficients can be designed. Were this information available, there would be no
`need of a Wiener filter. The art of practical Wiener filter design consists of nothing more
`than the robust estimation of the desired signal X(ω) and noise D(ω) components
`appearing in (4.15). References indicating how this can be achieved are presented at the
`ends of Sections 6.3.1 and 13.3.5.
`
`4.3 Kalman Filter and Variations
`In this section, we present the best known set of solutions for estimating the filtering
`density, namely the Kalman filter (KF) (Kalman 1960) and its several variations.
`
`4.3.1 Kalman Filter
`The Kalman filter provides a closed form means of sequentially updating p(xk|y1:k) under
`two critical assumptions:
`• The transition and observation models fk and hk are linear.
`• The process and observation noises uk and vk are Gaussian.
`
`As the linear combination of Gaussian r.v.s is also Gaussian, these assumptions taken
`together imply that both xk and yk will remain Gaussian for all time k. Note that the
`combination of Gaussians in the nonlinear domain, such as the logarithmic domain, results
`in a non-Gaussian distribution, as described in Section 9.3.1. As mentioned previously,
`under these conditions, the KF is the optimal MMSE estimator.
`In keeping with the aforementioned linearity assumption, the state model (4.1–4.2) can
`be expressed as
`
`xk = Fk|k−1 xk−1 + uk−1,
`yk = Hkxk + vk,
`
`(4.17)
`
`(4.18)
`
`where Fk|k−1 and Hk are the known transition and observation matrices. The noise terms
`uk and vk in (4.17–4.18) are by assumption zero mean, white Gaussian random vector
`processes with covariance matrices
`Uk = E{ukuT
`
`},
`
`k
`
`Vk = E{vkvT
`
`k
`
`},
`
`respectively. Moreover, by assumption uk and vk are statistically independent.
`By definition, the transition matrix Fk|k−1 has two important properties:
`• product rule
`
`Fk|m Fm|n = Fk|n,
`
`(4.19)
`
`

`

`6 S
`
`peech Feature Enhancement
`
`In automatic speech recognition (ASR) the distortion of the acoustic features can be com-
`pensated for either in the model domain or in the feature domain. The former techniques
`adapt the model on the distorted test data in such a way as if the model were trained on
`distorted data. Feature domain techniques, on the other hand, attempt to remove or sup-
`press the distortion itself. It has been shown in various publications, such as Deng et al.
`(2000); Sehr and Kellermann (2007), that feature domain techniques provide better sys-
`tem performance than simply matching the training and testing conditions. The problem is
`especially severe for speech corrupted with reverberation. In particular, for reverberation
`times above 500 ms, ASR performance with respect to a model trained on clean speech
`does not improve significantly even when the acoustic model of the recognizer has been
`trained on data from the same acoustic environment (Baba et al. 2002).
`The term enhancement indicates an improvement in speech quality. For speech observa-
`tions, enhancement can be expressed either in terms of intelligibility, which is an indicator
`of how well the speech can be understood by a human, or signal quality, which is an indi-
`cator of how badly the speech is corrupted, or it can include both of these measures. For
`the purpose of automatic classification, features must be manipulated to provide a higher
`class separability. It is possible to perform speech feature enhancement in an independent
`preprocessing step, or within the front-end of the ASR system during feature extraction.
`In both cases it is not necessary to modify the decoding stage and it might not require
`any changes to the acoustic models of the ASR system, except for methods that change
`the means or variances of the features, such as cepstral mean and variance normalization.
`If the training data, however, is distorted itself, it might be helpful to enhance the training
`features as well.
`In general the speech enhancement problem can be formulated as the estimation of
`cleaned speech coefficients by maximizing or minimizing certain objective criteria using
`additional knowledge, which could represent prior knowledge about the characteristics of
`the desired speech signal or unwanted distortion, for example. A common and widely
`accepted distortion measure was introduced in Chapter 4, namely, the squared error dis-
`tortion,
`
`d( ˆx, x) = |f ( ˆx) − f (x)|2
`
`Distant Speech Recognition Matthias W¨olfel and John McDonough
`© 2009 John Wiley & Sons, Ltd.
`
`

`

`182
`
`Distant Speech Recognition
`
`where the function f (x) – which could be anyone of x, |x|, x2, or log x – determines the
`fidelity criterion of the estimator.
`As the term speech enhancement is very broad and can potentially cover a wide variety
`of techniques, including:
`• additive noise reduction,
`• dereverberation,
`• blind source separation,
`• beamforming,
`• reconstruction of lost speech packets in digital networks, or
`• bandwidth extension of narrowband speech,
`
`it is useful to provide some more specificity. An obvious classification criteria is provided
`by the number and type of sensors used. Single-channel methods, as described in this
`section, obtain the input from just a single microphone while multi-channel methods rely
`on observations from an array of sensors. These methods can be further categorized by the
`type of sensors. An example of the fusion of audio and visual features in order to improve
`recognition performance is given by Almajai et al. (2007). As discussed in Chapters 12
`and 13, respectively, blind source separation and beamforming combine acoustic sig-
`nals captured only with microphones. These techniques differ inasmuch beamforming
`assumes more prior information – namely, the geometry of the sensor array and position
`of the speaker – is available. Single and multi-channel approaches can be combined to
`further improve the signal or feature in terms of the objective function used, such as
`signal-to-noise ratio (SNR), class separability, or word error rate.
`In this book we want to use the term speech feature enhancement exclusively to describe
`algorithms or devices whose purpose is to improve the speech features, where a single cor-
`rupted waveform or single corrupted feature stream is available. The goal is an improved
`classification accuracy which may not necessarily result in an improved or pleasing sound
`quality if reconstruction is at all possible. As seen in previous sections, additive noise and
`reverberation are the most frequently encountered problems in distant speech recognition
`(DSR) and our investigations are limited to methods of removing the effects of these
`distortions.
`Work on speech enhancement addressing noise reduction has been a research topic
`since the early 1960s when Manfred Schr¨oder at Bell Labs began working in the field.
`Schr¨oder’s analog implementation of spectral subtraction, however, is not well known
`inasmuch as it was only published in patents (Schr¨oder 1965, 1968). In 1974 Weiss
`et al. (1974) proposed an algorithm in the autocorrelation domain. Five years later Boll
`(1979) proposed a similar algorithm which, however, worked in the spectra domain. Boll’s
`algorithm became one of the earliest and most popular approaches to speech enhancement.
`A broad variety of variations to Boll’s basic spectral subtraction approach followed.
`Cepstral mean normalization (CMN), another popular approach, which in contrast to the
`aforementioned methods is designed to compensate for channel distortion, was proposed
`by Atal (1974) already in 1974. CMN came into wide use, however, only in the early
`1990s. The effects of additive noise on cepstral coefficients as well as various remedies
`were investigated in the PhD dissertations by Acero (1990a), Gales (1995), and Moreno
`(1996).
`
`

`

`Speech Feature Enhancement
`
`183
`
`Considering speech feature enhancement as a Bayesian filtering problem leads to the
`application of a series of statistical algorithms intended to estimate the state of a dynamical
`system. Such Bayesian filters are described in Chapter 4. Pioneering work in that direction
`was presented by Lim and Oppenheim (1978) where an autoregressive model was used for
`a speech signal distorted by additive white Gaussian noise. Lim’s algorithm estimates the
`autoregressive parameters by solving the Yule–Walker equation with the current estimate
`of the speech signal and obtains an improved speech signal by applying a Wiener filter
`to the observed signal. Paliwal and Basu (1987) extended this idea by replacing the
`Wiener filter with a Kalman filter (KF). That work was likely the first application of
`the KF to speech feature enhancement. In the years following different sequential speech
`enhancement methods were proposed and the single Gaussian model was replaced by a
`Gaussian mixture (Lee et al. 1997). Several extensions intended to overcome the strict
`assumptions of the KF have appeared in the literature. The interacting multiple model,
`wherein several KFs in different stages interact with each other, was proposed by Kim
`(1998). Just recently very powerful methods based on partice filters have been proposed
`to enhance the speech features in the logarithmic spectral domain (Singh and Raj 2003;
`Yao and Nakamura 2002). This idea has been adopted and augmented by W¨olfel (2008a)
`to jointly track, estimate and compensate for additive and reverberant distortions.
`
`6.1 Noise and Reverberation in Various Domains
`We begin our exposition by defining a signal model. Let x = [x1, x2,··· , xM] denote the
`original speech sequence, let h = [h1, h2,··· , hM] denote convolutional distortions such
`as the room impulse response, and let n = [n1, n2,··· , nM] denote the additive noise
`sequence. The signal model can then be expressed as
`y(t) = h(t) ∗ x(t) + n(t),
`
`(6.1)
`
`in the discrete-time domain, which we indicate with the superscript (t). Next we develop
`equivalent representations of the signal model in alternative domains, which will be
`indicated with suitable superscripts. The relationship, however, between additive and
`convolution distortion as well as the clean signal might become nontrivial after the trans-
`formation into different domains. In particular, ignoring the phase will lead to approximate
`solutions, which are frequently used due to their relative simplicity. An overview of the
`relationship between the original and clean signal is presented in Table 6.1.
`The advantage of
`time domain techniques is that
`they can be applied on a
`sample-by-sample basis, while all alternative domains presented here require windowing
`the signals and processing an entire block of data at once.
`
`6.1.1 Frequency Domain
`Representing the waveform as a sum of sinusoids by the application of the Fourier trans-
`form leads to the spectral domain representation,
`y(f ) = h(f )x(f ) + n(f ),
`
`(6.2)
`
`

`

`13
`
`Beamforming
`
`In this chapter, we investigate a class of techniques – known collectively as beamform-
`ing – by which signals from several sensors can be combined to emphasize a desired
`source and suppress interference from other directions. Beamforming begins with the
`assumption that the positions of all sensors are known, and that the position of the
`desired source is known or can be estimated. The simplest of beamforming algorithms,
`the delay-and-sum beamformer, uses only this geometrical knowledge to combine the sig-
`nals from several sensors. More sophisticated adaptive beamformers attempt to minimize
`the total output power of the array under the constraint that the desired source must be
`unattenuated. The conventional adaptive beamforming algorithms attempt to minimize a
`quadratic optimization criterion related to signal-to-noise ratio under a distortionless con-
`straint in the look direction. Recent research has revealed, however, that such quadratic
`criteria are not optimal for acoustic beamforming of human speech. Hence, we also
`present beamformers based on non-conventional optimization criteria that have appeared
`more recently in the literature.
`Any reader well acquainted with the conventional array processing literature will cer-
`tainly have already seen the material in Sections 13.1 through 13.4. The interaction of
`propagating waves with the sensors of a beamformer are described in Section 13.1.1,
`as are the effects of sensor spacing and beam steering on the spatial sensitivity of the
`array. The beam pattern, which is a plot of array sensitivity versus direction of arrival of
`propagating wave, is defined and described in Section 13.1.2. The simplest beamformer,
`namely the delay-and-sum beamformer, is presented in Section 13.1.3, and the effects
`of beam steering are discussed in Section 13.1.4. Quantitative measures of beamforming
`performance are presented in Section 13.2, the most important of which are directivity,
`as presented in Section 13.2.1, and array gain, as presented in Section 13.2.2. These mea-
`sures will be used to evaluate the conventional beamforming algorithms described later
`in the chapter.
`In Section 13.3, we take up the discussion of the conventional beamforming algorithms.
`The minimum variance distortionless response (MVDR) is presented in Section 13.3.1, and
`its performance is analyzed in Sections 13.3.2 and 13.3.3. The beamforming algorithms
`based on the MVDR design, including the minimum mean square error and maximum
`signal-to-noise ratio beamformers, have the advantage of being tractable to analyze in
`
`Distant Speech Recognition Matthias W¨olfel and John McDonough
`© 2009 John Wiley & Sons, Ltd.
`
`

`

`410
`
`Distant Speech Recognition
`
`simple acoustic environments. As discussed in Section 13.3.4, the superdirective beam-
`former, which is based on particular assumptions about the ambient noise field, has
`proven useful in real acoustic environments. The minimum mean-square error (MMSE)
`beamformer is presented in Section 13.3.5 and its relation to the MVDR beamformer is
`discussed. The maximum signal-to-noise ratio design is then presented in Section 13.3.6.
`The generalized sidelobe canceller (GSC), which is to play a decisive role in the latter
`sections of this chapter, is presented in Section 13.3.7. As discussed in Section 13.3.8,
`diagonal loading is a very simple technique for adding robustness into adaptive beam-
`forming designs.
`Section 13.4, the last about the conventional beamforming algorithms, discusses imple-
`mentations of adaptive beamforming algorithms that are suitable for online operation.
`Firstly, a convergence analysis of designs based on stochastic gradient descent is pre-
`sented in Section 13.4.1, thereafter the various least mean-square (LMS) error designs,
`are presented in Section 13.4.2. These designs provide a complexity that is linear with the
`number N of sensors in the array, but can be slow to converge under unfavorable acoustic
`conditions. The recursive least square (RLS) error design, whose complexity increases as
`N 2, is discussed in Section 13.4.3. In return for this greater complexity, the RLS designs
`can provide better convergence characteristics. The RLS algorithms are known to be sus-
`ceptible to numerical instabilities. A way to remedy this problem, namely the square-root
`implementation, is discussed in Section 13.4.4.
`Recent research has revealed that the optimization criteria used in conventional array
`processing are not optimal for acoustic beamforming applications. In Section 13.5 of this
`chapter we discuss nonconventional optmization criteria for beamforming. A beamformer
`that maximizes the likelihood of the output signal with respect to a hidden Markov model
`(HMM) such as those discussed in Chapters 7 and 8 is discussed in Section 13.5.1.
`Section 13.5.2 presents a nonconventional beamforming algorithm based on the opti-
`mization of a negentropy criterion subject to a distortionless constraint. The negentropy
`criterion provides an indication of how non-Gaussian a random variable is. Human speech
`is a highly non-Gaussian signal, but becomes more nearly Gaussian when corrupted with
`noise or reverberation. Hence, in adjusting the active weight vectors of a GSC so as
`to provide a maximally non-Gaussian output subject to a distortionless constraint, the
`harmful effects of noise and reverberation on the output of the array can be minimized. A
`refinement of the maximum negentropy beamformer (MNB) is presented in Section 13.5.3,
`whereby a HMM is used to capture the nonstationarity of the desired speaker’s speech.
`It happens quite often when two or more people speak together, that they will speak
`simultaneously, thereby creating regions of overlapping or simultaneous speech. Thus, the
`recognition of such simultaneous speech is an area of active research. In Section 13.5.4,
`we present a relatively new algorithm for separating overlapping speech into different
`output streams. This algorithm is based on the construction of two beamformers in GSC
`configuration, one pointing at each active speaker. To provide optimal separation perfor-
`mance, the active weight vectors of both GSCs are optimized jointly to provide two output
`streams with minimum mutual information (MinMI). This approach is also motivated in
`large part by research within the ICA field. The geometric source separation algorithm
`is presented in Section 13.5.5, which under the proper assumptions can be shown to be
`related to the MinMI beamformer.
`
`

`

`Beamforming
`
`411
`
`Section 13.6 discusses a technique for automatically inferring the geometry of a micro-
`phone array based on a diffuse noise assumption.
`In the final section of the chapter, we present our conclusions and recommendations
`for further reading.
`
`13.1 Beamforming Fundamentals
`Here we consider the fundamental concepts required to describe the interaction of propa-
`gating sound waves with sensor arrays. In this regard, the discussion here is an extension
`of that in Section 2.1. The exposition in this section is based largely on Van Trees (2002,
`sect. 2.2), and will make extensive use of the basic signal processing concepts developed
`in Chapter 3.
`
`13.1.1 Sound Propagation and Array Geometry
`To begin, consider an arbitrary array of N sensors. We will assume for the moment that the
`locations mn, for n = 0, 1, . . . , N − 1 of the sensors are known. These sensors produce
`a set of signals denoted by the vector
`
`⎤⎥⎥⎥⎦
`
`.
`
`⎡⎢⎢⎢⎣
`
`f (t, m0)
`f (t, m1)
`...
`f (t, mN−1)
`
`f(t, m) =
`
`For the present, we will also work in the continuous-time domain t. This is done only
`to avoid the granularity introduced by a discrete-time index. But this will cease to be an
`issue when we move to the subband domain, as the phase shifts and scaling factors to be
`applied in the subband domain are continuous-valued, regardless of whether or not this
`is so for the signals with which we begin. The output of each sensor is processed with a
`linear time-invariant (LTI) filter with impulse response hn(τ ) and filter outputs are then
`summed to obtain the final output of the beamformer:
`
`y(t) = N−1(cid:8)
`
`n=0
`
`In matrix notation, the sensor weights of the delay-and-sum beamformer can be expressed
`as
`
`(cid:9) ∞
`−∞ hn(t − τ ) fn(τ, mn) dτ.
`(cid:9) ∞
`
`y(t) =
`
`where
`
`−∞
`
`hT (t − τ ) f(τ, m) dτ,
`
`(13.1)
`
`⎤⎥⎥⎥⎦
`
`.
`
`⎡⎢⎢⎢⎣
`
`h0(t)
`h1(t)
`...
`hN−1(t)
`
`h(t) =
`
`

`

`412
`
`Distant Speech Recognition
`
`Moving to the frequency domain by applying the continuous-time Fourier transform
`(3.48) enables (13.1) to be rewritten as
`Y (ω) =
`
`(cid:9) ∞
`
`−∞ y(t) e
`
`−j ωt dt = HT (ω) F(ω, m),
`(cid:9) ∞
`(cid:9) ∞
`−∞ h(t)e
`
`where
`
`H(ω) =
`
`F(ω, m) =
`
`−j ωt dt,
`
`f(t, m)e
`
`−∞
`
`−j ωt dt,
`
`(13.2)
`
`(13.3)
`
`(13.4)
`
`are, respectively, the vectors of frequency responses of the filters and spectra of the signals
`produced by the sensors.
`In building an actual beamforming system, we will not, of course, work with
`continuous-time Fourier transforms as implied by (13.2). Rather, the output of each
`microphone will be sampled then processed with an analysis filter bank such as was
`described in Chapter 11 to yield a set of subband samples. The N samples for each
`center frequency ωm = 2π m/M, where M is the number of subband samples, will then
`be gathered together and the inner product (13.2) will be calculated, whereupon all M
`beamformer outputs can then be transformed back into the time domain by a synthesis
`bank. We are justified in taking this approach by the reasoning presented in Section
`11.1, where it was explained that the output of the analysis bank can be interpreted as a
`short-time Fourier transform of the sampled signals subject only to the condition that the
`signals are sampled often enough in time to satisfy the Nyquist criterion. Beamforming in
`the subband domain has the considerable advantage that the active sensor weights can be
`optimized for each subband independently, which provides a tremendous computational
`savings with respect to a time-domain filter-and-sum beamformer with filters of the same
`length on the output of each sensor.
`Although the filter frequency responses are represented as constant with time in
`(13.2–13.4), in subsequent sections we will relax this assumption and allow H(ω) to be
`adapted in order to maximize or minimize an optimization criterion. We will in this case,
`however, make the assumption that is standard in adaptive filtering theory, namely, that
`H(ω) changes sufficiently slowly such that (13.2) is valid for the duration of a single
`subband snapshot (Haykin 2002). This implies, however, that the system is no longer
`actually linear.
`We will typically use spherical coordinates (r, θ, φ) to describe the propagation of sound
`waves through space. The relation between these spherical coordinates and the Cartesian
`coordinates (x, y, z) is illustrated in Figure 13.1. So defined, r > 0 is the radius or range,
`the polar angle θ assumes values on the range 0 ≤ θ ≤ π, and the azimuth assumes values
`on the range 0 ≤ φ ≤ 2π. Letting φ vary over its entire range is normal for circular arrays,
`but with the linear arrays considered in Section 13.1.3, it is typical for the sensors to be
`shielded acoustically from the rear so that, effectively, no sound propagates in the range
`π ≤ φ ≤ 2π.
`In the classical array-processing literature, it is quite common to make a plane wave
`assumption, which implies that the source of the wave is so distant that the locus of points
`
`

`

`
`Beamforming
`413
`
`
`
`x = r sineoosdi
`
`
`Figure 13.l Relation between the spherical coordinates (r,6.¢) and Cartesian coordinates
`(I. 37.?)
`
`with the same phase or wavefront is a plane. Such an assumption is seldom justified in
`acoustic beamt'orming through air. as the aperture of the array is typically of the same
`order of magnitude as the distance from the source to the sensors. Nonetheless. such an
`assumption is useful in introducing the conventional array-processing theory. our chief
`concern in this section. because it simplifies many important concepts. It is often useful
`in practice as well. in that it is not always possible to reliably estimate the distance from
`the source to the array, in which case the plane wave assumption is the onl)r possible
`choice.
`
`Consider then a plane wave shown in Figure 13.1 propagating in the direction
`
`a;
`
`—sin6'cos¢
`
`a = a). = —sianin¢
`a
`—cos9
`
`The first simplilication this produces is that the same signal ft!) arrives at each sensor.
`but not at the same time. Hence. we can write
`
`ftnm) =
`
`ftt- To)
`f0 — Ti)
`I
`
`,
`
`fU-TN—IJ
`
`(13.5)
`
`where the time delay ofarrivat (TDOA) 15,, appearing in (I35) can be calculated through
`the inner product.
`
`
`a m" = Jpn“ .sinecos¢+m,.,. . sin65in¢+mm -cos.9].
`C
`C
`-
`
`(I16)
`
`

`

`414
`
`Distant Speech Recognition
`
`c is the velocity of sound, and mn = [mn,x mn,y mn,z]. Each τn represents the differ-
`ence in arrival time of the wavefront at the nth sensor with respect to the origin.
`If we now define the direction cosines
`u (cid:2) −a,
`
`(13.7)
`
`then τn can be expressed as
`τn = − 1
`
`c
`
`[ux mn,x + uy mn,y + uzmn,z] = − uT mn
`
`c
`
`.
`
`(13.8)
`
`The time-delay property (3.50) of the continuous-time Fourier transform implies that
`under the signal model (13.5), the nth component of F(ω) defined in (13.4) can be
`expressed as
`
`(cid:9) ∞
`−∞ f (t − τn)e
`
`Fn(ω) =
`
`−j ωt dt = e
`
`−j ωτn F (ω),
`
`(13.9)
`
`where F (ω) is the Fourier transform of the original source. From (13.7) and (13.8) we
`infer
`
`ωτn = ω
`
`c
`
`aT mn = − ω
`
`c
`
`uT mn.
`
`(13.10)
`
`For plane waves propagating in a locally homogeneous medium, the wave number is
`defined as
`
`k = ω
`
`c
`
`a = 2π
`

`
`a,
`
`(13.11)
`
`where λ is the wavelength corresponding to the angular frequency ω. Based on (13.7),
`we can now express the wavenumber as
`
`= − 2π

`
`u.
`
`⎤⎦
`
`⎡⎣
`
`k = − 2π
`

`
`sin θ cos φ
`sin θ sin φ
`cos θ
`
`Assuming that the speed of sound is constant implies that
`= 2π
`|k| = ω

`
`c
`
`.
`
`(13.12)
`
`Physically, the wavenumber represents both the direction of propagation and frequency of
`the plane wave. As indicated by (13.11), the vector k specifies the direction of propagation
`of the plane wave. Equation (13.12) implies that the magnitude of k determines the
`frequency of the plane wave.
`Together (13.10) and (13.11) imply that
`ωτn = kT mn.
`
`(13.13)
`
`

`

`Beamforming
`
`415
`
`Hence, the Fourier transform of the propagating wave whose nth component is (13.9) can
`be expressed in vector form as
`
`(13.14)
`
`(13.15)
`
`F(ω) = F (ω) vk(k),
`
`⎤⎥⎥⎥⎥⎦
`
`,
`
`e
`
`e
`
`−jkT m0
`−jkT m1
`...
`−jkT mN−1
`
`⎡⎢⎢⎢⎢⎣
`
`e
`
`vk(k) (cid:2)
`
`where the array manifold vector, defined as
`
`represents a complete “summary” of the interaction of the array geometry with a
`propagating wave. As mentioned previously, beamforming is typically performed in
`the discrete-time Fourier transform domain,
`through the use of

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket