`US RE48,371
`Amazon Ex. 1021
`
`
`
`DISTANT SPEECH
`
`RECOGNITION
`
`
`
`DISTANT SPEECH
`RECOGNITION
`
`Matthias W¨olfel
`Universit¨at Karlsruhe (TH), Germany
`
`and
`John McDonough
`Universit¨at des Saarlandes, Germany
`
`A John Wiley and Sons, Ltd., Publication
`
`
`
`This edition first published 2009
`© 2009 John Wiley & Sons Ltd
`
`Registered office
`
`John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
`
`For details of our global editorial offices, for customer services and for information about how to apply for
`permission to reuse the copyright material in this book please see our website at www.wiley.com.
`
`The right of the author to be identified as the author of this work has been asserted in accordance with the
`Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
`any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
`the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
`
`Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
`available in electronic books.
`
`Designations used by companies to distinguish their products are often claimed as trademarks. All brand names
`and product names used in this book are trade names, service marks, trademarks or registered trademarks of their
`respective owners. The publisher is not associated with any product or vendor mentioned in this book. This
`publication is designed to provide accurate and authoritative information in regard to the subject matter covered.
`It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional
`advice or other expert assistance is required, the services of a competent professional should be sought.
`
`Wiley also publishes its books in a variety of electronic formats. Some content that appears
`in print may not be available in electronic books.
`
`Library of Congress Cataloging-in-Publication Data
`
`W¨olfel, Matthias.
`Distant speech recognition / Matthias W¨olfel, John McDonough.
`p. cm.
`Includes bibliographical references and index.
`ISBN 978-0-470-51704-8 (cloth)
`1. Automatic speech recognition. I. McDonough, John (John W.) II. Title.
`TK7882.S65W64 2009
`(cid:2)
`006.4
`54 – dc22
`
`2008052791
`
`A catalogue record for this book is available from the British Library
`
`ISBN 978-0-470-51704-8 (H/B)
`
`Typeset in 10/12 Times by Laserwords Private Limited, Chennai, India
`Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire
`
`
`
`98
`
`Distant Speech Recognition
`
`Hence, it is apparent that the optimal MMSE estimator is equivalent to the conditional
`mean,
`
`Ep(xk|y1:k ) {xk|y1:k} =
`xk p(xk|y1:k) dxk.
`(4.8)
`Similarly, it follows that knowledge of the filtering density p(xk|y1:k−1) enables all other
`less general estimates of xk to be readily calculated.
`
`(cid:2)
`
`4.2 Wiener Filter
`Stochastic filter theory was established by the pioneering work of Norbert Wiener (1949),
`Wiener and Hopf (1931), and Andrey Kolmogorov (1941a, b). A Wiener filter provides
`the optimal static, linear, MMSE solution, where the mean square error is calculated
`between the output of the filter and some desired signal. We discuss the Wiener filter in
`this section, because such a filter is equivalent to the Kalman filter described in Section 4.3
`without any process noise. Hence, the Wiener filter is in fact a Bayesian estimator (Simon
`2006, sect. 8.5.2). We will derive both the time and frequency domain solutions for the
`finite impulse response (FIR) filter.
`
`4.2.1 Time Domain Solution
`Let x[n] denote the desired signal and let d[n] represent some additive distortion. The
`primary assumptions inherent in the Wiener filter are that the second-order statistics of
`both x[n] and d[n] are stationary. The corrupted signal is then defined as
`y[n] (cid:2) x[n] + d[n].
`
`The time domain output of the FIR Wiener filter, which is the estimate ˆx[n] of the desired
`signal x[n], is by definition obtained from the convolution
`
`ˆx[n] (cid:2) L−1(cid:11)
`
`l=0
`
`h[l] y[n − l],
`
`(4.9)
`
`(cid:13)
`(cid:12)
`where h[n] is the filter impulse response of length L. Upon defining
`h[0] h[1] ··· h[L − 1]
`(cid:12)
`h (cid:2)
`y[n] y[n − 1] ··· y[n − L + 1]
`y[n] (cid:2)
`
`(cid:13)
`
`T
`
`,
`
`T
`
`,
`
`the output of the filter can be expressed as
`ˆx[n] = hT y[n].
`The estimation error is [n] (cid:2) x[n] − ˆx[n], and the squared-estimation error is given by
`ζ (cid:2) E{ T [n] [n]} = E{(x[n] − hT y[n])T (x[n] − hT y[n])}.
`
`(4.10)
`
`
`
`Bayesian Filters
`
`99
`
`which must be minimized. Equation (4.10) can be rewritten as
`ζ = E{xT [n] x[n]} − 2 hT rxy + hT Ry h,
`
`where
`
`Ry (cid:2) E{y[n] yT [n]},
`rxy (cid:2) E{y[n] x[n]}.
`
`The Wiener filter is based on the assumption that the components Ry and rxy are stationary.
`In order to solve for the optimal filter coefficients, we set
`
`= −2 rxy + 2 hT Ry = 0,
`
`∂ζ
`∂h
`
`which leads immediately to the famous Wiener–Hopf equation
`Ry h = rxy .
`
`The solution for the optimal coefficients is then
`ho = R
`
`−1
`y rxy .
`
`(4.11)
`
`(4.12)
`
`Note that the optimal solution can also be found through the well-known orthogonality
`principle (Stark and Woods 1994), which can be stated as
`E{y[n − i] [n]} = 0∀ i = 0, . . . , L − 1.
`
`(4.13)
`
`In other words, the orthogonality principle requires that the estimation error [n] is
`orthogonal to all of the inputs y[n − i] for i = 0, . . . , L − 1 used to form the estimate
`ˆx[n].
`
`4.2.2 Frequency Domain Solution
`In order to derive the Wiener filter in the frequency domain, let us express (4.13) as
`
`(cid:16)(cid:17)
`
`hopt[l] y[n − l]
`
`= 0∀ i = 0, . . . , L − 1.
`
`(cid:14)
`
`E
`
`y[n − i]
`
`(cid:15)
`
`x[n] − L−1(cid:11)
`
`l=0
`
`Equivalently, we can write
`
`rxy[n] − hopt[n] ∗ ry[n] = 0,
`
`(4.14)
`
`
`
`100
`
`Distant Speech Recognition
`
`where the cross-correlation sequence of x[n] and y[n] as well as the autocorrelation
`sequence of y[n] are, respectively,
`E{y[n − l] x[n]}, ∀ l = 0, . . . , L − 1,
`0,
`otherwise,
`E{y[n − l] y[n]}, ∀ l = −L + 1, . . . , L − 1,
`0,
`otherwise.
`
`rxy[l] (cid:2)
`
`ry[l] (cid:2)
`
`(cid:14)
`(cid:14)
`
`Taking the Fourier transform of (4.14) provides
` XY (ω) − Hopt(ω) Y (ω) = 0,
`where1 rxy[n] ↔ XY (ω), h[n] ↔ Hopt(ω), and ry[n] ↔ Y (ω). This leads immediately
`to the solution
`
`Hopt(ω) = XY (ω)
`
` Y (ω)
`
`.
`
`(4.15)
`
`Given that X(ω) and D(ω) are statistically independent by assumption, it follows that
` Y (ω) = X(ω) + D(ω),
` XY (ω) = X(ω).
`
`Hence, we can rewrite (4.15) as
`Hopt(ω) =
`
` X(ω)
`
`,
`
`(4.16)
`
` X(ω) + D(ω)
`the form in which the Wiener filter is most often seen. Alternatively, the frequency
`response of the filter can be expressed as
`Hopt(ω) =
`
`1
`1 + D(ω)
` X(ω)
`
`,
`
`from which it is apparent that when the spectral power of the disturbance comes to
`dominate that of the signal, the gain of the filter is reduced. When the signal dominates
`the disturbance, on the other hand, the gain increases. In all cases it holds that
`0 ≤ |Hopt(ω)| ≤ 1.
`
`As presented here, the classical Wiener filter presents something of a paradox in that
`it requires that the desired signal x[n] or its power spectrum X(ω) is known before the
`1 The notation ry[n] ↔ Y (ω) indicates that ry[n] and Y (ω) comprise a Fourier transform pair; see Section 3.1.2
`for details.
`
`
`
`Bayesian Filters
`
`101
`
`filter coefficients can be designed. Were this information available, there would be no
`need of a Wiener filter. The art of practical Wiener filter design consists of nothing more
`than the robust estimation of the desired signal X(ω) and noise D(ω) components
`appearing in (4.15). References indicating how this can be achieved are presented at the
`ends of Sections 6.3.1 and 13.3.5.
`
`4.3 Kalman Filter and Variations
`In this section, we present the best known set of solutions for estimating the filtering
`density, namely the Kalman filter (KF) (Kalman 1960) and its several variations.
`
`4.3.1 Kalman Filter
`The Kalman filter provides a closed form means of sequentially updating p(xk|y1:k) under
`two critical assumptions:
`• The transition and observation models fk and hk are linear.
`• The process and observation noises uk and vk are Gaussian.
`
`As the linear combination of Gaussian r.v.s is also Gaussian, these assumptions taken
`together imply that both xk and yk will remain Gaussian for all time k. Note that the
`combination of Gaussians in the nonlinear domain, such as the logarithmic domain, results
`in a non-Gaussian distribution, as described in Section 9.3.1. As mentioned previously,
`under these conditions, the KF is the optimal MMSE estimator.
`In keeping with the aforementioned linearity assumption, the state model (4.1–4.2) can
`be expressed as
`
`xk = Fk|k−1 xk−1 + uk−1,
`yk = Hkxk + vk,
`
`(4.17)
`
`(4.18)
`
`where Fk|k−1 and Hk are the known transition and observation matrices. The noise terms
`uk and vk in (4.17–4.18) are by assumption zero mean, white Gaussian random vector
`processes with covariance matrices
`Uk = E{ukuT
`
`},
`
`k
`
`Vk = E{vkvT
`
`k
`
`},
`
`respectively. Moreover, by assumption uk and vk are statistically independent.
`By definition, the transition matrix Fk|k−1 has two important properties:
`• product rule
`
`Fk|m Fm|n = Fk|n,
`
`(4.19)
`
`
`
`6 S
`
`peech Feature Enhancement
`
`In automatic speech recognition (ASR) the distortion of the acoustic features can be com-
`pensated for either in the model domain or in the feature domain. The former techniques
`adapt the model on the distorted test data in such a way as if the model were trained on
`distorted data. Feature domain techniques, on the other hand, attempt to remove or sup-
`press the distortion itself. It has been shown in various publications, such as Deng et al.
`(2000); Sehr and Kellermann (2007), that feature domain techniques provide better sys-
`tem performance than simply matching the training and testing conditions. The problem is
`especially severe for speech corrupted with reverberation. In particular, for reverberation
`times above 500 ms, ASR performance with respect to a model trained on clean speech
`does not improve significantly even when the acoustic model of the recognizer has been
`trained on data from the same acoustic environment (Baba et al. 2002).
`The term enhancement indicates an improvement in speech quality. For speech observa-
`tions, enhancement can be expressed either in terms of intelligibility, which is an indicator
`of how well the speech can be understood by a human, or signal quality, which is an indi-
`cator of how badly the speech is corrupted, or it can include both of these measures. For
`the purpose of automatic classification, features must be manipulated to provide a higher
`class separability. It is possible to perform speech feature enhancement in an independent
`preprocessing step, or within the front-end of the ASR system during feature extraction.
`In both cases it is not necessary to modify the decoding stage and it might not require
`any changes to the acoustic models of the ASR system, except for methods that change
`the means or variances of the features, such as cepstral mean and variance normalization.
`If the training data, however, is distorted itself, it might be helpful to enhance the training
`features as well.
`In general the speech enhancement problem can be formulated as the estimation of
`cleaned speech coefficients by maximizing or minimizing certain objective criteria using
`additional knowledge, which could represent prior knowledge about the characteristics of
`the desired speech signal or unwanted distortion, for example. A common and widely
`accepted distortion measure was introduced in Chapter 4, namely, the squared error dis-
`tortion,
`
`d( ˆx, x) = |f ( ˆx) − f (x)|2
`
`Distant Speech Recognition Matthias W¨olfel and John McDonough
`© 2009 John Wiley & Sons, Ltd.
`
`
`
`182
`
`Distant Speech Recognition
`
`where the function f (x) – which could be anyone of x, |x|, x2, or log x – determines the
`fidelity criterion of the estimator.
`As the term speech enhancement is very broad and can potentially cover a wide variety
`of techniques, including:
`• additive noise reduction,
`• dereverberation,
`• blind source separation,
`• beamforming,
`• reconstruction of lost speech packets in digital networks, or
`• bandwidth extension of narrowband speech,
`
`it is useful to provide some more specificity. An obvious classification criteria is provided
`by the number and type of sensors used. Single-channel methods, as described in this
`section, obtain the input from just a single microphone while multi-channel methods rely
`on observations from an array of sensors. These methods can be further categorized by the
`type of sensors. An example of the fusion of audio and visual features in order to improve
`recognition performance is given by Almajai et al. (2007). As discussed in Chapters 12
`and 13, respectively, blind source separation and beamforming combine acoustic sig-
`nals captured only with microphones. These techniques differ inasmuch beamforming
`assumes more prior information – namely, the geometry of the sensor array and position
`of the speaker – is available. Single and multi-channel approaches can be combined to
`further improve the signal or feature in terms of the objective function used, such as
`signal-to-noise ratio (SNR), class separability, or word error rate.
`In this book we want to use the term speech feature enhancement exclusively to describe
`algorithms or devices whose purpose is to improve the speech features, where a single cor-
`rupted waveform or single corrupted feature stream is available. The goal is an improved
`classification accuracy which may not necessarily result in an improved or pleasing sound
`quality if reconstruction is at all possible. As seen in previous sections, additive noise and
`reverberation are the most frequently encountered problems in distant speech recognition
`(DSR) and our investigations are limited to methods of removing the effects of these
`distortions.
`Work on speech enhancement addressing noise reduction has been a research topic
`since the early 1960s when Manfred Schr¨oder at Bell Labs began working in the field.
`Schr¨oder’s analog implementation of spectral subtraction, however, is not well known
`inasmuch as it was only published in patents (Schr¨oder 1965, 1968). In 1974 Weiss
`et al. (1974) proposed an algorithm in the autocorrelation domain. Five years later Boll
`(1979) proposed a similar algorithm which, however, worked in the spectra domain. Boll’s
`algorithm became one of the earliest and most popular approaches to speech enhancement.
`A broad variety of variations to Boll’s basic spectral subtraction approach followed.
`Cepstral mean normalization (CMN), another popular approach, which in contrast to the
`aforementioned methods is designed to compensate for channel distortion, was proposed
`by Atal (1974) already in 1974. CMN came into wide use, however, only in the early
`1990s. The effects of additive noise on cepstral coefficients as well as various remedies
`were investigated in the PhD dissertations by Acero (1990a), Gales (1995), and Moreno
`(1996).
`
`
`
`Speech Feature Enhancement
`
`183
`
`Considering speech feature enhancement as a Bayesian filtering problem leads to the
`application of a series of statistical algorithms intended to estimate the state of a dynamical
`system. Such Bayesian filters are described in Chapter 4. Pioneering work in that direction
`was presented by Lim and Oppenheim (1978) where an autoregressive model was used for
`a speech signal distorted by additive white Gaussian noise. Lim’s algorithm estimates the
`autoregressive parameters by solving the Yule–Walker equation with the current estimate
`of the speech signal and obtains an improved speech signal by applying a Wiener filter
`to the observed signal. Paliwal and Basu (1987) extended this idea by replacing the
`Wiener filter with a Kalman filter (KF). That work was likely the first application of
`the KF to speech feature enhancement. In the years following different sequential speech
`enhancement methods were proposed and the single Gaussian model was replaced by a
`Gaussian mixture (Lee et al. 1997). Several extensions intended to overcome the strict
`assumptions of the KF have appeared in the literature. The interacting multiple model,
`wherein several KFs in different stages interact with each other, was proposed by Kim
`(1998). Just recently very powerful methods based on partice filters have been proposed
`to enhance the speech features in the logarithmic spectral domain (Singh and Raj 2003;
`Yao and Nakamura 2002). This idea has been adopted and augmented by W¨olfel (2008a)
`to jointly track, estimate and compensate for additive and reverberant distortions.
`
`6.1 Noise and Reverberation in Various Domains
`We begin our exposition by defining a signal model. Let x = [x1, x2,··· , xM] denote the
`original speech sequence, let h = [h1, h2,··· , hM] denote convolutional distortions such
`as the room impulse response, and let n = [n1, n2,··· , nM] denote the additive noise
`sequence. The signal model can then be expressed as
`y(t) = h(t) ∗ x(t) + n(t),
`
`(6.1)
`
`in the discrete-time domain, which we indicate with the superscript (t). Next we develop
`equivalent representations of the signal model in alternative domains, which will be
`indicated with suitable superscripts. The relationship, however, between additive and
`convolution distortion as well as the clean signal might become nontrivial after the trans-
`formation into different domains. In particular, ignoring the phase will lead to approximate
`solutions, which are frequently used due to their relative simplicity. An overview of the
`relationship between the original and clean signal is presented in Table 6.1.
`The advantage of
`time domain techniques is that
`they can be applied on a
`sample-by-sample basis, while all alternative domains presented here require windowing
`the signals and processing an entire block of data at once.
`
`6.1.1 Frequency Domain
`Representing the waveform as a sum of sinusoids by the application of the Fourier trans-
`form leads to the spectral domain representation,
`y(f ) = h(f )x(f ) + n(f ),
`
`(6.2)
`
`
`
`13
`
`Beamforming
`
`In this chapter, we investigate a class of techniques – known collectively as beamform-
`ing – by which signals from several sensors can be combined to emphasize a desired
`source and suppress interference from other directions. Beamforming begins with the
`assumption that the positions of all sensors are known, and that the position of the
`desired source is known or can be estimated. The simplest of beamforming algorithms,
`the delay-and-sum beamformer, uses only this geometrical knowledge to combine the sig-
`nals from several sensors. More sophisticated adaptive beamformers attempt to minimize
`the total output power of the array under the constraint that the desired source must be
`unattenuated. The conventional adaptive beamforming algorithms attempt to minimize a
`quadratic optimization criterion related to signal-to-noise ratio under a distortionless con-
`straint in the look direction. Recent research has revealed, however, that such quadratic
`criteria are not optimal for acoustic beamforming of human speech. Hence, we also
`present beamformers based on non-conventional optimization criteria that have appeared
`more recently in the literature.
`Any reader well acquainted with the conventional array processing literature will cer-
`tainly have already seen the material in Sections 13.1 through 13.4. The interaction of
`propagating waves with the sensors of a beamformer are described in Section 13.1.1,
`as are the effects of sensor spacing and beam steering on the spatial sensitivity of the
`array. The beam pattern, which is a plot of array sensitivity versus direction of arrival of
`propagating wave, is defined and described in Section 13.1.2. The simplest beamformer,
`namely the delay-and-sum beamformer, is presented in Section 13.1.3, and the effects
`of beam steering are discussed in Section 13.1.4. Quantitative measures of beamforming
`performance are presented in Section 13.2, the most important of which are directivity,
`as presented in Section 13.2.1, and array gain, as presented in Section 13.2.2. These mea-
`sures will be used to evaluate the conventional beamforming algorithms described later
`in the chapter.
`In Section 13.3, we take up the discussion of the conventional beamforming algorithms.
`The minimum variance distortionless response (MVDR) is presented in Section 13.3.1, and
`its performance is analyzed in Sections 13.3.2 and 13.3.3. The beamforming algorithms
`based on the MVDR design, including the minimum mean square error and maximum
`signal-to-noise ratio beamformers, have the advantage of being tractable to analyze in
`
`Distant Speech Recognition Matthias W¨olfel and John McDonough
`© 2009 John Wiley & Sons, Ltd.
`
`
`
`410
`
`Distant Speech Recognition
`
`simple acoustic environments. As discussed in Section 13.3.4, the superdirective beam-
`former, which is based on particular assumptions about the ambient noise field, has
`proven useful in real acoustic environments. The minimum mean-square error (MMSE)
`beamformer is presented in Section 13.3.5 and its relation to the MVDR beamformer is
`discussed. The maximum signal-to-noise ratio design is then presented in Section 13.3.6.
`The generalized sidelobe canceller (GSC), which is to play a decisive role in the latter
`sections of this chapter, is presented in Section 13.3.7. As discussed in Section 13.3.8,
`diagonal loading is a very simple technique for adding robustness into adaptive beam-
`forming designs.
`Section 13.4, the last about the conventional beamforming algorithms, discusses imple-
`mentations of adaptive beamforming algorithms that are suitable for online operation.
`Firstly, a convergence analysis of designs based on stochastic gradient descent is pre-
`sented in Section 13.4.1, thereafter the various least mean-square (LMS) error designs,
`are presented in Section 13.4.2. These designs provide a complexity that is linear with the
`number N of sensors in the array, but can be slow to converge under unfavorable acoustic
`conditions. The recursive least square (RLS) error design, whose complexity increases as
`N 2, is discussed in Section 13.4.3. In return for this greater complexity, the RLS designs
`can provide better convergence characteristics. The RLS algorithms are known to be sus-
`ceptible to numerical instabilities. A way to remedy this problem, namely the square-root
`implementation, is discussed in Section 13.4.4.
`Recent research has revealed that the optimization criteria used in conventional array
`processing are not optimal for acoustic beamforming applications. In Section 13.5 of this
`chapter we discuss nonconventional optmization criteria for beamforming. A beamformer
`that maximizes the likelihood of the output signal with respect to a hidden Markov model
`(HMM) such as those discussed in Chapters 7 and 8 is discussed in Section 13.5.1.
`Section 13.5.2 presents a nonconventional beamforming algorithm based on the opti-
`mization of a negentropy criterion subject to a distortionless constraint. The negentropy
`criterion provides an indication of how non-Gaussian a random variable is. Human speech
`is a highly non-Gaussian signal, but becomes more nearly Gaussian when corrupted with
`noise or reverberation. Hence, in adjusting the active weight vectors of a GSC so as
`to provide a maximally non-Gaussian output subject to a distortionless constraint, the
`harmful effects of noise and reverberation on the output of the array can be minimized. A
`refinement of the maximum negentropy beamformer (MNB) is presented in Section 13.5.3,
`whereby a HMM is used to capture the nonstationarity of the desired speaker’s speech.
`It happens quite often when two or more people speak together, that they will speak
`simultaneously, thereby creating regions of overlapping or simultaneous speech. Thus, the
`recognition of such simultaneous speech is an area of active research. In Section 13.5.4,
`we present a relatively new algorithm for separating overlapping speech into different
`output streams. This algorithm is based on the construction of two beamformers in GSC
`configuration, one pointing at each active speaker. To provide optimal separation perfor-
`mance, the active weight vectors of both GSCs are optimized jointly to provide two output
`streams with minimum mutual information (MinMI). This approach is also motivated in
`large part by research within the ICA field. The geometric source separation algorithm
`is presented in Section 13.5.5, which under the proper assumptions can be shown to be
`related to the MinMI beamformer.
`
`
`
`Beamforming
`
`411
`
`Section 13.6 discusses a technique for automatically inferring the geometry of a micro-
`phone array based on a diffuse noise assumption.
`In the final section of the chapter, we present our conclusions and recommendations
`for further reading.
`
`13.1 Beamforming Fundamentals
`Here we consider the fundamental concepts required to describe the interaction of propa-
`gating sound waves with sensor arrays. In this regard, the discussion here is an extension
`of that in Section 2.1. The exposition in this section is based largely on Van Trees (2002,
`sect. 2.2), and will make extensive use of the basic signal processing concepts developed
`in Chapter 3.
`
`13.1.1 Sound Propagation and Array Geometry
`To begin, consider an arbitrary array of N sensors. We will assume for the moment that the
`locations mn, for n = 0, 1, . . . , N − 1 of the sensors are known. These sensors produce
`a set of signals denoted by the vector
`
`⎤⎥⎥⎥⎦
`
`.
`
`⎡⎢⎢⎢⎣
`
`f (t, m0)
`f (t, m1)
`...
`f (t, mN−1)
`
`f(t, m) =
`
`For the present, we will also work in the continuous-time domain t. This is done only
`to avoid the granularity introduced by a discrete-time index. But this will cease to be an
`issue when we move to the subband domain, as the phase shifts and scaling factors to be
`applied in the subband domain are continuous-valued, regardless of whether or not this
`is so for the signals with which we begin. The output of each sensor is processed with a
`linear time-invariant (LTI) filter with impulse response hn(τ ) and filter outputs are then
`summed to obtain the final output of the beamformer:
`
`y(t) = N−1(cid:8)
`
`n=0
`
`In matrix notation, the sensor weights of the delay-and-sum beamformer can be expressed
`as
`
`(cid:9) ∞
`−∞ hn(t − τ ) fn(τ, mn) dτ.
`(cid:9) ∞
`
`y(t) =
`
`where
`
`−∞
`
`hT (t − τ ) f(τ, m) dτ,
`
`(13.1)
`
`⎤⎥⎥⎥⎦
`
`.
`
`⎡⎢⎢⎢⎣
`
`h0(t)
`h1(t)
`...
`hN−1(t)
`
`h(t) =
`
`
`
`412
`
`Distant Speech Recognition
`
`Moving to the frequency domain by applying the continuous-time Fourier transform
`(3.48) enables (13.1) to be rewritten as
`Y (ω) =
`
`(cid:9) ∞
`
`−∞ y(t) e
`
`−j ωt dt = HT (ω) F(ω, m),
`(cid:9) ∞
`(cid:9) ∞
`−∞ h(t)e
`
`where
`
`H(ω) =
`
`F(ω, m) =
`
`−j ωt dt,
`
`f(t, m)e
`
`−∞
`
`−j ωt dt,
`
`(13.2)
`
`(13.3)
`
`(13.4)
`
`are, respectively, the vectors of frequency responses of the filters and spectra of the signals
`produced by the sensors.
`In building an actual beamforming system, we will not, of course, work with
`continuous-time Fourier transforms as implied by (13.2). Rather, the output of each
`microphone will be sampled then processed with an analysis filter bank such as was
`described in Chapter 11 to yield a set of subband samples. The N samples for each
`center frequency ωm = 2π m/M, where M is the number of subband samples, will then
`be gathered together and the inner product (13.2) will be calculated, whereupon all M
`beamformer outputs can then be transformed back into the time domain by a synthesis
`bank. We are justified in taking this approach by the reasoning presented in Section
`11.1, where it was explained that the output of the analysis bank can be interpreted as a
`short-time Fourier transform of the sampled signals subject only to the condition that the
`signals are sampled often enough in time to satisfy the Nyquist criterion. Beamforming in
`the subband domain has the considerable advantage that the active sensor weights can be
`optimized for each subband independently, which provides a tremendous computational
`savings with respect to a time-domain filter-and-sum beamformer with filters of the same
`length on the output of each sensor.
`Although the filter frequency responses are represented as constant with time in
`(13.2–13.4), in subsequent sections we will relax this assumption and allow H(ω) to be
`adapted in order to maximize or minimize an optimization criterion. We will in this case,
`however, make the assumption that is standard in adaptive filtering theory, namely, that
`H(ω) changes sufficiently slowly such that (13.2) is valid for the duration of a single
`subband snapshot (Haykin 2002). This implies, however, that the system is no longer
`actually linear.
`We will typically use spherical coordinates (r, θ, φ) to describe the propagation of sound
`waves through space. The relation between these spherical coordinates and the Cartesian
`coordinates (x, y, z) is illustrated in Figure 13.1. So defined, r > 0 is the radius or range,
`the polar angle θ assumes values on the range 0 ≤ θ ≤ π, and the azimuth assumes values
`on the range 0 ≤ φ ≤ 2π. Letting φ vary over its entire range is normal for circular arrays,
`but with the linear arrays considered in Section 13.1.3, it is typical for the sensors to be
`shielded acoustically from the rear so that, effectively, no sound propagates in the range
`π ≤ φ ≤ 2π.
`In the classical array-processing literature, it is quite common to make a plane wave
`assumption, which implies that the source of the wave is so distant that the locus of points
`
`
`
`
`Beamforming
`413
`
`
`
`x = r sineoosdi
`
`
`Figure 13.l Relation between the spherical coordinates (r,6.¢) and Cartesian coordinates
`(I. 37.?)
`
`with the same phase or wavefront is a plane. Such an assumption is seldom justified in
`acoustic beamt'orming through air. as the aperture of the array is typically of the same
`order of magnitude as the distance from the source to the sensors. Nonetheless. such an
`assumption is useful in introducing the conventional array-processing theory. our chief
`concern in this section. because it simplifies many important concepts. It is often useful
`in practice as well. in that it is not always possible to reliably estimate the distance from
`the source to the array, in which case the plane wave assumption is the onl)r possible
`choice.
`
`Consider then a plane wave shown in Figure 13.1 propagating in the direction
`
`a;
`
`—sin6'cos¢
`
`a = a). = —sianin¢
`a
`—cos9
`
`The first simplilication this produces is that the same signal ft!) arrives at each sensor.
`but not at the same time. Hence. we can write
`
`ftnm) =
`
`ftt- To)
`f0 — Ti)
`I
`
`,
`
`fU-TN—IJ
`
`(13.5)
`
`where the time delay ofarrivat (TDOA) 15,, appearing in (I35) can be calculated through
`the inner product.
`
`
`a m" = Jpn“ .sinecos¢+m,.,. . sin65in¢+mm -cos.9].
`C
`C
`-
`
`(I16)
`
`
`
`414
`
`Distant Speech Recognition
`
`c is the velocity of sound, and mn = [mn,x mn,y mn,z]. Each τn represents the differ-
`ence in arrival time of the wavefront at the nth sensor with respect to the origin.
`If we now define the direction cosines
`u (cid:2) −a,
`
`(13.7)
`
`then τn can be expressed as
`τn = − 1
`
`c
`
`[ux mn,x + uy mn,y + uzmn,z] = − uT mn
`
`c
`
`.
`
`(13.8)
`
`The time-delay property (3.50) of the continuous-time Fourier transform implies that
`under the signal model (13.5), the nth component of F(ω) defined in (13.4) can be
`expressed as
`
`(cid:9) ∞
`−∞ f (t − τn)e
`
`Fn(ω) =
`
`−j ωt dt = e
`
`−j ωτn F (ω),
`
`(13.9)
`
`where F (ω) is the Fourier transform of the original source. From (13.7) and (13.8) we
`infer
`
`ωτn = ω
`
`c
`
`aT mn = − ω
`
`c
`
`uT mn.
`
`(13.10)
`
`For plane waves propagating in a locally homogeneous medium, the wave number is
`defined as
`
`k = ω
`
`c
`
`a = 2π
`
`λ
`
`a,
`
`(13.11)
`
`where λ is the wavelength corresponding to the angular frequency ω. Based on (13.7),
`we can now express the wavenumber as
`
`= − 2π
`λ
`
`u.
`
`⎤⎦
`
`⎡⎣
`
`k = − 2π
`
`λ
`
`sin θ cos φ
`sin θ sin φ
`cos θ
`
`Assuming that the speed of sound is constant implies that
`= 2π
`|k| = ω
`λ
`
`c
`
`.
`
`(13.12)
`
`Physically, the wavenumber represents both the direction of propagation and frequency of
`the plane wave. As indicated by (13.11), the vector k specifies the direction of propagation
`of the plane wave. Equation (13.12) implies that the magnitude of k determines the
`frequency of the plane wave.
`Together (13.10) and (13.11) imply that
`ωτn = kT mn.
`
`(13.13)
`
`
`
`Beamforming
`
`415
`
`Hence, the Fourier transform of the propagating wave whose nth component is (13.9) can
`be expressed in vector form as
`
`(13.14)
`
`(13.15)
`
`F(ω) = F (ω) vk(k),
`
`⎤⎥⎥⎥⎥⎦
`
`,
`
`e
`
`e
`
`−jkT m0
`−jkT m1
`...
`−jkT mN−1
`
`⎡⎢⎢⎢⎢⎣
`
`e
`
`vk(k) (cid:2)
`
`where the array manifold vector, defined as
`
`represents a complete “summary” of the interaction of the array geometry with a
`propagating wave. As mentioned previously, beamforming is typically performed in
`the discrete-time Fourier transform domain,
`through the use of