`
`DISTANT
`SPEECH
`RECOGNITION
`
`Amazon Ex. 1017
`IPR Petition - US RE47,049
`
`Amazon Ex. 1017, Page 1 of 87
`
`
`
`DISTANT SPEECH
`RECOGNITION
`
`Amazon Ex. 1017, Page 2 of 87
`
`
`
`DISTANT SPEECH
`RECOGNITION
`
`Matthias W¨olfel
`Universit¨at Karlsruhe (TH), Germany
`
`and
`John McDonough
`Universit¨at des Saarlandes, Germany
`
`A John Wiley and Sons, Ltd., Publication
`
`Amazon Ex. 1017, Page 3 of 87
`
`
`
`This edition first published 2009
`© 2009 John Wiley & Sons Ltd
`
`Registered office
`
`John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
`
`For details of our global editorial offices, for customer services and for information about how to apply for
`permission to reuse the copyright material in this book please see our website at www.wiley.com.
`
`The right of the author to be identified as the author of this work has been asserted in accordance with the
`Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
`any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
`the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
`
`Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
`available in electronic books.
`
`Designations used by companies to distinguish their products are often claimed as trademarks. All brand names
`and product names used in this book are trade names, service marks, trademarks or registered trademarks of their
`respective owners. The publisher is not associated with any product or vendor mentioned in this book. This
`publication is designed to provide accurate and authoritative information in regard to the subject matter covered.
`It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional
`advice or other expert assistance is required, the services of a competent professional should be sought.
`
`Wiley also publishes its books in a variety of electronic formats. Some content that appears
`in print may not be available in electronic books.
`
`Library of Congress Cataloging-in-Publication Data
`
`W¨olfel, Matthias.
`Distant speech recognition / Matthias W¨olfel, John McDonough.
`p. cm.
`Includes bibliographical references and index.
`ISBN 978-0-470-51704-8 (cloth)
`1. Automatic speech recognition. I. McDonough, John (John W.) II. Title.
`TK7882.S65W64 2009
`(cid:2)
`006.4
`54 – dc22
`
`2008052791
`
`A catalogue record for this book is available from the British Library
`
`ISBN 978-0-470-51704-8 (H/B)
`
`Typeset in 10/12 Times by Laserwords Private Limited, Chennai, India
`Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire
`
`Amazon Ex. 1017, Page 4 of 87
`
`
`
`13
`
`Beamforming
`
`In this chapter, we investigate a class of techniques – known collectively as beamform-
`ing – by which signals from several sensors can be combined to emphasize a desired
`source and suppress interference from other directions. Beamforming begins with the
`assumption that the positions of all sensors are known, and that the position of the
`desired source is known or can be estimated. The simplest of beamforming algorithms,
`the delay-and-sum beamformer, uses only this geometrical knowledge to combine the sig-
`nals from several sensors. More sophisticated adaptive beamformers attempt to minimize
`the total output power of the array under the constraint that the desired source must be
`unattenuated. The conventional adaptive beamforming algorithms attempt to minimize a
`quadratic optimization criterion related to signal-to-noise ratio under a distortionless con-
`straint in the look direction. Recent research has revealed, however, that such quadratic
`criteria are not optimal for acoustic beamforming of human speech. Hence, we also
`present beamformers based on non-conventional optimization criteria that have appeared
`more recently in the literature.
`Any reader well acquainted with the conventional array processing literature will cer-
`tainly have already seen the material in Sections 13.1 through 13.4. The interaction of
`propagating waves with the sensors of a beamformer are described in Section 13.1.1,
`as are the effects of sensor spacing and beam steering on the spatial sensitivity of the
`array. The beam pattern, which is a plot of array sensitivity versus direction of arrival of
`propagating wave, is defined and described in Section 13.1.2. The simplest beamformer,
`namely the delay-and-sum beamformer, is presented in Section 13.1.3, and the effects
`of beam steering are discussed in Section 13.1.4. Quantitative measures of beamforming
`performance are presented in Section 13.2, the most important of which are directivity,
`as presented in Section 13.2.1, and array gain, as presented in Section 13.2.2. These mea-
`sures will be used to evaluate the conventional beamforming algorithms described later
`in the chapter.
`In Section 13.3, we take up the discussion of the conventional beamforming algorithms.
`The minimum variance distortionless response (MVDR) is presented in Section 13.3.1, and
`its performance is analyzed in Sections 13.3.2 and 13.3.3. The beamforming algorithms
`based on the MVDR design, including the minimum mean square error and maximum
`signal-to-noise ratio beamformers, have the advantage of being tractable to analyze in
`
`Distant Speech Recognition Matthias W¨olfel and John McDonough
`© 2009 John Wiley & Sons, Ltd.
`
`Amazon Ex. 1017, Page 5 of 87
`
`
`
`410
`
`Distant Speech Recognition
`
`simple acoustic environments. As discussed in Section 13.3.4, the superdirective beam-
`former, which is based on particular assumptions about the ambient noise field, has
`proven useful in real acoustic environments. The minimum mean-square error (MMSE)
`beamformer is presented in Section 13.3.5 and its relation to the MVDR beamformer is
`discussed. The maximum signal-to-noise ratio design is then presented in Section 13.3.6.
`The generalized sidelobe canceller (GSC), which is to play a decisive role in the latter
`sections of this chapter, is presented in Section 13.3.7. As discussed in Section 13.3.8,
`diagonal loading is a very simple technique for adding robustness into adaptive beam-
`forming designs.
`Section 13.4, the last about the conventional beamforming algorithms, discusses imple-
`mentations of adaptive beamforming algorithms that are suitable for online operation.
`Firstly, a convergence analysis of designs based on stochastic gradient descent is pre-
`sented in Section 13.4.1, thereafter the various least mean-square (LMS) error designs,
`are presented in Section 13.4.2. These designs provide a complexity that is linear with the
`number N of sensors in the array, but can be slow to converge under unfavorable acoustic
`conditions. The recursive least square (RLS) error design, whose complexity increases as
`N 2, is discussed in Section 13.4.3. In return for this greater complexity, the RLS designs
`can provide better convergence characteristics. The RLS algorithms are known to be sus-
`ceptible to numerical instabilities. A way to remedy this problem, namely the square-root
`implementation, is discussed in Section 13.4.4.
`Recent research has revealed that the optimization criteria used in conventional array
`processing are not optimal for acoustic beamforming applications. In Section 13.5 of this
`chapter we discuss nonconventional optmization criteria for beamforming. A beamformer
`that maximizes the likelihood of the output signal with respect to a hidden Markov model
`(HMM) such as those discussed in Chapters 7 and 8 is discussed in Section 13.5.1.
`Section 13.5.2 presents a nonconventional beamforming algorithm based on the opti-
`mization of a negentropy criterion subject to a distortionless constraint. The negentropy
`criterion provides an indication of how non-Gaussian a random variable is. Human speech
`is a highly non-Gaussian signal, but becomes more nearly Gaussian when corrupted with
`noise or reverberation. Hence, in adjusting the active weight vectors of a GSC so as
`to provide a maximally non-Gaussian output subject to a distortionless constraint, the
`harmful effects of noise and reverberation on the output of the array can be minimized. A
`refinement of the maximum negentropy beamformer (MNB) is presented in Section 13.5.3,
`whereby a HMM is used to capture the nonstationarity of the desired speaker’s speech.
`It happens quite often when two or more people speak together, that they will speak
`simultaneously, thereby creating regions of overlapping or simultaneous speech. Thus, the
`recognition of such simultaneous speech is an area of active research. In Section 13.5.4,
`we present a relatively new algorithm for separating overlapping speech into different
`output streams. This algorithm is based on the construction of two beamformers in GSC
`configuration, one pointing at each active speaker. To provide optimal separation perfor-
`mance, the active weight vectors of both GSCs are optimized jointly to provide two output
`streams with minimum mutual information (MinMI). This approach is also motivated in
`large part by research within the ICA field. The geometric source separation algorithm
`is presented in Section 13.5.5, which under the proper assumptions can be shown to be
`related to the MinMI beamformer.
`
`Amazon Ex. 1017, Page 6 of 87
`
`
`
`Beamforming
`
`411
`
`Section 13.6 discusses a technique for automatically inferring the geometry of a micro-
`phone array based on a diffuse noise assumption.
`In the final section of the chapter, we present our conclusions and recommendations
`for further reading.
`
`13.1 Beamforming Fundamentals
`Here we consider the fundamental concepts required to describe the interaction of propa-
`gating sound waves with sensor arrays. In this regard, the discussion here is an extension
`of that in Section 2.1. The exposition in this section is based largely on Van Trees (2002,
`sect. 2.2), and will make extensive use of the basic signal processing concepts developed
`in Chapter 3.
`
`13.1.1 Sound Propagation and Array Geometry
`To begin, consider an arbitrary array of N sensors. We will assume for the moment that the
`locations mn, for n = 0, 1, . . . , N − 1 of the sensors are known. These sensors produce
`a set of signals denoted by the vector
`
`⎡⎢⎢⎢⎣
`
`f (t, m0)
`f (t, m1)
`...
`f (t, mN−1)
`
`⎤⎥⎥⎥⎦
`
`.
`
`f(t, m) =
`
`For the present, we will also work in the continuous-time domain t. This is done only
`to avoid the granularity introduced by a discrete-time index. But this will cease to be an
`issue when we move to the subband domain, as the phase shifts and scaling factors to be
`applied in the subband domain are continuous-valued, regardless of whether or not this
`is so for the signals with which we begin. The output of each sensor is processed with a
`linear time-invariant (LTI) filter with impulse response hn(τ ) and filter outputs are then
`summed to obtain the final output of the beamformer:
`
`y(t ) = N−1(cid:8)
`
`n=0
`
`In matrix notation, the sensor weights of the delay-and-sum beamformer can be expressed
`as
`
`(cid:9) ∞
`−∞ hn(t − τ ) fn(τ, mn) dτ.
`(cid:9) ∞
`
`y(t ) =
`
`where
`
`−∞
`
`hT (t − τ ) f(τ, m) dτ,
`
`(13.1)
`
`⎤⎥⎥⎥⎦
`
`.
`
`⎡⎢⎢⎢⎣
`
`h0(t )
`h1(t )
`...
`hN−1(t )
`
`h(t ) =
`
`Amazon Ex. 1017, Page 7 of 87
`
`
`
`412
`
`Distant Speech Recognition
`
`Moving to the frequency domain by applying the continuous-time Fourier transform
`(3.48) enables (13.1) to be rewritten as
`Y (ω) =
`
`(cid:9) ∞
`
`−∞ y(t ) e
`
`−j ωt dt = HT (ω) F(ω, m),
`(cid:9) ∞
`(cid:9) ∞
`−∞ h(t )e
`
`where
`
`H(ω) =
`
`F(ω, m) =
`
`−j ωt dt,
`
`f(t, m)e
`
`−∞
`
`−j ωt dt,
`
`(13.2)
`
`(13.3)
`
`(13.4)
`
`are, respectively, the vectors of frequency responses of the filters and spectra of the signals
`produced by the sensors.
`In building an actual beamforming system, we will not, of course, work with
`continuous-time Fourier transforms as implied by (13.2). Rather, the output of each
`microphone will be sampled then processed with an analysis filter bank such as was
`described in Chapter 11 to yield a set of subband samples. The N samples for each
`center frequency ωm = 2π m/M, where M is the number of subband samples, will then
`be gathered together and the inner product (13.2) will be calculated, whereupon all M
`beamformer outputs can then be transformed back into the time domain by a synthesis
`bank. We are justified in taking this approach by the reasoning presented in Section
`11.1, where it was explained that the output of the analysis bank can be interpreted as a
`short-time Fourier transform of the sampled signals subject only to the condition that the
`signals are sampled often enough in time to satisfy the Nyquist criterion. Beamforming in
`the subband domain has the considerable advantage that the active sensor weights can be
`optimized for each subband independently, which provides a tremendous computational
`savings with respect to a time-domain filter-and-sum beamformer with filters of the same
`length on the output of each sensor.
`Although the filter frequency responses are represented as constant with time in
`(13.2–13.4), in subsequent sections we will relax this assumption and allow H(ω) to be
`adapted in order to maximize or minimize an optimization criterion. We will in this case,
`however, make the assumption that is standard in adaptive filtering theory, namely, that
`H(ω) changes sufficiently slowly such that (13.2) is valid for the duration of a single
`subband snapshot (Haykin 2002). This implies, however, that the system is no longer
`actually linear.
`We will typically use spherical coordinates (r, θ , φ) to describe the propagation of sound
`waves through space. The relation between these spherical coordinates and the Cartesian
`coordinates (x, y, z) is illustrated in Figure 13.1. So defined, r > 0 is the radius or range,
`the polar angle θ assumes values on the range 0 ≤ θ ≤ π, and theazimuth assumes values
`on the range 0 ≤ φ ≤ 2π. Letting φ vary over its entire range is normal for circular arrays,
`but with the linear arrays considered in Section 13.1.3, it is typical for the sensors to be
`shielded acoustically from the rear so that, effectively, no sound propagates in the range
`π ≤ φ ≤ 2π.
`In the classical array-processing literature, it is quite common to make a plane wave
`assumption, which implies that the source of the wave is so distant that the locus of points
`
`Amazon Ex. 1017, Page 8 of 87
`
`
`
`Beam forming
`
`41
`
`z=r cos8
`
`..... ------;----,,-------1,
`,,..•· x = r sin8co 4>
`.... u••·····•·--·-•-•o• ·"···• .... L-··· .. ·
`y = r ln8 in~
`
`Figure 13. 1 Relation between Lhe spherical coordinates (r, 0, q,
`(x, y , z)
`
`and Cartesian coordinate
`
`wi th the am pha e or wavefront i a plan . u h an a umpli n i
`. ldom ju ti fied in
`acou Lie beamfonning through air, a th apertur of Lh
`·may i Lypicall
`f th
`ame
`order of magnitude a Lhe di tance from th ource to the en ors. NoneLhele
`uch an
`ru umplion i u eful in introducing the conventional array-proce
`ing U1eory. our chief
`concern in this ecti on. b au e it implifi e many important concept . It i often u efu l
`in practice a w II , in that it i not alway po ible to reliably c timate the di tance from
`Lhe ource to the array
`in which ca e the plane wave a umption i Lhe only po
`ibl e
`ch ice.
`on id r U1en a plane wave hown in Figure I . I propagalin° in Lh di re Lion
`
`[ax] [- in 0 co </>]
`
`a= a y =
`a~
`
`-
`
`in 0 in</>
`-c 0
`
`.
`
`that u,e same ignal f(t arrive at ach . en r ,
`i
`fi rst irnplifi at ion thi produce
`Th
`but not al U1e ame rime. Henc , we an write
`
`f(t -
`f(t -
`
`ro) ]
`•1 )
`
`f (t, m
`
`[
`
`=
`
`f(t -·•N-1)
`
`( 13.5)
`
`where the time delay of arrival (TD A r 11 appearing in ( 13.5) can be calculated throu gh
`the inner pr duct.
`
`aTm,,
`I
`t'11 =--=--lm,,.x
`C
`C
`
`in0 o </>+11111 .y •
`
`in0 in<J,+111 11 ,z • CO 0]
`
`13.6
`
`Amazon Ex. 1017, Page 9 of 87
`
`
`
`414
`
`Distant Speech Recognition
`
`c is the velocity of sound , and mn = [mn,x mn,y mn,z]. Each τn represents the differ-
`ence in arrival time of the wavefront at the nth sensor with respect to the origin.
`If we now define the direction cosines
`u (cid:2) −a,
`
`(13.7)
`
`then τn can be expressed as
`τn = − 1
`
`c
`
`[ux mn,x + uy mn,y + uzmn,z] = − uT mn
`
`c
`
`.
`
`(13.8)
`
`The time-delay property (3.50) of the continuous-time Fourier transform implies that
`under the signal model (13.5), the nth component of F(ω) defined in (13.4) can be
`expressed as
`
`(cid:9) ∞
`−∞ f (t − τn)e
`
`Fn(ω) =
`
`−j ωt dt = e
`
`−j ωτn F (ω),
`
`(13.9)
`
`where F (ω) is the Fourier transform of the original source. From (13.7) and (13.8) we
`infer
`
`ωτn = ω
`
`c
`
`aT mn = − ω
`
`c
`
`uT mn.
`
`(13.10)
`
`For plane waves propagating in a locally homogeneous medium, the wave number is
`defined as
`
`k = ω
`
`c
`
`a = 2π
`
`λ
`
`a,
`
`(13.11)
`
`where λ is the wavelength corresponding to the angular frequency ω. Based on (13.7),
`we can now express the wavenumber as
`
`= − 2π
`λ
`
`u.
`
`⎤⎦
`
`⎡⎣
`
`k = − 2π
`
`λ
`
`sin θ cos φ
`sin θ sin φ
`cos θ
`
`Assuming that the speed of sound is constant implies that
`|k| = ω
`= 2π
`λ
`
`c
`
`.
`
`(13.12)
`
`Physically, the wavenumber represents both the direction of propagation and frequency of
`the plane wave. As indicated by (13.11), the vector k specifies the direction of propagation
`of the plane wave. Equation (13.12) implies that the magnitude of k determines the
`frequency of the plane wave.
`Together (13.10) and (13.11) imply that
`ωτn = kT mn.
`
`(13.13)
`
`Amazon Ex. 1017, Page 10 of 87
`
`
`
`Beamforming
`
`415
`
`Hence, the Fourier transform of the propagating wave whose nth component is (13.9) can
`be expressed in vector form as
`
`(13.14)
`
`(13.15)
`
`F(ω) = F (ω) vk(k),
`
`⎤⎥⎥⎥⎥⎦
`
`,
`
`e
`
`e
`
`−j kT m0
`−j kT m1
`...
`−j kT mN−1
`
`e
`
`⎡⎢⎢⎢⎢⎣
`
`vk(k) (cid:2)
`
`where the array manifold vector, defined as
`
`represents a complete “summary” of the interaction of the array geometry with a
`propagating wave. As mentioned previously, beamforming is typically performed in
`the discrete-time Fourier transform domain,
`through the use of digital filter banks.
`This implies that the time-shifts must be specified in samples, in which case the array
`manifold vector must be represented as
`
`⎤⎥⎥⎥⎦
`
`e
`e
`
`⎡⎢⎢⎢⎣
`
`e
`
`13.1.2 Beam Patterns
`In Section 3.1.1 we demonstrated that the complex exponential sequence f [n] = ej ωn is
`an eigensequence for any digital LTI system. It can be similarly shown that
`f (t ) = ej ωt
`
`(13.17)
`
`is an eigenfunction for any analog LTI system. This implies that if the complex exponential
`(13.17) is taken as the input to a single-input, single-output LTI system, the output of the
`system always has the form
`
`y(t ) = G(ω) ej ωt ,
`
`where, as discussed in Section 3.1, G(ω) is the frequency response of the system. For the
`analysis of multiple-input, single-output systems used in array processing, we consider
`eigenfunctions of the form
`
`fn(t, mn) = exp
`
`(cid:10)
`
`(cid:11)
`j (ωt − kT mn)
`
`,
`
`(13.18)
`
`−j ωm τ0/Ts
`−j ωm τ1/Ts
`...
`−j ωm τN−1/Ts
`where the subband center frequencies are {ωm}, the propagation delays {τn} are calculated
`according to (13.8), and Ts is the sampling interval defined in Section 3.1.4.
`
`vDT(x, ωm) (cid:2)
`
`,
`
`(13.16)
`
`Amazon Ex. 1017, Page 11 of 87
`
`
`
`416
`
`Distant Speech Recognition
`
`which is in fact the definition of a plane wave. For the entire array, we can write
`f(t, m) = ej ωt vk(k).
`
`(13.19)
`
`The response of the array to a plane wave input can be expressed as
`y(t, k) = ϒ (ω, k) ej ωt ,
`
`where the frequency–wavenumber response function (Van Trees 2002, sect. 2.2) is defined
`as
`
`ϒ (ω, k) (cid:2) HT (ω) vk(k),
`
`and H(ω) is the Fourier transform of h(t ) defined in (13.3). Just as the frequency response
`H (ω) defined in (3.13) specifies the response of conventional LTI system to a sinusoidal
`input, the frequency–wavenumber response function specifies the response of an array
`to a plane wave input with wavenumber k and angular frequency ω. Observe that the
`notation ϒ (ω, k) is redundant in that the angular frequency ω is uniquely specified by
`the wavenumber k through (13.12). We retain the argument ω, however, to stress the
`frequency-dependent nature of the frequency–wavenumber response function.
`The beam pattern indicates the sensitivity of the array to a plane wave with wavenumber
`k = 2π
`λ a(θ , φ), and is defined as
`B(ω : θ , φ) (cid:2) ϒ (ω, k)|
`
`k= 2π
`λ a(θ ,φ) ,
`
`where a(θ , φ) is a unit vector with spherical coordinate angles θ and φ. The primary
`difference between the frequency–wavenumber response function and the beam pattern
`is that the arguments in the beam pattern must correspond to the physical angles θ and φ.
`
`13.1.3 Delay-and-Sum Beamformer
`In a delay-and-sum beamformer 1 (DSB), the impulse response of the filter on each sensor
`is a shifted impulse:
`
`hn(t ) = 1
`δ(t + τn),
`where δ(t ) is the Dirac delta function. The time shifts {τn} are calculated according to
`(13.13), such that the signals from each sensor in the array upon which a plane wave
`with wavenumber k and angular frequency ω impinges are added coherently. As we will
`shortly see, this has the effect of enhancing the desired plane wave with respect to plane
`waves propagating in other directions, provided certain conditions are met. If the signal is
`
`N
`
`1 Many authors (Van Trees 2002) refer to the delay-and-sum beamformer as the conventional beamformer. In
`this volume, however, we will reserve the term “conventional” to refer to the conventional adaptive beamformer
`algorithms – namely, the minimum variance distortionless response, MMSE, and maximum signal-to-noise ratio
`beamformers – discussed in Section 13.3.
`
`Amazon Ex. 1017, Page 12 of 87
`
`
`
`Beamforming
`
`417
`
`Time Domain Implementation
`f(t−t0)
`d(t+t0)
`f(t−t1)
`d(t+t1)
`
`+
`
`1/N
`
`f(t)
`
`Subband Domain Implementation
`f(t−t0)
`f(t−t1)
`
`+
`
`1/N
`
`f(t)
`
`e jwkt
`
`0
`
`e jwkt
`
`1
`
`f(t−tN−1)
`
`d(t+tN−1)
`
`f(t−tN−1)
`
`e jwkt
`
`N 1
`
`Figure 13.2 Time and subband domain implementations of the delay-and-sum beamformer
`
`narrowband with a center frequency of ωc, then, as indicated by (3.50), a time delay of
`τn corresponds to a linear phase shift, such that the complex weight applied to the output
`of the nth sensor can be expressed as
`
`ej ωcτn .
`
`= Hn(ωc) = 1
`
`N
`
`∗ n
`
`w
`
`In matrix form, this becomes
`wH (ωc) = HT (ωc) = 1
`
`N
`
`vH
`k (k),
`
`(13.20)
`
`where the array manifold vector vk(k) is defined in (13.15) and (13.16) for the continuous-
`and discrete-time cases, respectively. The narrowband assumption is justified in that,
`as mentioned previously, we will apply an analysis filter bank to the output of each
`sensor to divide it into M narrowband signals. As discussed in Section 11, the filter bank
`prototype is designed to minimize aliasing distortion, which implies it will have good
`suppression in the stopband. This assertion is readily verified through an examination
`of the frequency response plots in Figures 11.10 through 11.12. Both time and subband
`domain implementations of the DSB are shown in Figure 13.2.
`A simple discrete Fourier transform (DFT) can also be used for the subband analysis
`and resynthesis. This approach, however, is suboptimal in that it corresponds to a uniform
`DFT filter bank with a prototype impulse response whose values are constant. This implies
`that there will be large sidelobes in the stopband, as shown in Figure 11.2, and that
`the complex samples at the output of the different subbands will be neither statistically
`independent nor uncorrelated.
`In order to gain an appreciation of the behavior of a sensor array, we now introduce
`several simplifying assumptions. Firstly, we will consider the case of a uniform linear
`array with equal intersensor spacing as shown in Figure 13.1. The nth sensor is located
`at
`
`(cid:12)
`
`mn,x =
`
`n − N − 1
`
`2
`
`(cid:13)
`
`d, mn,y = mn,z = 0 ∀ n = 0, . . . , N − 1,
`
`where d is the intersensor spacing. As a further simplification, assume that plane waves
`propagate only parallel to the x – y plane, so that the array manifold vector (13.15) can
`
`Amazon Ex. 1017, Page 13 of 87
`
`
`
`418
`
`Distant Speech Recognition
`
`be expressed as
`vk(kx ) =
`
`(cid:14)
`
`(cid:15)
`
`j
`
`e
`
`(cid:15)
`
`(cid:16)
`kx d · · · e
`
`j
`
`N−1
`2
`
`(cid:16)
`kx d · · · e
`
`(cid:15)
`
`−j
`
`N−1
`2
`
`(cid:16)
`
`kx d
`
`N−1
`2
`
`−1
`
`(cid:17)
`
`T
`
`,
`
`where the x-component of k is by definition
`cos φ = −k0 cos φ,
`kx (cid:2) − 2π
`
`λ
`
`and
`
`k0 (cid:2) |k| = 2π
`Let ux = cos φ denote the direction cosine with respect to the x-axis, and let us define
`ψ (cid:2) −kx d = 2π
`cos φ · d = 2π
`
`λ
`
`ux d.
`
`λ
`
`(13.21)
`
`.
`
`λ
`
`The variable ψ contains the all-important ratio d/λ as well as the direction of arrival
`(DOA) in u = ux = cos φ. Hence ψ is a succinct summary of all information needed to
`calculate the sensitivity of the array. The wavenumber response as a function of kx can
`then be expressed as
`(cid:16)
`(cid:15)
`n− N−1
`
`−j
`
`e
`
`∗ n
`
`w
`
`ϒ (ω, kx ) = wH vk(kx ) = N−1(cid:8)
`
`n=0
`
`2
`
`kx d
`
`.
`
`(13.22)
`
`) 2π d
`λ cos φ,
`
`) 2π d
`λ u,
`
`2
`
`2
`
`2
`
`The array manifold vector can be represented in the other spaces according to
`[vφ (φ)]n = ej (n− N−1
`= ej (n− N−1
`[vu(u)]n
`= ej (n− N−1
`[vψ (ψ )]n
`)ψ ,
`where [·]n denotes the nth component of the relevant array manifold vector. The
`representations of the beam pattern given above are useful for several reasons. Firstly,
`the φ –space is that in which the physical wave actually propagates, hence it is inherently
`useful. As we will learn in Section 13.1.4, the representation in u–space is useful
`inasmuch as, due to the definition u (cid:2) cos φ, steering the beam in this space is equivalent
`to simply shifting the beam pattern. Finally, the ψ –space is useful because the definition
`(13.21) directly incorporates the all-important ratio d/λ, whose significance will be
`discussed in Section 13.1.4.
`Based on (13.21), the beam pattern can also be expressed as a function of φ, u, or ψ:
`
`
`
`ej n 2π dλ cos φ ,
`
`∗ n
`
`w
`
`N−1(cid:8)
`
`n=0
`
`Bφ (φ) = wH vφ (φ) = e
`
`−j ( N−1
`
`2
`
`) 2π d
`λ cos φ
`
`Amazon Ex. 1017, Page 14 of 87
`
`
`
`419
`
`(13.23)
`
`
`
`ej n 2π dλ u,
`
`∗ n
`
`w
`
`) 2π d
`λ u
`
`N−1(cid:8)
`N−1(cid:8)
`
`n=0
`
`Beamforming
`
`Bu(u) = wH vu(u) = e
`
`−j ( N−1
`
`2
`
`∗ n
`
`w
`
`ej nψ .
`
`n=0
`
`Bψ (ψ ) = wH vψ (ψ ) = e
`
`−j ( N−1
`
`2
`
`)ψ
`
`Now we introduce a further simplifying assumption, namely, that all sensors are uni-
`formly weighted, such that
`
`wn = 1
`
`N
`
`∀ n = 0, 1, . . . , N − 1.
`
`In this case, the beam pattern in ψ-space can be expressed as
`(cid:16)
`(cid:15)
`
`Bψ (ψ ) = 1
`
`N
`
`−j
`
`e
`
`N−1
`2
`
`ψ
`
`N−1(cid:8)
`
`ej nψ .
`
`n=0
`
`(13.24)
`
`Using the identity
`
`N−1(cid:8)
`
`n=0
`
`xn = 1 − xN
`1 − x
`
`(cid:12)
`
`ψ
`
`(cid:13)
`
`−j N ψ/2 − ej N ψ/2
`−j ψ/2 − ej ψ/2
`≤ ψ ≤ 2π d
`
`e
`
`,
`
`λ
`
`(cid:16)
`1 − ej N ψ
`1 − ej ψ
`(cid:16)
`ψ · ej N ψ/2
`· e
`(cid:13)
`ej ψ/2
`∀ − 2π d
`λ
`
`N−1
`2
`
`N−1
`2
`
`ψ 2
`
`it is possible to rewrite (13.24) as
`
`Bψ (ψ ) = 1
`
`N
`= 1
`N
`
`−j
`
`−j
`
`e
`
`e
`
`= sincN
`
`where
`
`(cid:15)
`(cid:15)
`(cid:12)
`
`sincN (x) (cid:2) 1
`
`N
`
`sin (N x)
`sin x
`
`.
`
`From the final equality in (13.25), which is plotted against both linear and decibel axes
`in Figure 13.3, it is clear that Bψ (ψ ) is periodic with period 2π for odd N. Moreover,
`Bψ (ψ ) assumes its maximum values when both numerator and denominator of (13.26)
`are zero, in which case it can be shown to assume a value of unity through the application
`of L’Hospital’s rule.
`
`(13.25)
`
`(13.26)
`
`Amazon Ex. 1017, Page 15 of 87
`
`
`
`420
`
`Distant Speech Recognition
`
`Visible
`Region
`
`−2
`
`−1
`
`0
`
`+1
`
`+2
`
`+3
`
`0
`−10
`−20
`−30
`−40
`−50
`−3
`
`Response Function [dB]
`
`Visible
`Region
`
`1
`
`0.5
`
`esponse Function
`
`0
`−0.5
`−1R
`
`−3
`
`−2
`
`−1
`
`0
`
`+1
`
`+2
`
`+3
`
`Figure 13.3 Comparison between a beam pattern on a linear and logarithmic scale, ψ =
`λ d cos φ, N = 20
`
`2π
`
`Substituting the relevant equality from (13.21), the beam pattern can be expressed in
`φ-space as
`
`(cid:12)
`
`Bφ(φ) = sincN
`
`(cid:13)
`
`π d
`
`λ
`
`cos φ
`
`∀ 0 ≤ φ ≤ π.
`
`(13.27)
`
`(13.28)
`
`In u-space this becomes
`
`Bu(u) = sincN
`
`(cid:12)
`
`(cid:13)
`
`u
`
`π d
`
`λ
`
`∀ − 1 ≤ u ≤ 1.
`
`A comparison of the beam pattern in different spaces is provided in Figure 13.4. Note
`that in each of (13.25), (13.27) and (13.28), we have indicated the allowable range on
`the argument of the beam pattern. As shown in Figure 13.4, this range is known as the
`visible region, because this is the region in which waves may actually propagate. It is
`often useful, however, to assume that ψ, φ, and u can vary over the entire real axis. In
`this case, every point outside of the range outside of the visible region is said to lie in
`the virtual region. Clearly, the beam patterns as plotted in the kx-, ψ- and ux-spaces are
`just scaled replicas, just as we would expect given the linear relationships between these
`variables manifest in (13.21). The beam pattern plotted in φ-space, on the other hand, has
`a noticeably narrower main lobe and significantly longer sidelobes due to the term cos φ
`appearing in (13.21).
`The portion of the visible region where the array provides maximal sensitivity is known
`as the main lobe. A grating lobe is a sidelobe with the same height as the main lobe. As
`mentioned previously, such lobes appear when the numerator and denominator of (13.26)
`are both zero, which for sincN (ψ/2) occurs at intervals of
`ψ = 2π m,
`for odd N. In direction cosine or u-space, the beam pattern (13.23) is specified by Bu(u) =
`sincN (π du/λ) and the grating lobes appear at intervals of
`u = λ
`m∀ m = 1, 2, . . . .
`
`(13.29)
`
`d
`
`Amazon Ex. 1017, Page 16 of 87
`
`
`
`Beamforming
`
`421
`
`Visible Region
`
`Virtual Region
`
`0
`
`π/d
`kx-Space
`
`2π/d
`
`3π/d
`
`Visible Region
`
`Virtual Region
`
`0
`
`π
`ψ-Space
`
`2π
`
`3π
`
`Visible Region
`
`Virtual Region
`
`0
`
`1
`u-Space
`
`2
`
`3
`
`Visible Region
`
`Virtual Region
`
`1
`0.75
`0.5
`0.25
`0
`−π/d
`
`1
`0.75
`0.5
`0.25
`0
`−π
`
`1
`0.75
`0.5
`0.25
`0
`−1
`
`1
`0.75
`0.5
`0.25
`0
`180°
`
`90°
`
`−90°
`
`−180°
`
`0
`φ-Space
`Figure 13.4 Beam pattern plots in kx-, ψ-, u- and φ-spaces for a linear array with d = λ/2 and
`N = 20
`
`The grating lobes are harmless as long as they remain in the virtual region. If the spacing
`between the sensors of the array is chosen to be too large, however, the grating lobes
`can move into the visible region. The effect is illustrated in Figure 13.5. The quantity
`that determines whether a grating lobe enters the visible region is the ratio d/λ. For a
`uniformly-weighted, uniform linear array, we must require d/λ < 1 in order to ensure that
`no grating lobe enters the visible region. We will shortly find, however, that steering can
`cause grating lobes to move into the visible region even when this condition is satisfied.
`
`13.1.4 Beam Steering
`Steering of the beam pattern is typically accomplished at the digital rather than physical
`level so that the array “listens” to a source emanating from a known or estimated position.
`For a plane wave, recall that the sensor inputs are given by (13.19). We would like the
`output to be time-aligned to the “target” wavenumber k = kT , which is known as the
`
`Amazon Ex. 1017, Page 17 of 87
`
`
`
`422
`
`Distant Speech Recognition
`
`Visible Region
`
`−3
`
`−2
`
`−1
`0
`u-space, d = λ/4
`
`+1
`
`+2
`
`+3
`
`−3
`
`−2
`
`−1
`0
`u-space, d = λ/2
`
`+1
`
`+2
`
`+3
`
`0
`−10
`−20
`−30
`−40
`−50
`
`0
`−10
`−20
`−30
`−40
`−50
`
`Response Function [dB]
`
`Response Function [dB]
`
`−3
`
`−2
`
`−1
`
`+1
`
`+2
`
`+3
`
`0
`−10
`−20
`−30
`−40
`−50
`
`Response Function [dB]
`
`−½
`
`−¼
`
`−¾
`
`−¼
`
`−½
`
`−¾
`
`−¼
`
`−¾
`
`−½
`
`0
`
`+¼
`
`+½
`
`+¾
`
`+¼
`
`+½
`
`+¾
`
`+¼
`
`+½
`
`+¾
`
`1.0
`
`0
`
`1.0
`
`0
`
`1.0
`
`0
`u-space, d = λ
`Figure 13.5 Effect of element spacing on beam patterns in linear and polar coordinates for N = 10
`
`main response axis or look direction. As noted before, steering can be accomplished with
`time delays, or phase shifts. We will, however, universally prefer the latter based on our
`use of filter banks to carve up the sensor outputs into narrowband signals. The steered
`sensor inputs can then be expressed as
`fs (t, m) = ej ωt vk(k − kT ),
`and the steered frequency wavenumber response as
`ϒ (ω, k|kT ) = ϒ (ω, k − kT ).
`Hence, in wavenumber space, steering is equivalent to a simple shift, which is the principal
`advantage of plotting beam patterns in this space.
`When DSB is steered to k = kT , the sensor weights become
`w = 1
`
`(13.30)
`
`vk(kT ).
`
`N
`
`(cid:18)(cid:18)(cid:18)(cid:18)
`
`,
`
`(13.31)
`
`k=a(θ ,φ)
`
`The delay-and-sum beam pattern, which by definition is
`Bdsb(k : kT ) (cid:2) 1
`
`k (kT )vk(k)
`vH
`
`N
`
`Amazon Ex. 1017, Page 18 of 87
`
`
`
`Beamforming
`
`423
`
`is that beam pattern obtained when a DSB is steered to wavenumber kT and evaluated
`at wavenumber k = a(θ , φ). For a linear array, the delay-and-sum beam pattern can be
`(cid:16)
`(cid:15)
`expressed as
`(cid:15)
`(cid:16) ,
`N ψ−ψT
`2
`ψ−ψT
`2
`
`ψ (ψT )vψ (ψ ) = 1
`vH
`
`N
`
`sin
`
`sin
`
`Bdsb(ψ : ψT ) = 1
`
`N
`
`or alternatively in u-space as
`
`Bdsb(u : uT ) = 1
`
`N
`
`π N d
`
`π d
`
`N
`
`(cid:10)
`(cid:11)
`(cid:11) .
`(cid:10)
`λ (u − uT )
`u (uT )vu(u) = 1
`sin
`vH
`λ (u − uT )
`sin
`The broadside angle φ = φ − π/2 is, by definition, measured with respect to the y-axis
`and has the same sense as φ. The effect of array steering with respect to φ is illustrated
`in Figure 13.6. Based on the fact that steering corresponds to a simple shift in u-space,
`we can readily develop a requirement for exclud