throbber
MATTHIAS WbLFEL AND JOHN McDONOUGH
`
`DISTANT
`SPEECH
`RECOGNITION
`
`Amazon Ex. 1017
`IPR Petition - US RE47,049
`
`Amazon Ex. 1017, Page 1 of 87
`
`

`

`DISTANT SPEECH
`RECOGNITION
`
`Amazon Ex. 1017, Page 2 of 87
`
`

`

`DISTANT SPEECH
`RECOGNITION
`
`Matthias W¨olfel
`Universit¨at Karlsruhe (TH), Germany
`
`and
`John McDonough
`Universit¨at des Saarlandes, Germany
`
`A John Wiley and Sons, Ltd., Publication
`
`Amazon Ex. 1017, Page 3 of 87
`
`

`

`This edition first published 2009
`© 2009 John Wiley & Sons Ltd
`
`Registered office
`
`John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom
`
`For details of our global editorial offices, for customer services and for information about how to apply for
`permission to reuse the copyright material in this book please see our website at www.wiley.com.
`
`The right of the author to be identified as the author of this work has been asserted in accordance with the
`Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
`any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
`the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
`
`Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be
`available in electronic books.
`
`Designations used by companies to distinguish their products are often claimed as trademarks. All brand names
`and product names used in this book are trade names, service marks, trademarks or registered trademarks of their
`respective owners. The publisher is not associated with any product or vendor mentioned in this book. This
`publication is designed to provide accurate and authoritative information in regard to the subject matter covered.
`It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional
`advice or other expert assistance is required, the services of a competent professional should be sought.
`
`Wiley also publishes its books in a variety of electronic formats. Some content that appears
`in print may not be available in electronic books.
`
`Library of Congress Cataloging-in-Publication Data
`
`W¨olfel, Matthias.
`Distant speech recognition / Matthias W¨olfel, John McDonough.
`p. cm.
`Includes bibliographical references and index.
`ISBN 978-0-470-51704-8 (cloth)
`1. Automatic speech recognition. I. McDonough, John (John W.) II. Title.
`TK7882.S65W64 2009
`(cid:2)
`006.4
`54 – dc22
`
`2008052791
`
`A catalogue record for this book is available from the British Library
`
`ISBN 978-0-470-51704-8 (H/B)
`
`Typeset in 10/12 Times by Laserwords Private Limited, Chennai, India
`Printed and bound in Great Britain by Antony Rowe Ltd, Chippenham, Wiltshire
`
`Amazon Ex. 1017, Page 4 of 87
`
`

`

`13
`
`Beamforming
`
`In this chapter, we investigate a class of techniques – known collectively as beamform-
`ing – by which signals from several sensors can be combined to emphasize a desired
`source and suppress interference from other directions. Beamforming begins with the
`assumption that the positions of all sensors are known, and that the position of the
`desired source is known or can be estimated. The simplest of beamforming algorithms,
`the delay-and-sum beamformer, uses only this geometrical knowledge to combine the sig-
`nals from several sensors. More sophisticated adaptive beamformers attempt to minimize
`the total output power of the array under the constraint that the desired source must be
`unattenuated. The conventional adaptive beamforming algorithms attempt to minimize a
`quadratic optimization criterion related to signal-to-noise ratio under a distortionless con-
`straint in the look direction. Recent research has revealed, however, that such quadratic
`criteria are not optimal for acoustic beamforming of human speech. Hence, we also
`present beamformers based on non-conventional optimization criteria that have appeared
`more recently in the literature.
`Any reader well acquainted with the conventional array processing literature will cer-
`tainly have already seen the material in Sections 13.1 through 13.4. The interaction of
`propagating waves with the sensors of a beamformer are described in Section 13.1.1,
`as are the effects of sensor spacing and beam steering on the spatial sensitivity of the
`array. The beam pattern, which is a plot of array sensitivity versus direction of arrival of
`propagating wave, is defined and described in Section 13.1.2. The simplest beamformer,
`namely the delay-and-sum beamformer, is presented in Section 13.1.3, and the effects
`of beam steering are discussed in Section 13.1.4. Quantitative measures of beamforming
`performance are presented in Section 13.2, the most important of which are directivity,
`as presented in Section 13.2.1, and array gain, as presented in Section 13.2.2. These mea-
`sures will be used to evaluate the conventional beamforming algorithms described later
`in the chapter.
`In Section 13.3, we take up the discussion of the conventional beamforming algorithms.
`The minimum variance distortionless response (MVDR) is presented in Section 13.3.1, and
`its performance is analyzed in Sections 13.3.2 and 13.3.3. The beamforming algorithms
`based on the MVDR design, including the minimum mean square error and maximum
`signal-to-noise ratio beamformers, have the advantage of being tractable to analyze in
`
`Distant Speech Recognition Matthias W¨olfel and John McDonough
`© 2009 John Wiley & Sons, Ltd.
`
`Amazon Ex. 1017, Page 5 of 87
`
`

`

`410
`
`Distant Speech Recognition
`
`simple acoustic environments. As discussed in Section 13.3.4, the superdirective beam-
`former, which is based on particular assumptions about the ambient noise field, has
`proven useful in real acoustic environments. The minimum mean-square error (MMSE)
`beamformer is presented in Section 13.3.5 and its relation to the MVDR beamformer is
`discussed. The maximum signal-to-noise ratio design is then presented in Section 13.3.6.
`The generalized sidelobe canceller (GSC), which is to play a decisive role in the latter
`sections of this chapter, is presented in Section 13.3.7. As discussed in Section 13.3.8,
`diagonal loading is a very simple technique for adding robustness into adaptive beam-
`forming designs.
`Section 13.4, the last about the conventional beamforming algorithms, discusses imple-
`mentations of adaptive beamforming algorithms that are suitable for online operation.
`Firstly, a convergence analysis of designs based on stochastic gradient descent is pre-
`sented in Section 13.4.1, thereafter the various least mean-square (LMS) error designs,
`are presented in Section 13.4.2. These designs provide a complexity that is linear with the
`number N of sensors in the array, but can be slow to converge under unfavorable acoustic
`conditions. The recursive least square (RLS) error design, whose complexity increases as
`N 2, is discussed in Section 13.4.3. In return for this greater complexity, the RLS designs
`can provide better convergence characteristics. The RLS algorithms are known to be sus-
`ceptible to numerical instabilities. A way to remedy this problem, namely the square-root
`implementation, is discussed in Section 13.4.4.
`Recent research has revealed that the optimization criteria used in conventional array
`processing are not optimal for acoustic beamforming applications. In Section 13.5 of this
`chapter we discuss nonconventional optmization criteria for beamforming. A beamformer
`that maximizes the likelihood of the output signal with respect to a hidden Markov model
`(HMM) such as those discussed in Chapters 7 and 8 is discussed in Section 13.5.1.
`Section 13.5.2 presents a nonconventional beamforming algorithm based on the opti-
`mization of a negentropy criterion subject to a distortionless constraint. The negentropy
`criterion provides an indication of how non-Gaussian a random variable is. Human speech
`is a highly non-Gaussian signal, but becomes more nearly Gaussian when corrupted with
`noise or reverberation. Hence, in adjusting the active weight vectors of a GSC so as
`to provide a maximally non-Gaussian output subject to a distortionless constraint, the
`harmful effects of noise and reverberation on the output of the array can be minimized. A
`refinement of the maximum negentropy beamformer (MNB) is presented in Section 13.5.3,
`whereby a HMM is used to capture the nonstationarity of the desired speaker’s speech.
`It happens quite often when two or more people speak together, that they will speak
`simultaneously, thereby creating regions of overlapping or simultaneous speech. Thus, the
`recognition of such simultaneous speech is an area of active research. In Section 13.5.4,
`we present a relatively new algorithm for separating overlapping speech into different
`output streams. This algorithm is based on the construction of two beamformers in GSC
`configuration, one pointing at each active speaker. To provide optimal separation perfor-
`mance, the active weight vectors of both GSCs are optimized jointly to provide two output
`streams with minimum mutual information (MinMI). This approach is also motivated in
`large part by research within the ICA field. The geometric source separation algorithm
`is presented in Section 13.5.5, which under the proper assumptions can be shown to be
`related to the MinMI beamformer.
`
`Amazon Ex. 1017, Page 6 of 87
`
`

`

`Beamforming
`
`411
`
`Section 13.6 discusses a technique for automatically inferring the geometry of a micro-
`phone array based on a diffuse noise assumption.
`In the final section of the chapter, we present our conclusions and recommendations
`for further reading.
`
`13.1 Beamforming Fundamentals
`Here we consider the fundamental concepts required to describe the interaction of propa-
`gating sound waves with sensor arrays. In this regard, the discussion here is an extension
`of that in Section 2.1. The exposition in this section is based largely on Van Trees (2002,
`sect. 2.2), and will make extensive use of the basic signal processing concepts developed
`in Chapter 3.
`
`13.1.1 Sound Propagation and Array Geometry
`To begin, consider an arbitrary array of N sensors. We will assume for the moment that the
`locations mn, for n = 0, 1, . . . , N − 1 of the sensors are known. These sensors produce
`a set of signals denoted by the vector
`
`⎡⎢⎢⎢⎣
`
`f (t, m0)
`f (t, m1)
`...
`f (t, mN−1)
`
`⎤⎥⎥⎥⎦
`
`.
`
`f(t, m) =
`
`For the present, we will also work in the continuous-time domain t. This is done only
`to avoid the granularity introduced by a discrete-time index. But this will cease to be an
`issue when we move to the subband domain, as the phase shifts and scaling factors to be
`applied in the subband domain are continuous-valued, regardless of whether or not this
`is so for the signals with which we begin. The output of each sensor is processed with a
`linear time-invariant (LTI) filter with impulse response hn(τ ) and filter outputs are then
`summed to obtain the final output of the beamformer:
`
`y(t ) = N−1(cid:8)
`
`n=0
`
`In matrix notation, the sensor weights of the delay-and-sum beamformer can be expressed
`as
`
`(cid:9) ∞
`−∞ hn(t − τ ) fn(τ, mn) dτ.
`(cid:9) ∞
`
`y(t ) =
`
`where
`
`−∞
`
`hT (t − τ ) f(τ, m) dτ,
`
`(13.1)
`
`⎤⎥⎥⎥⎦
`
`.
`
`⎡⎢⎢⎢⎣
`
`h0(t )
`h1(t )
`...
`hN−1(t )
`
`h(t ) =
`
`Amazon Ex. 1017, Page 7 of 87
`
`

`

`412
`
`Distant Speech Recognition
`
`Moving to the frequency domain by applying the continuous-time Fourier transform
`(3.48) enables (13.1) to be rewritten as
`Y (ω) =
`
`(cid:9) ∞
`
`−∞ y(t ) e
`
`−j ωt dt = HT (ω) F(ω, m),
`(cid:9) ∞
`(cid:9) ∞
`−∞ h(t )e
`
`where
`
`H(ω) =
`
`F(ω, m) =
`
`−j ωt dt,
`
`f(t, m)e
`
`−∞
`
`−j ωt dt,
`
`(13.2)
`
`(13.3)
`
`(13.4)
`
`are, respectively, the vectors of frequency responses of the filters and spectra of the signals
`produced by the sensors.
`In building an actual beamforming system, we will not, of course, work with
`continuous-time Fourier transforms as implied by (13.2). Rather, the output of each
`microphone will be sampled then processed with an analysis filter bank such as was
`described in Chapter 11 to yield a set of subband samples. The N samples for each
`center frequency ωm = 2π m/M, where M is the number of subband samples, will then
`be gathered together and the inner product (13.2) will be calculated, whereupon all M
`beamformer outputs can then be transformed back into the time domain by a synthesis
`bank. We are justified in taking this approach by the reasoning presented in Section
`11.1, where it was explained that the output of the analysis bank can be interpreted as a
`short-time Fourier transform of the sampled signals subject only to the condition that the
`signals are sampled often enough in time to satisfy the Nyquist criterion. Beamforming in
`the subband domain has the considerable advantage that the active sensor weights can be
`optimized for each subband independently, which provides a tremendous computational
`savings with respect to a time-domain filter-and-sum beamformer with filters of the same
`length on the output of each sensor.
`Although the filter frequency responses are represented as constant with time in
`(13.2–13.4), in subsequent sections we will relax this assumption and allow H(ω) to be
`adapted in order to maximize or minimize an optimization criterion. We will in this case,
`however, make the assumption that is standard in adaptive filtering theory, namely, that
`H(ω) changes sufficiently slowly such that (13.2) is valid for the duration of a single
`subband snapshot (Haykin 2002). This implies, however, that the system is no longer
`actually linear.
`We will typically use spherical coordinates (r, θ , φ) to describe the propagation of sound
`waves through space. The relation between these spherical coordinates and the Cartesian
`coordinates (x, y, z) is illustrated in Figure 13.1. So defined, r > 0 is the radius or range,
`the polar angle θ assumes values on the range 0 ≤ θ ≤ π, and theazimuth assumes values
`on the range 0 ≤ φ ≤ 2π. Letting φ vary over its entire range is normal for circular arrays,
`but with the linear arrays considered in Section 13.1.3, it is typical for the sensors to be
`shielded acoustically from the rear so that, effectively, no sound propagates in the range
`π ≤ φ ≤ 2π.
`In the classical array-processing literature, it is quite common to make a plane wave
`assumption, which implies that the source of the wave is so distant that the locus of points
`
`Amazon Ex. 1017, Page 8 of 87
`
`

`

`Beam forming
`
`41
`
`z=r cos8
`
`..... ------;----,,-------1,
`,,..•· x = r sin8co 4>
`.... u••·····•·--·-•-•o• ·"···• .... L-··· .. ·
`y = r ln8 in~
`
`Figure 13. 1 Relation between Lhe spherical coordinates (r, 0, q,
`(x, y , z)
`
`and Cartesian coordinate
`
`wi th the am pha e or wavefront i a plan . u h an a umpli n i
`. ldom ju ti fied in
`acou Lie beamfonning through air, a th apertur of Lh
`·may i Lypicall
`f th
`ame
`order of magnitude a Lhe di tance from th ource to the en ors. NoneLhele
`uch an
`ru umplion i u eful in introducing the conventional array-proce
`ing U1eory. our chief
`concern in this ecti on. b au e it implifi e many important concept . It i often u efu l
`in practice a w II , in that it i not alway po ible to reliably c timate the di tance from
`Lhe ource to the array
`in which ca e the plane wave a umption i Lhe only po
`ibl e
`ch ice.
`on id r U1en a plane wave hown in Figure I . I propagalin° in Lh di re Lion
`
`[ax] [- in 0 co </>]
`
`a= a y =
`a~
`
`-
`
`in 0 in</>
`-c 0
`
`.
`
`that u,e same ignal f(t arrive at ach . en r ,
`i
`fi rst irnplifi at ion thi produce
`Th
`but not al U1e ame rime. Henc , we an write
`
`f(t -
`f(t -
`
`ro) ]
`•1 )
`
`f (t, m
`
`[
`
`=
`
`f(t -·•N-1)
`
`( 13.5)
`
`where the time delay of arrival (TD A r 11 appearing in ( 13.5) can be calculated throu gh
`the inner pr duct.
`
`aTm,,
`I
`t'11 =--=--lm,,.x
`C
`C
`
`in0 o </>+11111 .y •
`
`in0 in<J,+111 11 ,z • CO 0]
`
`13.6
`
`Amazon Ex. 1017, Page 9 of 87
`
`

`

`414
`
`Distant Speech Recognition
`
`c is the velocity of sound , and mn = [mn,x mn,y mn,z]. Each τn represents the differ-
`ence in arrival time of the wavefront at the nth sensor with respect to the origin.
`If we now define the direction cosines
`u (cid:2) −a,
`
`(13.7)
`
`then τn can be expressed as
`τn = − 1
`
`c
`
`[ux mn,x + uy mn,y + uzmn,z] = − uT mn
`
`c
`
`.
`
`(13.8)
`
`The time-delay property (3.50) of the continuous-time Fourier transform implies that
`under the signal model (13.5), the nth component of F(ω) defined in (13.4) can be
`expressed as
`
`(cid:9) ∞
`−∞ f (t − τn)e
`
`Fn(ω) =
`
`−j ωt dt = e
`
`−j ωτn F (ω),
`
`(13.9)
`
`where F (ω) is the Fourier transform of the original source. From (13.7) and (13.8) we
`infer
`
`ωτn = ω
`
`c
`
`aT mn = − ω
`
`c
`
`uT mn.
`
`(13.10)
`
`For plane waves propagating in a locally homogeneous medium, the wave number is
`defined as
`
`k = ω
`
`c
`
`a = 2π
`

`
`a,
`
`(13.11)
`
`where λ is the wavelength corresponding to the angular frequency ω. Based on (13.7),
`we can now express the wavenumber as
`
`= − 2π

`
`u.
`
`⎤⎦
`
`⎡⎣
`
`k = − 2π
`

`
`sin θ cos φ
`sin θ sin φ
`cos θ
`
`Assuming that the speed of sound is constant implies that
`|k| = ω
`= 2π

`
`c
`
`.
`
`(13.12)
`
`Physically, the wavenumber represents both the direction of propagation and frequency of
`the plane wave. As indicated by (13.11), the vector k specifies the direction of propagation
`of the plane wave. Equation (13.12) implies that the magnitude of k determines the
`frequency of the plane wave.
`Together (13.10) and (13.11) imply that
`ωτn = kT mn.
`
`(13.13)
`
`Amazon Ex. 1017, Page 10 of 87
`
`

`

`Beamforming
`
`415
`
`Hence, the Fourier transform of the propagating wave whose nth component is (13.9) can
`be expressed in vector form as
`
`(13.14)
`
`(13.15)
`
`F(ω) = F (ω) vk(k),
`
`⎤⎥⎥⎥⎥⎦
`
`,
`
`e
`
`e
`
`−j kT m0
`−j kT m1
`...
`−j kT mN−1
`
`e
`
`⎡⎢⎢⎢⎢⎣
`
`vk(k) (cid:2)
`
`where the array manifold vector, defined as
`
`represents a complete “summary” of the interaction of the array geometry with a
`propagating wave. As mentioned previously, beamforming is typically performed in
`the discrete-time Fourier transform domain,
`through the use of digital filter banks.
`This implies that the time-shifts must be specified in samples, in which case the array
`manifold vector must be represented as
`
`⎤⎥⎥⎥⎦
`
`e
`e
`
`⎡⎢⎢⎢⎣
`
`e
`
`13.1.2 Beam Patterns
`In Section 3.1.1 we demonstrated that the complex exponential sequence f [n] = ej ωn is
`an eigensequence for any digital LTI system. It can be similarly shown that
`f (t ) = ej ωt
`
`(13.17)
`
`is an eigenfunction for any analog LTI system. This implies that if the complex exponential
`(13.17) is taken as the input to a single-input, single-output LTI system, the output of the
`system always has the form
`
`y(t ) = G(ω) ej ωt ,
`
`where, as discussed in Section 3.1, G(ω) is the frequency response of the system. For the
`analysis of multiple-input, single-output systems used in array processing, we consider
`eigenfunctions of the form
`
`fn(t, mn) = exp
`
`(cid:10)
`
`(cid:11)
`j (ωt − kT mn)
`
`,
`
`(13.18)
`
`−j ωm τ0/Ts
`−j ωm τ1/Ts
`...
`−j ωm τN−1/Ts
`where the subband center frequencies are {ωm}, the propagation delays {τn} are calculated
`according to (13.8), and Ts is the sampling interval defined in Section 3.1.4.
`
`vDT(x, ωm) (cid:2)
`
`,
`
`(13.16)
`
`Amazon Ex. 1017, Page 11 of 87
`
`

`

`416
`
`Distant Speech Recognition
`
`which is in fact the definition of a plane wave. For the entire array, we can write
`f(t, m) = ej ωt vk(k).
`
`(13.19)
`
`The response of the array to a plane wave input can be expressed as
`y(t, k) = ϒ (ω, k) ej ωt ,
`
`where the frequency–wavenumber response function (Van Trees 2002, sect. 2.2) is defined
`as
`
`ϒ (ω, k) (cid:2) HT (ω) vk(k),
`
`and H(ω) is the Fourier transform of h(t ) defined in (13.3). Just as the frequency response
`H (ω) defined in (3.13) specifies the response of conventional LTI system to a sinusoidal
`input, the frequency–wavenumber response function specifies the response of an array
`to a plane wave input with wavenumber k and angular frequency ω. Observe that the
`notation ϒ (ω, k) is redundant in that the angular frequency ω is uniquely specified by
`the wavenumber k through (13.12). We retain the argument ω, however, to stress the
`frequency-dependent nature of the frequency–wavenumber response function.
`The beam pattern indicates the sensitivity of the array to a plane wave with wavenumber
`k = 2π
`λ a(θ , φ), and is defined as
`B(ω : θ , φ) (cid:2) ϒ (ω, k)|
`
`k= 2π
`λ a(θ ,φ) ,
`
`where a(θ , φ) is a unit vector with spherical coordinate angles θ and φ. The primary
`difference between the frequency–wavenumber response function and the beam pattern
`is that the arguments in the beam pattern must correspond to the physical angles θ and φ.
`
`13.1.3 Delay-and-Sum Beamformer
`In a delay-and-sum beamformer 1 (DSB), the impulse response of the filter on each sensor
`is a shifted impulse:
`
`hn(t ) = 1
`δ(t + τn),
`where δ(t ) is the Dirac delta function. The time shifts {τn} are calculated according to
`(13.13), such that the signals from each sensor in the array upon which a plane wave
`with wavenumber k and angular frequency ω impinges are added coherently. As we will
`shortly see, this has the effect of enhancing the desired plane wave with respect to plane
`waves propagating in other directions, provided certain conditions are met. If the signal is
`
`N
`
`1 Many authors (Van Trees 2002) refer to the delay-and-sum beamformer as the conventional beamformer. In
`this volume, however, we will reserve the term “conventional” to refer to the conventional adaptive beamformer
`algorithms – namely, the minimum variance distortionless response, MMSE, and maximum signal-to-noise ratio
`beamformers – discussed in Section 13.3.
`
`Amazon Ex. 1017, Page 12 of 87
`
`

`

`Beamforming
`
`417
`
`Time Domain Implementation
`f(t−t0)
`d(t+t0)
`f(t−t1)
`d(t+t1)
`
`+
`
`1/N
`
`f(t)
`
`Subband Domain Implementation
`f(t−t0)
`f(t−t1)
`
`+
`
`1/N
`
`f(t)
`
`e jwkt
`
`0
`
`e jwkt
`
`1
`
`f(t−tN−1)
`
`d(t+tN−1)
`
`f(t−tN−1)
`
`e jwkt
`
`N 1
`
`Figure 13.2 Time and subband domain implementations of the delay-and-sum beamformer
`
`narrowband with a center frequency of ωc, then, as indicated by (3.50), a time delay of
`τn corresponds to a linear phase shift, such that the complex weight applied to the output
`of the nth sensor can be expressed as
`
`ej ωcτn .
`
`= Hn(ωc) = 1
`
`N
`
`∗ n
`
`w
`
`In matrix form, this becomes
`wH (ωc) = HT (ωc) = 1
`
`N
`
`vH
`k (k),
`
`(13.20)
`
`where the array manifold vector vk(k) is defined in (13.15) and (13.16) for the continuous-
`and discrete-time cases, respectively. The narrowband assumption is justified in that,
`as mentioned previously, we will apply an analysis filter bank to the output of each
`sensor to divide it into M narrowband signals. As discussed in Section 11, the filter bank
`prototype is designed to minimize aliasing distortion, which implies it will have good
`suppression in the stopband. This assertion is readily verified through an examination
`of the frequency response plots in Figures 11.10 through 11.12. Both time and subband
`domain implementations of the DSB are shown in Figure 13.2.
`A simple discrete Fourier transform (DFT) can also be used for the subband analysis
`and resynthesis. This approach, however, is suboptimal in that it corresponds to a uniform
`DFT filter bank with a prototype impulse response whose values are constant. This implies
`that there will be large sidelobes in the stopband, as shown in Figure 11.2, and that
`the complex samples at the output of the different subbands will be neither statistically
`independent nor uncorrelated.
`In order to gain an appreciation of the behavior of a sensor array, we now introduce
`several simplifying assumptions. Firstly, we will consider the case of a uniform linear
`array with equal intersensor spacing as shown in Figure 13.1. The nth sensor is located
`at
`
`(cid:12)
`
`mn,x =
`
`n − N − 1
`
`2
`
`(cid:13)
`
`d, mn,y = mn,z = 0 ∀ n = 0, . . . , N − 1,
`
`where d is the intersensor spacing. As a further simplification, assume that plane waves
`propagate only parallel to the x – y plane, so that the array manifold vector (13.15) can
`
`Amazon Ex. 1017, Page 13 of 87
`
`

`

`418
`
`Distant Speech Recognition
`
`be expressed as
`vk(kx ) =
`
`(cid:14)
`
`(cid:15)
`
`j
`
`e
`
`(cid:15)
`
`(cid:16)
`kx d · · · e
`
`j
`
`N−1
`2
`
`(cid:16)
`kx d · · · e
`
`(cid:15)
`
`−j
`
`N−1
`2
`
`(cid:16)
`
`kx d
`
`N−1
`2
`
`−1
`
`(cid:17)
`
`T
`
`,
`
`where the x-component of k is by definition
`cos φ = −k0 cos φ,
`kx (cid:2) − 2π
`

`
`and
`
`k0 (cid:2) |k| = 2π
`Let ux = cos φ denote the direction cosine with respect to the x-axis, and let us define
`ψ (cid:2) −kx d = 2π
`cos φ · d = 2π
`

`
`ux d.
`

`
`(13.21)
`
`.
`

`
`The variable ψ contains the all-important ratio d/λ as well as the direction of arrival
`(DOA) in u = ux = cos φ. Hence ψ is a succinct summary of all information needed to
`calculate the sensitivity of the array. The wavenumber response as a function of kx can
`then be expressed as
`(cid:16)
`(cid:15)
`n− N−1
`
`−j
`
`e
`
`∗ n
`
`w
`
`ϒ (ω, kx ) = wH vk(kx ) = N−1(cid:8)
`
`n=0
`
`2
`
`kx d
`
`.
`
`(13.22)
`
`) 2π d
`λ cos φ,
`
`) 2π d
`λ u,
`
`2
`
`2
`
`2
`
`The array manifold vector can be represented in the other spaces according to
`[vφ (φ)]n = ej (n− N−1
`= ej (n− N−1
`[vu(u)]n
`= ej (n− N−1
`[vψ (ψ )]n
`)ψ ,
`where [·]n denotes the nth component of the relevant array manifold vector. The
`representations of the beam pattern given above are useful for several reasons. Firstly,
`the φ –space is that in which the physical wave actually propagates, hence it is inherently
`useful. As we will learn in Section 13.1.4, the representation in u–space is useful
`inasmuch as, due to the definition u (cid:2) cos φ, steering the beam in this space is equivalent
`to simply shifting the beam pattern. Finally, the ψ –space is useful because the definition
`(13.21) directly incorporates the all-important ratio d/λ, whose significance will be
`discussed in Section 13.1.4.
`Based on (13.21), the beam pattern can also be expressed as a function of φ, u, or ψ:
`
`
`
`ej n 2π dλ cos φ ,
`
`∗ n
`
`w
`
`N−1(cid:8)
`
`n=0
`
`Bφ (φ) = wH vφ (φ) = e
`
`−j ( N−1
`
`2
`
`) 2π d
`λ cos φ
`
`Amazon Ex. 1017, Page 14 of 87
`
`

`

`419
`
`(13.23)
`
`
`
`ej n 2π dλ u,
`
`∗ n
`
`w
`
`) 2π d
`λ u
`
`N−1(cid:8)
`N−1(cid:8)
`
`n=0
`
`Beamforming
`
`Bu(u) = wH vu(u) = e
`
`−j ( N−1
`
`2
`
`∗ n
`
`w
`
`ej nψ .
`
`n=0
`
`Bψ (ψ ) = wH vψ (ψ ) = e
`
`−j ( N−1
`
`2
`
`)ψ
`
`Now we introduce a further simplifying assumption, namely, that all sensors are uni-
`formly weighted, such that
`
`wn = 1
`
`N
`
`∀ n = 0, 1, . . . , N − 1.
`
`In this case, the beam pattern in ψ-space can be expressed as
`(cid:16)
`(cid:15)
`
`Bψ (ψ ) = 1
`
`N
`
`−j
`
`e
`
`N−1
`2
`

`
`N−1(cid:8)
`
`ej nψ .
`
`n=0
`
`(13.24)
`
`Using the identity
`
`N−1(cid:8)
`
`n=0
`
`xn = 1 − xN
`1 − x
`
`(cid:12)
`

`
`(cid:13)
`
`−j N ψ/2 − ej N ψ/2
`−j ψ/2 − ej ψ/2
`≤ ψ ≤ 2π d
`
`e
`
`,
`

`
`(cid:16)
`1 − ej N ψ
`1 − ej ψ
`(cid:16)
`ψ · ej N ψ/2
`· e
`(cid:13)
`ej ψ/2
`∀ − 2π d

`
`N−1
`2
`
`N−1
`2
`
`ψ 2
`
`it is possible to rewrite (13.24) as
`
`Bψ (ψ ) = 1
`
`N
`= 1
`N
`
`−j
`
`−j
`
`e
`
`e
`
`= sincN
`
`where
`
`(cid:15)
`(cid:15)
`(cid:12)
`
`sincN (x) (cid:2) 1
`
`N
`
`sin (N x)
`sin x
`
`.
`
`From the final equality in (13.25), which is plotted against both linear and decibel axes
`in Figure 13.3, it is clear that Bψ (ψ ) is periodic with period 2π for odd N. Moreover,
`Bψ (ψ ) assumes its maximum values when both numerator and denominator of (13.26)
`are zero, in which case it can be shown to assume a value of unity through the application
`of L’Hospital’s rule.
`
`(13.25)
`
`(13.26)
`
`Amazon Ex. 1017, Page 15 of 87
`
`

`

`420
`
`Distant Speech Recognition
`
`Visible
`Region
`
`−2
`
`−1
`
`0
`
`+1
`
`+2
`
`+3
`
`0
`−10
`−20
`−30
`−40
`−50
`−3
`
`Response Function [dB]
`
`Visible
`Region
`
`1
`
`0.5
`
`esponse Function
`
`0
`−0.5
`−1R
`
`−3
`
`−2
`
`−1
`
`0
`
`+1
`
`+2
`
`+3
`
`Figure 13.3 Comparison between a beam pattern on a linear and logarithmic scale, ψ =
`λ d cos φ, N = 20
`
`2π
`
`Substituting the relevant equality from (13.21), the beam pattern can be expressed in
`φ-space as
`
`(cid:12)
`
`Bφ(φ) = sincN
`
`(cid:13)
`
`π d
`

`
`cos φ
`
`∀ 0 ≤ φ ≤ π.
`
`(13.27)
`
`(13.28)
`
`In u-space this becomes
`
`Bu(u) = sincN
`
`(cid:12)
`
`(cid:13)
`
`u
`
`π d
`

`
`∀ − 1 ≤ u ≤ 1.
`
`A comparison of the beam pattern in different spaces is provided in Figure 13.4. Note
`that in each of (13.25), (13.27) and (13.28), we have indicated the allowable range on
`the argument of the beam pattern. As shown in Figure 13.4, this range is known as the
`visible region, because this is the region in which waves may actually propagate. It is
`often useful, however, to assume that ψ, φ, and u can vary over the entire real axis. In
`this case, every point outside of the range outside of the visible region is said to lie in
`the virtual region. Clearly, the beam patterns as plotted in the kx-, ψ- and ux-spaces are
`just scaled replicas, just as we would expect given the linear relationships between these
`variables manifest in (13.21). The beam pattern plotted in φ-space, on the other hand, has
`a noticeably narrower main lobe and significantly longer sidelobes due to the term cos φ
`appearing in (13.21).
`The portion of the visible region where the array provides maximal sensitivity is known
`as the main lobe. A grating lobe is a sidelobe with the same height as the main lobe. As
`mentioned previously, such lobes appear when the numerator and denominator of (13.26)
`are both zero, which for sincN (ψ/2) occurs at intervals of
`ψ = 2π m,
`for odd N. In direction cosine or u-space, the beam pattern (13.23) is specified by Bu(u) =
`sincN (π du/λ) and the grating lobes appear at intervals of
`u = λ
`m∀ m = 1, 2, . . . .
`
`(13.29)
`
`d
`
`Amazon Ex. 1017, Page 16 of 87
`
`

`

`Beamforming
`
`421
`
`Visible Region
`
`Virtual Region
`
`0
`
`π/d
`kx-Space
`
`2π/d
`
`3π/d
`
`Visible Region
`
`Virtual Region
`
`0
`

`ψ-Space
`
`2π
`
`3π
`
`Visible Region
`
`Virtual Region
`
`0
`
`1
`u-Space
`
`2
`
`3
`
`Visible Region
`
`Virtual Region
`
`1
`0.75
`0.5
`0.25
`0
`−π/d
`
`1
`0.75
`0.5
`0.25
`0
`−π
`
`1
`0.75
`0.5
`0.25
`0
`−1
`
`1
`0.75
`0.5
`0.25
`0
`180°
`
`90°
`
`−90°
`
`−180°
`
`0
`φ-Space
`Figure 13.4 Beam pattern plots in kx-, ψ-, u- and φ-spaces for a linear array with d = λ/2 and
`N = 20
`
`The grating lobes are harmless as long as they remain in the virtual region. If the spacing
`between the sensors of the array is chosen to be too large, however, the grating lobes
`can move into the visible region. The effect is illustrated in Figure 13.5. The quantity
`that determines whether a grating lobe enters the visible region is the ratio d/λ. For a
`uniformly-weighted, uniform linear array, we must require d/λ < 1 in order to ensure that
`no grating lobe enters the visible region. We will shortly find, however, that steering can
`cause grating lobes to move into the visible region even when this condition is satisfied.
`
`13.1.4 Beam Steering
`Steering of the beam pattern is typically accomplished at the digital rather than physical
`level so that the array “listens” to a source emanating from a known or estimated position.
`For a plane wave, recall that the sensor inputs are given by (13.19). We would like the
`output to be time-aligned to the “target” wavenumber k = kT , which is known as the
`
`Amazon Ex. 1017, Page 17 of 87
`
`

`

`422
`
`Distant Speech Recognition
`
`Visible Region
`
`−3
`
`−2
`
`−1
`0
`u-space, d = λ/4
`
`+1
`
`+2
`
`+3
`
`−3
`
`−2
`
`−1
`0
`u-space, d = λ/2
`
`+1
`
`+2
`
`+3
`
`0
`−10
`−20
`−30
`−40
`−50
`
`0
`−10
`−20
`−30
`−40
`−50
`
`Response Function [dB]
`
`Response Function [dB]
`
`−3
`
`−2
`
`−1
`
`+1
`
`+2
`
`+3
`
`0
`−10
`−20
`−30
`−40
`−50
`
`Response Function [dB]
`
`−½
`
`−¼
`
`−¾
`
`−¼
`
`−½
`
`−¾
`
`−¼
`
`−¾
`
`−½
`
`0
`
`+¼
`
`+½
`
`+¾
`
`+¼
`
`+½
`
`+¾
`
`+¼
`
`+½
`
`+¾
`
`1.0
`
`0
`
`1.0
`
`0
`
`1.0
`
`0
`u-space, d = λ
`Figure 13.5 Effect of element spacing on beam patterns in linear and polar coordinates for N = 10
`
`main response axis or look direction. As noted before, steering can be accomplished with
`time delays, or phase shifts. We will, however, universally prefer the latter based on our
`use of filter banks to carve up the sensor outputs into narrowband signals. The steered
`sensor inputs can then be expressed as
`fs (t, m) = ej ωt vk(k − kT ),
`and the steered frequency wavenumber response as
`ϒ (ω, k|kT ) = ϒ (ω, k − kT ).
`Hence, in wavenumber space, steering is equivalent to a simple shift, which is the principal
`advantage of plotting beam patterns in this space.
`When DSB is steered to k = kT , the sensor weights become
`w = 1
`
`(13.30)
`
`vk(kT ).
`
`N
`
`(cid:18)(cid:18)(cid:18)(cid:18)
`
`,
`
`(13.31)
`
`k=a(θ ,φ)
`
`The delay-and-sum beam pattern, which by definition is
`Bdsb(k : kT ) (cid:2) 1
`
`k (kT )vk(k)
`vH
`
`N
`
`Amazon Ex. 1017, Page 18 of 87
`
`

`

`Beamforming
`
`423
`
`is that beam pattern obtained when a DSB is steered to wavenumber kT and evaluated
`at wavenumber k = a(θ , φ). For a linear array, the delay-and-sum beam pattern can be
`(cid:16)
`(cid:15)
`expressed as
`(cid:15)
`(cid:16) ,
`N ψ−ψT
`2
`ψ−ψT
`2
`
`ψ (ψT )vψ (ψ ) = 1
`vH
`
`N
`
`sin
`
`sin
`
`Bdsb(ψ : ψT ) = 1
`
`N
`
`or alternatively in u-space as
`
`Bdsb(u : uT ) = 1
`
`N
`
`π N d
`
`π d
`
`N
`
`(cid:10)
`(cid:11)
`(cid:11) .
`(cid:10)
`λ (u − uT )
`u (uT )vu(u) = 1
`sin
`vH
`λ (u − uT )
`sin
`The broadside angle φ = φ − π/2 is, by definition, measured with respect to the y-axis
`and has the same sense as φ. The effect of array steering with respect to φ is illustrated
`in Figure 13.6. Based on the fact that steering corresponds to a simple shift in u-space,
`we can readily develop a requirement for exclud

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket