`(19) World Intellectual Property
`Organization
`International Bureau
`
`(10) International Publication Number
`(43) International Publication Date
`WO 2017/105998 Al
` 22 June 2017 (22.06.2017) WIPO|PCT
`
`\Z
`
`(22) International Filing Date:
`
`(25) Filing Language:
`(26) Publication Language:
`
`8 December 2016 (08.12.2016)
`
`English
`English
`
`(81) Designated States (unless otherwise indicated, for every
`($1) International Patent Classification:
`
`GIOL 21/0208 (2013.01)—GIOL 21/0216 (2013.01) kind of national protection available): AE, AG, AL, AM,
`21) International Application Number:
`AO, AT, AU, AZ, BA, BB, BG, BH, BN, BR, BW, BY,
`@1)
`International
`Application Number:
`BZ, CA, CH,CL, CN, CO, CR, CU, CZ, DE, DJ, DK, DM,
`5
`a
`DO, DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT,
`PCT/US2016/065563
`HN, HR, HU, ID, IL, IN, IR, IS, JP, BE, KG, KH, KN,
`KP, KR, KW, KZ, LA, LC, LK, LR, LS, LU, LY, MA,
`MD, ME, MG, MK, MN, MW, MX, MY, MZ, NA, NG,
`NI, NO, NZ, OM, PA, PE, PG, PH, PL, PT, QA, RO, RS,
`RU, RW,SA, SC, SD, SE, SG, SK, SL, SM,ST, SV, SY,
`TH, TJ, TM, TN, TR, TT, TZ, UA, UG, US, OZ, VC, VN,
`ZA, ZM, ZW.
`
`(0) Priority Data:
`14/973,274
`
`US
`
`17 December 2015 (17.12.2015)
`:
`INC.
`TECHNOLOGIES,
`AMAZON
`(1) Applicant:
`[US/US]; PO Box 81226, Seaitle, Washington 98108-1226
`(US).
`Inventors: AYRAPETIAN, Robert; 410 Terry Avenue
`North, Seattle, Washington 98109-5210 (US), HILMES
`>
`>
`8
`9
`Philip Ryan; 410 Terry Avenue North, Seattle, Washing-
`ton 98109-5210 (US)
`,
`,
`.
`(74) Agent: BARZILAY, Dan; 2 Seaport Lane, Suite 300, Bo-
`ston, Massachusetts 02210-2028 (US).
`
`(72)
`
`(84) Designated States (unless otherwise indicated, for every
`kind of regional protection availabie}): ARIPO (BW, GH,
`GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, ST, SZ,
`TZ, UG, 4M, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU,
`TJ, TM), European (AL, AT, BE, BG, CH,CY, CZ, DE,
`DK, EE, ES,FL FR, GB, GR, HR, HU, IE,IS, I, LT, LU,
`5
`:
`LV, MC, MK, MT, NL, NO,PL, PT, RO, RS. SF,SI, SK,
`SM, TR), OAPI (BE, BJ, CF, CG, CL, CM, GA, GN, GQ,
`GW, KM, ML, MR, NE, SN, TD, TG).
`sn.
`Published:
`—— with international search report (Art. 21(3))
`
`(54) Title: ADAPTIVE BEAMFORMING TO CREATE REFERENCE CHA
`
`NELS
`
`
`
`
`
`System
`{Oy
`Wireless
`
`RP inks 113.
`
`
`
`FIG. 1
`
`
`Fixed Beamformer
`‘PBF105
`
`Muliple Input
`Canceier (MC)
`106
`
`
`
`
` +
`
`
`
`
`
`
`
`
`| Microphone
`4
`
`Adaptive Beamtormer 164 |
`y 12a 2ye t20b
`
`igna!
`
`Target Signal|| Referens
`
`|
`12
`
`
`we
`Audio Output
`128
`Acoustic Echo Canceliation
`(AEC) 108
`
`
`
`
`730
`
`aR
`
`eceive audiinput
`132
`
`x
`Parform audio beamforming
`
`Device 162,
`
`
`Output audio data
`
`
`(57) Abstract: An echo cancellation system that performs audio beamforming to separate audio input into multiple directions and
`determines a target signal and a reference signal trom the multiple directions. For cxample, the system may detect a strong signal as -
`sociated with a speaker and select the strong signal as a reference signal, selecting another direction as a target signal. The system
`may determine a speech position and mayselect the speech position as a target signal and an opposite direction as a reference signal.
`The system may create pairwise combinations of opposite directions, with an individual direction being selected as a target signal
`and a reference signal. The system mayselect a fixed beamformer output for the target signal and an adaptive beamformer output for
`the reference signal, or vice versa. The system may remove the reference signal (e.g., audio output by the loudspeaker) to isolate
`speech included in the target signal.
`
`AMZN0001532
`
`
`
` wo2017/105998A1|IIMIMIMNIMIINTNNAINTANTMMEATAATT
`
`
`
`WO 2017/105998
`
`PCT/US2016/065363
`
`ADAPTIVE BEAMFORMING TO CREATE REFERENCE CHANNELS
`
`CROSS-REFERENCE TO RELATED APPLICATION DATA
`
`This application claims priority to U.S. Patent Application No. 14/973,274filed
`
`on December 17, 2015 which is incorporated herein by reference inits entirety.
`
`BACKGROUND
`
`In audio systems, automatic echo cancellation (AEC) refers to techniques that are
`
`used to recognize when a system has recaptured sound via a microphone after some delay
`
`that the system previously output via a speaker. Systems that provide AEC subtract a delayed
`
`10
`
`version of the onginal audio signal from the captured audio, producing a version of the
`
`captured audio that ideally eliminates the “echo”of the original audio signal, leaving only
`
`newaudio information. For example, if someone were singing karaoke into a microphone
`
`while prerecorded music is output by a loudspeaker, AEC can be used to remove anyof the
`
`recorded music from the audio captured by the microphone, allowing the singer’s voice to be
`
`amplified and output without also reproducing a delayed “echo” the original music. As
`
`another example, a media player that accepts voice commands via a microphone can use
`
`AEC to remove reproduced sounds corresponding to output media that are captured bythe
`
`microphone, makingit easier to process input voice commands.
`
`20
`
`25
`
`BRIEF DESCRIPTION OF DRAWINGS
`
`For a more complete understanding of the present disclosure, reference is now
`
`madeto the following description taken in conjunction with the accompanying drawings.
`
`FIG. | illustrates an echo cancellation systemthat performs adaptive beamforming
`
`according to embodiments of the present disclosure.
`
`FIG. 2 is an illustration of beamforming according lo embodiments of the present
`
`disclosure.
`
`FIGS. 3A-3B illustrate examples of beamforming configurations according to
`
`embodiments of the present disclosure.
`
`FIG. 4 illustrates an example of different techniques of adaptive beamforming
`
`according to embodiments of the present disclosure.
`
`FIGS. 5A-5B illustrate examples of a first signal mapping usingafirst technique
`
`AMZN0001533
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`according to embodiments of the present disclosure.
`
`FIGS. 6A-6C illustrate examples of signal mappings using the first technique
`
`according to embodiments of the present disclosure.
`
`FIGS. 7A-7C illustrate examples of signal mappings using a second technique
`
`according to embodiments of the present disclosure.
`
`FIGS. 8A-8Billustrate examples of signal mappings using a third technique
`
`according to embodiments of the present disclosure.
`
`FIG. 9 is a flowchart conceptuallyillustrating an example method for determining
`
`a signal mapping according to embodiments of the present disclosure.
`
`10
`
`FIGS. 10A-10B illustrate an example of a signal mapping using a fourth technique
`
`according to embodiments of the present disclosure.
`
`FIG. 11 is a flowchart conceptually illustrating an example method for
`
`determining a signal mapping according to embodiments of the present disclosure.
`
`FIG. 12 is a block diagram conceptually illustrating example components of a
`
`system for echo cancellation according to embodiments of the present disclosure.
`
`DETAILED DESCRIPTION
`
`Typically, a conventional Acoustic Echo Cancellation (AEC) system may remove
`
`audio output by a loudspeaker from audio captured by the system’s microphone(s) by
`
`20
`
`subtracting a delayed version of the originally transmitted audio. However, in stereo and
`
`multi-channel audio systems that include wireless or network-connected loudspeakers and/or
`
`microphones, a major cause of problems is whenthere are differences between the signal sent
`
`to a loudspeaker and a signal played at the loudspeaker. As the signal sent to the loudspeaker
`
`is not the sameas the signal played at the loudspeaker, the signal sent to the loudspeakeris
`
`NoA
`
`not a true reference signal for the AEC system. For example, when the AEC system attempts
`
`to remove the audio output by the loudspeaker from audio captured bythe system’s
`
`microphone(s) by subtracting a delayed version of the originally transmitted audio, the audio
`
`captured by the microphoneis subtly different than the audio that had been sentto the
`
`loudspeaker.
`
`There maybe a difference between the signal sent to the loudspeaker and the
`
`signal played at the loudspeaker for one or more reasons. A first cause is a difference in
`
`clock synchronization (e.g., clock offset) between loudspeakers and microphones. For
`
`2
`
`AMZN0001534
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`example, in a wireless “surround sound”5.1 system comprising six wireless loudspeakers
`
`that each receive an audio signal from a surround-sound receiver, the receiver and each
`
`loudspeaker has its own crystal oscillator which provides the respective component with an
`
`independent “clock” signal. Among other things that the clock signals are used for is
`
`converting analog audio signals into digital audio signals (“A/D conversion’’) and converting
`
`digital audio signals into analog audio signals (“D/A conversion”). Such conversions are
`
`commonplace in audio systems, such as when a surround-sound receiver performs A/D
`
`conversion prior to transmitting audio to a wireless loudspeaker, and when the loudspeaker
`
`performs D/A conversion on the received signal to recreate an analog signal. The
`
`loudspeaker produces audible sound by driving a “voice coil” with an amplified version of
`
`the analog signal.
`
`A second causeis that the signal sent to the loudspeaker may be modified based
`
`on compression/decompression during wireless communication, resulting in a different signal
`
`being received by the loudspeaker than was sent to the loudspeaker. A third case is non-
`
`linear post-processing performed on the received signal by the loudspeakerprior to playing
`
`the received signal. A fourth cause is buffering performed by the loudspeaker, which could
`
`create unknown latency, additional samples, fewer samples or the like that subtly change the
`
`signal played by the loudspeaker.
`
`10
`
`15
`
`To perform Acoustic Echo Cancellation (AEC) without knowingthe signal played
`
`20
`
`by the loudspeaker, devices, systems and methods may perform audio beamforming on a
`
`signal received by the microphones and may determine a reference signal and a target signal
`
`based on the audio beamforming. For example, the system may receive audio input and
`
`separate the audio input into multiple directions. The system maydetect a strong signal
`
`associated with a speaker and mayset the strong signal as a reference signal, selecting
`
`NoA
`
`another direction as a target signal.
`
`In some examples, the system may determine a speech
`
`position (e.g., near end talk position) and may set the direction associated with the speech
`
`position as a target signal and an opposite direction as a reference signal. If the system
`
`cannot detect a strong signal or determine a speech position, the system may create pairwise
`
`combinations of opposite directions, with an individual direction being used as a target signal
`
`and a reference signal. The system may remove the reference signal(e.g., audio output by
`
`the loudspeaker) to isolate speech includedin the target signal.
`
`FIG. 1 illustrates a high-level conceptual block diagram of echo-cancellation
`
`3
`
`AMZN0001535
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`aspects of an AEC system 100. As illustrated, an audio input 110 provides stereo audio
`
`“reference” signals x1(n) 112a and x2(n) 112b. The reference signal x;(n) 1 12a is transmitted
`
`via a radio frequency (RF) link 113 to a wireless loudspeaker 114a, and the reference signal
`
`x2(n) 112b is transmitted via an RF link 113 to a wireless loudspeaker 114b. Each speaker
`
`outputs the received audio, and portions of the output sounds are captured bya pair of
`
`microphones 118a and 118b as “echo” signals y(n) 120a and y2(n) 120b, which contain some
`
`of the reproduced sounds from the reference signals x;(n) 112a and x2(n) 1125,in addition to
`
`any additional sounds (e.g., speech) picked up by the microphones 118.
`
`To isolate the additional sounds from the reproduced sounds, the device 102 may
`
`10
`
`include an adaptive beamformer 104 that may perform audio beamforming on the echo
`
`signals 120 to determine a target signal 122 and a reference signal 124. For example, the
`
`adaptive beamformer 104 may include a fixed beamformer (FBF) 105, a multiple input
`
`canceler (MC) 106 and/or a blocking matrix (BM) 107. The FBF 105 maybe configured to
`
`form a beam in a specific direction so that a target signal is passed andall other signals are
`
`attenuated, enabling the adaptive beamformer 104 to select a particular direction.
`
`In contrast,
`
`the BM 107 may be configured to form a null in a specific direction so that the target signal is
`
`attenuated and all other signals are passed. The adaptive beamformer 104 maygenerate fixed
`
`beamforms(e.g., outputs of the FBF 105) or may generate adaptive beamforms using a
`
`Linearly Constrained Minimum Variance (LCMV) beamformer, a Minimum Variance
`
`20
`
`Distortioniess Response (MVDR) beamformer or other beamforming techniques. For
`
`example, the adaptive beamformer 104 may receive audio input, determine six beamforming
`
`directions and output six fixed beamform outputs and six adaptive beamform outputs. In
`
`some examples, the adaptive beamformer 104 may generate six fixed beamform outputs, six
`
`LCMV beamform outputs and six MVDR beamform outputs, although the disclosure is not
`
`NoA
`
`limited thereto. Using the adaptive beamformer 104 and techniques discussed below, the
`
`device 102 may determine the target signal 122 and the reference signal 124 to pass to an
`
`acoustic echo cancellation (AEC) 108. The AEC 108 may removethe reference signal (e.¢.,
`
`reproduced sounds) from the target signal (e.g., reproduced sounds and additional sounds) to
`
`remove the reproduced sounds andisolate the additional sounds (e.g., speech) as audio output
`
`126.
`
`Toillustrate, in some examples the device 102 may use outputs of the FBF 105 as
`
`the target signal 122. For example, the outputs of the FBF 105 may be shownin equation (1):
`
`4
`
`AMZN0001536
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`Target =s +7 + noise
`
`a)
`
`where s is speech (e.g., the additional sounds), z is an echo from the signal sent to the
`
`loudspeaker(e.g., the reproduced sounds) and noise is additional noise that is not associated
`
`with the speech or the echo. In order to attenuate the echo (z), the device 102 may use
`
`outputs of the BM 107 as the reference signal 124, which may be shown in equation2:
`
`10
`
`Reference = z + noise
`
`(2)
`
`By removing the reference signal 124 from the target signal 122, the device 102 may remove
`
`the echo and generate the audio output 126 including onlythe speech and somenoise. The
`
`device 102 mayuse the audio output 126 to perform speech recognition processing on the
`
`speech to determine a command and mayexecute the command. For example, the device
`
`102 may determine that the speech corresponds to a command to play music and the device
`
`102 may play music in response to receiving the speech.
`
`In some examples, the device 102 mayassociate specific directions with the
`
`reproduced sounds and/or speech based on features of the signal sent to the loudspeaker.
`
`Examples of features includes power spectrum density, peak levels, pause intervals or the like
`
`20
`
`that may be used to identify the signal sent to the loudspeaker and/or propagation delay
`
`between different signals. For example, the adaptive beamformer 104 may compare the
`
`signal sent to the loudspeaker with a signal associated with a first direction to determineif the
`
`signal associated with the first direction includes reproduced sounds from the loudspeaker.
`
`When the signal associated with the first direction matches the signal sent to the loudspeaker,
`
`NoA
`
`the device 102 may associate the first direction with a wireless speaker. When the signal
`
`associated with the first direction does not match the signal sent to the loudspeaker, the
`
`device 102 may associate thefirst direction with speech, a speech position, a person or the
`
`like.
`
`Asillustrated in FIG. 1, the device 102 may receive (130) an audio input and may
`
`perform (132) audio beamforming. For example, the device 102 mayreceive the audio input
`
`from the microphones 118 and mayperform audio beamforming to separate the audio input
`
`into separate directions. The device 102 may determine (134) a speech position (e.g., near
`
`5
`
`AMZN0001537
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`end talk position) associated with speech and/or a person speaking. For example, the device
`
`102 mayidentify the speech, a person and/or a position associated with the speech/person
`
`using audio data (e.g., audio beamforming when speech is recognized), video data (e.g., facial
`
`recognition) and/or other inputs known to one of skill in the art. The device 102 may
`
`determine (136) a target signal and may determine (138) a reference signal based on the
`
`speech position and the audio beamforming. For example, the device 102 may associate the
`
`speech position with the target signal and may select an opposite direction as the reference
`
`signal.
`
`The device 102 may determine the target signal and the reference signal using
`
`10
`
`oultiple techniques, which are discussed in greater detail below. For example, the device
`
`102 mayuse a first technique when the device 102 detects a clearly defined speaker signal, a
`
`second technique when the device 102 doesn’t detect a clearly defined speaker signal but
`
`does identify a speech position and/or a third technique when the device 102 doesn’t detect a
`
`clearly defined speaker signal or a speech position. Using the first technique, the device 102
`
`mayassociate the clearly defined speaker signal with the reference signal and may select any
`
`or all of the other directions as the target signal. For example, the device 102 may generate a
`
`single target signal using all of the remaining directions for a single loudspeaker or may
`
`generate multiple target signals using portions of remaining directions for multiple
`
`loudspeakers. Using the second technique, the device 102 may associate the speech position
`
`20
`
`with the target signal and may select an opposite direction as the reference signal. Using the
`
`third technique, the device 102 may select multiple combinations of opposing directions to
`
`generate multiple target signals and multiple reference signals.
`
`The device 102 may remove (140) an echo from the target signal by removing the
`
`reference signal to isolate speech or additional sounds and may output (142) audio data
`
`NoA
`
`including the speech or additional sounds. For example, the device 102 may remove music
`
`(e.g., reproduced sounds) played over the loudspeakers 114 to isolate a voice command input
`
`to the microphones 118.
`
`The device 102 may include a microphone array having multiple microphones 118
`
`that are laterally spaced from each other so that they can be used by audio beamforming
`
`components to produce directional audio signals. The microphones 118 may, in some
`
`instances, be dispersed around a perimeter of the device 102 in order to apply beampatterns to
`
`audio signals based on sound captured by the microphone(s) 118. For example, the
`
`6
`
`AMZN0001538
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`microphones 118 may be positioned at spaced intervals along a perimeter of the device 102,
`
`although the present disclosure is not limited thereto.
`
`In some examples, the microphone(s)
`
`118 maybe spaced on a substantially vertical surface of the device 102 and/or a top surface
`
`of the device 102. Each of the microphones 118 is omnidirectional, and beamforming
`
`technology is used to produce directional audio signals based on signals from the
`
`microphones 118. In other embodiments, the microphones mayhave directional audio
`
`reception, which may removethe need for subsequent beamforming.
`
`In various embodiments, the microphone array may include greater or less than the
`
`number of microphones 118 shown. Speaker(s) (not illustrated) may be located at the bottom
`
`of the device 102, and may be configured to emit sound omnidirectionally, in a 360 degree
`
`pattern around the device 102. For example, the speaker(s) may comprise a round speaker
`
`element directed downwardlyin the lowerpart of the device 102.
`
`Using the plurality of microphones 118 the device 102 may employ beamforming
`
`techniques to isolate desired sounds for purposes of converting those soundsinto audio
`
`signals for speech processing by the system. Beamformingis the process of applying a set of
`
`beamformercoefficients to audio signal data to create beampatterns, or effective directions of
`
`gain or attenuation. In some implementations, these volumes maybe consideredto result
`
`from constructive and destructive interference between signals from individual microphones
`
`in a microphonearray.
`
`The device 102 mayinclude an adaptive beamformer 104 that may include one or
`
`more audio beamformers or beamforming components that are configured to generate an
`
`audio signal that is focused in a direction from which user speech has been detected. More
`
`specifically, the beamforming components may be responsiveto spatially separated
`
`microphone elements of the microphonearray to produce directional audio signals that
`
`emphasize sounds originating from different directions relative to the device 102, and to
`
`select and output one of the audio signals that is most likely to contain user speech.
`
`Audio beamforming, also referred to as audio array processing, uses a microphone
`
`array having multiple microphones that are spaced from each other at known distances.
`
`Sound originating from a source is received by each of the microphones. However, because
`
`each microphoneis potentially at a different distance from the sound source, a propagating
`
`sound wave arrives at each of the microphonesatslightly different times. This difference in
`
`arrival time results in phase differences between audio signals produced by the microphones.
`
`7
`
`10
`
`15
`
`20
`
`NoA
`
`AMZN0001539
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`The phase differences can be exploited to enhance sounds originating from chosen directions
`
`relative to the microphone array.
`
`Beamforming uses signal processing techniques to combine signals from the different
`
`microphonesso that sound signals originating from a particular direction are emphasized
`
`while sound signals from other directions are deemphasized. More specifically, signals from
`
`the different microphones are combined in such a way that signals from a particular direction
`
`experience constructive interference, while signals from other directions experience
`
`destructive interference. The parameters used in beamforming maybe varied to dynarnically
`
`select different directions, even when using a fixed-configuration microphonearray.
`
`A given beampattern maybe used to selectively gather signals from a particular
`
`spatial location where a signal source is present. The selected beampattern may be
`
`configured to provide gain or attenuation for the signal source. For example, the beampattern
`
`may be focused on a particular user’s head allowing for the recovery of the user’s speech
`
`while attenuating noise from an operating air conditioner that is across the room and in a
`
`different direction than the user relative to a device that captures the audio signals.
`
`Such spatial selectivity by using beamforming allows for the rejection or attenuation
`
`of undesired signals outside of the beampattern. The increased selectivity of the beampattern
`
`improves signal-to-noise ratio for the audio signal. By improving the signal-to-noise ratio,
`
`the accuracy of speaker recognition performed on the audio signal is improved.
`
`The processed data from the beamformer module may then undergo additional
`
`filtering or be used directly by other modules. For example, a filter may be applied to
`
`processed data which is acquiring speech from a user to remove residual audio noise from a
`
`machine running in the environment.
`
`FIG. 2 is an illustration of beamforming according to embodiments of the present
`
`disclosure. FIG. 2 illustrates a schematic of a beampattern 202 formed by applying
`
`beamforming coefficients to signal data acquired from a microphonearray of the device 102.
`
`As mentioned above, the beampattern 202 results from the application of a set of beamformer
`
`coefficients to the signal data. The beampattern generates directions of effective gain or
`
`attenuation.
`
`In this illustration, the dashed line indicates isometric lines of gain provided by
`
`the beamforming coefficients. For example, the gain at the dashed line here may be +12
`
`decibels (dB) relative to an isotropic microphone.
`
`The beampattern 202 mayexhibit a plurality of lobes, or regions of gain, with gain
`
`8
`
`10
`
`15
`
`20
`
`NoA
`
`AMZN0001540
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`predominating in a particular direction designated the beampattern direction 204. A main
`
`lobe 206 is shown here extending along the beampattern direction 204. A main lobe beam-
`
`width 208 is shown, indicating a maximum width of the main lobe 206. In this example, the
`
`beampattern 202 also includes side lobes 210, 212, 214, and 216. Opposite the main lobe 206
`
`along the beampattern direction 204 is the back lobe 218. Disposed around the beampattern
`
`202 are null regions 220. These null regions are areas of attenuation to signals.
`
`In the
`
`example, the person 10 resides within the main lobe 206 and benefits from the gain provided
`
`by the beampattern 202 and exhibits an improved SNR ratio compared to a signal acquired
`
`with non-beamforming.
`
`In contrast, if the person 10 were to speak from a null region, the
`
`10
`
`resulting audio signal may besignificantly reduced. As shown in this illustration, the use of
`
`the beampattern provides for gain in signal acquisition compared to non-beamforming.
`
`Beamformingalso allowsfor spatial selectivity, effectively allowing the system to “turn a
`
`deaf ear” on a signal which is not of interest. Beamforming mayresult in directional audio
`
`signal(s) that may then be processed by other components of the device 102 and/or system
`
`15
`
`100.
`
`While beamforming alone may increase a signal-to-noise (SNR) ratio of an audio
`
`signal, combining known acoustic characteristics of an environment(e.g., a room impulse
`
`response (RIR)) and heuristic knowledge of previous beampattern lobe selection may provide
`
`an even better indication of a speaking user’s likely location within the environment.
`
`In some
`
`20
`
`instances, a device includes multiple microphones that capture audio signals that include user
`
`speech. As is known and as used herein, “capturing” an audio signal includes a microphone
`
`transducing audio waves of captured soundto an electrical signal and a codec digitizing the
`
`signal. The device mayalso include functionality for applying different beampatterns to the
`
`captured audio signals, with each beampattern having multiple lobes. By identifying lobes
`
`NoA
`
`most likely to contain user speech using the combination discussed above, the techniques
`
`enable devotion of additional processing resources of the portion of an audio signal most
`
`likely to contain user speech to provide better echo canceling and thus a cleaner SNR ratio mn
`
`the resulting processed audio signal.
`
`To determine a value of an acoustic characteristic of an environment (e.g., an RIR of
`
`the environment), the device 102 may emit sounds at known frequencies (e.g., chirps, text-to-
`
`speech audio, music or spoken word content playback, etc.) to measure a reverberant
`
`signature of the environment to generate an RIR of the environment. Measured over time in
`
`9
`
`AMZNO0001541
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`an ongoing fashion, the device may be able to generate a consistent picture of the RIR and the
`
`reverberant qualities of the environment, thus better enabling the device to determine or
`
`approximate whereit is located in relation to walls or corners of the environment (assuming
`
`the device is stationary). Further, if the device is moved, the device may be able to determine
`
`this change by noticing a change in the RIR pattern. In conjunction withthis information, by
`
`tracking which lobe of a beampattern the device most often selects as having the strongest
`
`spoken signal path over time, the device may begin to notice patterns in which lobes are
`
`selected. If a certain set of lobes (or microphones) is selected, the device can heuristically
`
`determine the user's typical speaking location in the environment. The device may devote
`
`more CPU resources to digital signal processing (DSP) techniques forthat lobe or set of
`
`lobes. For example, the device may run acoustic echo cancelation (AEC) at full strength
`
`across the three most commonlytargeted lobes, instead of picking a single lobe to run AECat
`
`full strength. The techniques maythus improve subsequent automatic speech recognition
`
`(ASR) and/or speaker recognition results as long as the device is not rotated or moved. And,
`
`if the device 1s moved, the techniques mayhelp the device to determine this change by
`
`comparing current RIR results to historical ones to recognize differences that are significant
`
`enough to cause the device to begin processing the signal coming from all lobes
`
`approximately equally, rather than focusing only on the most commonly targeted lobes.
`
`10
`
`15
`
`By focusing processing resources on a portion of an audio signal most likely to
`
`20
`
`include user speech, the SNR of that portion may be increased as compared to the SNR if
`
`processing resources were spread out equally to the entire audio signal. This higher SNR for
`
`the most pertinent portion of the audio signal may increase the efficacy of the device 102
`
`when performing speaker recognition on the resulting audio signal.
`
`Using the beamforming and directional based techniques above,
`
`the system may
`
`NoA
`
`determine a direction of detected audio relative to the audio capture components.
`
`Such
`
`direction information may be used to link speech / a recognized speaker identity to video data
`
`as described below.
`
`FIGS. 3A-3B illustrate examples of beamforming configurations according to
`
`embodiments of the present disclosure. As illustrated in FIG. 3A, the device 102 may
`
`perform beamforming to determinea plurality of portions or sections of audio received from
`
`a microphonearray. FIG. 3A illustrates a beamforming configuration 310 including six
`
`portions or sections (e.g, Sections 1-6). For example, the device 102 may include six
`
`10
`
`AMZN0001542
`
`
`
`WO 2017/105998
`
`PCT/US2016/063563
`
`different microphones, may divide an area around the device 102 into six sections or the like.
`
`However, the present disclosure is not limited thereto and the number of microphonesin the
`
`microphonearray and/or the numberof portions/sections in the beamforming may vary. As
`
`illustrated in FIG. 3B, the device 102 may generate a beamforming configuration 312
`
`including eight portions/sections (e.g., Sections 1-8) without departing from the disclosure.
`
`For example, the device 102 mayinclude eight different microphones, may divide the area
`
`around the device 102 into eight portions/sections or the like. Thus, the following examples
`
`mayperform beamforming and separate an audio signal into eight different portions/sections,
`
`but these examples are intended as illustrative examples and the disclosure is not limited
`
`10
`
`thereto.
`
`The numberof portions/sections generated using beamforming does not depend
`
`on the number of microphones in the microphonearray. For example, the device 102 may
`
`include twelve microphones in the microphonearray but may determine three portions, six
`
`portions or twelve portions of the audio data without departing from the disclosure. As
`
`15
`
`discussed above, the adaptive beamformer 104 may generate fixed beamforms (e.g., outputs
`
`of the FBF 105) or may generate adaptive beamforms using a Linearly Constrained Minimum
`
`Variance (LCMV) beamformer, a Minimum Variance Distortionless Response (MVDR)
`
`beamformer or other beamforming techniques. For example, the adaptive beamformer 104
`
`may receive the audio input, may determine six beamforming directions and output six fixed
`
`20
`
`beamform outputs and six adaptive beamform outputs corresponding to the six beamforming
`
`directions.
`
`In some examples, the adaptive beamformer 104 may generate six fixed
`
`beamform outputs, six LCMV beamform outputs and six MVDR beamform outputs, although
`
`the disclosure is not limited thereto.
`
`The device 102 may determine a number of wireless loudspeakers and/or
`
`NoA
`
`directions associated with the wireless loudspeakers using the fixed beamform outputs. For
`
`example, the device 102 may localize energy in the frequency domain andclearly identify
`
`much higher energy in two directions associated with two wireless loudspeakers(e.g., a first
`
`direction associated with a first speaker and a second direction associated with a second
`
`speaker).
`
`In some examples, the device 102 may determine an existence and/or location
`
`associated with the wireless loudspeakers using a frequencyrange (e.g., 1 kHz to 3 kHz),
`
`although the discl