`Kristjansson et al .
`
`( 10 ) Patent No . : US 10 , 147 , 439 B1
`( 45 ) Date of Patent :
`Dec . 4 , 2018
`
`US010147439B1
`
`( 54 ) VOLUME ADJUSTMENT FOR LISTENING
`ENVIRONMENT
`( 71 ) Applicant : Amazon Technologies , Inc . , Seattle ,
`WA ( US )
`( 72 ) Inventors : Trausti Thor Kristjansson , San Jose ,
`CA ( US ) ; Mohamed Mansour ,
`Cupertino , CA ( US ) ; Amit Singh
`Chhetri , Santa Clara , CA ( US ) ; Ludger
`Solbach , San Jose , CA ( US )
`( 73 ) Assignee : Amazon Technologies , Inc . , Seattle ,
`WA ( US )
`Subject to any disclaimer , the term of this
`patent is extended or adjusted under 35
`U . S . C . 154 ( b ) by 0 days .
`( 21 ) Appl . No . : 15 / 474 , 197
`( 22 ) Filed :
`Mar . 30 , 2017
`( 51 )
`Int . Ci .
`G06F 3 / 00
`GIOL 21 / 0364
`GIOL 15 / 22
`GIOL 21 / 0232
`GIOL 13 / 00
`GIOL 21 / 0216
`U . S . CI .
`CPC . . . . . . . . . . GIOL 21 / 0364 ( 2013 . 01 ) ; GIOL 13 / 00
`( 2013 . 01 ) ; GIOL 15 / 22 ( 2013 . 01 ) ; GIOL
`21 / 0232 ( 2013 . 01 ) ; GIOL 2015 / 223 ( 2013 . 01 ) ;
`GIOL 2021 / 02166 ( 2013 . 01 )
`
`( * ) Notice :
`
`( 52 )
`
`( 56 )
`
`( 58 ) Field of Classification Search
`CPC . . . . . . . . . GO6F 3 / 165 ; G06F 3 / 167 ; G10L 15 / 22 ;
`GIOL 15 / 265
`USPC . . . . . . . . . .
`. . . . . . . . 704 / 423
`See application file for complete search history .
`References Cited
`U . S . PATENT DOCUMENTS
`8 , 862 , 387 B2 * 10 / 2014 Kandangath . . . . . . . GO1C 21 / 3629
`701 / 419
`2014 / 0135076 A1 *
`5 / 2014 Lee . . . . . . . . . . . . . . . . . . . . . H04M 1 / 6041
`455 / 569 . 1
`2017 / 0242650 A1 *
`8 / 2017 Jarvis . . . . . . . . . . . . . . . . . . . . . . H04S 7 / 301
`G06F 3 / 165
`2017 / 0242651 A1 *
`8 / 2017 Lang . . . . . . . . . . . . . .
`* cited by examiner
`Primary Examiner - Daniel Abebe
`( 74 ) Attorney , Agent , or Firm — Pierce Atwood LLP
`( 57 )
`ABSTRACT
`A speech - capturing device that can modulate its output
`audio data volume based on environmental sound conditions
`at the location of a user speaking to the device . The device
`detects the sound pressure of a spoken utterance at the
`device location and determines the distance of the user from
`the device . The device also detects the sound pressure of
`noise at the device and uses information about the location
`of the noise source and user to determine the sound pressure
`of noise at the location of the talker . The device can then
`adjust the gain for output audio ( such as a spoken response
`to the utterance ) to ensure that the output audio is at a certain
`desired sound pressure when it reaches the location of the
`user .
`
`18 Claims , 13 Drawing Sheets
`
`( 2006 . 01 )
`( 2013 . 01 )
`( 2006 . 01 )
`( 2013 . 01 )
`( 2006 . 01 )
`( 2013 . 01 )
`
`—
`
`—
`
`—
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`Detect input audio including an utterance
`and noise
`1 132 Send audio data to server for speech
`processing
`
`- - - - - - - - - -
`
`| ?
`
`?? ??
`
`?? ??
`
`?
`
`?
`
`?
`
`?
`
`Server ( s )
`120
`
`System
`100
`
`Determine user location
`
`?
`
`?
`
`1134
`| 1364 Determine location ( s ) of noise source ( s ) ?
`
`i 138
`
`- - - - - -
`
`Determine noise level at user ' s location
`
`Calculate gain for output audio data to |
`1 140 - Jensure desired volume of output audio of |
`user ' s location in view of noise level at
`user ' s location
`| 112 Receive output audio data including from
`server
`Output audio using calculated gain
`
`| 144
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- - - -
`
`-
`
`- - - - - - - - - -
`
`- -
`
`- -
`
`?
`
`Network ( s )
`199
`
`Input
`Audio
`11
`
`Noise
`Source 1
`190a
`
`User
`
`Device
`110
`
`Output
`Audio
`15
`
`Noise
`Source 2
`190b
`
`-1-
`
`Amazon v. SoundClear
`US Patent 11,069,337
`Amazon Ex. 1013
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 1 of 13
`
`US 10 , 147 , 439 B1
`
`Device 110
`
`Noise Source 2 190b
`
`Output Audio
`
`15
`
`Network ( s )
`199
`
`Input Audio
`
`User
`
`Noise Source 1 190a
`
`war
`
`a
`
`were
`
`
`
`en we
`
`
`
`na na na an na
`
`
`
`
`
`
`
`System
`100
`
`Server ( s ) 120
`
`FIG . 1
`
`
`
`
`Detect input audio including an utterance and
`noise
`
`N NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
`
`A
`
`wwwww
`
`w
`
`
`
`
`
`Determine user location
`
`
`
`
`
`Send audio data to server for
`
`speech processing
`wwwwwwwwwwwwwwwwww
`
`ir
`
`own
`
`
`
`
`
`Determine location ( s ) of noise source ( s )
`
`???????????????????????????????????????????????????????????????????????????????????????? | 136
`
`
`
`
`
`
`
`
`Calculate gain for output audio data to 140 mensure desired volume of output
`
`audio of 1 user ' s location in view of noise
`level at user ' s location
`
`
`
`138 - Determine noise level at user ' s location
`MAHA
`
`
`
`
`
`Receive output audio data including from server
`
`
`
`
`
`
`
`
`
`Output audio using calculated gain
`
`
`
`
`
`
`
`132
`
`142
`
`- 1144
`
`-
`
`~
`
`~
`
`-2-
`
`
`
`atent
`
`Dec . 4 , 2018
`
`Sheet 2 of 13
`
`US 10 , 147 , 439 B1
`
`Microphone Array 210
`
`202
`
`202
`
`202
`
`202
`
`202
`
`FIG . 2 .
`
`Microphone Array 210
`
`u
`
`
`
`Device 110
`
`Speaker . 220
`
`-3-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 3 of 13
`
`US 10 , 147 , 439 B1
`
`FIG . 3
`
`Direction 4
`
`- -
`
`-
`
`Direction 5
`
`Direction 2
`
`202
`
`202
`
`Direction 1
`
`i
`
`Direction 3
`
`1
`
`/
`
`202
`
`w
`
`202
`2021
`
`1
`
`Direction 8
`
`1
`
`Direction 7
`
`202
`
`Direction 6
`
`-4-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 4 of 13
`
`US 10 , 147 , 439 B1
`
`ASR Models 452
`thyhythm
`
`or wome
`
`Z
`
`Y
`
`453n
`
`454n
`
`454a
`
`banaunge Language Model 1
`Acoustic Acoustic Model 1
`453a
`
`Storage
`NLU
`473
`
`478n
`
`4740
`
`478a
`
`
`Grammar Intents med X478b
`mhamnian2 kedamaian2 Domain 1 Domain 1
`
`Device 1
`
`474a Devica 2
`
`476nny Domains
`Www
`47667
`
`474n
`
`760
`
`476ammen
`
`Entity Library 482
`
`MANNAAAAAAAAAAAAAAA
`
`1484n
`
`V
`
`W
`
`*
`
`*
`
`tttttttttttttt
`
`LLLLLLLLLLLLLLL
`Gazetteer C
`Gazetteer B
`Gazetteer A
`
`NNNNNNNNNNNNNNNNNNNNN IV AV VM .
`
`*
`
`*
`
`*
`
`Y
`
`1 - 4840
`
`med 648464 - 484a
`
`
`
`
`
`( 486aa 486ab 486an
`
`Domain 1 Lexicon
`I Dakaia2
`
`vvv
`
`Command Processor 490
`
`FIG . 4
`
`- 450
`
`Speech Recognition
`Automatic
`
`- - - - - - - - -
`
`-
`
`-
`
`456
`
`h458
`FYYYYYYYYYYYYYYYYYY
`Speech Recognition Engine
`Acoustic Front End ( AFE )
`
`
`
`460
`
`wwwwwwwwwwwwwwwwwww
`
`Language Understanding ( NLU )
`Recognizer 463
`Natural
`
`NER 462
`
`R
`
`C Module 464
`
`w RRRRRRRRRRRRRRRRRRRRR
`
`Server ( s )
`120
`
`
`
`Input Audio data 111
`
`User
`
`Input Audio
`
`Device and
`
`110
`
`LA20
`
`1 440
`
`Volume correction component
`
`Wakeword Detection Module
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`-5-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 5 of 13
`
`US 10 , 147 , 439 B1
`
`Voice Unit Storage 572
`TTS
`
`m578a .
`
`- 578n 578b
`Voice Voice Inventory
`
`TTS Storage 520
`
`FIG . 5
`
`516
`
`DOOROO O
`
`b
`
`onbonbonbon
`
`User 5
`
`Output Audio 15
`
`
`
`Speech Synthesis Engine
`
`6
`
`
`
`Unit Selection Engine
`
`530
`
`Parametric Synthesis Engine
`
`532
`
`: . . . Homema . . .
`
`?
`
`. .
`
`. . . .
`
`Device 110
`
`Output Audio data 511
`
`
`
`
`
`TTS Front End ( TTSFE )
`
`Command Processor ( s )
`TTS Componenty 514
`
`490
`
`-6-
`
`
`
`U . S . Patent
`
`atent
`
`Dec . 4 , 2018
`
`Sheet 6 of 13
`
`US 10 , 147 , 439 B1
`
`FIG . 6
`
`
`
`Device 110
`
`606
`
`D $ 2 , 0
`
`age
`602
`
`Dud Dud
`se
`
`User 5
`
`604
`
`Ds1 . d
`
`D 52 . 4 ?? : 20
`608
`
`D $ 1 , 4
`
`610
`
`Noise Source 2
`1900
`
`Noise Source 1 190a
`
`-7-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 7 of 1
`
`Sheet 7 of 13
`
`US 10 , 147 , 439 B1
`
`
`
`Single Microphone Device 110
`
`Can assume noise level at user is
`
`
`
`same as noise level at device
`
`
`602
`
`Dud
`
`User 5
`
`FIG . 7
`
`Noise Source 2
`190b
`
`Noise Source 1 190a
`
`-8-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 8 of 13
`
`US 10 , 147 , 439 B1
`
`???????????
`
`??????
`
`?????? ????? ??????
`
`?????? ????? ??????
`
`??????
`
`??????
`
`?
`
`Device \ 110
`
`FIG . 8
`
`na zoo
`
`D = 7 , 0
`
`604
`
`User 5
`
`n20
`
`610
`
`Noise Source 2
`1906
`
`SN : ( Ni = 7 , 0 )
`
`Noise IP
`Source 1 190a
`
`-9-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 9 of 13
`
`US 10 , 147 , 439 B1
`
`Device 1 110
`
`FIG . 9
`
`*
`
`602 Dua
`
`User 5
`
`Dave
`
`904
`
`910 DECU
`
`Noise Source 2
`1905
`
`SN , ( N1 - 70 )
`
`930
`
`Noise I Noise Source 1 190a
`
`-10-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 10 of 13
`
`US 10 , 147 , 439 B1
`
`Network ( s )
`199
`
`FIG . 10
`
`evice Device Location kitchen upstairs office
`Name Name Device
`
`Echo Echo TV
`
`D
`
`
`
`living room
`
`Dot
`
`
`
`
`
`User Profile Storage 1002
`
`w 1004
`
`
`345671444 - XXX - 332 194 . 66 . 82 . 11 499 567 - xjW - 432 192 . 65 . 82 . 31 498 - jij - 879
`1192 . 67 . 28 . 43
`
`Device ID IP Address L 44 Device ID I IP Address L 444 Device ID IP Address
`
`
`
`136 - sts - 456 1196 . 66 . 23 . 23
`
`- - - - - - - - - - - - - - -
`
`1361
`
`-
`
`-11-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 11 of 13
`
`US 10 , 147 , 439 B1
`
`199
`
`Network
`
`Microphone ( s )
`202
`
`Antenna 1114
`
`- Speaker 202
`
`Display 1120
`
`jijini
`
`FIG . 11
`
`Device 110
`
`
`
`Bus 1124
`
`M
`
`V / O Device Interfaces 1102
`
`Controller ( s ) / Processor ( s )
`1104
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`Memory 1106
`
`Storage 1108
`
`Volume correction component 440
`
`
`
`ASR Module 250
`
`
`
`NLU Module 260
`
`Command Processor 290
`
`:
`
`:
`
`Beamformer component 1160
`
`AAANAAN
`
`-12-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 12 of 13
`
`US 10 , 147 , 439 B1
`
`Network 199
`
`FIG . 12
`
`
`
`Bus 1224
`
`Server 120
`
`1 / 0 Device Interfaces 1202
`
`.
`
`Controller ( s ) / Processor ( s )
`1204
`
`Memory 1206
`
`????????????
`
`
`
`ASR Module 250
`
`*
`
`*
`
`*
`
`*
`
`*
`
`
`
`NLU Module 260
`
`Command Processor 290
`
`Storage 1208
`
`.
`
`Synchronization component 1250
`
`-13-
`
`
`
`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 13 of 13
`
`US 10 , 147 , 439 B1
`
`Speecha Controlled Device 110
`
`Application Server 125
`
`.
`
`WAV
`
`FIG . 13
`
`hi !
`
`Server 120
`
`Network 199
`
`Camera ( s ) 1302
`
`
`
`Smart Phone 1106
`
`Vehicle 110e
`
`VE
`
`to
`
`$
`
`Tablet Computer 110d
`
`.
`
`
`
`Smart Watch 110c
`
`08 : 00
`
`-14-
`
`
`
`US 10 , 147 , 439 B1
`
`cerned with transforming audio data associated with speech
`VOLUME ADJUSTMENT FOR LISTENING
`ENVIRONMENT
`into text data representative of that speech . Similarly , natural
`language understanding ( NLU ) is a field of computer sci
`BACKGROUND
`ence , artificial intelligence , and linguistics concerned with
`enabling computers to derive meaning from text input
`Speech recognition systems have progressed to the point
`containing natural language . ASR and NLU are often used
`where humans can interact with computing devices using
`together as part of a speech processing system . Text - to
`their voices . Such systems employ techniques to identify the
`speech ( TTS ) is a field of concerning transforming textual
`words spoken by a human user based on the various qualities
`data into audio data that is synthesized to resemble human
`of a received audio input . Speech recognition combined with
`speech .
`natural language understanding processing techniques
`A speech processing system may be configured as a
`enable speech - based user control of a computing device to
`relatively self - contained system where one device captures
`perform tasks based on the user ' s spoken commands . The
`combination of speech recognition and natural language
`audio , performs speech processing , and executes a command
`understanding processing techniques is referred to herein as 15 corresponding to the input speech . Alternatively , a speech
`speech processing . Speech processing may also involve 15 processing system may be configured as a distributed system
`converting a user ' s speech into text data which may then be
`where a number of different devices combine to capture
`provided to various text - based software applications .
`audio of a spoken utterance , perform speech processing , and
`Speech processing may be used by computers , hand - held
`execute a command corresponding to the utterance . In a
`devices , telephone computer systems , kiosks , and a wide
`variety of other devices to improve human - computer inter - 20 distributed system , one device may capture audio while
`other device ( s ) perform speech processing , etc . Output audio
`actions . Output of speech processing systems may include
`synthesized speech .
`data resulting from execution of the command , for example
`synthesized speech responding to the input utterance , may
`then be sent back to the same device that captured the
`BRIEF DESCRIPTION OF DRAWINGS
`25 original utterance , or may be sent to one or more other
`For a more complete understanding of the present disclo -
`devices .
`sure , reference is now made to the following description
`Listening environments for users may vary in terms of the
`amount of noise experienced by the user . Various noise
`taken in conjunction with the accompanying drawings .
`FIG . 1 illustrates a speech processing system configured
`sources may combine to make it difficult for a user to hear
`to adjust output volume according to embodiments of the 30 and / or understand output audio from a device . To improve
`the ability of a user to detect and understand output audio
`present disclosure .
`FIG . 2 illustrates certain components of a device config
`from a device , a system may determine a location of a user
`and noise sources relative to a device , may determine the
`ured to input and output audio according to embodiments of
`noise level at the user ' s location , and may determine a gain
`the present disclosure .
`FIG . 3 illustrates directional based audio processing and 35 for output audio data to ensure the output audio is at a
`sufficient volume when it reaches the user ' s location . The
`beamforming according to embodiments of the present
`system may also incorporate user feedback regarding the
`disclosure .
`FIG . 4 is a diagram of components of a speech processing
`output volume for purposes of determining the output vol
`system according to embodiments of the present disclosure .
`ume at a future time under similar noise conditions . While
`FIG . 5 is a diagram of components for text - to - speech 40 the present techniques are illustrated for a speech processing
`ST
`( TTS ) processing according to embodiments of the present
`system , they are applicable to other systems as well includ
`ing a voice over internet protocol ( VoIP ) system , different
`disclosure .
`FIG . 6 illustrates determining relative positions of audio
`communication systems , or other such systems that involve
`emitting and capturing elements in an environment accord
`the exchange of audio data .
`FIG . 1 shows a speech processing system 100 . Although
`ing to embodiments of the present disclosure .
`FIG . 7 illustrates estimating noise experienced by a user
`the figures and discussion illustrate certain operational steps
`of the system 100 in a particular order , the steps described
`according to embodiments of the present disclosure .
`FIG . 8 illustrates determining relative positions of audio
`may be performed in a different order ( as well as certain
`emitting and capturing elements in an environment accord
`steps removed or added ) without departing from the intent of
`50 the disclosure . As shown in FIG . 1 , the system 100 may
`ing to embodiments of the present disclosure .
`FIG . 9 illustrates estimating noise experienced by a user
`include one or more speech - controlled devices 110 local to
`according to embodiments of the present disclosure .
`user 5 , as well as one or more networks 199 and one or more
`FIG . 10 illustrates data stored and associated with user
`servers 120 connected to speech - controlled device ( s ) 110
`profiles according to embodiments of the present disclosure .
`across network ( s ) 199 . The server ( s ) 120 ( which may be one
`FIG . 11 is a block diagram conceptually illustrating 55 or more different physical devices ) may be capable of
`example components of a device according to embodiments
`performing traditional speech processing ( e . g . , ASR , NLU ,
`command processing , etc . ) or other operations as described
`of the present disclosure .
`FIG . 12 is a block diagram conceptually illustrating
`herein . A single server 120 may perform all speech process
`ing or multiple servers 120 may combine to perform all
`example components of a server according to embodiments
`60 speech processing . Further , the server ( s ) 120 may execute
`of the present disclosure .
`FIG . 13 illustrates an example of a computer network for
`certain commands , such as answering spoken utterances of
`the users 5 , operating other devices ( light switches , appli
`use with the system .
`ances , etc . ) . In addition , certain speech detection or com
`DETAILED DESCRIPTION
`mand execution functions may be performed by the speech
`65 controlled device 110 . Further , the system may be in
`Automatic speech recognition ( ASR ) is a field of com -
`communication with external data sources , such as a knowl
`edge base , external service provider devices , or the like .
`puter science , artificial intelligence , and linguistics con
`
`45
`
`-15-
`
`
`
`US 10 , 147 , 439 B1
`
`signal portion corresponding to the audio ) . Other techniques
`In one example , as shown in FIG . 1 , a user 5 may be in
`( either in the time domain or in the sub - band domain ) may
`an environment with device 110 as well as noise sources
`also be used .
`190a and 1906 . The user 5 may speak an utterance which is
`For example , if audio data corresponding to a user ' s
`represented in input audio 11 . The device 110 may detect
`( 130 ) the input audio which may include both the utterance 5 speech is first detected and / or is most strongly detected by
`as well as noise from the noise sources . The device 110 may
`a microphone associated with 7 , the device may determine
`send ( 132 ) audio data to the server 120 for speech process
`that the user is located in a location in direction 7 . The
`ing . The device 110 may determine ( 134 ) the user ' s location
`device may isolate audio coming from direction 7 using
`relative to the device 110 , either through range finding
`techniques known to the art . Thus , the device 110 may boost
`image processing , estimation or other techniques . The 10 audio coming from direction 7 , thus increasing the ampli
`device 110 may also determine ( 136 ) the location ( s ) of noise
`tude of audio data corresponding to speech from user 5
`sources in the environment also through range finding ,
`relative to other audio captured from other directions . In this
`image processing , estimation or other techniques . Using the
`manner , noise from noise sources that is coming from all the
`location information as well as sound pressure information
`other directions will be dampened relative to the desired
`of the noise as detected in the input audio , the device may 15 audio ( e . g . , speech from user 5 ) coming from direction 7 .
`determine ( 138 ) the noise level at the user ' s location . The
`Further , the device may engage in other filtering / beam
`device may then calculate ( 140 ) a gain for output audio data
`forming operations to determine a direction associated with
`to ensure a desired volume of the output audio when it
`incoming audio where the direction may not necessarily be
`reaches the user ' s location . The desired volume may be an
`associated with a unique microphone . Thus a device may be
`absolute volume and / or a volume relative to the noise at the 20 capable of having a number of detectable directions that may
`user ' s location . The device 110 may receive ( 142 ) output
`not be exactly the same as the number of microphones .
`audio data from the server corresponding to the utterance
`Using beamforming and other audio isolation techniques
`and may output ( 144 ) audio corresponding to the output
`the device 110 may isolate the desired audio ( e . g . , the
`audio data using the calculated gain to determine the initial
`utterance ) from other audio ( e . g . , noise ) for purposes of the
`output audio volume at the device . The gain should ensure 25 operations described herein .
`that the output audio is at a sufficient volume by the time it
`Once input audio 11 is captured , input audio data 111
`corresponding to the input audio may be sent to a remote
`reaches the user 5 .
`As illustrated in FIG . 2 , a device 110 may include , among
`device , such as server 120 , for further processing . FIG . 4 is
`other components , a microphone array 210 , an output audio
`a conceptual diagram of how a spoken utterance is pro
`speaker 116 , a beamformer component 1160 ( as illustrated 30 cessed . The various components illustrated may be located
`in FIG . 11 ) , or other components . While illustrated in two
`on a same or different physical devices . Communication
`dimensions , the microphone array 210 may be angled or
`between various components illustrated in FIG . 4 may occur
`otherwise configured in a three - dimensional configuration ,
`directly or across a network ( s ) 199 . An audio capture
`with different elevations of the individual microphones or
`component , such as the microphone 202 of the speech
`the like . The microphone array 210 may include a number of 35 controlled device 110 ( or other device ) , captures input audio
`different individual microphones . As illustrated in FIG . 2 ,
`11 corresponding to a spoken utterance . The device 110 ,
`the array 210 includes eight ( 8 ) microphones , 202 . The
`using a wakeword detection component 420 , then processes
`individual microphones may capture sound and pass the
`the audio , or audio data corresponding to the audio , to
`resulting audio signal created by the sound to a downstream
`determine if a keyword ( such as a wakeword ) is detected in
`component , such as analysis filterbank . Each individual 40 the audio . Following detection of a wakeword , the speech
`piece of audio data captured by a microphone may be in a
`controlled device 110 sends audio data 111 , corresponding to
`time domain . To isolate audio from a particular direction , the
`the input audio 11 , to a server ( s ) 120 that includes an ASR
`system may compare the audio data ( or audio signals related
`component 450 . The audio data 111 may be output from an
`to the audio data , such as audio signals in
`a sub - band
`acoustic front end ( AFE ) 456 located on the speech - con
`domain ) to determine a time difference of detection of a 45 trolled device 110 prior to transmission . Alternatively , the
`particular segment of audio data . If the audio data for a first
`audio data 111 may be in a different form for processing by
`microphone includes the segment of audio data earlier in
`a remote AFE 456 , such as the AFE 456 located with the
`time than the audio data for a second microphone , then the
`ASR component 450 .
`system may determine that the source of the audio that
`The wakeword detection component 420 works in con
`resulted in the segment of audio data may be located closer 50 junction with other components of the speech - controlled
`to the first microphone than to the second microphone
`device 110 , for example the microphone 202 to detect
`( which resulted in the audio being detected by the first
`keywords in audio 11 . For example , the speech - controlled
`microphone before being detected by the second micro
`device 110 may convert audio 11 into audio data , and
`phone ) .
`process the audio data with the wakeword detection com
`Using such direction isolation / beamforming techniques , a
`55 ponent 420 to determine whether speech is detected , and if
`so , if the audio data comprising speech matches an audio
`device 110 may isolate directionality of audio sources . As
`shown in FIG . 3 , a particular direction may be associated
`signature and / or model corresponding to a particular key
`with a particular microphone of a microphone array . For
`word .
`example , direction 1 is associated with one microphone ,
`The speech - controlled device 110 may use various tech
`direction 2 is associated with another microphone 202 , and 60 niques to determine whether audio data includes speech .
`so on . If audio is detected first by a particular microphone
`Some embodiments may apply voice activity detection
`the device 110 may determine that the source of the audio is
`( VAD ) techniques . Such techniques may determine whether
`associated with the direction of the microphone . Further , the
`speech is present in an audio input based on various quan
`device 110 may use other techniques for determining a
`titative aspects of the audio input , such as the spectral slope
`source direction of particular audio including determining 65 between one or more frames of the audio input ; the energy
`what microphone detected the audio with a largest amplitude
`levels of the audio input in one or more spectral bands ; the
`( which in turn may result in a highest strength of the audio
`signal - to - noise ratios of the audio input in one or more
`
`-16-
`
`
`
`US 10 , 147 , 439 B1
`
`Upon receipt by the server ( s ) 120 , an ASR module 450
`spectral bands ; or other quantitative aspects . In other
`may convert the audio data 111 into text . The ASR module
`embodiments , the speech - controlled device 110 may imple
`450 transcribes the audio data 111 into text data representing
`ment a limited classifier configured to distinguish speech
`words of speech contained in the audio data 111 . The text
`from background noise . The classifier may be implemented
`data may then be used by other components for various
`by techniques such as linear classifiers , support vector 5
`purposes , such as executing system commands , inputting
`machines , and decision trees . In still other embodiments ,
`data , etc . A spoken utterance in the audio data 111 is input
`Hidden Markov Model ( HMM ) or Gaussian Mixture Model
`to a processor configured to perform ASR , which then
`( GMM ) techniques may be applied to compare the audio
`interprets the spoken utterance based on a similarity between
`input to one or more acoustic models in speech storage ,
`which acoustic models may include models corresponding 10 the spoken utterance and pre - established language models
`to speech , noise ( such as environmental noise or background
`454 stored in an ASR model knowledge base ( i . e . , ASR
`noise ) , or silence . Still other techniques may be used to
`model storage 452 ) . For example , the ASR module 450 may
`determine whether speech is present in the audio input .
`compare the audio data 111 with models for sounds ( e . g . ,
`Once speech is detected in the audio captured by the
`subword units or phonemes ) and sequences of sounds to
`speech - controlled device 110 , the speech - controlled device 15 identify words that match the sequence of sounds spoken in
`110 may use the wakeword detection component 420 to
`the spoken utterance of the audio data 111 .
`perform wakeword detection to determine when a user
`The different ways a spoken utterance may be interpreted
`intends to speak a query to the speech - controlled device 110 .
`( i . e . , the different hypotheses ) may each be assigned a
`This process may also be referred to as keyword detection ,
`probability or a confidence score representing a likelihood
`with the wakeword being a specific example of a keyword . 20 that a particular set of words matches those spoken in the
`Specifically , keyword detection is typically performed with
`spoken utterance . The confidence score may be based on a
`out performing linguistic analysis , textual analysis , or
`number of factors including , for example , a similarity of the
`semantic analysis . Instead , incoming audio ( or audio data ) is
`sound in the spoken utterance to models for language sounds
`analyzed to determine if specific characteristics of the audio
`( e . g . , an acoustic model 453 stored in the ASR model storage
`match preconfigured acoustic waveforms , audio signatures , 25 452 ) , and a likelihood that a particular word that matches the
`or other data to determine if the incoming audio “ matches ”
`sound would be included in the sentence at the specific
`stored audio data corresponding to a keyword .
`location ( e . g . , using a language model 454 stored in the ASR
`Thus , the wakeword detection component 420 may com -
`model storage 452 ) . Thus , each potential textual interpreta
`pare audio data to stored models or data to detect a wake -
`tion of the spoken utterance ( i . e . , hypothesis ) is associated
`word . One approach for wakeword detection applies general 30 with a confidence score . Based on the considered factors and
`large vocabulary continuous speech recognition ( LVCSR )
`the assigned confidence score , the ASR module 450 outputs
`systems to decode the audio signals , with wakeword search
`the most likely text recognized in the audio data 111 . The
`ing conducted in the resulting lattices or confusion net -
`ASR module 450 may also output multiple hypotheses in the
`works . LVCSR decoding may require relatively high com
`form of a lattice or an N - best list with each hypothesis
`putational resources . Another approach for wakeword 35 corresponding to a confidence score or other score ( e . g . ,
`spotting builds HMMs for each key wakeword word and
`such as probability scores , etc . ) .
`non - wakeword speech signals , respectively . The non - wake -
`The device or devices including the ASR module 450 may
`word speech includes other spoken words , background
`include an AFE 456 and a speech recognition engine 458 .
`noise , etc . There can be one or more HMMs built to model
`The AFE 456 may transform raw audio data captured by the
`the non - wakeword speech characteristics , which are named 40 microphone 202 into data for processing by the speech
`filler models . Viterbi decoding is used to search the best path
`recognition engine 458 . The speech recognition engine 458
`in the decoding graph , and the decoding output is further
`compares the speech recognition data with acoustic models
`processed to make the decision on keyword presence . This
`453 , language models 454 , and other data models and
`approach can be extended to include discriminative infor -
`information for recognizing the speech conveyed in the
`mation by incorporating a hybrid DNN - HMM decoding 45 audio data 111 .
`framework . In another embodiment , the wakeword spotting
`The speech recognition engine 458 may process data
`system may be built on deep neural network ( DNN ) / recur -
`output from the AFE 456 with reference to information
`sive neural network ( RNN ) structures directly , without
`stored in the ASR model storage 452 . Alternatively , post
`MINI involved . Such a system may estimate the posteriors
`front - end processed data ( e . g . , feature vectors ) may be
`of wakewords with context information , either by stacking 50 received by the device executing ASR processing from
`frames within a context window for DNN , or using RNN .
`another source besides the internal AFE 456 . For example ,
`Following - on posterior threshold tuning or smoothing is
`the speech - controlled device 110 may process audio data 111
`applied for decision making . Other techniques for wakeword
`into feature vectors ( e . g . , using an on - device AFE 456 ) and
`detection , such as those known in the art , may also be used .
`transmit that information to the server 120 across the net
`Once the wakeword is detected , the speech - controlled 55 work 199 for ASR processing . Feature vectors may arrive at
`device 110 may “ wake ” and begin transmitting audio data
`the server 120 encoded , in which case they may be decoded
`111 corresponding to input audio 11 to the server ( s ) 120 for
`prior to processing by the processor executing the speech
`speech processing . The audio data 111 may be sent to the
`recognition engine 458 .
`server ( s ) 120 for routing to a recipient device or may be sent
`The speech recognition engine 458 attempts to match
`to the server ( s ) 120 for speech processing for interpretation 60 received feature vectors to language phonemes and words as
`of the included speech ( either for purposes of enabling
`known in the stored acoustic models 453 and language
`voice - communications and / or for purposes of executing a
`models 454 . The speech recognition engine 458 computes
`command in the speech ) . The audio data 111 may include
`recognition scores for the feature vectors based on acoustic
`data corresponding to the wakeword , or the portion of the
`information and language information . The acoustic infor
`audio data 111 corresponding to the wakeword may be 65 mation is used to calculate an acoustic score representing a
`removed by the speech - controlled device 110 prior to send
`likelihood that the intended sound represented by a group of
`feature vectors matches a language phoneme . The language
`ing .
`
`-17-
`
`
`
`US 1