throbber
( 12 ) United States Patent
`Kristjansson et al .
`
`( 10 ) Patent No . : US 10 , 147 , 439 B1
`( 45 ) Date of Patent :
`Dec . 4 , 2018
`
`US010147439B1
`
`( 54 ) VOLUME ADJUSTMENT FOR LISTENING
`ENVIRONMENT
`( 71 ) Applicant : Amazon Technologies , Inc . , Seattle ,
`WA ( US )
`( 72 ) Inventors : Trausti Thor Kristjansson , San Jose ,
`CA ( US ) ; Mohamed Mansour ,
`Cupertino , CA ( US ) ; Amit Singh
`Chhetri , Santa Clara , CA ( US ) ; Ludger
`Solbach , San Jose , CA ( US )
`( 73 ) Assignee : Amazon Technologies , Inc . , Seattle ,
`WA ( US )
`Subject to any disclaimer , the term of this
`patent is extended or adjusted under 35
`U . S . C . 154 ( b ) by 0 days .
`( 21 ) Appl . No . : 15 / 474 , 197
`( 22 ) Filed :
`Mar . 30 , 2017
`( 51 )
`Int . Ci .
`G06F 3 / 00
`GIOL 21 / 0364
`GIOL 15 / 22
`GIOL 21 / 0232
`GIOL 13 / 00
`GIOL 21 / 0216
`U . S . CI .
`CPC . . . . . . . . . . GIOL 21 / 0364 ( 2013 . 01 ) ; GIOL 13 / 00
`( 2013 . 01 ) ; GIOL 15 / 22 ( 2013 . 01 ) ; GIOL
`21 / 0232 ( 2013 . 01 ) ; GIOL 2015 / 223 ( 2013 . 01 ) ;
`GIOL 2021 / 02166 ( 2013 . 01 )
`
`( * ) Notice :
`
`( 52 )
`
`( 56 )
`
`( 58 ) Field of Classification Search
`CPC . . . . . . . . . GO6F 3 / 165 ; G06F 3 / 167 ; G10L 15 / 22 ;
`GIOL 15 / 265
`USPC . . . . . . . . . .
`. . . . . . . . 704 / 423
`See application file for complete search history .
`References Cited
`U . S . PATENT DOCUMENTS
`8 , 862 , 387 B2 * 10 / 2014 Kandangath . . . . . . . GO1C 21 / 3629
`701 / 419
`2014 / 0135076 A1 *
`5 / 2014 Lee . . . . . . . . . . . . . . . . . . . . . H04M 1 / 6041
`455 / 569 . 1
`2017 / 0242650 A1 *
`8 / 2017 Jarvis . . . . . . . . . . . . . . . . . . . . . . H04S 7 / 301
`G06F 3 / 165
`2017 / 0242651 A1 *
`8 / 2017 Lang . . . . . . . . . . . . . .
`* cited by examiner
`Primary Examiner - Daniel Abebe
`( 74 ) Attorney , Agent , or Firm — Pierce Atwood LLP
`( 57 )
`ABSTRACT
`A speech - capturing device that can modulate its output
`audio data volume based on environmental sound conditions
`at the location of a user speaking to the device . The device
`detects the sound pressure of a spoken utterance at the
`device location and determines the distance of the user from
`the device . The device also detects the sound pressure of
`noise at the device and uses information about the location
`of the noise source and user to determine the sound pressure
`of noise at the location of the talker . The device can then
`adjust the gain for output audio ( such as a spoken response
`to the utterance ) to ensure that the output audio is at a certain
`desired sound pressure when it reaches the location of the
`user .
`
`18 Claims , 13 Drawing Sheets
`
`( 2006 . 01 )
`( 2013 . 01 )
`( 2006 . 01 )
`( 2013 . 01 )
`( 2006 . 01 )
`( 2013 . 01 )
`
`—
`
`—
`
`—
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`Detect input audio including an utterance
`and noise
`1 132 Send audio data to server for speech
`processing
`
`- - - - - - - - - -
`
`| ?
`
`?? ??
`
`?? ??
`
`?
`
`?
`
`?
`
`?
`
`Server ( s )
`120
`
`System
`100
`
`Determine user location
`
`?
`
`?
`
`1134
`| 1364 Determine location ( s ) of noise source ( s ) ?
`
`i 138
`
`- - - - - -
`
`Determine noise level at user ' s location
`
`Calculate gain for output audio data to |
`1 140 - Jensure desired volume of output audio of |
`user ' s location in view of noise level at
`user ' s location
`| 112 Receive output audio data including from
`server
`Output audio using calculated gain
`
`| 144
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- -
`
`- - - -
`
`-
`
`- - - - - - - - - -
`
`- -
`
`- -
`
`?
`
`Network ( s )
`199
`
`Input
`Audio
`11
`
`Noise
`Source 1
`190a
`
`User
`
`Device
`110
`
`Output
`Audio
`15
`
`Noise
`Source 2
`190b
`
`-1-
`
`Amazon v. SoundClear
`US Patent 11,069,337
`Amazon Ex. 1013
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 1 of 13
`
`US 10 , 147 , 439 B1
`
`Device 110
`
`Noise Source 2 190b
`
`Output Audio
`
`15
`
`Network ( s )
`199
`
`Input Audio
`
`User
`
`Noise Source 1 190a
`
`war
`
`a
`
`were
`
`
`
`en we
`
`
`
`na na na an na
`
`
`
`
`
`
`
`System
`100
`
`Server ( s ) 120
`
`FIG . 1
`
`
`
`
`Detect input audio including an utterance and
`noise
`
`N NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
`
`A
`
`wwwww
`
`w
`
`
`
`
`
`Determine user location
`
`
`
`
`
`Send audio data to server for
`
`speech processing
`wwwwwwwwwwwwwwwwww
`
`ir
`
`own
`
`
`
`
`
`Determine location ( s ) of noise source ( s )
`
`???????????????????????????????????????????????????????????????????????????????????????? | 136
`
`
`
`
`
`
`
`
`Calculate gain for output audio data to 140 mensure desired volume of output
`
`audio of 1 user ' s location in view of noise
`level at user ' s location
`
`
`
`138 - Determine noise level at user ' s location
`MAHA
`
`
`
`
`
`Receive output audio data including from server
`
`
`
`
`
`
`
`
`
`Output audio using calculated gain
`
`
`
`
`
`
`
`132
`
`142
`
`- 1144
`
`-
`
`~
`
`~
`
`-2-
`
`

`

`atent
`
`Dec . 4 , 2018
`
`Sheet 2 of 13
`
`US 10 , 147 , 439 B1
`
`Microphone Array 210
`
`202
`
`202
`
`202
`
`202
`
`202
`
`FIG . 2 .
`
`Microphone Array 210
`
`u
`
`
`
`Device 110
`
`Speaker . 220
`
`-3-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 3 of 13
`
`US 10 , 147 , 439 B1
`
`FIG . 3
`
`Direction 4
`
`- -
`
`-
`
`Direction 5
`
`Direction 2
`
`202
`
`202
`
`Direction 1
`
`i
`
`Direction 3
`
`1
`
`/
`
`202
`
`w
`
`202
`2021
`
`1
`
`Direction 8
`
`1
`
`Direction 7
`
`202
`
`Direction 6
`
`-4-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 4 of 13
`
`US 10 , 147 , 439 B1
`
`ASR Models 452
`thyhythm
`
`or wome
`
`Z
`
`Y
`
`453n
`
`454n
`
`454a
`
`banaunge Language Model 1
`Acoustic Acoustic Model 1
`453a
`
`Storage
`NLU
`473
`
`478n
`
`4740
`
`478a
`
`
`Grammar Intents med X478b
`mhamnian2 kedamaian2 Domain 1 Domain 1
`
`Device 1
`
`474a Devica 2
`
`476nny Domains
`Www
`47667
`
`474n
`
`760
`
`476ammen
`
`Entity Library 482
`
`MANNAAAAAAAAAAAAAAA
`
`1484n
`
`V
`
`W
`
`*
`
`*
`
`tttttttttttttt
`
`LLLLLLLLLLLLLLL
`Gazetteer C
`Gazetteer B
`Gazetteer A
`
`NNNNNNNNNNNNNNNNNNNNN IV AV VM .
`
`*
`
`*
`
`*
`
`Y
`
`1 - 4840
`
`med 648464 - 484a
`
`
`
`
`
`( 486aa 486ab 486an
`
`Domain 1 Lexicon
`I Dakaia2
`
`vvv
`
`Command Processor 490
`
`FIG . 4
`
`- 450
`
`Speech Recognition
`Automatic
`
`- - - - - - - - -
`
`-
`
`-
`
`456
`
`h458
`FYYYYYYYYYYYYYYYYYY
`Speech Recognition Engine
`Acoustic Front End ( AFE )
`
`
`
`460
`
`wwwwwwwwwwwwwwwwwww
`
`Language Understanding ( NLU )
`Recognizer 463
`Natural
`
`NER 462
`
`R
`
`C Module 464
`
`w RRRRRRRRRRRRRRRRRRRRR
`
`Server ( s )
`120
`
`
`
`Input Audio data 111
`
`User
`
`Input Audio
`
`Device and
`
`110
`
`LA20
`
`1 440
`
`Volume correction component
`
`Wakeword Detection Module
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`-5-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 5 of 13
`
`US 10 , 147 , 439 B1
`
`Voice Unit Storage 572
`TTS
`
`m578a .
`
`- 578n 578b
`Voice Voice Inventory
`
`TTS Storage 520
`
`FIG . 5
`
`516
`
`DOOROO O
`
`b
`
`onbonbonbon
`
`User 5
`
`Output Audio 15
`
`
`
`Speech Synthesis Engine
`
`6
`
`
`
`Unit Selection Engine
`
`530
`
`Parametric Synthesis Engine
`
`532
`
`: . . . Homema . . .
`
`?
`
`. .
`
`. . . .
`
`Device 110
`
`Output Audio data 511
`
`
`
`
`
`TTS Front End ( TTSFE )
`
`Command Processor ( s )
`TTS Componenty 514
`
`490
`
`-6-
`
`

`

`U . S . Patent
`
`atent
`
`Dec . 4 , 2018
`
`Sheet 6 of 13
`
`US 10 , 147 , 439 B1
`
`FIG . 6
`
`
`
`Device 110
`
`606
`
`D $ 2 , 0
`
`age
`602
`
`Dud Dud
`se
`
`User 5
`
`604
`
`Ds1 . d
`
`D 52 . 4 ?? : 20
`608
`
`D $ 1 , 4
`
`610
`
`Noise Source 2
`1900
`
`Noise Source 1 190a
`
`-7-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 7 of 1
`
`Sheet 7 of 13
`
`US 10 , 147 , 439 B1
`
`
`
`Single Microphone Device 110
`
`Can assume noise level at user is
`
`
`
`same as noise level at device
`
`
`602
`
`Dud
`
`User 5
`
`FIG . 7
`
`Noise Source 2
`190b
`
`Noise Source 1 190a
`
`-8-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 8 of 13
`
`US 10 , 147 , 439 B1
`
`???????????
`
`??????
`
`?????? ????? ??????
`
`?????? ????? ??????
`
`??????
`
`??????
`
`?
`
`Device \ 110
`
`FIG . 8
`
`na zoo
`
`D = 7 , 0
`
`604
`
`User 5
`
`n20
`
`610
`
`Noise Source 2
`1906
`
`SN : ( Ni = 7 , 0 )
`
`Noise IP
`Source 1 190a
`
`-9-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 9 of 13
`
`US 10 , 147 , 439 B1
`
`Device 1 110
`
`FIG . 9
`
`*
`
`602 Dua
`
`User 5
`
`Dave
`
`904
`
`910 DECU
`
`Noise Source 2
`1905
`
`SN , ( N1 - 70 )
`
`930
`
`Noise I Noise Source 1 190a
`
`-10-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 10 of 13
`
`US 10 , 147 , 439 B1
`
`Network ( s )
`199
`
`FIG . 10
`
`evice Device Location kitchen upstairs office
`Name Name Device
`
`Echo Echo TV
`
`D
`
`
`
`living room
`
`Dot
`
`
`
`
`
`User Profile Storage 1002
`
`w 1004
`
`
`345671444 - XXX - 332 194 . 66 . 82 . 11 499 567 - xjW - 432 192 . 65 . 82 . 31 498 - jij - 879
`1192 . 67 . 28 . 43
`
`Device ID IP Address L 44 Device ID I IP Address L 444 Device ID IP Address
`
`
`
`136 - sts - 456 1196 . 66 . 23 . 23
`
`- - - - - - - - - - - - - - -
`
`1361
`
`-
`
`-11-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 11 of 13
`
`US 10 , 147 , 439 B1
`
`199
`
`Network
`
`Microphone ( s )
`202
`
`Antenna 1114
`
`- Speaker 202
`
`Display 1120
`
`jijini
`
`FIG . 11
`
`Device 110
`
`
`
`Bus 1124
`
`M
`
`V / O Device Interfaces 1102
`
`Controller ( s ) / Processor ( s )
`1104
`
`. . . . . . . . . . . . . . . . . . . . . . . . . . .
`
`Memory 1106
`
`Storage 1108
`
`Volume correction component 440
`
`
`
`ASR Module 250
`
`
`
`NLU Module 260
`
`Command Processor 290
`
`:
`
`:
`
`Beamformer component 1160
`
`AAANAAN
`
`-12-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 12 of 13
`
`US 10 , 147 , 439 B1
`
`Network 199
`
`FIG . 12
`
`
`
`Bus 1224
`
`Server 120
`
`1 / 0 Device Interfaces 1202
`
`.
`
`Controller ( s ) / Processor ( s )
`1204
`
`Memory 1206
`
`????????????
`
`
`
`ASR Module 250
`
`*
`
`*
`
`*
`
`*
`
`*
`
`
`
`NLU Module 260
`
`Command Processor 290
`
`Storage 1208
`
`.
`
`Synchronization component 1250
`
`-13-
`
`

`

`U . S . Patent
`
`Dec . 4 , 2018
`
`Sheet 13 of 13
`
`US 10 , 147 , 439 B1
`
`Speecha Controlled Device 110
`
`Application Server 125
`
`.
`
`WAV
`
`FIG . 13
`
`hi !
`
`Server 120
`
`Network 199
`
`Camera ( s ) 1302
`
`
`
`Smart Phone 1106
`
`Vehicle 110e
`
`VE
`
`to
`
`$
`
`Tablet Computer 110d
`
`.
`
`
`
`Smart Watch 110c
`
`08 : 00
`
`-14-
`
`

`

`US 10 , 147 , 439 B1
`
`cerned with transforming audio data associated with speech
`VOLUME ADJUSTMENT FOR LISTENING
`ENVIRONMENT
`into text data representative of that speech . Similarly , natural
`language understanding ( NLU ) is a field of computer sci
`BACKGROUND
`ence , artificial intelligence , and linguistics concerned with
`enabling computers to derive meaning from text input
`Speech recognition systems have progressed to the point
`containing natural language . ASR and NLU are often used
`where humans can interact with computing devices using
`together as part of a speech processing system . Text - to
`their voices . Such systems employ techniques to identify the
`speech ( TTS ) is a field of concerning transforming textual
`words spoken by a human user based on the various qualities
`data into audio data that is synthesized to resemble human
`of a received audio input . Speech recognition combined with
`speech .
`natural language understanding processing techniques
`A speech processing system may be configured as a
`enable speech - based user control of a computing device to
`relatively self - contained system where one device captures
`perform tasks based on the user ' s spoken commands . The
`combination of speech recognition and natural language
`audio , performs speech processing , and executes a command
`understanding processing techniques is referred to herein as 15 corresponding to the input speech . Alternatively , a speech
`speech processing . Speech processing may also involve 15 processing system may be configured as a distributed system
`converting a user ' s speech into text data which may then be
`where a number of different devices combine to capture
`provided to various text - based software applications .
`audio of a spoken utterance , perform speech processing , and
`Speech processing may be used by computers , hand - held
`execute a command corresponding to the utterance . In a
`devices , telephone computer systems , kiosks , and a wide
`variety of other devices to improve human - computer inter - 20 distributed system , one device may capture audio while
`other device ( s ) perform speech processing , etc . Output audio
`actions . Output of speech processing systems may include
`synthesized speech .
`data resulting from execution of the command , for example
`synthesized speech responding to the input utterance , may
`then be sent back to the same device that captured the
`BRIEF DESCRIPTION OF DRAWINGS
`25 original utterance , or may be sent to one or more other
`For a more complete understanding of the present disclo -
`devices .
`sure , reference is now made to the following description
`Listening environments for users may vary in terms of the
`amount of noise experienced by the user . Various noise
`taken in conjunction with the accompanying drawings .
`FIG . 1 illustrates a speech processing system configured
`sources may combine to make it difficult for a user to hear
`to adjust output volume according to embodiments of the 30 and / or understand output audio from a device . To improve
`the ability of a user to detect and understand output audio
`present disclosure .
`FIG . 2 illustrates certain components of a device config
`from a device , a system may determine a location of a user
`and noise sources relative to a device , may determine the
`ured to input and output audio according to embodiments of
`noise level at the user ' s location , and may determine a gain
`the present disclosure .
`FIG . 3 illustrates directional based audio processing and 35 for output audio data to ensure the output audio is at a
`sufficient volume when it reaches the user ' s location . The
`beamforming according to embodiments of the present
`system may also incorporate user feedback regarding the
`disclosure .
`FIG . 4 is a diagram of components of a speech processing
`output volume for purposes of determining the output vol
`system according to embodiments of the present disclosure .
`ume at a future time under similar noise conditions . While
`FIG . 5 is a diagram of components for text - to - speech 40 the present techniques are illustrated for a speech processing
`ST
`( TTS ) processing according to embodiments of the present
`system , they are applicable to other systems as well includ
`ing a voice over internet protocol ( VoIP ) system , different
`disclosure .
`FIG . 6 illustrates determining relative positions of audio
`communication systems , or other such systems that involve
`emitting and capturing elements in an environment accord
`the exchange of audio data .
`FIG . 1 shows a speech processing system 100 . Although
`ing to embodiments of the present disclosure .
`FIG . 7 illustrates estimating noise experienced by a user
`the figures and discussion illustrate certain operational steps
`of the system 100 in a particular order , the steps described
`according to embodiments of the present disclosure .
`FIG . 8 illustrates determining relative positions of audio
`may be performed in a different order ( as well as certain
`emitting and capturing elements in an environment accord
`steps removed or added ) without departing from the intent of
`50 the disclosure . As shown in FIG . 1 , the system 100 may
`ing to embodiments of the present disclosure .
`FIG . 9 illustrates estimating noise experienced by a user
`include one or more speech - controlled devices 110 local to
`according to embodiments of the present disclosure .
`user 5 , as well as one or more networks 199 and one or more
`FIG . 10 illustrates data stored and associated with user
`servers 120 connected to speech - controlled device ( s ) 110
`profiles according to embodiments of the present disclosure .
`across network ( s ) 199 . The server ( s ) 120 ( which may be one
`FIG . 11 is a block diagram conceptually illustrating 55 or more different physical devices ) may be capable of
`example components of a device according to embodiments
`performing traditional speech processing ( e . g . , ASR , NLU ,
`command processing , etc . ) or other operations as described
`of the present disclosure .
`FIG . 12 is a block diagram conceptually illustrating
`herein . A single server 120 may perform all speech process
`ing or multiple servers 120 may combine to perform all
`example components of a server according to embodiments
`60 speech processing . Further , the server ( s ) 120 may execute
`of the present disclosure .
`FIG . 13 illustrates an example of a computer network for
`certain commands , such as answering spoken utterances of
`the users 5 , operating other devices ( light switches , appli
`use with the system .
`ances , etc . ) . In addition , certain speech detection or com
`DETAILED DESCRIPTION
`mand execution functions may be performed by the speech
`65 controlled device 110 . Further , the system may be in
`Automatic speech recognition ( ASR ) is a field of com -
`communication with external data sources , such as a knowl
`edge base , external service provider devices , or the like .
`puter science , artificial intelligence , and linguistics con
`
`45
`
`-15-
`
`

`

`US 10 , 147 , 439 B1
`
`signal portion corresponding to the audio ) . Other techniques
`In one example , as shown in FIG . 1 , a user 5 may be in
`( either in the time domain or in the sub - band domain ) may
`an environment with device 110 as well as noise sources
`also be used .
`190a and 1906 . The user 5 may speak an utterance which is
`For example , if audio data corresponding to a user ' s
`represented in input audio 11 . The device 110 may detect
`( 130 ) the input audio which may include both the utterance 5 speech is first detected and / or is most strongly detected by
`as well as noise from the noise sources . The device 110 may
`a microphone associated with 7 , the device may determine
`send ( 132 ) audio data to the server 120 for speech process
`that the user is located in a location in direction 7 . The
`ing . The device 110 may determine ( 134 ) the user ' s location
`device may isolate audio coming from direction 7 using
`relative to the device 110 , either through range finding
`techniques known to the art . Thus , the device 110 may boost
`image processing , estimation or other techniques . The 10 audio coming from direction 7 , thus increasing the ampli
`device 110 may also determine ( 136 ) the location ( s ) of noise
`tude of audio data corresponding to speech from user 5
`sources in the environment also through range finding ,
`relative to other audio captured from other directions . In this
`image processing , estimation or other techniques . Using the
`manner , noise from noise sources that is coming from all the
`location information as well as sound pressure information
`other directions will be dampened relative to the desired
`of the noise as detected in the input audio , the device may 15 audio ( e . g . , speech from user 5 ) coming from direction 7 .
`determine ( 138 ) the noise level at the user ' s location . The
`Further , the device may engage in other filtering / beam
`device may then calculate ( 140 ) a gain for output audio data
`forming operations to determine a direction associated with
`to ensure a desired volume of the output audio when it
`incoming audio where the direction may not necessarily be
`reaches the user ' s location . The desired volume may be an
`associated with a unique microphone . Thus a device may be
`absolute volume and / or a volume relative to the noise at the 20 capable of having a number of detectable directions that may
`user ' s location . The device 110 may receive ( 142 ) output
`not be exactly the same as the number of microphones .
`audio data from the server corresponding to the utterance
`Using beamforming and other audio isolation techniques
`and may output ( 144 ) audio corresponding to the output
`the device 110 may isolate the desired audio ( e . g . , the
`audio data using the calculated gain to determine the initial
`utterance ) from other audio ( e . g . , noise ) for purposes of the
`output audio volume at the device . The gain should ensure 25 operations described herein .
`that the output audio is at a sufficient volume by the time it
`Once input audio 11 is captured , input audio data 111
`corresponding to the input audio may be sent to a remote
`reaches the user 5 .
`As illustrated in FIG . 2 , a device 110 may include , among
`device , such as server 120 , for further processing . FIG . 4 is
`other components , a microphone array 210 , an output audio
`a conceptual diagram of how a spoken utterance is pro
`speaker 116 , a beamformer component 1160 ( as illustrated 30 cessed . The various components illustrated may be located
`in FIG . 11 ) , or other components . While illustrated in two
`on a same or different physical devices . Communication
`dimensions , the microphone array 210 may be angled or
`between various components illustrated in FIG . 4 may occur
`otherwise configured in a three - dimensional configuration ,
`directly or across a network ( s ) 199 . An audio capture
`with different elevations of the individual microphones or
`component , such as the microphone 202 of the speech
`the like . The microphone array 210 may include a number of 35 controlled device 110 ( or other device ) , captures input audio
`different individual microphones . As illustrated in FIG . 2 ,
`11 corresponding to a spoken utterance . The device 110 ,
`the array 210 includes eight ( 8 ) microphones , 202 . The
`using a wakeword detection component 420 , then processes
`individual microphones may capture sound and pass the
`the audio , or audio data corresponding to the audio , to
`resulting audio signal created by the sound to a downstream
`determine if a keyword ( such as a wakeword ) is detected in
`component , such as analysis filterbank . Each individual 40 the audio . Following detection of a wakeword , the speech
`piece of audio data captured by a microphone may be in a
`controlled device 110 sends audio data 111 , corresponding to
`time domain . To isolate audio from a particular direction , the
`the input audio 11 , to a server ( s ) 120 that includes an ASR
`system may compare the audio data ( or audio signals related
`component 450 . The audio data 111 may be output from an
`to the audio data , such as audio signals in
`a sub - band
`acoustic front end ( AFE ) 456 located on the speech - con
`domain ) to determine a time difference of detection of a 45 trolled device 110 prior to transmission . Alternatively , the
`particular segment of audio data . If the audio data for a first
`audio data 111 may be in a different form for processing by
`microphone includes the segment of audio data earlier in
`a remote AFE 456 , such as the AFE 456 located with the
`time than the audio data for a second microphone , then the
`ASR component 450 .
`system may determine that the source of the audio that
`The wakeword detection component 420 works in con
`resulted in the segment of audio data may be located closer 50 junction with other components of the speech - controlled
`to the first microphone than to the second microphone
`device 110 , for example the microphone 202 to detect
`( which resulted in the audio being detected by the first
`keywords in audio 11 . For example , the speech - controlled
`microphone before being detected by the second micro
`device 110 may convert audio 11 into audio data , and
`phone ) .
`process the audio data with the wakeword detection com
`Using such direction isolation / beamforming techniques , a
`55 ponent 420 to determine whether speech is detected , and if
`so , if the audio data comprising speech matches an audio
`device 110 may isolate directionality of audio sources . As
`shown in FIG . 3 , a particular direction may be associated
`signature and / or model corresponding to a particular key
`with a particular microphone of a microphone array . For
`word .
`example , direction 1 is associated with one microphone ,
`The speech - controlled device 110 may use various tech
`direction 2 is associated with another microphone 202 , and 60 niques to determine whether audio data includes speech .
`so on . If audio is detected first by a particular microphone
`Some embodiments may apply voice activity detection
`the device 110 may determine that the source of the audio is
`( VAD ) techniques . Such techniques may determine whether
`associated with the direction of the microphone . Further , the
`speech is present in an audio input based on various quan
`device 110 may use other techniques for determining a
`titative aspects of the audio input , such as the spectral slope
`source direction of particular audio including determining 65 between one or more frames of the audio input ; the energy
`what microphone detected the audio with a largest amplitude
`levels of the audio input in one or more spectral bands ; the
`( which in turn may result in a highest strength of the audio
`signal - to - noise ratios of the audio input in one or more
`
`-16-
`
`

`

`US 10 , 147 , 439 B1
`
`Upon receipt by the server ( s ) 120 , an ASR module 450
`spectral bands ; or other quantitative aspects . In other
`may convert the audio data 111 into text . The ASR module
`embodiments , the speech - controlled device 110 may imple
`450 transcribes the audio data 111 into text data representing
`ment a limited classifier configured to distinguish speech
`words of speech contained in the audio data 111 . The text
`from background noise . The classifier may be implemented
`data may then be used by other components for various
`by techniques such as linear classifiers , support vector 5
`purposes , such as executing system commands , inputting
`machines , and decision trees . In still other embodiments ,
`data , etc . A spoken utterance in the audio data 111 is input
`Hidden Markov Model ( HMM ) or Gaussian Mixture Model
`to a processor configured to perform ASR , which then
`( GMM ) techniques may be applied to compare the audio
`interprets the spoken utterance based on a similarity between
`input to one or more acoustic models in speech storage ,
`which acoustic models may include models corresponding 10 the spoken utterance and pre - established language models
`to speech , noise ( such as environmental noise or background
`454 stored in an ASR model knowledge base ( i . e . , ASR
`noise ) , or silence . Still other techniques may be used to
`model storage 452 ) . For example , the ASR module 450 may
`determine whether speech is present in the audio input .
`compare the audio data 111 with models for sounds ( e . g . ,
`Once speech is detected in the audio captured by the
`subword units or phonemes ) and sequences of sounds to
`speech - controlled device 110 , the speech - controlled device 15 identify words that match the sequence of sounds spoken in
`110 may use the wakeword detection component 420 to
`the spoken utterance of the audio data 111 .
`perform wakeword detection to determine when a user
`The different ways a spoken utterance may be interpreted
`intends to speak a query to the speech - controlled device 110 .
`( i . e . , the different hypotheses ) may each be assigned a
`This process may also be referred to as keyword detection ,
`probability or a confidence score representing a likelihood
`with the wakeword being a specific example of a keyword . 20 that a particular set of words matches those spoken in the
`Specifically , keyword detection is typically performed with
`spoken utterance . The confidence score may be based on a
`out performing linguistic analysis , textual analysis , or
`number of factors including , for example , a similarity of the
`semantic analysis . Instead , incoming audio ( or audio data ) is
`sound in the spoken utterance to models for language sounds
`analyzed to determine if specific characteristics of the audio
`( e . g . , an acoustic model 453 stored in the ASR model storage
`match preconfigured acoustic waveforms , audio signatures , 25 452 ) , and a likelihood that a particular word that matches the
`or other data to determine if the incoming audio “ matches ”
`sound would be included in the sentence at the specific
`stored audio data corresponding to a keyword .
`location ( e . g . , using a language model 454 stored in the ASR
`Thus , the wakeword detection component 420 may com -
`model storage 452 ) . Thus , each potential textual interpreta
`pare audio data to stored models or data to detect a wake -
`tion of the spoken utterance ( i . e . , hypothesis ) is associated
`word . One approach for wakeword detection applies general 30 with a confidence score . Based on the considered factors and
`large vocabulary continuous speech recognition ( LVCSR )
`the assigned confidence score , the ASR module 450 outputs
`systems to decode the audio signals , with wakeword search
`the most likely text recognized in the audio data 111 . The
`ing conducted in the resulting lattices or confusion net -
`ASR module 450 may also output multiple hypotheses in the
`works . LVCSR decoding may require relatively high com
`form of a lattice or an N - best list with each hypothesis
`putational resources . Another approach for wakeword 35 corresponding to a confidence score or other score ( e . g . ,
`spotting builds HMMs for each key wakeword word and
`such as probability scores , etc . ) .
`non - wakeword speech signals , respectively . The non - wake -
`The device or devices including the ASR module 450 may
`word speech includes other spoken words , background
`include an AFE 456 and a speech recognition engine 458 .
`noise , etc . There can be one or more HMMs built to model
`The AFE 456 may transform raw audio data captured by the
`the non - wakeword speech characteristics , which are named 40 microphone 202 into data for processing by the speech
`filler models . Viterbi decoding is used to search the best path
`recognition engine 458 . The speech recognition engine 458
`in the decoding graph , and the decoding output is further
`compares the speech recognition data with acoustic models
`processed to make the decision on keyword presence . This
`453 , language models 454 , and other data models and
`approach can be extended to include discriminative infor -
`information for recognizing the speech conveyed in the
`mation by incorporating a hybrid DNN - HMM decoding 45 audio data 111 .
`framework . In another embodiment , the wakeword spotting
`The speech recognition engine 458 may process data
`system may be built on deep neural network ( DNN ) / recur -
`output from the AFE 456 with reference to information
`sive neural network ( RNN ) structures directly , without
`stored in the ASR model storage 452 . Alternatively , post
`MINI involved . Such a system may estimate the posteriors
`front - end processed data ( e . g . , feature vectors ) may be
`of wakewords with context information , either by stacking 50 received by the device executing ASR processing from
`frames within a context window for DNN , or using RNN .
`another source besides the internal AFE 456 . For example ,
`Following - on posterior threshold tuning or smoothing is
`the speech - controlled device 110 may process audio data 111
`applied for decision making . Other techniques for wakeword
`into feature vectors ( e . g . , using an on - device AFE 456 ) and
`detection , such as those known in the art , may also be used .
`transmit that information to the server 120 across the net
`Once the wakeword is detected , the speech - controlled 55 work 199 for ASR processing . Feature vectors may arrive at
`device 110 may “ wake ” and begin transmitting audio data
`the server 120 encoded , in which case they may be decoded
`111 corresponding to input audio 11 to the server ( s ) 120 for
`prior to processing by the processor executing the speech
`speech processing . The audio data 111 may be sent to the
`recognition engine 458 .
`server ( s ) 120 for routing to a recipient device or may be sent
`The speech recognition engine 458 attempts to match
`to the server ( s ) 120 for speech processing for interpretation 60 received feature vectors to language phonemes and words as
`of the included speech ( either for purposes of enabling
`known in the stored acoustic models 453 and language
`voice - communications and / or for purposes of executing a
`models 454 . The speech recognition engine 458 computes
`command in the speech ) . The audio data 111 may include
`recognition scores for the feature vectors based on acoustic
`data corresponding to the wakeword , or the portion of the
`information and language information . The acoustic infor
`audio data 111 corresponding to the wakeword may be 65 mation is used to calculate an acoustic score representing a
`removed by the speech - controlled device 110 prior to send
`likelihood that the intended sound represented by a group of
`feature vectors matches a language phoneme . The language
`ing .
`
`-17-
`
`

`

`US 1

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket