throbber
US 8,321,213 B2
`(10) Patent No:
`a2) United States Patent
`Petit et al.
`(45) Date of Patent:
`*Nov. 27, 2012
`
`
`US008321213B2
`
`(54) ACOUSTIC VOICE ACTIVITY DETECTION
`(AVAD) FOR ELECTRONIC SYSTEMS
`
`(75)
`
`Inventors: Nicolas Petit, San Francisco, CA (US);
`Gregory Burnett, Dodge Center, MN
`(US); Zhinian Jing, San Francisco, CA
`US(US)
`(73) Assignee: uy Inc., San Francisco, CA
`:
`:
`:
`:
`+.
`(*) Notice:
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 540 days.
`oo
`This patent is subject to a terminal dis-
`claimer.
`
`(21) Appl. No.: 12/606,146
`
`(22)
`(65)
`
`Filed:
`
`Oct. 26, 2009
`Prior Publication Data
`US 2010/0128894 Al
`May 27, 2010
`
`Related U.S. Application Data
`(63) Continuation-in-part of application No. 12/139,333
`filed on Jun. 13. 2008. and a continuation-in-part of
`application No. 11/805,987, filed on May 25, 2007,
`now abandoned.
`
`(60) Provisional application No. 61/108,426, filed on Oct.
`24, 2008.
`
`(51)
`
`(56)
`
`Int. Cl.
`(2006.01)
`GIOL 11/06
`(52) US. CL cc ceecnscneeescnenees 704/208; 704/214
`(58) Field of Classification Search .................. 704/208,
`704/210, 214, 215; 381/99, 100, 46
`See application file for complete search history.
`References Cited
`US. PATENT DOCUMENTS
`5,459,814 A * 10/1995 Guptaetal. oo. 704/233
`7,171,357 B2*
`1/2007 Boland .......
`.. 704/231
`
`7246058 B2*
`7/2007 Burnett _....
`304/296
`....
`7,464,029 B2* 12/2008 Visseret al.
`.. 704/210
`..
`8,019,091 B2*
`9/2011 Burnett etal.
`. 381/71.8
`2009/0089053 A1*
`4/2009 Wanget al. acccccccccsn 704/233
`* cited by examiner
`
`Primary Examiner — Abul Azad
`(74) Attorney, Agent, or Firm — Kokka & Backus, PC
`
`ABSTRACT
`(57)
`Acoustic Voice Activity Detection (AVAD) methodsandsys-
`tems are described. The AVAD methodsand systems, includ-
`ing correspondingalgorithms or programs, use microphones
`to generate virtual directional microphones which have very
`similar noise responsesand very dissimilar speech responses.
`The ratio of the energies of the virtual microphonesis then
`calculated over a given window size andthe ratio can then be
`used with a variety ofmethods to generate a VAD signal. The
`virtual microphonescan be constructed using either an adap-
`tive or a fixedfilter.
`
`42 Claims, 35 Drawing Sheets
`
`v0
`
`502
`
`504
`
`Formingfirst virtual microphone by combining
`first signal offirst physical microphone and
`secondsignal of second physical microphone.
`
`Formingfilter that describes relationship for
`speech betweenfirst physical microphone
`and second physical microphone.
`
`Forming secondvirtual microphone by
`applyingfilter to first signal to generate
`first intermediate signal, and summing
`first intermediate signal and second signal.
`
`energy ratio is greater than threshold value. 506
`
`Generating energy ratio of energies of first virtual
`microphone and second virtual microphone.
`
`508
`
`Detecting acoustic voice activity of speaker when
`
`APPLE 1001
`
`APPLE 1001
`
`1
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 1 of 35
`
`US 8,321,213 B2
`
`
`
`FIG.2
`
`2
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 2 of 35
`
`US 8,321,213 B2
`
`
`
`FIG.3
`
`
`
`3
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 3 of 35
`
`US 8,321,213 B2
`
`first intermediate signal and secondsignal.
`
`y00
`
`502
`
`504
`
`306
`
`508
`
`510
`
`Forming first virtual microphone by combining
`first signal offirst physical microphone and
`secondsignal of second physical microphone.
`
`Formingfilter that describes relationship for
`speech between first physical microphone
`and second physical microphone.
`
`Forming secondvirtual microphone by
`applyingfilter to first signal to generate
`first intermediate signal, and summing
`
`Generating energy ratio of energies offirst virtual
`microphone and second virtual microphone.
`
`Detecting acoustic voice activity of speaker when
`energy ratio is greater than threshold value.
`
`FIG.S
`
`4
`
`4
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 4 of 35
`
`US 8,321,213 B2
`
`
`
`OSIOUUl&}9qPaXTfJOF(W0j}0q)ZApue(dor)[A
`
`
`
`
`
`
`
`
`
`
`
`(998)omy
`
`9DIA
`
`5
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 5 of 35
`
`US 8,321,213 B2
`
` Ayao
`
`
`yooodsvj9qpoxly10}(Wo}}0q)ZApue(doy)[A
`
`0€GZ4GIOLG
`
`
`
`(998)ouITy
`
`LOI
`
`6
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 6 of 35
`
`US 8,321,213 B2
`
`
`
`QSIOUUIyooedsv}0qpox]JOF(W10}30q)ZApue(doy)[A
`
`
`
`
`
`coeeeconnee e wees
`
`Otaa
`
`(998)ouuT}
`
`8Did
`
`7
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 7 of 35
`
`US 8,321,213 B2
`
` QSIOUUlBjOq
`
`
`
`
`SAL}depeJ0F(W0}}0q)ZApue(doy)[A
`
`
`
`
`
`(998)ovary
`
`6Did
`
`8
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 8 of 35
`
`US 8,321,213 B2
`
` AyTuo
`
`
`
`yooadsejaqaAtdepeJoy(W0}30q)ZApue(doy)[A
`
`
`
`(998)SUIT}
`
`01Did
`
`9
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 9 of 35
`
`US 8,321,213 B2
`
` asiouUlyooads
`
`
`
`ejaqSAIdepeJoy(tu0}}0q)ZApue(dor)[A
`
`QIUUULJLueyo
`
`1ut
`
`Gc02S
`
`(998)ow
`
`IPDW
`
`
`
`
`
`
`10
`
`10
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 10 of 35
`
`US 8,321,213 B2
`
`1230
`
`1230
`
`
`
`Detection
`
` Processor
`
`Sensors
`
`enoising
`
`FIG.12
`
`Processor
`
`Detection
`
`
`
`
`Denoising
`
`11
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 11 of 35
`
`US 8,321,213 B2
`
`yosadspauesy)
`
`[PAOWY
`
`ASTON
`
`())
`
`12
`
`SION
`
`(u)u
`
`12
`
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 12 of 35
`
`US 8,321,213 B2
`
`1250
`
`a
`
`Pa
`
`Constants:
`
`V= 0 if noise, 1 if UV, 2 ifV
`Read20meeeqdata
`~*
`
`VTC = voiced threshold for corr
`from m4, m2, gems
`
`VTS = voiced threshold for std. dev.
`—
`
`ff = forgetting factorfor std. dev.
`
`Calculate XCORRof m1, gems||Num_ma = # of taps in m.a.filter
`
`UV_ma = UVstd dev m.a.thresh
`Step 10 msec
`UV_std = UV std dev threshold
`UV = binary values denoting UV
`
`detected in each subband
`Calc mean (abs(XCORR)) = MC
`NAVSAD
`num_begin = # win at "beginning
`
`Variables:
`Calc STD DEV of gems = GSD|bhi = LMScalc of MIC 1-2 TF
`keep_old = 1 if last win V/UV, 0 ow
`sd_ma_vector= last NV sd values
`Viwindow) = 2
`
`sd_ma =m.a.of the last NV sd
`bh1 = bh1_old
`PSAD
`UV = [0,0], Filter m1 and
`
`
`m2 into 2 bands, 1500-2500
`and 2500-3500 Hz
`
`Calculate bh1 using
`
`Pathfinder for each subband
`
`Is
`
`new_std > UV_ma*sd_ma
`and
`new_std > UV_sd
`
`If not keep_old or at beginning,
`OR
`
`add new_sum to new_sum_vector
`are we at the beginning?
`(ff numberslong)
`
`
`old_std = new_std
`
`new_std = STD DEV of
`keepold =0
`
`new sum vector
`
`UV(subband) = 2
`bh1_old = bh1
`
`
`
`bh1 = bh1_old
`
`
` If not keep_old or at beginning,
`keep_old = 1
`shift sd_ma_vectorto right
`
`After both subbands
`
`
`
`Replacefirst value in
`checked, is
`CEIL(SUM(UV}/2) = 1?
`sd ma vector with old std
`
` Filter sd_ma_vector with moving
`
`FIG. l 5
`averagefilter to get sd_ma
`
`IsMC > VTC and
`GSD > VTS?
`
` 0
`
`\_N
`
`new_sum = sum(abs(bh1));
`
`13
`
`13
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 13 of 35
`
`US 8,321,213 B2
`
`Gems and Mean Correlation
`
`0
`
`0.5
`
`|
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`FIG.16A
`
`Gems and Standard Deviation
`
`0
`
`0.5
`
`|
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`FIG.16B
`
`14
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 14 of 35
`
`US 8,321,213 B2
`
`v7 1100
`
`Voicing
`
`| Noise
`
`1706.
`|
`| Acoustic
`
`15
`
`15
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 15 of 35
`
`US 8,321,213 B2
`
` Linear array
`
`midline
`
`FIG.18
`
`16
`
`16
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 16 of 35
`
`US 8,321,213 B2
`
`1900
`
`di versus delta M for delta d = 1, 2,3, 4 cm
`
`17
`
`17
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 17 of 35
`
`US 8,321,213 B2
`
`2000
`
`Gain Parameter
`
`Acoustic data (solid) and gain parameter (dashed)
`
`Acoustic Data
`
`0
`
`0.5
`
`I
`
`1.5
`
`2
`time (samples)
`
`2.5
`
`3
`
`3.5
`
`4
`
`x 10°
`
`FIG.20
`
`18
`
`18
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 18 of 35
`
`US 8,321,213 B2
`
`Mic1 and V for "pop pan"in \headmic\micgemsp1.bin
`
`2100
`
`Voicing Signal
`
`Audio Signal
`2104
`
`—_ Unvoiced
`|
`Level
`
`"Gems Signal
`2106
`
`Not Voiced
`
`0
`
`0.5
`
`|
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`x 10
`
`time (samples)
`
`FIG.21
`
`19
`
`19
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 19 of 35
`
`US 8,321,213 B2
`
`
`yosadgpourayy
`
`[PAOWIOYOSION
`
`0077—*
`
`
`
`WOTEWIOJU]SUIDIOA
`
`POC
`
`(sy)|00C2
`
`TVNDIS
`
`(u)s
`
`())|10¢¢
`
`HSION
`
`(u)u
`
`20
`
`20
`
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 20 of 35
`
`US 8,321,213 B2
`
`
`
`21
`
`21
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 21 of 35
`
`US 8,321,213 B2
`
`
`
`22
`
`22
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 22 of 35
`
`
`
`US 8,321,213 B2
`
`23
`
`23
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 23 of 35
`
`US 8,321,213 B2
`
` aw”
`
`TTT TT eee
`
`¢2
`
`702
`
`FIG.27
`
`24
`
`24
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 24 of 35
`
`US 8,321,213 B2
`
`Receive acoustic signals at a first physical
`microphone anda second physical microphone.
`
`Output first microphone signal from first physical
`microphone and second microphonesignal from
`second physical microphone.
`
`Form first virtual microphone using the first combination
`of first microphone signal and second microphonesignal.
`
`Form second virtual microphone using second combination
`of first microphone signal and second microphonesignal.
`
`Generate denoised output signals havingless
`acoustic noise than received acoustic signals.
`2800
`FIG.28
`
`Form physical microphone array includingfirst
`physical microphone and second physical microphone.
`
`signals from physical microphonearray.
`
`Form virtual microphone array includingfirst virtual
`microphone and second virtual microphone using
`
`2900—*
`
`FIG.29
`
`25
`
`2802
`
`2804
`
`1806
`
`9808
`
`2810
`
`2902
`
`2904
`
`25
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 25 of 35
`
`US 8,321,213 B2
`
`Linear response of V2 to a speech source at 0.10 meters
`180
`
`Linear response of V2 to a no
`
`1S€ SOUICE a
`
`t 1 meters
`
`-4---4------
`
`bo ee
`
`'''!'4
`
`26
`
`26
`
`

`

`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 26 of 35
`
`US 8,321,213 B2
`
`Linear response of V1 to a speech source at 0.10 meters
`
`0
`
`8
`
`Loe
`
`me
`
`*
`
`L
`
`In¢ar Tesponse 0
`
`co
`
`oS
`
`—_
`
`f V1 toa no
`
`ISC SOUrCE a
`
`t 1 meters
`
`27
`
`27
`
`
`
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 27 of 35
`
`US 8,321,213 B2
`
`Linear response of V1 to a speech source at 0.1 meters
`
`
`Hz
`
`
`
`
`28
`
`28
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 28 of 35
`
`US 8,321,213 B2
`
`
`
`Response(dB)
`
`Frequency response at 0 degrees
`
`‘Cardioid speech
`response
`
`rocrdpceeaceenpereereeron WIggg feomrnrpeenm
`|
`|
`|
`response
`|
`
`!
`
`“20
`
`1000
`
`2000
`
`5000
`4000
`3000
`Frequency (Hz)
`
`6000
`
`7000
`
`8000
`
`FIG.35
`
`29
`
`29
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 29 of 35
`
`US 8,321,213 B2
`
`
`
`Response(dB)
`
`V1(top, dashed) and V2 speechresponse vs. B assuming d, = 0.1m
`(dB)
`V1/V2forspeech
`
`0
`FIG.36
`
`V1/V2 for speech versus B assuming d, = 0.1m
`
`B
`FIG.37
`
`30
`
`30
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 30 of 35
`
`US 8,321,213 B2
`
`
`
`
`
`0.05
`
`01
`
`O15
`
`O38
`025
`O02
`Actual d, (meters)
`FIG.38
`
`035
`
`04
`
`045
`
`0.5
`
`B factorvs. actual d, assuming d, = 0.1m and theta = 0
`
`
`B versus theta assuming d, = 0.1m
`
`
`
`40
`
`60
`
`80
`
`80
`
`60
`
`40
`
`-20
`0
`20
`theta (degrees)
`FIG.39
`
`31
`
`31
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 31 of 35
`
`US 8,321,213 B2
`
` 0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`
`
`Amplitude(dB)
`
`
`
`Phase(degrees)
`
` 0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`Frequency (Hz)
`
`FIG.40
`
`32
`
`32
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 32 of 35
`
`US 8,321,213 B2
`
`
`
`Amplitude(dB)
`
`
`
`Phase(degrees)
`
`
`
`0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`N(s) for B = 1.2 and D = -7.2e-006 seconds
`
`
`260
`
`ae
`
`220
`
`.
`
`|
`
`200 panannngdbbed --eee --
`
`180
`
`
`!
`!
`!
`!
`!
`!
`!
`1000
`2000
`3000
`4000
`5000
`6000
`7000
`
`0
`
`8000
`
`Frequency (Hz)
`
`33
`
`33
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 33 of 35
`
`US 8,321,213 B2
`
`10
`
`0 ee eeeeee eeeeeeedeeb eee eeebtee
`
`
`
`Cancellation with dl = 1, thetal = 0, d2 = 1, and theta2 = 30
`
`
`B)Amplitude(d
`
`
`
`Phase(degrees)
`
`=)|----nnnaanannefeefeeeepane
`
`2) ----------banceeneeeeeefeeeeeeeeeeeeens
`
`#30|--4ee
`40)
`1000
`2000
`3000
`4000
`5000
`000
`7000
`8000
`
`60
`
`
`1000
`2000
`3000
`4000
`5000
`6000
`7000
`8000
`Frequency (Hz)
`
`FIG.42
`
`34
`
`34
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 34 of 35
`
`US 8,321,213 B2
`
`Cancellation with dl = 1, thetal = 0, d2 = 1, and theta2 = 45
`-40
`
` 1 I
`
`fennn
`
`20] [oe
`
`0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
` 60
`
`0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`Frequency (Hz)
`
`FIG.43
`
`35
`
`
`
`Amplitude(dB)
`
`
`
`Phase(degrees)
`
`35
`
`

`

`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 35 of 35
`
`US 8,321,213 B2
`
`Original V1 (top) and cleaned V1 (bottom) with simplified VAD (dashed) in noise
`
`Noisy
`Cleaned
`
`0
`
`0.5
`
`1
`1.5
`Time (samples at 8 kHz/sec)
`
`2
`
`2.5
`
`FIG.44
`
`36
`
`36
`
`

`

`US 8,321,213 B2
`
`1
`ACOUSTIC VOICE ACTIVITY DETECTION
`
`2
`FIG.6 showsexperimentalresults of the algorithm using a
`(AVAD) FOR ELECTRONIC SYSTEMS
`fixed beta when only noise is present, under an embodiment.
`FIG.7showsexperimentalresults of the algorithm using a
`RELATED APPLICATIONS
`fixed beta when only speech is present, under an embodiment.
`FIG. 8 showsexperimentalresults of the algorithm using a
`fixed beta when speech and noise is present, under an embodi-
`ment.
`
`This application claimsthe benefit of U.S. Patent Applica-
`tion No. 61/108,426, filed Oct. 24, 2008.
`This application is a continuation in part of U.S. patent
`application Ser. No. 11/805,987, filed May 25, 2007.
`This application is a continuation in part of U.S. patent
`application Ser. No. 12/139,333, filed Jun. 13, 2008.
`
`TECHNICAL FIELD
`
`The disclosure herein relates generally to noise suppres-
`sion. In particular, this disclosure relates to noise suppression
`systems, devices, and methods for use in acoustic applica-
`tions.
`
`BACKGROUND
`
`The ability to correctly identify voiced and unvoiced
`speech is critical to many speech applications including
`speech recognition, speaker verification, noise suppression,
`and many others. In a typical acoustic application, speech
`from a human speaker is captured and transmitted to a
`receiver in a different location. In the speaker’s environment
`there may exist one or more noise sources that pollute the
`speech signal, the signal of interest, with unwanted acoustic
`noise. This makes it difficult or impossible for the receiver,
`whether human or machine, to understand the user’s speech.
`Typical methods for classifying voiced and unvoiced
`speech haverelied mainly on the acoustic content of single
`microphone data, which is plagued by problems with noise
`and the corresponding uncertainties in signal content. This is
`especially problematic withthe proliferation ofportable com-
`munication devices like mobile telephones. There are meth-
`ods knownin the art for suppressing the noise present in the
`speech signals, but these normally require a robust method of
`determining when speech is being produced. Non-acoustic
`methods have been employed successfully in commercial
`products suchas the Jawbone headset produced by Aliphcom,
`Inc., San Francisco, Calif. (Aliph), but an acoustic-only solu-
`tion is desired in some cases (e.g., for reduced cost, as a
`supplementto the non-acoustic sensor, etc.).
`
`INCORPORATION BY REFERENCE
`
`Each patent, patent application, and/or publication men-
`tioned in this specification is herein incorporated by reference
`in its entirety to the sameextent as if each individual patent,
`patent application, and/or publication was specifically and
`individually indicated to be incorporated by reference.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG.1 is a configuration of a two-microphonearray with
`speech source S, under an embodiment.
`FIG.2 is a block diagram ofV, construction using a fixed
`B(z), under an embodiment.
`FIG. 3 is a block diagram of V, construction using an
`adaptive B(z), under an embodiment.
`FIG. 4 is a block diagram of V, construction, under an
`embodiment.
`FIG. 5 is a flow diagram of acoustic voice activity detec-
`tion, under an embodiment.
`
`FIG. 9 shows experimental results of the algorithm using
`an adaptive beta when only noise is present, under an embodi-
`ment.
`
`10
`
`FIG. 10 shows experimentalresults of the algorithm using
`an adaptive beta when only speech is present, under an
`embodiment.
`
`FIG. 11 shows experimentalresults of the algorithm using
`an adaptive beta when speech andnoise is present, under an
`embodiment.
`
`FIG. 12 is ablock diagram of a NAVSADsystem, under an
`embodiment
`
`FIG. 13 is a block diagram of a PSAD system, under an
`embodiment.
`
`FIG. 14 is a block diagram of a denoising subsystem,
`referred to herein as the Pathfinder system, under an embodi-
`ment.
`
`FIG. 15 is a flow diagram of a detection algorithm for use
`in detecting voiced and unvoiced speech, under an embodi-
`ment.
`
`15
`
`20
`
`25
`
`FIGS. 16A, 16B, and 17 show data plots for an example in
`which a subject twice speaks the phrase “pop pan’, under an
`embodiment.
`
`30
`
`FIG. 16A plots the received GEMSsignalforthis utterance
`along with the mean correlation between the GEMSsignal
`and the Mic 1 signal and the threshold T1 used for voiced
`speech detection, under an embodiment.
`FIG. 16Bplots the recerved GEMSsignalfor this utterance
`along with the standard deviation ofthe GEMSsignal and the
`threshold T2 used for voiced speech detection, under an
`embodiment.
`
`FIG. 17 plots voiced speech detected from the acoustic or
`audio signal, along with the GEMSsignal and the acoustic
`noise; no unvoiced speech is detected in this example because
`ofthe heavy backgroundbabble noise, under an embodiment.
`FIG. 18 is a microphonearray for use under an embodi-
`ment of the PSAD system.
`FIG. 19 is a plot of AM versus d, for several Ad values,
`under an embodiment.
`
`FIG. 20 showsa plotofthe gain parameteras the sum ofthe
`absolute values of H,(z) and the acoustic data or audio from
`microphone1, under an embodiment.
`FIG.21 is an alternative plot of acoustic data presented in
`FIG. 20, under an embodiment.
`FIG. 22 is a two-microphone adaptive noise suppression
`system, under an embodiment.
`FIG. 23 is a generalized two-microphone array (DOMA)
`including an array and speech source S configuration, under
`an embodiment.
`
`FIG. 24 is a system for generating or producinga first order
`gradient microphone V using two omnidirectional elements
`O, and O,, under an embodiment.
`FIG. 25 is a block diagram for a DOMA including two
`physical microphones configured to form two virtual micro-
`phones V, and V,, under an embodiment.
`FIG. 26 is a block diagram for a DOMA including two
`physical microphones configured to form N virtual micro-
`phones V, through V,,, where N is any numbergreater than
`one, under an embodiment.
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`37
`
`37
`
`

`

`3
`FIG. 271s an example ofa headset or head-worn device that
`includes the DOMA,as described herein, under an embodi-
`ment.
`
`4
`signal but requires training. In addition, restrictions can be
`placed onthe filter to ensure thatit is training only on speech
`and not on environmentalnoise.
`
`US 8,321,213 B2
`
`In the following description, numerousspecific details are
`introduced to provide a thorough understanding of, and
`enabling description for, embodiments. One skilled in the
`relevant art, however, will recognize that these embodiments
`can be practiced without one or more ofthe specific details, or
`with other components, systems, etc. In other instances, well-
`known structures or operations are not shown, or are not
`described in detail, to avoid obscuring aspectsofthe disclosed
`embodiments.
`FIG.1 is a configuration of a two-microphonearray of the
`AVAD with speech source S, under an embodiment. The
`AVADof an embodimentuses two physical microphones (O,
`and O.,) to form two virtual microphones (V, and V.,). The
`virtual microphones of an embodimentare directional micro-
`phones, but the embodimentis not so limited. The physical
`microphones of an embodiment
`include omnidirectional
`microphones, but the embodiments described herein are not
`limited to omnidirectional microphones. The virtual micro-
`phone (VM) V,is configured in sucha waythatit has minimal
`responseto the speech of the user, while V, is configured so
`that it does respondtothe user’s speech but has a very similar
`noise magnitude responseto V,, as describedin detail herein.
`The PSAD VAD methodscan then be used to determine when
`speech is taking place. A further refinementis the use of an
`adaptivefilter to further minimize the speech response ofV3,
`thereby increasing the speech energyratio used in PSAD and
`resulting in better overall performance of the AVAD.
`The PSAD algorithm as described herein calculates the
`ratio of the energies of two directional microphones M, and
`M3:
`
`
`My (z;)?
`My(zi)"
`
`66599
`1
`
`66599 s
`wherethe “z’”
`indicates the discrete frequency domain and
`ranges from the beginning of the window ofinterest to the
`end, but the samerelationship holds in the time domain. The
`summation can occur over a window of any length; 200
`samples at a sampling rate of 8 kHz has been used to good
`effect. Microphone M,is assumedto have a greater speech
`response than microphone M,. The ratio R depends on the
`relative strength of the acoustic signal of interest as detected
`by the microphones.
`For matched omnidirectional microphones(i.e. they have
`the same response to acoustic signals for all spatial orienta-
`tions and frequencies), the size of R can be calculated for
`speech and noise by approximating the propagation of speech
`and noise wavesas spherically symmetric sources. For these
`the energy of the propagating wave decreases as 1/y?:
`
`FIG. 28 is a flow diagram for denoising acoustic signals
`using the DOMA,under an embodiment.
`FIG.29 is a flow diagram for forming the DOMA,under an
`embodiment.
`
`FIG.30 is a plot of linear response of virtual microphone
`V, with B=0.8 to a 1 kHz speech source at a distance of 0.1 m,
`under an embodiment.
`
`FIG.31 is a plot of linear response of virtual microphone
`V, with (8=0.8 to a 1 kHz noise source at a distance of 1.0 m,
`under an embodiment.
`FIG.32 is a plot of linear response of virtual microphone
`V, with B=0.8 toa 1 kHz speech source at a distance of 0.1 m,
`under an embodiment.
`FIG.33 is a plot of linear response of virtual microphone
`V, with B=0.8 to a 1 kHz noise sourceat a distance of 1.0 m,
`under an embodiment.
`
`FIG.34 is a plot of linear response of virtual microphone
`V, with 6=0.8 to a speech source at a distance of 0.1 m for
`frequencies of 100, 500, 1000, 2000, 3000, and 4000 Hz,
`under an embodiment.
`FIG. 35 is a plot showing comparison of frequency
`responses for speech for the array of an embodiment andfor
`a conventional cardioid microphone, under an embodiment.
`FIG. 36 is a plot showing speech response for V, (top,
`dashed) andV,, (bottom, solid) versus B withd, assumed to be
`0.1 m, under an embodiment, under an embodiment.
`FIG.37 is a plot showing a ratio ofV/V, speech responses
`shown in FIG. 31 versus B, under an embodiment.
`FIG.38 is a plot of B versus actual d, assuming that d,=10
`cm and theta=0, under an embodiment.
`FIG. 39 is a plot of B versus theta with d=10 cm and
`assuming d,=10 cm, under an embodiment.
`FIG. 40 is a plot of amplitude (top) and phase (bottom)
`response of N(s) with B=1 and D=-7.2 usec, under an
`embodiment.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`FIG. 41 is a plot of amplitude (top) and phase (bottom)
`response of N(s) with B=1.2 and D=-7.2 usec, under an
`embodiment.
`
`40
`
`FIG. 42 is a plot of amplitude (top) and phase (bottom)
`responseofthe effect on the speech cancellation in V, due to
`a mistake in the location of the speech source with q1=0
`degrees and q2=30 degrees, under an embodiment.
`FIG. 43 is a plot of amplitude (top) and phase (bottom)
`response of the effect on the speech cancellation in V, due to
`a mistake in the location of the speech source with q1=0
`degrees and q2=45 degrees, under an embodiment.
`FIG.44 showsexperimentalresults for a 2d,=19 mm array
`using a linear B of 0.83 and B1=B2=1 on a Bruel and Kjaer
`Head and Torso Simulator (HATS) in very loud (~85 dBA)
`music/speech noise environment.
`
`DETAILED DESCRIPTION
`
`Acoustic Voice Activity Detection (AVAD) methods and
`systems are described herein. The AVAD methods and sys-
`tems, which include algorithms or programs, use micro-
`phones to generate virtual directional microphones which
`have very similar noise responses andvery dissimilar speech
`responses. The ratio of the energies of the virtual micro-
`phonesis then calculated over a given window size and the
`ratio can then be used with a variety of methods to generate a
`VAD signal. The virtual microphones can be constructed
`using either a fixed or an adaptive filter. The adaptive filter
`generally results in a more accurate and noise-robust VAD
`
`45
`
`50
`
`55
`
`60
`
`65
`
`R=
`
`.
`
`) May d dtd
`
`
`=—=
`di
`
`Mo(zyP
`
`aq
`
`The distance d,is the distance from the acoustic source to
`M,, d, is the distance from the acoustic source to M,, and
`d=d,-d, (see FIG. 1). It is assumed that O, is closer to the
`speech source (the user’s mouth)so that d is always positive.
`If the microphonesand the user’s mouth are all on a line, then
`d=2d,, the distance between the microphones. For matched
`
`38
`
`38
`
`

`

`US 8,321,213 B2
`
`5
`omnidirectional microphones, the magnitude of R, depends
`only on the relative distance between the microphonesandthe
`acoustic source. For noise sources, the distancesare typically
`a meter or more, and for speech sources, the distances are on
`the order of 10 cm, but the distances are not so limited.
`
`a(Z)O2(z)
`Therefore for a 2-cm array typical values ofRare:
`Bz) =
`ZO;(2)
`
`6
`Thefilter B(z) can also be determined experimentally using
`an adaptivefilter. FIG. 3 is a block diagram ofV, construction
`using an adaptive B(z), under an embodiment, where:
`
`
`Ru @ 12cm _
`5 aq 10em
`
`102 cm 1.02
`d
`“ad 100cm
`
`where the “S” subscript denotes the ratio for speech sources
`and “N”the ratio for noise sources. There is not a significant
`amount of separation between noise and speech sources in
`this case, and therefore it would be difficult to implement a
`robust solution using simple omnidirectional microphones.
`A better implementationis to use directional microphones
`where the second microphonehas minimal speech response.
`As described herein, such microphones can be constructed
`using omnidirectional microphones O, and O,:
`
`Fi @=-B@)a@)O2Z)+01(@)2"
`
`10
`
`20
`
`25
`
`The adaptive process varies B(z) to minimizethe output ofV>
`when only speech is being received by O, and O,. A small
`amountof noise maybetolerated withlittle ill effect, but it is
`preferred that only speech is being received when the coeffi-
`cients of f(z) are calculated. Any adaptive process may be
`used; a normalized least-mean squares (NLMS) algorithm
`wasused in the examples below.
`The V, can be constructed using the currentvalue for B(z)
`or the fixed filter B(z) can be used for simplicity. FIG. 4 is a
`block diagram ofV, construction, under an embodiment.
`Nowtheratio R is
`
`
`_ WM
`Ie
`
`(-Blz)a(Z)O2(z) + Ovle)e")
`(e(2)0r(2)- BOO@<*)
`
`Vo(@)-a(z)O2(z)-BE)Oi@)z*
`
`where a(z) is a calibration filter used to compensate O,’s
`response so that it is the same as O,, B(z) is a filter that
`describes the relationship between O, and calibrated O, for
`speech, andy is a fixed delay that depends onthe size of the
`array. There is no loss of generality in defining a(z) as above,
`as either microphone may be compensatedto matchthe other.
`For this configuration V, and V, have very similar noise
`response magnitudes and very dissimilar speech response
`magnitudesif
`
`40
`
`where again d=2d, and c is the speed of soundin air, which is
`temperature dependent and approximately
`
`45
`
`=3313,/1+—— =
`eens
`273.15 sec
`
`50
`
`where T is the temperature of the air in Celsius.
`Thefilter B(z) can be calculated using wave theory to be
`
`where double bar indicates norm and again any size window
`maybe used.If f(z) has been accurately calculated,the ratio
`for speech should be relatively high (e.g., greater than
`approximately 2) and the ratio for noise should berelatively
`low (e.g., less than approximately 1.1). The ratio calculated
`will depend on both the relative energies of the speech and
`noise as well as the orientation of the noise and the reverber-
`ance ofthe environment. In practice, either the adaptedfilter
`B(Z)orthe static filter b(z) may be used for V,(z) with little
`effect on R—butit is important to use the adaptedfilter B(z)
`in V,(z) for best performance. Many techniques known to
`those skilled in the art (e.g., smoothing, etc.) can be used to
`make R more amenable to use in generating a VAD and the
`embodiments herein are not so limited.
`Theratio R can be calculated for the entire frequency band
`ofinterest, or can be calculated in frequency subbands. One
`effective subband discovered was 250 Hz to 1250 Hz, another
`was 200 Hz to 3000 Hz, but many others are possible and
`useful.
`Once generated, the vectorofthe ratio R versustime (or the
`matrix of R versus time if multiple subbandsare used) can be
`used with any detection system (such as onethat uses fixed
`and/or adaptive thresholds) to determine when speech is
`occurring. While many detection systems and methods are
`knownto exist by thoseskilled in the art and may be used,the
`method described herein for generating an R so that the
`speech is easily discernable is novel. It is important to note
`that the R does not depend on the type of noise or its orien-
`tation or frequency content; R simply depends on the V, and
`V, spatial response similarity for noise and spatial response
`dissimilarity for speech. In this way it is very robust and can
`operate smoothly in a variety ofnoisy acoustic environments.
`where again d, is the distance from the user’s mouth to O,.
`FIG. 5 is a flow diagram of acoustic voice activity detection
`FIG. 2 is a block diagram of V, construction using a fixed
`500, under an embodiment. The detection comprises forming
`B(z), under an embodiment. This fixed (or static) B works
`
`sufficiently well ifthe calibrationfilter a(z) is accurate and d, a first virtual microphone by combiningafirst signal of a first
`and d., are accurate for the user. This fixed-B algorithm, how-
`physical microphone and a secondsignal of a second physical
`
`ever, neglects important effects such as reflection,diffraction, microphone 502. The detection comprises formingafilter
`poorarray orientation(i.e. the microphonesand the mouth of
`that describes a relationship for speech between the first
`the userare not all ona line), and the possibility of different d,
`physical microphone and the second physical microphone
`and d, values for different users.
`504. The detection comprises forming a secondvirtual micro-
`
`a
`Pa= a=
`
`
`dy
`djt+d
`
`55
`
`[2]
`
`39
`
`39
`
`

`

`US 8,321,213 B2
`
`8
`the axis of the array. As described herein, and with reference
`to FIG. 1,
`
`dy — dy
`
`(seconds)
`
`dy =¥ a - 2dsdycos(0) + de
`
`dy = yf d? + 2dsdocos(@) + dB
`
`where d, is the distance from the midpointof the array to the
`speech source. Varying d, from 10 to 15 cm andallowing6 to
`vary between 0 and +-30 degrees, the maximum difference in
`y results from the difference of y at 0 degrees (58.8 tisec) and
`y at +-30 degrees for d,=10 cm (50.8 sec). This means that
`the maximum expected phase difference is 58.8-50.8=8.0
`usec, or 0.064 samples at an 8 kHz sampling rate. Since
`o(f)=2aft=20f{8.0x10~%)rad
`
`the maximum phase difference realized at 4 kHz is only 0.2
`rad or about 11.4 degrees, a small amount, but nota negligible
`one. Therefore the 6 filter should almost linear phase, but
`someallowance madefordifferences in position and angle. In
`practice a slightly larger amount was used (0.071 samples at
`8 kHz) in order to compensate for poor calibration and dif-
`fraction effects, and this worked well. The limit on the phase
`in the example below was implemented as the ratio of the
`central tap energy to the combined energy of the other taps:
`
`phase limit ratio =
`
`(center tap)?
`IIAll
`
`10
`
`15
`
`20
`
`25
`
`30
`
`7
`phoneby applyingthefilterto thefirst signal to generate a first
`intermediate signal, and summingthefirst intermediate sig-
`nal and the second signal 506. The detection comprises gen-
`erating an energy ratio of energies of the first virtual micro-
`phoneandthe secondvirtual microphone 508. The detection
`comprises detecting acoustic voice activity of a speaker when
`the energy ratio is greater than a threshold value 510.
`The accuracy of the adaptationto the B(z) of the system is
`a factorin determiningthe effectiveness ofthe AVAD. A more
`accurate adaptation to the actual B(z) of the system leads to
`lower energy of the speech response in V,, and a higherratio
`R. The noise (far-field) magnitude response is largely
`unchanged by the adaptation process, so the ratio R will be
`near unity for accurately adapted beta. For purposes of accu-
`racy, the system can be trained on speech alone, or the noise
`should be low enoughin energy so as not to affect or to have
`a minimalaffect the training.
`To make the training as accurate as possible, the coeffi-
`cients of the filter 6(z) of an embodiment are generally
`updated underthe following conditions, but the embodiment
`is not so limited: speech is being produced (requiresa rela-
`tively high SNRor other methodofdetection such as an Aliph
`Skin Surface Microphone (SSM)as described in U.S. patent
`application Ser. No. 10/769,302, filed Jan. 30, 2004, which is
`incorporated by reference herein in its entirety); no wind is
`detected (wind can be detected using many different methods
`known in the art, such as examining the microphones for
`uncorrelated low-frequency noise); and the current value ofR
`is much larger than a smoothed history of R values (this
`ensures that training occurs only when strong speech is
`present). These pr

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket