`(10) Patent No:
`a2) United States Patent
`Petit et al.
`(45) Date of Patent:
`*Nov. 27, 2012
`
`
`US008321213B2
`
`(54) ACOUSTIC VOICE ACTIVITY DETECTION
`(AVAD) FOR ELECTRONIC SYSTEMS
`
`(75)
`
`Inventors: Nicolas Petit, San Francisco, CA (US);
`Gregory Burnett, Dodge Center, MN
`(US); Zhinian Jing, San Francisco, CA
`US(US)
`(73) Assignee: uy Inc., San Francisco, CA
`:
`:
`:
`:
`+.
`(*) Notice:
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 540 days.
`oo
`This patent is subject to a terminal dis-
`claimer.
`
`(21) Appl. No.: 12/606,146
`
`(22)
`(65)
`
`Filed:
`
`Oct. 26, 2009
`Prior Publication Data
`US 2010/0128894 Al
`May 27, 2010
`
`Related U.S. Application Data
`(63) Continuation-in-part of application No. 12/139,333
`filed on Jun. 13. 2008. and a continuation-in-part of
`application No. 11/805,987, filed on May 25, 2007,
`now abandoned.
`
`(60) Provisional application No. 61/108,426, filed on Oct.
`24, 2008.
`
`(51)
`
`(56)
`
`Int. Cl.
`(2006.01)
`GIOL 11/06
`(52) US. CL cc ceecnscneeescnenees 704/208; 704/214
`(58) Field of Classification Search .................. 704/208,
`704/210, 214, 215; 381/99, 100, 46
`See application file for complete search history.
`References Cited
`US. PATENT DOCUMENTS
`5,459,814 A * 10/1995 Guptaetal. oo. 704/233
`7,171,357 B2*
`1/2007 Boland .......
`.. 704/231
`
`7246058 B2*
`7/2007 Burnett _....
`304/296
`....
`7,464,029 B2* 12/2008 Visseret al.
`.. 704/210
`..
`8,019,091 B2*
`9/2011 Burnett etal.
`. 381/71.8
`2009/0089053 A1*
`4/2009 Wanget al. acccccccccsn 704/233
`* cited by examiner
`
`Primary Examiner — Abul Azad
`(74) Attorney, Agent, or Firm — Kokka & Backus, PC
`
`ABSTRACT
`(57)
`Acoustic Voice Activity Detection (AVAD) methodsandsys-
`tems are described. The AVAD methodsand systems, includ-
`ing correspondingalgorithms or programs, use microphones
`to generate virtual directional microphones which have very
`similar noise responsesand very dissimilar speech responses.
`The ratio of the energies of the virtual microphonesis then
`calculated over a given window size andthe ratio can then be
`used with a variety ofmethods to generate a VAD signal. The
`virtual microphonescan be constructed using either an adap-
`tive or a fixedfilter.
`
`42 Claims, 35 Drawing Sheets
`
`v0
`
`502
`
`504
`
`Formingfirst virtual microphone by combining
`first signal offirst physical microphone and
`secondsignal of second physical microphone.
`
`Formingfilter that describes relationship for
`speech betweenfirst physical microphone
`and second physical microphone.
`
`Forming secondvirtual microphone by
`applyingfilter to first signal to generate
`first intermediate signal, and summing
`first intermediate signal and second signal.
`
`energy ratio is greater than threshold value. 506
`
`Generating energy ratio of energies of first virtual
`microphone and second virtual microphone.
`
`508
`
`Detecting acoustic voice activity of speaker when
`
`APPLE 1001
`
`APPLE 1001
`
`1
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 1 of 35
`
`US 8,321,213 B2
`
`
`
`FIG.2
`
`2
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 2 of 35
`
`US 8,321,213 B2
`
`
`
`FIG.3
`
`
`
`3
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 3 of 35
`
`US 8,321,213 B2
`
`first intermediate signal and secondsignal.
`
`y00
`
`502
`
`504
`
`306
`
`508
`
`510
`
`Forming first virtual microphone by combining
`first signal offirst physical microphone and
`secondsignal of second physical microphone.
`
`Formingfilter that describes relationship for
`speech between first physical microphone
`and second physical microphone.
`
`Forming secondvirtual microphone by
`applyingfilter to first signal to generate
`first intermediate signal, and summing
`
`Generating energy ratio of energies offirst virtual
`microphone and second virtual microphone.
`
`Detecting acoustic voice activity of speaker when
`energy ratio is greater than threshold value.
`
`FIG.S
`
`4
`
`4
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 4 of 35
`
`US 8,321,213 B2
`
`
`
`OSIOUUl&}9qPaXTfJOF(W0j}0q)ZApue(dor)[A
`
`
`
`
`
`
`
`
`
`
`
`(998)omy
`
`9DIA
`
`5
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 5 of 35
`
`US 8,321,213 B2
`
` Ayao
`
`
`yooodsvj9qpoxly10}(Wo}}0q)ZApue(doy)[A
`
`0€GZ4GIOLG
`
`
`
`(998)ouITy
`
`LOI
`
`6
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 6 of 35
`
`US 8,321,213 B2
`
`
`
`QSIOUUIyooedsv}0qpox]JOF(W10}30q)ZApue(doy)[A
`
`
`
`
`
`coeeeconnee e wees
`
`Otaa
`
`(998)ouuT}
`
`8Did
`
`7
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 7 of 35
`
`US 8,321,213 B2
`
` QSIOUUlBjOq
`
`
`
`
`SAL}depeJ0F(W0}}0q)ZApue(doy)[A
`
`
`
`
`
`(998)ovary
`
`6Did
`
`8
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 8 of 35
`
`US 8,321,213 B2
`
` AyTuo
`
`
`
`yooadsejaqaAtdepeJoy(W0}30q)ZApue(doy)[A
`
`
`
`(998)SUIT}
`
`01Did
`
`9
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 9 of 35
`
`US 8,321,213 B2
`
` asiouUlyooads
`
`
`
`ejaqSAIdepeJoy(tu0}}0q)ZApue(dor)[A
`
`QIUUULJLueyo
`
`1ut
`
`Gc02S
`
`(998)ow
`
`IPDW
`
`
`
`
`
`
`10
`
`10
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 10 of 35
`
`US 8,321,213 B2
`
`1230
`
`1230
`
`
`
`Detection
`
` Processor
`
`Sensors
`
`enoising
`
`FIG.12
`
`Processor
`
`Detection
`
`
`
`
`Denoising
`
`11
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 11 of 35
`
`US 8,321,213 B2
`
`yosadspauesy)
`
`[PAOWY
`
`ASTON
`
`())
`
`12
`
`SION
`
`(u)u
`
`12
`
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 12 of 35
`
`US 8,321,213 B2
`
`1250
`
`a
`
`Pa
`
`Constants:
`
`V= 0 if noise, 1 if UV, 2 ifV
`Read20meeeqdata
`~*
`
`VTC = voiced threshold for corr
`from m4, m2, gems
`
`VTS = voiced threshold for std. dev.
`—
`
`ff = forgetting factorfor std. dev.
`
`Calculate XCORRof m1, gems||Num_ma = # of taps in m.a.filter
`
`UV_ma = UVstd dev m.a.thresh
`Step 10 msec
`UV_std = UV std dev threshold
`UV = binary values denoting UV
`
`detected in each subband
`Calc mean (abs(XCORR)) = MC
`NAVSAD
`num_begin = # win at "beginning
`
`Variables:
`Calc STD DEV of gems = GSD|bhi = LMScalc of MIC 1-2 TF
`keep_old = 1 if last win V/UV, 0 ow
`sd_ma_vector= last NV sd values
`Viwindow) = 2
`
`sd_ma =m.a.of the last NV sd
`bh1 = bh1_old
`PSAD
`UV = [0,0], Filter m1 and
`
`
`m2 into 2 bands, 1500-2500
`and 2500-3500 Hz
`
`Calculate bh1 using
`
`Pathfinder for each subband
`
`Is
`
`new_std > UV_ma*sd_ma
`and
`new_std > UV_sd
`
`If not keep_old or at beginning,
`OR
`
`add new_sum to new_sum_vector
`are we at the beginning?
`(ff numberslong)
`
`
`old_std = new_std
`
`new_std = STD DEV of
`keepold =0
`
`new sum vector
`
`UV(subband) = 2
`bh1_old = bh1
`
`
`
`bh1 = bh1_old
`
`
` If not keep_old or at beginning,
`keep_old = 1
`shift sd_ma_vectorto right
`
`After both subbands
`
`
`
`Replacefirst value in
`checked, is
`CEIL(SUM(UV}/2) = 1?
`sd ma vector with old std
`
` Filter sd_ma_vector with moving
`
`FIG. l 5
`averagefilter to get sd_ma
`
`IsMC > VTC and
`GSD > VTS?
`
` 0
`
`\_N
`
`new_sum = sum(abs(bh1));
`
`13
`
`13
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 13 of 35
`
`US 8,321,213 B2
`
`Gems and Mean Correlation
`
`0
`
`0.5
`
`|
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`FIG.16A
`
`Gems and Standard Deviation
`
`0
`
`0.5
`
`|
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`FIG.16B
`
`14
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 14 of 35
`
`US 8,321,213 B2
`
`v7 1100
`
`Voicing
`
`| Noise
`
`1706.
`|
`| Acoustic
`
`15
`
`15
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 15 of 35
`
`US 8,321,213 B2
`
` Linear array
`
`midline
`
`FIG.18
`
`16
`
`16
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 16 of 35
`
`US 8,321,213 B2
`
`1900
`
`di versus delta M for delta d = 1, 2,3, 4 cm
`
`17
`
`17
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 17 of 35
`
`US 8,321,213 B2
`
`2000
`
`Gain Parameter
`
`Acoustic data (solid) and gain parameter (dashed)
`
`Acoustic Data
`
`0
`
`0.5
`
`I
`
`1.5
`
`2
`time (samples)
`
`2.5
`
`3
`
`3.5
`
`4
`
`x 10°
`
`FIG.20
`
`18
`
`18
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 18 of 35
`
`US 8,321,213 B2
`
`Mic1 and V for "pop pan"in \headmic\micgemsp1.bin
`
`2100
`
`Voicing Signal
`
`Audio Signal
`2104
`
`—_ Unvoiced
`|
`Level
`
`"Gems Signal
`2106
`
`Not Voiced
`
`0
`
`0.5
`
`|
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`x 10
`
`time (samples)
`
`FIG.21
`
`19
`
`19
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 19 of 35
`
`US 8,321,213 B2
`
`
`yosadgpourayy
`
`[PAOWIOYOSION
`
`0077—*
`
`
`
`WOTEWIOJU]SUIDIOA
`
`POC
`
`(sy)|00C2
`
`TVNDIS
`
`(u)s
`
`())|10¢¢
`
`HSION
`
`(u)u
`
`20
`
`20
`
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 20 of 35
`
`US 8,321,213 B2
`
`
`
`21
`
`21
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 21 of 35
`
`US 8,321,213 B2
`
`
`
`22
`
`22
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 22 of 35
`
`
`
`US 8,321,213 B2
`
`23
`
`23
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 23 of 35
`
`US 8,321,213 B2
`
` aw”
`
`TTT TT eee
`
`¢2
`
`702
`
`FIG.27
`
`24
`
`24
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 24 of 35
`
`US 8,321,213 B2
`
`Receive acoustic signals at a first physical
`microphone anda second physical microphone.
`
`Output first microphone signal from first physical
`microphone and second microphonesignal from
`second physical microphone.
`
`Form first virtual microphone using the first combination
`of first microphone signal and second microphonesignal.
`
`Form second virtual microphone using second combination
`of first microphone signal and second microphonesignal.
`
`Generate denoised output signals havingless
`acoustic noise than received acoustic signals.
`2800
`FIG.28
`
`Form physical microphone array includingfirst
`physical microphone and second physical microphone.
`
`signals from physical microphonearray.
`
`Form virtual microphone array includingfirst virtual
`microphone and second virtual microphone using
`
`2900—*
`
`FIG.29
`
`25
`
`2802
`
`2804
`
`1806
`
`9808
`
`2810
`
`2902
`
`2904
`
`25
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 25 of 35
`
`US 8,321,213 B2
`
`Linear response of V2 to a speech source at 0.10 meters
`180
`
`Linear response of V2 to a no
`
`1S€ SOUICE a
`
`t 1 meters
`
`-4---4------
`
`bo ee
`
`'''!'4
`
`26
`
`26
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 26 of 35
`
`US 8,321,213 B2
`
`Linear response of V1 to a speech source at 0.10 meters
`
`0
`
`8
`
`Loe
`
`me
`
`*
`
`L
`
`In¢ar Tesponse 0
`
`co
`
`oS
`
`—_
`
`f V1 toa no
`
`ISC SOUrCE a
`
`t 1 meters
`
`27
`
`27
`
`
`
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 27 of 35
`
`US 8,321,213 B2
`
`Linear response of V1 to a speech source at 0.1 meters
`
`
`Hz
`
`
`
`
`28
`
`28
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 28 of 35
`
`US 8,321,213 B2
`
`
`
`Response(dB)
`
`Frequency response at 0 degrees
`
`‘Cardioid speech
`response
`
`rocrdpceeaceenpereereeron WIggg feomrnrpeenm
`|
`|
`|
`response
`|
`
`!
`
`“20
`
`1000
`
`2000
`
`5000
`4000
`3000
`Frequency (Hz)
`
`6000
`
`7000
`
`8000
`
`FIG.35
`
`29
`
`29
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 29 of 35
`
`US 8,321,213 B2
`
`
`
`Response(dB)
`
`V1(top, dashed) and V2 speechresponse vs. B assuming d, = 0.1m
`(dB)
`V1/V2forspeech
`
`0
`FIG.36
`
`V1/V2 for speech versus B assuming d, = 0.1m
`
`B
`FIG.37
`
`30
`
`30
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 30 of 35
`
`US 8,321,213 B2
`
`
`
`
`
`0.05
`
`01
`
`O15
`
`O38
`025
`O02
`Actual d, (meters)
`FIG.38
`
`035
`
`04
`
`045
`
`0.5
`
`B factorvs. actual d, assuming d, = 0.1m and theta = 0
`
`
`B versus theta assuming d, = 0.1m
`
`
`
`40
`
`60
`
`80
`
`80
`
`60
`
`40
`
`-20
`0
`20
`theta (degrees)
`FIG.39
`
`31
`
`31
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 31 of 35
`
`US 8,321,213 B2
`
` 0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`
`
`Amplitude(dB)
`
`
`
`Phase(degrees)
`
` 0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`Frequency (Hz)
`
`FIG.40
`
`32
`
`32
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 32 of 35
`
`US 8,321,213 B2
`
`
`
`Amplitude(dB)
`
`
`
`Phase(degrees)
`
`
`
`0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`N(s) for B = 1.2 and D = -7.2e-006 seconds
`
`
`260
`
`ae
`
`220
`
`.
`
`|
`
`200 panannngdbbed --eee --
`
`180
`
`
`!
`!
`!
`!
`!
`!
`!
`1000
`2000
`3000
`4000
`5000
`6000
`7000
`
`0
`
`8000
`
`Frequency (Hz)
`
`33
`
`33
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 33 of 35
`
`US 8,321,213 B2
`
`10
`
`0 ee eeeeee eeeeeeedeeb eee eeebtee
`
`
`
`Cancellation with dl = 1, thetal = 0, d2 = 1, and theta2 = 30
`
`
`B)Amplitude(d
`
`
`
`Phase(degrees)
`
`=)|----nnnaanannefeefeeeepane
`
`2) ----------banceeneeeeeefeeeeeeeeeeeeens
`
`#30|--4ee
`40)
`1000
`2000
`3000
`4000
`5000
`000
`7000
`8000
`
`60
`
`
`1000
`2000
`3000
`4000
`5000
`6000
`7000
`8000
`Frequency (Hz)
`
`FIG.42
`
`34
`
`34
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 34 of 35
`
`US 8,321,213 B2
`
`Cancellation with dl = 1, thetal = 0, d2 = 1, and theta2 = 45
`-40
`
` 1 I
`
`fennn
`
`20] [oe
`
`0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
` 60
`
`0
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`Frequency (Hz)
`
`FIG.43
`
`35
`
`
`
`Amplitude(dB)
`
`
`
`Phase(degrees)
`
`35
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 35 of 35
`
`US 8,321,213 B2
`
`Original V1 (top) and cleaned V1 (bottom) with simplified VAD (dashed) in noise
`
`Noisy
`Cleaned
`
`0
`
`0.5
`
`1
`1.5
`Time (samples at 8 kHz/sec)
`
`2
`
`2.5
`
`FIG.44
`
`36
`
`36
`
`
`
`US 8,321,213 B2
`
`1
`ACOUSTIC VOICE ACTIVITY DETECTION
`
`2
`FIG.6 showsexperimentalresults of the algorithm using a
`(AVAD) FOR ELECTRONIC SYSTEMS
`fixed beta when only noise is present, under an embodiment.
`FIG.7showsexperimentalresults of the algorithm using a
`RELATED APPLICATIONS
`fixed beta when only speech is present, under an embodiment.
`FIG. 8 showsexperimentalresults of the algorithm using a
`fixed beta when speech and noise is present, under an embodi-
`ment.
`
`This application claimsthe benefit of U.S. Patent Applica-
`tion No. 61/108,426, filed Oct. 24, 2008.
`This application is a continuation in part of U.S. patent
`application Ser. No. 11/805,987, filed May 25, 2007.
`This application is a continuation in part of U.S. patent
`application Ser. No. 12/139,333, filed Jun. 13, 2008.
`
`TECHNICAL FIELD
`
`The disclosure herein relates generally to noise suppres-
`sion. In particular, this disclosure relates to noise suppression
`systems, devices, and methods for use in acoustic applica-
`tions.
`
`BACKGROUND
`
`The ability to correctly identify voiced and unvoiced
`speech is critical to many speech applications including
`speech recognition, speaker verification, noise suppression,
`and many others. In a typical acoustic application, speech
`from a human speaker is captured and transmitted to a
`receiver in a different location. In the speaker’s environment
`there may exist one or more noise sources that pollute the
`speech signal, the signal of interest, with unwanted acoustic
`noise. This makes it difficult or impossible for the receiver,
`whether human or machine, to understand the user’s speech.
`Typical methods for classifying voiced and unvoiced
`speech haverelied mainly on the acoustic content of single
`microphone data, which is plagued by problems with noise
`and the corresponding uncertainties in signal content. This is
`especially problematic withthe proliferation ofportable com-
`munication devices like mobile telephones. There are meth-
`ods knownin the art for suppressing the noise present in the
`speech signals, but these normally require a robust method of
`determining when speech is being produced. Non-acoustic
`methods have been employed successfully in commercial
`products suchas the Jawbone headset produced by Aliphcom,
`Inc., San Francisco, Calif. (Aliph), but an acoustic-only solu-
`tion is desired in some cases (e.g., for reduced cost, as a
`supplementto the non-acoustic sensor, etc.).
`
`INCORPORATION BY REFERENCE
`
`Each patent, patent application, and/or publication men-
`tioned in this specification is herein incorporated by reference
`in its entirety to the sameextent as if each individual patent,
`patent application, and/or publication was specifically and
`individually indicated to be incorporated by reference.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG.1 is a configuration of a two-microphonearray with
`speech source S, under an embodiment.
`FIG.2 is a block diagram ofV, construction using a fixed
`B(z), under an embodiment.
`FIG. 3 is a block diagram of V, construction using an
`adaptive B(z), under an embodiment.
`FIG. 4 is a block diagram of V, construction, under an
`embodiment.
`FIG. 5 is a flow diagram of acoustic voice activity detec-
`tion, under an embodiment.
`
`FIG. 9 shows experimental results of the algorithm using
`an adaptive beta when only noise is present, under an embodi-
`ment.
`
`10
`
`FIG. 10 shows experimentalresults of the algorithm using
`an adaptive beta when only speech is present, under an
`embodiment.
`
`FIG. 11 shows experimentalresults of the algorithm using
`an adaptive beta when speech andnoise is present, under an
`embodiment.
`
`FIG. 12 is ablock diagram of a NAVSADsystem, under an
`embodiment
`
`FIG. 13 is a block diagram of a PSAD system, under an
`embodiment.
`
`FIG. 14 is a block diagram of a denoising subsystem,
`referred to herein as the Pathfinder system, under an embodi-
`ment.
`
`FIG. 15 is a flow diagram of a detection algorithm for use
`in detecting voiced and unvoiced speech, under an embodi-
`ment.
`
`15
`
`20
`
`25
`
`FIGS. 16A, 16B, and 17 show data plots for an example in
`which a subject twice speaks the phrase “pop pan’, under an
`embodiment.
`
`30
`
`FIG. 16A plots the received GEMSsignalforthis utterance
`along with the mean correlation between the GEMSsignal
`and the Mic 1 signal and the threshold T1 used for voiced
`speech detection, under an embodiment.
`FIG. 16Bplots the recerved GEMSsignalfor this utterance
`along with the standard deviation ofthe GEMSsignal and the
`threshold T2 used for voiced speech detection, under an
`embodiment.
`
`FIG. 17 plots voiced speech detected from the acoustic or
`audio signal, along with the GEMSsignal and the acoustic
`noise; no unvoiced speech is detected in this example because
`ofthe heavy backgroundbabble noise, under an embodiment.
`FIG. 18 is a microphonearray for use under an embodi-
`ment of the PSAD system.
`FIG. 19 is a plot of AM versus d, for several Ad values,
`under an embodiment.
`
`FIG. 20 showsa plotofthe gain parameteras the sum ofthe
`absolute values of H,(z) and the acoustic data or audio from
`microphone1, under an embodiment.
`FIG.21 is an alternative plot of acoustic data presented in
`FIG. 20, under an embodiment.
`FIG. 22 is a two-microphone adaptive noise suppression
`system, under an embodiment.
`FIG. 23 is a generalized two-microphone array (DOMA)
`including an array and speech source S configuration, under
`an embodiment.
`
`FIG. 24 is a system for generating or producinga first order
`gradient microphone V using two omnidirectional elements
`O, and O,, under an embodiment.
`FIG. 25 is a block diagram for a DOMA including two
`physical microphones configured to form two virtual micro-
`phones V, and V,, under an embodiment.
`FIG. 26 is a block diagram for a DOMA including two
`physical microphones configured to form N virtual micro-
`phones V, through V,,, where N is any numbergreater than
`one, under an embodiment.
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`37
`
`37
`
`
`
`3
`FIG. 271s an example ofa headset or head-worn device that
`includes the DOMA,as described herein, under an embodi-
`ment.
`
`4
`signal but requires training. In addition, restrictions can be
`placed onthe filter to ensure thatit is training only on speech
`and not on environmentalnoise.
`
`US 8,321,213 B2
`
`In the following description, numerousspecific details are
`introduced to provide a thorough understanding of, and
`enabling description for, embodiments. One skilled in the
`relevant art, however, will recognize that these embodiments
`can be practiced without one or more ofthe specific details, or
`with other components, systems, etc. In other instances, well-
`known structures or operations are not shown, or are not
`described in detail, to avoid obscuring aspectsofthe disclosed
`embodiments.
`FIG.1 is a configuration of a two-microphonearray of the
`AVAD with speech source S, under an embodiment. The
`AVADof an embodimentuses two physical microphones (O,
`and O.,) to form two virtual microphones (V, and V.,). The
`virtual microphones of an embodimentare directional micro-
`phones, but the embodimentis not so limited. The physical
`microphones of an embodiment
`include omnidirectional
`microphones, but the embodiments described herein are not
`limited to omnidirectional microphones. The virtual micro-
`phone (VM) V,is configured in sucha waythatit has minimal
`responseto the speech of the user, while V, is configured so
`that it does respondtothe user’s speech but has a very similar
`noise magnitude responseto V,, as describedin detail herein.
`The PSAD VAD methodscan then be used to determine when
`speech is taking place. A further refinementis the use of an
`adaptivefilter to further minimize the speech response ofV3,
`thereby increasing the speech energyratio used in PSAD and
`resulting in better overall performance of the AVAD.
`The PSAD algorithm as described herein calculates the
`ratio of the energies of two directional microphones M, and
`M3:
`
`
`My (z;)?
`My(zi)"
`
`66599
`1
`
`66599 s
`wherethe “z’”
`indicates the discrete frequency domain and
`ranges from the beginning of the window ofinterest to the
`end, but the samerelationship holds in the time domain. The
`summation can occur over a window of any length; 200
`samples at a sampling rate of 8 kHz has been used to good
`effect. Microphone M,is assumedto have a greater speech
`response than microphone M,. The ratio R depends on the
`relative strength of the acoustic signal of interest as detected
`by the microphones.
`For matched omnidirectional microphones(i.e. they have
`the same response to acoustic signals for all spatial orienta-
`tions and frequencies), the size of R can be calculated for
`speech and noise by approximating the propagation of speech
`and noise wavesas spherically symmetric sources. For these
`the energy of the propagating wave decreases as 1/y?:
`
`FIG. 28 is a flow diagram for denoising acoustic signals
`using the DOMA,under an embodiment.
`FIG.29 is a flow diagram for forming the DOMA,under an
`embodiment.
`
`FIG.30 is a plot of linear response of virtual microphone
`V, with B=0.8 to a 1 kHz speech source at a distance of 0.1 m,
`under an embodiment.
`
`FIG.31 is a plot of linear response of virtual microphone
`V, with (8=0.8 to a 1 kHz noise source at a distance of 1.0 m,
`under an embodiment.
`FIG.32 is a plot of linear response of virtual microphone
`V, with B=0.8 toa 1 kHz speech source at a distance of 0.1 m,
`under an embodiment.
`FIG.33 is a plot of linear response of virtual microphone
`V, with B=0.8 to a 1 kHz noise sourceat a distance of 1.0 m,
`under an embodiment.
`
`FIG.34 is a plot of linear response of virtual microphone
`V, with 6=0.8 to a speech source at a distance of 0.1 m for
`frequencies of 100, 500, 1000, 2000, 3000, and 4000 Hz,
`under an embodiment.
`FIG. 35 is a plot showing comparison of frequency
`responses for speech for the array of an embodiment andfor
`a conventional cardioid microphone, under an embodiment.
`FIG. 36 is a plot showing speech response for V, (top,
`dashed) andV,, (bottom, solid) versus B withd, assumed to be
`0.1 m, under an embodiment, under an embodiment.
`FIG.37 is a plot showing a ratio ofV/V, speech responses
`shown in FIG. 31 versus B, under an embodiment.
`FIG.38 is a plot of B versus actual d, assuming that d,=10
`cm and theta=0, under an embodiment.
`FIG. 39 is a plot of B versus theta with d=10 cm and
`assuming d,=10 cm, under an embodiment.
`FIG. 40 is a plot of amplitude (top) and phase (bottom)
`response of N(s) with B=1 and D=-7.2 usec, under an
`embodiment.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`FIG. 41 is a plot of amplitude (top) and phase (bottom)
`response of N(s) with B=1.2 and D=-7.2 usec, under an
`embodiment.
`
`40
`
`FIG. 42 is a plot of amplitude (top) and phase (bottom)
`responseofthe effect on the speech cancellation in V, due to
`a mistake in the location of the speech source with q1=0
`degrees and q2=30 degrees, under an embodiment.
`FIG. 43 is a plot of amplitude (top) and phase (bottom)
`response of the effect on the speech cancellation in V, due to
`a mistake in the location of the speech source with q1=0
`degrees and q2=45 degrees, under an embodiment.
`FIG.44 showsexperimentalresults for a 2d,=19 mm array
`using a linear B of 0.83 and B1=B2=1 on a Bruel and Kjaer
`Head and Torso Simulator (HATS) in very loud (~85 dBA)
`music/speech noise environment.
`
`DETAILED DESCRIPTION
`
`Acoustic Voice Activity Detection (AVAD) methods and
`systems are described herein. The AVAD methods and sys-
`tems, which include algorithms or programs, use micro-
`phones to generate virtual directional microphones which
`have very similar noise responses andvery dissimilar speech
`responses. The ratio of the energies of the virtual micro-
`phonesis then calculated over a given window size and the
`ratio can then be used with a variety of methods to generate a
`VAD signal. The virtual microphones can be constructed
`using either a fixed or an adaptive filter. The adaptive filter
`generally results in a more accurate and noise-robust VAD
`
`45
`
`50
`
`55
`
`60
`
`65
`
`R=
`
`.
`
`) May d dtd
`
`
`=—=
`di
`
`Mo(zyP
`
`aq
`
`The distance d,is the distance from the acoustic source to
`M,, d, is the distance from the acoustic source to M,, and
`d=d,-d, (see FIG. 1). It is assumed that O, is closer to the
`speech source (the user’s mouth)so that d is always positive.
`If the microphonesand the user’s mouth are all on a line, then
`d=2d,, the distance between the microphones. For matched
`
`38
`
`38
`
`
`
`US 8,321,213 B2
`
`5
`omnidirectional microphones, the magnitude of R, depends
`only on the relative distance between the microphonesandthe
`acoustic source. For noise sources, the distancesare typically
`a meter or more, and for speech sources, the distances are on
`the order of 10 cm, but the distances are not so limited.
`
`a(Z)O2(z)
`Therefore for a 2-cm array typical values ofRare:
`Bz) =
`ZO;(2)
`
`6
`Thefilter B(z) can also be determined experimentally using
`an adaptivefilter. FIG. 3 is a block diagram ofV, construction
`using an adaptive B(z), under an embodiment, where:
`
`
`Ru @ 12cm _
`5 aq 10em
`
`102 cm 1.02
`d
`“ad 100cm
`
`where the “S” subscript denotes the ratio for speech sources
`and “N”the ratio for noise sources. There is not a significant
`amount of separation between noise and speech sources in
`this case, and therefore it would be difficult to implement a
`robust solution using simple omnidirectional microphones.
`A better implementationis to use directional microphones
`where the second microphonehas minimal speech response.
`As described herein, such microphones can be constructed
`using omnidirectional microphones O, and O,:
`
`Fi @=-B@)a@)O2Z)+01(@)2"
`
`10
`
`20
`
`25
`
`The adaptive process varies B(z) to minimizethe output ofV>
`when only speech is being received by O, and O,. A small
`amountof noise maybetolerated withlittle ill effect, but it is
`preferred that only speech is being received when the coeffi-
`cients of f(z) are calculated. Any adaptive process may be
`used; a normalized least-mean squares (NLMS) algorithm
`wasused in the examples below.
`The V, can be constructed using the currentvalue for B(z)
`or the fixed filter B(z) can be used for simplicity. FIG. 4 is a
`block diagram ofV, construction, under an embodiment.
`Nowtheratio R is
`
`
`_ WM
`Ie
`
`(-Blz)a(Z)O2(z) + Ovle)e")
`(e(2)0r(2)- BOO@<*)
`
`Vo(@)-a(z)O2(z)-BE)Oi@)z*
`
`where a(z) is a calibration filter used to compensate O,’s
`response so that it is the same as O,, B(z) is a filter that
`describes the relationship between O, and calibrated O, for
`speech, andy is a fixed delay that depends onthe size of the
`array. There is no loss of generality in defining a(z) as above,
`as either microphone may be compensatedto matchthe other.
`For this configuration V, and V, have very similar noise
`response magnitudes and very dissimilar speech response
`magnitudesif
`
`40
`
`where again d=2d, and c is the speed of soundin air, which is
`temperature dependent and approximately
`
`45
`
`=3313,/1+—— =
`eens
`273.15 sec
`
`50
`
`where T is the temperature of the air in Celsius.
`Thefilter B(z) can be calculated using wave theory to be
`
`where double bar indicates norm and again any size window
`maybe used.If f(z) has been accurately calculated,the ratio
`for speech should be relatively high (e.g., greater than
`approximately 2) and the ratio for noise should berelatively
`low (e.g., less than approximately 1.1). The ratio calculated
`will depend on both the relative energies of the speech and
`noise as well as the orientation of the noise and the reverber-
`ance ofthe environment. In practice, either the adaptedfilter
`B(Z)orthe static filter b(z) may be used for V,(z) with little
`effect on R—butit is important to use the adaptedfilter B(z)
`in V,(z) for best performance. Many techniques known to
`those skilled in the art (e.g., smoothing, etc.) can be used to
`make R more amenable to use in generating a VAD and the
`embodiments herein are not so limited.
`Theratio R can be calculated for the entire frequency band
`ofinterest, or can be calculated in frequency subbands. One
`effective subband discovered was 250 Hz to 1250 Hz, another
`was 200 Hz to 3000 Hz, but many others are possible and
`useful.
`Once generated, the vectorofthe ratio R versustime (or the
`matrix of R versus time if multiple subbandsare used) can be
`used with any detection system (such as onethat uses fixed
`and/or adaptive thresholds) to determine when speech is
`occurring. While many detection systems and methods are
`knownto exist by thoseskilled in the art and may be used,the
`method described herein for generating an R so that the
`speech is easily discernable is novel. It is important to note
`that the R does not depend on the type of noise or its orien-
`tation or frequency content; R simply depends on the V, and
`V, spatial response similarity for noise and spatial response
`dissimilarity for speech. In this way it is very robust and can
`operate smoothly in a variety ofnoisy acoustic environments.
`where again d, is the distance from the user’s mouth to O,.
`FIG. 5 is a flow diagram of acoustic voice activity detection
`FIG. 2 is a block diagram of V, construction using a fixed
`500, under an embodiment. The detection comprises forming
`B(z), under an embodiment. This fixed (or static) B works
`
`sufficiently well ifthe calibrationfilter a(z) is accurate and d, a first virtual microphone by combiningafirst signal of a first
`and d., are accurate for the user. This fixed-B algorithm, how-
`physical microphone and a secondsignal of a second physical
`
`ever, neglects important effects such as reflection,diffraction, microphone 502. The detection comprises formingafilter
`poorarray orientation(i.e. the microphonesand the mouth of
`that describes a relationship for speech between the first
`the userare not all ona line), and the possibility of different d,
`physical microphone and the second physical microphone
`and d, values for different users.
`504. The detection comprises forming a secondvirtual micro-
`
`a
`Pa= a=
`
`
`dy
`djt+d
`
`55
`
`[2]
`
`39
`
`39
`
`
`
`US 8,321,213 B2
`
`8
`the axis of the array. As described herein, and with reference
`to FIG. 1,
`
`dy — dy
`
`(seconds)
`
`dy =¥ a - 2dsdycos(0) + de
`
`dy = yf d? + 2dsdocos(@) + dB
`
`where d, is the distance from the midpointof the array to the
`speech source. Varying d, from 10 to 15 cm andallowing6 to
`vary between 0 and +-30 degrees, the maximum difference in
`y results from the difference of y at 0 degrees (58.8 tisec) and
`y at +-30 degrees for d,=10 cm (50.8 sec). This means that
`the maximum expected phase difference is 58.8-50.8=8.0
`usec, or 0.064 samples at an 8 kHz sampling rate. Since
`o(f)=2aft=20f{8.0x10~%)rad
`
`the maximum phase difference realized at 4 kHz is only 0.2
`rad or about 11.4 degrees, a small amount, but nota negligible
`one. Therefore the 6 filter should almost linear phase, but
`someallowance madefordifferences in position and angle. In
`practice a slightly larger amount was used (0.071 samples at
`8 kHz) in order to compensate for poor calibration and dif-
`fraction effects, and this worked well. The limit on the phase
`in the example below was implemented as the ratio of the
`central tap energy to the combined energy of the other taps:
`
`phase limit ratio =
`
`(center tap)?
`IIAll
`
`10
`
`15
`
`20
`
`25
`
`30
`
`7
`phoneby applyingthefilterto thefirst signal to generate a first
`intermediate signal, and summingthefirst intermediate sig-
`nal and the second signal 506. The detection comprises gen-
`erating an energy ratio of energies of the first virtual micro-
`phoneandthe secondvirtual microphone 508. The detection
`comprises detecting acoustic voice activity of a speaker when
`the energy ratio is greater than a threshold value 510.
`The accuracy of the adaptationto the B(z) of the system is
`a factorin determiningthe effectiveness ofthe AVAD. A more
`accurate adaptation to the actual B(z) of the system leads to
`lower energy of the speech response in V,, and a higherratio
`R. The noise (far-field) magnitude response is largely
`unchanged by the adaptation process, so the ratio R will be
`near unity for accurately adapted beta. For purposes of accu-
`racy, the system can be trained on speech alone, or the noise
`should be low enoughin energy so as not to affect or to have
`a minimalaffect the training.
`To make the training as accurate as possible, the coeffi-
`cients of the filter 6(z) of an embodiment are generally
`updated underthe following conditions, but the embodiment
`is not so limited: speech is being produced (requiresa rela-
`tively high SNRor other methodofdetection such as an Aliph
`Skin Surface Microphone (SSM)as described in U.S. patent
`application Ser. No. 10/769,302, filed Jan. 30, 2004, which is
`incorporated by reference herein in its entirety); no wind is
`detected (wind can be detected using many different methods
`known in the art, such as examining the microphones for
`uncorrelated low-frequency noise); and the current value ofR
`is much larger than a smoothed history of R values (this
`ensures that training occurs only when strong speech is
`present). These pr