`
`(12) United States Patent
`Petit et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 8,321,213 B2
`*Nov. 27, 2012
`
`(54) ACOUSTIC VOICE ACTIVITY DETECTION
`(AVAD) FOR ELECTRONIC SYSTEMS
`
`(75) Inventors: Nicolas Petit, San Francisco, CA (US);
`Gregory Burnett, Dodge Center, MN
`(US); Zhinian Jing, San Francisco, CA
`(US)
`(73) Assignee: AliphCom, Inc., San Francisco, CA
`(US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 540 days.
`This patent is Subject to a terminal dis
`claimer.
`
`(*) Notice:
`
`(21) Appl. No.: 12/606,146
`
`(22) Filed:
`
`Oct. 26, 2009
`
`(65)
`
`Prior Publication Data
`US 2010/O128894 A1
`May 27, 2010
`
`Related U.S. Application Data
`(63) Continuation-in-part of application No. 12/139,333,
`filed on Jun. 13, 2008, and a continuation-in-part of
`application No. 1 1/805,987, filed on May 25, 2007,
`now abandoned.
`(60) Provisional application No. 61/108,426, filed on Oct.
`24, 2008.
`
`(51) Int. Cl.
`(2006.01)
`GIOL II/06
`(52) U.S. Cl. ........................................ 704/208; 704/214
`(58) Field of Classification Search .................. 704/208,
`704/210, 214, 215; 381/99, 100, 46
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`5.459,814 A * 10/1995 Gupta et al. .................. TO4,233
`7,171,357 B2* 1/2007 Boland ......
`... 704/231
`7,246,058 B2* 7/2007 Burnett ......
`... 704/226
`7.464,029 B2 * 12/2008 Visser et al. ... r TO4/210
`8,019,091 B2 * 9/2011 Burnett et al. ............... 381,718
`2009/0089053 A1* 4/2009 Wang et al. ................... TO4,233
`* cited by examiner
`Primary Examiner — Abul Azad
`(74) Attorney, Agent, or Firm — Kokka & Backus, PC
`(57)
`ABSTRACT
`Acoustic Voice Activity Detection (AVAD) methods and sys
`tems are described. The AVAD methods and systems, includ
`ing corresponding algorithms or programs, use microphones
`to generate virtual directional microphones which have very
`similar noise responses and very dissimilar speech responses.
`The ratio of the energies of the virtual microphones is then
`calculated over a given window size and the ratio can then be
`used with a variety of methods to generate a VAD signal. The
`virtual microphones can be constructed using either an adap
`tive or a fixed filter.
`
`42 Claims, 35 Drawing Sheets
`
`
`
`Forming first virtual microphone by combining
`first signal of first physical microphone and
`second signal of second physical microphone.
`
`Forming filter that describes relationship for
`speech between first physical microphone
`and second physical microphone.
`
`Forming second virtual microphone by
`applying filter to first signal to generate
`first intermediate signal, and summing
`first intermediate signal and second signal.
`
`-500
`
`502
`
`504
`
`506
`
`Generating energy ratio of energies of first virtual
`microphone and second virtual microphone.
`
`508
`
`Detecting acoustic voice activity of speaker when
`energy ratio is greater than threshold value.
`
`510
`
`Page 1 of 56
`
`GOOGLE EXHIBIT 1001
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 27, 2012
`Nov. 27, 2012
`
`Sheet 1 of 35
`Sheet 1 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`
`
`
`
`FIG.2
`FIG.2
`
`Page 2 of 56
`
`Page 2 of 56
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 27, 2012
`Nov. 27, 2012
`
`Sheet 2 of 35
`Sheet 2 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`
`
`FIG.3
`FIG.3
`
`
`
`
`
`Page 3 of 56
`
`Page 3 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 3 Of 35
`
`US 8,321,213 B2
`
`
`
`Forming first virtual microphone by combining
`first signal of first physical microphone and
`second signal of second physical microphone.
`
`Forming filter that describes relationship for
`speech between first physical microphone
`and second physical microphone.
`
`Forming second virtual microphone by
`applying filter to first signal to generate
`first intermediate signal, and summing
`first intermediate signal and second signal.
`
`-500
`
`502
`
`504
`
`506
`
`Generating energy ratio of energies of first virtual
`microphone and second virtual microphone.
`
`508
`
`Detecting acoustic voice activity of speaker when
`energy ratio is greater than threshold value.
`
`510
`
`FIG.5
`
`Page 4 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 4 of 35
`
`US 8,321,213 B2
`
`
`
`
`
`(038) QUI?
`
`9’OIH
`
`Page 5 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 5 Of 35
`
`
`
`07G909GZ0ZG?0||G US 8,321,213 B2
`
`L'OIH
`
`
`
`(008) QUI?
`
`- - - - - - - - - - - - - - -
`
`
`
`
`
`Á??O ?009ds £104 pôX? JOJ (u10110q) ZA pub (do]) IA
`
`
`
`
`
`
`
`
`
`Page 6 of 56
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 27, 2012
`Nov.27, 2012
`
`Sheet 6 of 35
`Sheet 6 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`
`
`(998)ouy QSIOUUIyooeds
`
`
`v}0qpox]JOF(W10}30q)ZApue(doy)[A
`
`------
`
`8Did
`
`Page 7 of 56
`
`Page 7 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 7 Of 35
`
`US 8,321,213 B2
`
`
`
`
`
`(008) QUI?
`
`6’OIH
`
`Page 8 of 56
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 8 of 35
`
`US 8,321,213 B2
`
` AyTuo
`
`
`
`yooadsejaqaAtdepeJoy(W0}30q)ZApue(doy)[A
`
`
`
`(998)SUIT}
`
`01Did
`
`Page 9 of 56
`
`Page 9 of 56
`
`
`
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 9 of 35
`
`US 8,321,213 B2
`
` asiouUlyooads
`onu
`
`
`ejaqSAIdepeJoy(tu0}}0q)ZApue(dor)[A
`0€_GC02S0}
`
`
`on
`
`on
`
`oH
`
`(998)ow
`
`IPDW
`
`
`
`
`
`
`Page 10 of 56
`
`Page 10 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 10 of 35
`
`US 8,321,213 B2
`
`
`
`
`
`Microphones
`
`1220
`
`
`
`Voicing
`Sensors
`
`
`
`
`
`Processor
`
`1230
`
`Detection
`Subsystem
`Denoising
`Subsystem
`
`1250
`
`1240
`
`FIG.12
`
`
`
`
`
`
`
`
`
`
`
`Processor
`
`1230
`
`Detection
`Subsystem
`
`Denoising
`Subsystem
`
`1250
`
`1240
`
`FIG.13
`
`Page 11 of 56
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 27, 2012
`Nov.27, 2012
`
`Sheet 11 of 35
`Sheet 11 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`
`
`
`
`O
`C
`N
`v
`
`Page 12 of 56
`
`yosadspauesy)
`
`
`BAOWOYISTON
`
`¢SIN
`
`(u)uasION
`
`v(co)
`
`(ou
`
`Page 12 of 56
`
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 12 of 35
`
`US 8,321,213 B2
`
`1250
`TN
`
`
`
`
`
`
`
`Step 10 mseC
`
`inrnw:
`Retail 2. data
`from m1, m2, gems
`II ly
`CalculateXCORRofm, gems
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Calcmean (abs(XCORR)) =MC
`
`CalcSTDDEVof gems = GSD
`V(window) = 2
`bhi = bhold
`
`ls MC> WTC and Y NO
`GSD > WTS?
`
`ls
`new std XUV ma'sd ma
`and
`td X UW
`new S OR UW Sd
`are we at the beginning?
`
`
`
`old std = new std
`keep old=0
`bhold=bh?
`
`
`
`After both Subbands
`checked, is
`CEL(SUM(UV)2)= 1?
`Ye
`
`
`
`UV(subband) = 2
`keep old = 1
`
`FIG.15
`
`Constants:
`W = 0 if noise, 1 if UW, 2 if V
`WTC E VOiced threshold for COrr
`WTS = Voiced threshold for std. deV.
`f=forgetting factor for std. deV.
`num_ma of taps in mafilter
`UV ma=UV std dev m.a. thresh
`UV std=UV std dev threshold
`UVF binary values denoting UV
`detected in each subband.
`num begin=# win at "beginning
`Variables:
`bhl=LMS calc of MIC 1-2TF
`keep old = 1 if last win V/UV, OOW
`sdma vector = last NV Sd values
`sdma = m.a. Of the last NV Sd
`PSAD
`UV = (0,0), Filterm? and
`m2 into 2 bands, 1500-2500
`and 2500-3500 HZ
`Calculate bhi using
`Pathfinder for each Subband
`new sum = Sum(abs(bh1);
`if not keep Old Or at beginning,
`add new sum to new Sum vector
`(fnumbers long)
`
`new std=STDDEVof
`new Sum Vector
`If not keep Old Or at beginning,
`shiftsdma vector to right
`Replace first value in
`sdma Vector with old std
`
`Filtersdma Vector with moving
`average filter to get Sdma
`
`Page 13 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 13 Of 35
`
`US 8,321,213 B2
`
`Gems and Mean Correlation
`
`O
`
`0.5
`
`1
`
`1.5
`2
`2.5
`FIG.16A
`
`3
`
`3.5
`
`4
`
`
`
`Gems and Standard Deviation
`
`Page 14 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 14 of 35
`
`US 8,321,213 B2
`
`
`
`Voicing
`
`-1700
`
`1706.
`Acoustic
`
`Page 15 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 15 Of 35
`
`US 8,321,213 B2
`
`
`
`Linear array
`midline
`
`FIG.18
`
`Page 16 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 16 of 35
`
`US 8,321,213 B2
`
`
`
`d1 versus delta M for delta d = 1,2,3,4 cm
`
`1900
`
`dl (cm)
`FIG.19
`
`Page 17 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 17 Of 35
`
`US 8,321,213 B2
`
`Acoustic data (solid) and gain parameter (dashed)
`
`
`
`2000
`
`Gain Parameter
`
`O
`
`0.5
`
`1
`
`2.5
`2
`1.5
`time (samples)
`
`3
`
`4
`3.5
`x 10'
`
`FIG20
`
`Page 18 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 18 Of 35
`
`US 8,321,213 B2
`
`2100
`
`
`
`Mic 1 and W for "pop pan" in \headmicmicgems plbin
`Voicing Signal
`Audio Signal
`2104
`
`Unvoiced
`Level
`
`Gems Signal
`
`2106
`
`-
`
`:
`
`Not Woiced
`
`O
`
`0.5
`
`1
`
`2.5
`2
`15
`time (samples)
`FIG.21
`
`3
`
`4
`
`3.5
`X 10
`
`Page 19 of 56
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 19 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`
`yosadgpourayy
`
`[PAOWIOYOSION
`
`0077—*
`
`
`
`WOTEWIOJU]SUIDIOA
`
`POC
`
`
`
`
`
`
`
`CN
`C
`S2
`t
`
`(sy)|00C2
`
`TVNDIS
`
`(u)s
`
`((%))|10¢¢
`
`ASION
`
`(u)u
`
`Page 20 of 56
`
`Page 20 of 56
`
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 27, 2012
`Nov. 27, 2012
`
`Sheet 20 Of 35
`Sheet 20 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`
`
`
`
`Page 21 of 56
`
`Page 21 of 56
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 27, 2012
`Nov.27, 2012
`
`Sheet 21 Of 35
`Sheet 21 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`
`
`
`
`Page 22 of 56
`
`Page 22 of 56
`
`
`
`Nov. 27, 2012
`Nov.27, 2012
`
`Sheet 22 Of 35
`Sheet 22 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`C
`C
`Mad
`
`
`
`U.S. Patent
`U.S. Patent
`
`
`
`Page 23 of 56
`
`Page 23 of 56
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov. 27, 2012
`Nov.27, 2012
`
`Sheet 23 Of 35
`Sheet 23 of 35
`
`US 8,321,213 B2
`US 8,321,213 B2
`
`
`
` aw”
`
`TTT TT eee
`
`i
`2702
`
`FIG.27
`FIG.27
`
`Page 24 of 56
`
`Page 24 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 24 of 35
`
`US 8,321,213 B2
`
`Receive acoustic signals at a first physical
`microphone and a second physical microphone.
`
`Output first microphone signal from first physical
`microphone and second microphone signal from
`second physical microphone.
`
`Form first virtual microphone using the first combination
`of first microphone signal and second microphone signal.
`
`Form second virtual microphone using second combination
`of first microphone signal and second microphone signal.
`
`Generate denoised output signals having less
`acoustic noise than received acoustic signals.
`2800-
`FIG.28
`
`
`
`Form physical microphone array including first
`physical microphone and second physical microphone.
`
`Form virtual microphone array including first virtual
`microphone and second virtual microphone using
`signals from physical microphone array.
`
`2802
`
`2804
`
`2806
`
`2808
`
`2810
`
`2902
`
`2904
`
`Page 25 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 25 Of 35
`
`US 8,321,213 B2
`
`Linear response of W2 to a speech source at 0.10 meters
`
`
`
`Linear response of V2 to ano
`tl meters
`1SC SOCC 2.
`
`-- - - - - - - - - - - - -
`
`Page 26 of 56
`
`
`
`U.S. Patent
`U.S. Patent
`
`Nov.27, 2012
`
`Sheet 26 of 35
`
`US 8,321,213 B2
`
`Linear response of V1 to a speech source at 0.10 meters
`
`180
`
`ban
`
`V11111
`
`mee
`—_
`
`In¢ar Tesponse 0
`
`f V1 toa no
`
`ISC SOUrCE a
`
`t 1 meters
`
`
`
`180
`oSCc
`
`- - - - - - - - - - -
`
`L
`
`Page 27 of 56
`
`Page 27 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 27 Of 35
`
`US 8,321,213 B2
`
`Linear response of W1 to a speech source at 0.1 meters
`
`
`
`
`
`t
`
`- - - - - - -
`
`
`
`A 4000Hz
`240
`
`
`
`Page 28 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 28 Of 35
`
`US 8,321,213 B2
`
`Frequency response at 0 degrees
`
`5-------------------------------------N----------------------------
`Cardioid speech
`... / ... response .N.
`
`t
`
`-5
`A2 -10----j- V1 speech ----
`response
`-15-----------------------------------------------------------------------------
`
`
`
`-20
`
`1000
`
`2000
`
`5000
`4000
`3000
`Frequency (Hz)
`
`6000
`
`7000
`
`8000
`
`FIG.35
`
`Page 29 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 29 Of 35
`
`US 8,321,213 B2
`
`W1 (top, dashed) and V2 speech response vs. Bassuming d = 0.lm
`
`7
`FIG.36
`W1W2 for speech versus Bassuming d = 0.1m
`
`
`
`B
`FIG.37
`
`Page 30 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 30 Of 35
`
`US 8,321,213 B2
`
`Bfactor VS. actuald assuming d = 0.lm and theta=0
`
`
`
`0.05
`
`0.1
`
`1.25
`
`0.15
`
`0.35
`
`0.3
`0.25
`0.2
`Actualds (meters)
`FIG.38
`B versus theta assuming d = 0.lm
`
`0.4
`
`0.45
`
`0.5
`
`-80
`
`-60
`
`-40
`
`20
`O
`-20
`theta (degrees)
`FIG.39
`
`40
`
`60
`
`80
`
`Page 31 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 31 Of 35
`
`US 8,321,213 B2
`
`O
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`
`
`S
`
`AH
`
`-100
`
`1000
`
`2000
`
`5000
`4000
`3000
`Frequency (Hz)
`
`6000
`
`7000
`
`8000
`
`FIG.40
`
`Page 32 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 32 Of 35
`
`US 8,321,213 B2
`
`
`
`O
`
`
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`O
`
`1000
`
`2000
`
`
`
`5000
`4000
`3000
`Frequency (Hz)
`
`FIG.41
`
`6000
`
`7000
`
`8000
`
`Page 33 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 33 Of 35
`
`US 8,321,213 B2
`
`Cancellation with d1 = 1, theta1 = 0, d2 = 1, and theta2 = 30
`
`O
`
`1000
`
`2000
`
`3000
`
`4000
`
`5000
`
`6000
`
`7000
`
`8000
`
`
`
`O
`
`1000
`
`2000
`
`
`
`5000
`4000
`3000
`Frequency (Hz)
`
`FIG.42
`
`6000
`
`7000
`
`8000
`
`Page 34 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 34 of 35
`
`US 8,321,213 B2
`
`Cancellation with d1 = 1, theta1 = 0, d2 = 1, and theta2 = 45
`
`-10---------
`-20-21.-----------------------------------
`-30/---------------------
`-40
`O
`1000
`2000
`3000
`4000
`5000
`6000
`7000
`8000
`
`
`
`d 3.
`92
`S
`A
`
`
`
`
`
`
`
`60;
`
`1000
`
`2000
`
`5000
`4000
`3000
`Frequency (Hz)
`
`6000
`
`7000
`
`8000
`
`FIG.43
`
`Page 35 of 56
`
`
`
`U.S. Patent
`
`Nov. 27, 2012
`
`Sheet 35 of 35
`
`US 8,321,213 B2
`
`Original Wl (top) and cleaned Wl (bottom) with simplified WAD (dashed) in noise
`0.4
`0.3
`
`2
`
`
`
`O
`
`0.5
`
`1.5
`1
`Time (samples at 8 kHz/sec)
`
`2
`
`2.5 1 n5
`X 10
`
`FIG.44
`
`Page 36 of 56
`
`
`
`US 8,321,213 B2
`
`1.
`ACOUSTIC VOICE ACTIVITY DETECTION
`(AVAD) FOR ELECTRONIC SYSTEMS
`
`RELATED APPLICATIONS
`
`This application claims the benefit of U.S. Patent Applica
`tion No. 61/108,426, filed Oct. 24, 2008.
`This application is a continuation in part of U.S. patent
`application Ser. No. 1 1/805,987, filed May 25, 2007.
`This application is a continuation in part of U.S. patent
`application Ser. No. 12/139,333, filed Jun. 13, 2008.
`
`TECHNICAL FIELD
`
`The disclosure herein relates generally to noise Suppres
`Sion. In particular, this disclosure relates to noise Suppression
`systems, devices, and methods for use in acoustic applica
`tions.
`
`10
`
`15
`
`BACKGROUND
`
`2
`FIG. 6 shows experimental results of the algorithm using a
`fixed beta when only noise is present, under an embodiment.
`FIG. 7 shows experimental results of the algorithm using a
`fixed beta when only speech is present, under an embodiment.
`FIG. 8 shows experimental results of the algorithm using a
`fixed beta when speech and noise is present, under an embodi
`ment.
`FIG. 9 shows experimental results of the algorithm using
`an adaptive beta when only noise is present, under an embodi
`ment.
`FIG. 10 shows experimental results of the algorithm using
`an adaptive beta when only speech is present, under an
`embodiment.
`FIG. 11 shows experimental results of the algorithm using
`an adaptive beta when speech and noise is present, under an
`embodiment.
`FIG. 12 is a block diagram of a NAVSAD system, under an
`embodiment
`FIG. 13 is a block diagram of a PSAD system, under an
`embodiment.
`FIG. 14 is a block diagram of a denoising Subsystem,
`referred to herein as the Pathfinder system, under an embodi
`ment.
`FIG. 15 is a flow diagram of a detection algorithm for use
`in detecting Voiced and unvoiced speech, under an embodi
`ment.
`FIGS. 16A, 16B, and 17 show data plots for an example in
`which a subject twice speaks the phrase "pop pan’, under an
`embodiment.
`FIG.16A plots the received GEMS signal for this utterance
`along with the mean correlation between the GEMS signal
`and the Mic 1 signal and the threshold T1 used for voiced
`speech detection, under an embodiment.
`FIG.16B plots the received GEMS signal for this utterance
`along with the standard deviation of the GEMS signal and the
`threshold T2 used for voiced speech detection, under an
`embodiment.
`FIG. 17 plots voiced speech detected from the acoustic or
`audio signal, along with the GEMS signal and the acoustic
`noise; no unvoiced speech is detected in this example because
`of the heavy background babble noise, under an embodiment.
`FIG. 18 is a microphone array for use under an embodi
`ment of the PSAD system.
`FIG. 19 is a plot of AM versus d for several Ad values,
`under an embodiment.
`FIG.20 shows a plot of the gain parameter as the sum of the
`absolute values of H (Z) and the acoustic data or audio from
`microphone 1, under an embodiment.
`FIG. 21 is an alternative plot of acoustic data presented in
`FIG. 20, under an embodiment.
`FIG. 22 is a two-microphone adaptive noise Suppression
`system, under an embodiment.
`FIG. 23 is a generalized two-microphone array (DOMA)
`including an array and speech Source S configuration, under
`an embodiment.
`FIG.24 is a system for generating or producing a first order
`gradient microphone V using two omnidirectional elements
`O and O, under an embodiment.
`FIG. 25 is a block diagram for a DOMA including two
`physical microphones configured to form two virtual micro
`phones V and V, under an embodiment.
`FIG. 26 is a block diagram for a DOMA including two
`physical microphones configured to form N virtual micro
`phones V through V, where N is any number greater than
`one, under an embodiment.
`
`25
`
`30
`
`The ability to correctly identify voiced and unvoiced
`speech is critical to many speech applications including
`speech recognition, speaker verification, noise Suppression,
`and many others. In a typical acoustic application, speech
`from a human speaker is captured and transmitted to a
`receiver in a different location. In the speaker's environment
`there may exist one or more noise sources that pollute the
`speech signal, the signal of interest, with unwanted acoustic
`noise. This makes it difficult or impossible for the receiver,
`whether human or machine, to understand the user's speech.
`Typical methods for classifying voiced and unvoiced
`speech have relied mainly on the acoustic content of single
`microphone data, which is plagued by problems with noise
`and the corresponding uncertainties in signal content. This is
`especially problematic with the proliferation of portable com
`munication devices like mobile telephones. There are meth
`ods known in the art for Suppressing the noise present in the
`speech signals, but these normally require a robust method of
`40
`determining when speech is being produced. Non-acoustic
`methods have been employed Successfully in commercial
`products such as the Jawbone headset produced by Aliphcom,
`Inc., San Francisco, Calif. (Aliph), but an acoustic-only solu
`tion is desired in some cases (e.g., for reduced cost, as a
`Supplement to the non-acoustic sensor, etc.).
`
`35
`
`45
`
`INCORPORATION BY REFERENCE
`
`Each patent, patent application, and/or publication men
`50
`tioned in this specification is herein incorporated by reference
`in its entirety to the same extent as if each individual patent,
`patent application, and/or publication was specifically and
`individually indicated to be incorporated by reference.
`
`55
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a configuration of a two-microphone array with
`speech source S, under an embodiment.
`FIG. 2 is a block diagram of V2 construction using a fixed
`B(Z), under an embodiment.
`FIG. 3 is a block diagram of V construction using an
`adaptive f3(Z), under an embodiment.
`FIG. 4 is a block diagram of V construction, under an
`embodiment.
`FIG. 5 is a flow diagram of acoustic voice activity detec
`tion, under an embodiment.
`
`60
`
`65
`
`Page 37 of 56
`
`
`
`3
`FIG.27 is an example of a headset orhead-worn device that
`includes the DOMA, as described herein, under an embodi
`ment.
`FIG. 28 is a flow diagram for denoising acoustic signals
`using the DOMA, under an embodiment.
`FIG.29 is a flow diagram for forming the DOMA, under an
`embodiment.
`FIG. 30 is a plot of linear response of virtual microphone
`V with B=0.8 to a 1 kHz speech source at a distance of 0.1 m,
`under an embodiment.
`FIG. 31 is a plot of linear response of virtual microphone
`V, with (B=0.8 to a 1 kHz noise source at a distance of 1.0m,
`under an embodiment.
`FIG. 32 is a plot of linear response of virtual microphone
`V with B=0.8 to a 1 kHz speech source at a distance of 0.1 m,
`under an embodiment.
`FIG.33 is a plot of linear response of virtual microphone
`V with B=0.8 to a 1 kHz noise source at a distance of 1.0m,
`under an embodiment.
`FIG. 34 is a plot of linear response of virtual microphone
`V with B=0.8 to a speech source at a distance of 0.1 m for
`frequencies of 100, 500, 1000, 2000, 3000, and 4000 Hz,
`under an embodiment.
`FIG. 35 is a plot showing comparison of frequency
`responses for speech for the array of an embodiment and for
`a conventional cardioid microphone, under an embodiment.
`FIG. 36 is a plot showing speech response for V (top,
`dashed) and V (bottom, solid) versus B withdassumed to be
`0.1 m, under an embodiment, under an embodiment.
`FIG.37 is a plot showing a ratio of V/V, speech responses
`shown in FIG. 31 versus B, under an embodiment.
`FIG.38 is a plot of B versus actual d assuming that d-10
`cm and theta=0, under an embodiment.
`FIG. 39 is a plot of B versus theta with d-10 cm and
`assuming di-10 cm, under an embodiment.
`FIG. 40 is a plot of amplitude (top) and phase (bottom)
`response of N(s) with B=1 and D=-7.2 usec, under an
`embodiment.
`FIG. 41 is a plot of amplitude (top) and phase (bottom)
`response of N(s) with B=1.2 and D=-7.2 usec, under an
`embodiment.
`FIG. 42 is a plot of amplitude (top) and phase (bottom)
`response of the effect on the speech cancellation in V, due to
`a mistake in the location of the speech source with q1 =0
`degrees and q2–30 degrees, under an embodiment.
`FIG. 43 is a plot of amplitude (top) and phase (bottom)
`response of the effect on the speech cancellation in V, due to
`a mistake in the location of the speech source with q1 =0
`degrees and q2–45 degrees, under an embodiment.
`FIG. 44 shows experimental results for a 2d 19 mm array
`using a linear B of 0.83 and B1 =B2=1 on a Bruel and Kjaer
`Head and Torso Simulator (HATS) in very loud (-85 dBA)
`music/speech noise environment.
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`US 8,321,213 B2
`
`4
`signal but requires training. In addition, restrictions can be
`placed on the filter to ensure that it is training only on speech
`and not on environmental noise.
`In the following description, numerous specific details are
`introduced to provide a thorough understanding of, and
`enabling description for, embodiments. One skilled in the
`relevant art, however, will recognize that these embodiments
`can be practiced without one or more of the specific details, or
`with other components, systems, etc. In other instances, well
`known structures or operations are not shown, or are not
`described in detail, to avoid obscuring aspects of the disclosed
`embodiments.
`FIG. 1 is a configuration of a two-microphone array of the
`AVAD with speech source S, under an embodiment. The
`AVAD of an embodiment uses two physical microphones (O.
`and O.) to form two virtual microphones (V and V). The
`virtual microphones of an embodiment are directional micro
`phones, but the embodiment is not so limited. The physical
`microphones of an embodiment include omnidirectional
`microphones, but the embodiments described herein are not
`limited to omnidirectional microphones. The virtual micro
`phone (VM)V is configured in such away that it has minimal
`response to the speech of the user, while V is configured so
`that it does respond to the user's speech but has a very similar
`noise magnitude response to V, as described in detail herein.
`The PSADVAD methods can then be used to determine when
`speech is taking place. A further refinement is the use of an
`adaptive filter to further minimize the speech response of V.
`thereby increasing the speech energy ratio used in PSAD and
`resulting in better overall performance of the AVAD.
`The PSAD algorithm as described herein calculates the
`ratio of the energies of two directional microphones M and
`M:
`
`where the “Z” indicates the discrete frequency domain and
`99
`1
`ranges from the beginning of the window of interest to the
`end, but the same relationship holds in the time domain. The
`Summation can occur over a window of any length, 200
`samples at a sampling rate of 8 kHZ has been used to good
`effect. Microphone M is assumed to have a greater speech
`response than microphone M. The ratio R depends on the
`relative strength of the acoustic signal of interest as detected
`by the microphones.
`For matched omnidirectional microphones (i.e. they have
`the same response to acoustic signals for all spatial orienta
`tions and frequencies), the size of R can be calculated for
`speech and noise by approximating the propagation of speech
`and noise waves as spherically symmetric Sources. For these
`the energy of the propagating wave decreases as 1/y :
`
`R=
`
`M(z) dz
`M2(zi)
`di
`
`: - :
`
`di +d
`d
`
`The distance d is the distance from the acoustic source to
`M. d is the distance from the acoustic source to M, and
`d=d-d (see FIG. 1). It is assumed that O is closer to the
`speech Source (the user's mouth) so that d is always positive.
`If the microphones and the user's mouth are all on a line, then
`d=2d, the distance between the microphones. For matched
`
`DETAILED DESCRIPTION
`
`55
`
`Acoustic Voice Activity Detection (AVAD) methods and
`systems are described herein. The AVAD methods and sys
`tems, which include algorithms or programs, use micro
`phones to generate virtual directional microphones which
`have very similar noise responses and very dissimilar speech
`responses. The ratio of the energies of the virtual micro
`phones is then calculated over a given window size and the
`ratio can then be used with a variety of methods to generate a
`VAD signal. The virtual microphones can be constructed
`using either a fixed or an adaptive filter. The adaptive filter
`generally results in a more accurate and noise-robust VAD
`
`60
`
`65
`
`Page 38 of 56
`
`
`
`US 8,321,213 B2
`
`6
`The filter B(z) can also be determined experimentally using
`an adaptive filter. FIG.3 is a block diagram of V construction
`using an adaptive B(Z), under an embodiment, where:
`
`5
`omnidirectional microphones, the magnitude of R, depends
`only on the relative distance between the microphones and the
`acoustic source. For noise Sources, the distances are typically
`a meter or more, and for speech sources, the distances are on
`the order of 10 cm, but the distances are not so limited.
`Therefore for a 2-cm array typical values of R are:
`
`12 cm
`R d2
`S - 4 - 10 en -
`d.
`102 cm
`= soon = 1.02
`
`where the “S” subscript denotes the ratio for speech sources
`and “N’ the ratio for noise sources. There is not a significant
`amount of separation between noise and speech sources in
`this case, and therefore it would be difficult to implement a
`robust solution using simple omnidirectional microphones.
`A better implementation is to use directional microphones
`where the second microphone has minimal speech response.
`As described herein, Such microphones can be constructed
`using omnidirectional microphones O and O:
`
`where C.(z) is a calibration filter used to compensate O’s
`response so that it is the same as O, B(Z) is a filter that
`describes the relationship between O and calibrated O for
`speech, and Y is a fixed delay that depends on the size of the
`array. There is no loss of generality in defining C (Z) as above,
`as either microphone may be compensated to match the other.
`For this configuration V and V have very similar noise
`response magnitudes and very dissimilar speech response
`magnitudes if
`
`5
`
`10
`
`15
`
`25
`
`30
`
`35
`
`The adaptive process varies B(z) to minimize the output ofV.
`when only speech is being received by O and O. A Small
`amount of noise may be tolerated with little ill effect, but it is
`preferred that only speech is being received when the coeffi
`cients of B(z) are calculated. Any adaptive process may be
`used; a normalized least-mean squares (NLMS) algorithm
`was used in the examples below.
`The V can be constructed using the current value for B(z)
`or the fixed filter B(z) can be used for simplicity. FIG. 4 is a
`block diagram of V construction, under an embodiment.
`Now the ratio R is
`
`where double bar indicates norm and again any size window
`may be used. If B(z) has been accurately calculated, the ratio
`for speech should be relatively high (e.g., greater than
`approximately 2) and the ratio for noise should be relatively
`low (e.g., less than approximately 1.1). The ratio calculated
`will depend on both the relative energies of the speech and
`noise as well as the orientation of the noise and the reverber
`ance of the environment. In practice, either the adapted filter
`f(z) or the static filter b(z) may be used for V. (z) with little
`effect on R—but it is important to use the adapted filter f(Z)
`in V(Z) for best performance. Many techniques known to
`those skilled in the art (e.g., Smoothing, etc.) can be used to
`make R more amenable to use in generating a VAD and the
`embodiments herein are not so limited.
`The ratio R can be calculated for the entire frequency band
`of interest, or can be calculated in frequency Subbands. One
`effective Subband discovered was 250Hz to 1250 Hz, another
`was 200 Hz to 3000 Hz, but many others are possible and
`useful.
`Once generated, the vector of the ratio Riversus time (or the
`matrix of R versus time if multiple subbands are used) can be
`used with any detection system (such as one that uses fixed
`and/or adaptive thresholds) to determine when speech is
`occurring. While many detection systems and methods are
`known to exist by those skilled in the art and may be used, the
`method described herein for generating an R so that the
`speech is easily discernable is novel. It is important to note
`that the R does not depend on the type of noise or its orien
`tation or frequency content; R simply depends on the V and
`V spatial response similarity for noise and spatial response
`dissimilarity for speech. In this way it is very robust and can
`operate Smoothly in a variety of noisy acoustic environments.
`FIG.5 is a flow diagram of acoustic voice activity detection
`500, under an embodiment. The detection comprises forming
`a first virtual microphone by combining a first signal of a first
`physical microphone and a second signal of a second physical
`microphone 502. The detection comprises forming a filter
`that describes a relationship for speech between the first
`physical microphone and the second physical microphone
`504. The detection comprises forming a second virtual micro
`
`g
`
`where again d=2do and c is the speed of sound in air, which is
`temperature dependent and approximately
`
`C = 331.3
`
`T
`1 +
`273.15 sec
`
`40
`
`45
`
`50
`
`where T is the temperature of the air in Celsius.
`The filter B(z) can be calculated using wave theory to be
`
`d
`d +d
`
`55
`
`2
`
`where again d is the distance from the user's mouth to O.
`FIG. 2 is a block diagram of V construction using a fixed
`B(Z), under an embodiment. This fixed (or static) f works
`sufficiently well if the calibration filter C.(z) is accurate and d
`and d are accurate for the user. This fixed-falgorithm, how
`ever, neglects important effects Such as reflection, diffraction,
`poor array orientation (i.e. the microphones and the mouth of 65
`the user are not all on a line), and the possibility of different d
`and d values for different users.
`
`60
`
`Page 39 of 56
`
`
`
`US 8,321,213 B2
`
`10
`
`15
`
`7
`phone by applying the filter to the first signal to generate a first
`intermediate signal, and Summing the first intermediate sig
`nal and the second signal 506. The detection comprises gen
`erating an energy ratio of energies of the first virtual micro
`phone and the second virtual microphone 508. The detection
`comprises detecting acoustic Voice activity of a speaker when
`the energy ratio is greater than a threshold value 510.
`The accuracy of the adaptation to the B(Z) of the system is
`a factor in determining the effectiveness of the AVAD. A more
`accurate adaptation to the actual B(Z) of the system leads to
`lower energy of the speech response in V, and a higher ratio
`R. The noise (far-field) magnitude response is largely
`unchanged by the adaptation process, so the ratio R will be
`near unity for accurately adapted beta. For purposes of accu
`racy, the system can be trained on speech alone, or the noise
`should be low enough in energy so as not to affect or to have
`a minimal affect the training.
`To make the training as accurate as possible, the coeffi
`c