`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 1 of 19
`
`
`
`
`
`
`
`
`EXHIBIT 1
`EXHIBIT 1
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 2 of 19
`ee”. TTT
`
`US007246058B2
`
`a2) United States Patent
`US 7,246,058 B2
`(10) Patent No.:
`
` Burnett (45) Date of Patent: Jul. 17, 2007
`
`
`(54) DETECTING VOICED AND UNVOICED
`SPEECH USING BOTH ACOUSTIC AND
`NONACOUSTIC SENSORS
`
`(75)
`
`J
`
`nventor:
`
`Gregory C. B
`- Burnett,
`rego
`(US)
`
`Li
`Livermore,
`
`CA
`
`(73) Assignee: Aliph, Inc., San Francisco, CA (US)
`(*) Notice:
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 688 days.
`
`(21) Appl. No.: 10/159,770
`
`(22)
`
`(65)
`
`4.
`Filed:
`
`May 30, 2002
`
`Prior Publication Data
`
`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`Related U.S. Application Data
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`3,789,166 A
`
`1/1974 Sebesta
`
`(Continued)
`
`FOREIGN PATENT DOCUMENTS
`0 637 187 A
`2/1995
`
`EP
`
`(Continued)
`
`OTHER PUBLICATIONS
`
`Gregory C. Burnett: “The Physiological Basis of Glottal Electro-
`magnetic Micropower Sensors (GEMS) and Their Use in Defining
`an Excitation Function for the Human Vocal Tract”, Dissertation,
`University of California at Davis, Jan. 1999, USA.
`
`(Continued)
`
`Primary Examiner—Abul K. Azad
`(74) Attorney, Agent, or Firm—Courtney Staniford &
`Gregory LLP
`
`(57)
`
`ABSTRACT
`
`(60) Provisional application No. 60/294,383, filed on May
`30, 2001, provisional application No. 60/335,100,
`filed on Oct. 30, 2001, provisional application No.
`60/332,202,filed on Nov. 21, 2001, provisional appli-
`Systems and methodsare provided for detecting voiced and
`cation No. 60/362,162,filed on Mar. 5, 2002, provi-
`unvoiced speech in acoustic signals having varying levels of
`sional application No. 60/362,103, filed on Mar. 5,
`background noise. The systems receive acoustic signals at
`2002, provisional application No. 60/362,170, filed
`two microphones, and generate difference parameters
`on Mar. 5, 2002, provisional application No. 60/361,
`between the acoustic signals received at each of the two
`981, filed on Mar. 5, 2002, provisional application
`microphones. re differenceparame‘ers are representative
`No. 60/362,161, filed on Mar. 5, 2002, provisional
`9.
`‘76 Teale Callerence i sigihal gan denweenh, porons ©
`application No. 60/368,209,filed on Mar. 27, 2002,
`
`
`
`
`‘sional 60/368.208. filedonM.application No. the received acoustic signals. The systems identify informa-
`Provisional app ication 0. OUO9G6,
`200,
`Te? On
`Nlar.
`tion of the acoustic signals as unvoiced speech when the
`
`27, 2002, provisional application No. 60/368,343, difference parameters exceedafirst threshold, and identify
`filed on Mar. 27, 2002.
`information of the acoustic signals as voiced speech when
`the difference parameters exceed a second threshold. Fur-
`ther, embodiments of the systems include non-acoustic
`sensors that receive physiological
`information to aid in
`identifying voiced speech.
`
`(51)
`
`Int. Cl.
`(2006.01)
`GIOL 11/06
`(52) US. C1. ccccccceeeeeeseseststeeseens 704/226; 704/214
`(58) Field of Classification Search ......0.000000.. None
`See application file for complete search history.
`
`5 Claims, 10 Drawing Sheets
`
`
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 3 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 3 of 19
`
`US 7,246,058 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`4,006,318
`4,591,668
`4,653,102
`4,777,649
`4,901,354
`5,097,515
`5,212,764
`5,400,409
`5,406,622
`5,414,776
`5,473,702
`5,515,865
`5,517,435
`5,539,859
`5,590,241
`5,633,935
`5,649,055
`5,664,052
`5,684,460
`5,729,694
`5,754,665
`5,835,608
`5,853,005
`5,917,921
`5,966,090
`5,986,600
`6,006,175
`6,009,396
`6,069,963
`6,191,724
`6,233,551
`6,266,422
`6,430,295
`
`z
`Bl
`Bl
`
`
`
`BrrrrrrrrrrrrrrrrrrrrrPreeeeee
`
`ok
`ok
`
`*
`
`2/1977
`5/1986
`3/1987
`10/1988
`2/1990
`3/1992
`5/1993
`3/1995
`4/1995
`5/1995
`12/1995
`5/1996
`5/1996
`7/1996
`12/1996
`5/1997
`TN997
`9/1997
`11/1997
`3/1998
`5/1998
`11/1998
`12/1998
`6/1999
`10/1999
`11/1999
`12/1999
`12/1999
`5/2000
`2/2001
`5/2001
`7/2001
`8/2002
`
`Sebesta et al.
`Iwata
`Hansen ..........sseeseeeeeeees 381/92
`Carlson et al. we. 704/233
`Gollmar etal.
`Baba
`Ariyoshi
`Linhard
`Silverberg et al.
`Sims, Jr.
`Yoshida etal.
`Scanlonetal.
`Sugiyama
`Robbeetal.
`Park et al. oe 704/227
`Kanamori etal.
`Guptaetal.
`Nishiguchiet al.
`Scanlonetal.
`Holzrichter et al.
`Hosoiet al.
`Warnakaetal.
`Scanlon
`Sasaki et al.
`McEwan
`McEwan
`Holzrichter ............... 704/208
`Nagata
`Martin et al.
`McEwan
`Cho et al. wee 704/208
`Ikeda
`Handeletal.
`
`......... 704/214
`
`2002/0039425 Al
`
`4/2002 Burnett et al.
`
`FOREIGN PATENT DOCUMENTS
`
`EP
`EP
`JP
`JP
`WO
`
`0 795 851 A2
`0 984 660 A2
`2000 312 395
`2001 189 987
`WO 02 07151
`
`9/1997
`3/2000
`11/2000
`7/2001
`1/2002
`
`OTHER PUBLICATIONS
`
`Todd J. Gableet al.: “Speaker Verification Using Combined. Acous-
`tic and EM SensorSignal Processing”, IEEE Intl. Conf. on Acous-
`tics, Speech & Signal Processing (ICASSP-2001), Salt Lake City,
`USA,2001.
`A. Hussain: “Intelligibility Assessment of a Multi-Band Speech
`Enhancement Scheme”, Proceedings IEEE Intl. Conf. on Acoustics,
`Speech & Signal Processing ICASSP-2000). Istanbul, Turkey, Jun.
`2000.
`ZhaoLi et al: “Robust Speech Coding Using Microphone Arrays”,
`Signals Systems and Computers, 1997. Conf.
`record of 31st
`Asilomar Conf., Nov. 2-5, 1997, IEEE Comput. Soc. Nov. 2, 1997,
`USA.
`L.C. Ng et al.: “Denoising of Human Speech Using Combined.
`Acoustic and EM Sensor Signal Processing”, 2000 IEEE Intl Conf
`on Acoustics Speech and Signal Processing. Proceedings (Cat. No.
`00CH37100),
`Istanbul, Turkey, Jun. 5-9, 2000, XP002186255,
`ISBN 0-7803-6293-4.
`S. Affes et al.: “A Signal Subspace Tracking Algorithm for Micro-
`phone Array Processing of Speech”. IEEE Transactions on Speech
`and Audio Processing, N.Y, USA vol. 5, No. 5, Sep. 1, 1997,
`XP000774303, ISBN 1063-6676.
`
`* cited by examiner
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 4 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 4 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 1 of 10
`
`US 7,246,058 B2
`
`
`
`100
` PROCESSOR
`30
`
`
`MICROPHONES
`
`10
`
`DETECTION
`
`
`SUBSYSTEM
`
`
`VOICING
`
`50
`
`SENSORS
`
` 20
`
`
`
` DENOISING
`
`
`SUBSYSTEM
`40
`
`
`
`Figure 1
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 5 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 5 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 2 of 10
`
`US 7,246,058 B2
`
`200
` PROCESSOR
`
`
`MICROPHONES
`
`10
`
` DETECTION
`SUBSYSTEM
`50
`
`30
`
`
`
`
`
`
` DENOISING
`
`SUBSYSTEM
`40
`
`Figure 2
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 6 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 6 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 3 of 10
`
`US 7,246,058 B2
`
`OPE
`
`@—'peedspours[D[VAOUIOIISIONI
`
`
`
`}fide<—s@)
`(eyIONos(u)s%TIVNDIS
`ozJUONVULIO;UTBupjoxa——|_avA|
`>\eo
`ZON(uu
`ASION’<———(wu(9)
`
`
`
`Oe
`
`‘>
`
`¢dans]
`
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 7 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 7 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 4 of 10
`
`US 7,246,058 B2
`
`50
`
`
`
`V(window = 0)
`Read in 20 msec of data
`from ml, m2, gems
`
`
`
`Cakulate XCORR of
`mi, gems
`
`NAVSAD
`
`Cak mean(abs(XCORR}
`
`anteXCORR)
`
`Cak STD DEV of gems
`= GSD
`
`Constants:
`V= Oif noise, 1 if UV, 2if V
`VIC= voiced threshold for corr
`VTS = voiced threshold for std. dev.
`ff = forgetting factor for std. dev.
`num_ma = # oftaps in m.afilter
`UV_ma = UV std dev m.a.thresh
`UV_std = UV std dev threshold
`UV = binary values denoting UV
`detected in each subband
`num_begin = # win at "beginning"
`Variables:
`bhi = LMS calc of MIC 1-2 TF
`keep_old= 1 iflast win V/DV, 0 ow
`sd_ma_vector = last NV sd values
`sd_ma=m.a ofthe last NVsd
`PSAD
`
`
`
`
`Is MC> VTC
`
`V¢window) =2 [YES
`
`and
`
`
`bhi = bhi_otd
`GSD> VTS?
`
`
`NO
`
`
`old_std =new_std
`keep_old =0
`bhi_old= bil
`
`new_std>UV_ma*sd_ma
`and
`new_std >UV_sd
`OR
`
`
`
`
`UV = (0 0], Fitter ml and
`m2 into 2 bands,
`1500-2500 and
`2500-3500 Hz
`
`
`
`
`Cakulate bhi using
`Pathfinder for each
`subband
`
`new_sum =
`
`sum{abs(bhl));
`
`
`
`Ifnot keep_old or at
`
`beginning, add new_sum
`
`to new_sum_vector (fF
`
`
`numbers long)
`
`new_sid = STD DEV of
`new_sum_ vector
`
` If not keepold or at
`
`
`begiming, shift
`sd_ma_vectorto right
`
`
`
`Replace first value in
`sd_ma_vector with
`old_sid
`
`Filter sd_ma_vector with
`moving averagefiter to
`get sd_ma
`
`
`
`
` Is
`are we at the beginnig?
`
`
`
`
`After both subbands
`checked,is
`CEL(SUM(UV)2) = 1?
`
`Vwindow) = 1
`
`YES
`
`UV(subband) = 2
`bhi = bhi_old
`keep_old=1
`
`Figure 4
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 8 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 8 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 5 of 10
`
`US 7,246,058 B2
`
`Figure 5A
`
`GEMS AND MEAN CORRELATION
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 9 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 9 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 6 of 10
`
`US 7,246,058 B2
`
`
`T
`T
`T
`T
`T
`T
`T
`|
`
`600
`
`. VOICING
`602 \r—
`
`
`
`-
`
`==
`
`|
`
`- ACOUSTIC
`|
`NOISE
`
`
`
`
`
`
`
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 10 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 10 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 7 of 10
`
`US 7,246,058 B2
`
`midline
`
`Linear array
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 11 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 11 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 8 of 10
`
`US 7,246,058 B2
`
`3
`
`t
`
`deltaM
` t
`
`800
`d1 versus delta M fordeltad =1,2,3,4cm
`Jf
`
`
`TT
`
`TT
`
`—T
`
`——T
`
`0
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`d1 (cm)
`
`Figure 8
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 12 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 12 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 9 of
`
`10
`
`US 7,246,058 B2
`
`
` GAIN PARAMETER
`
`T
`
`T
`
`3.5
`
`®pl.
`
`time (samples)
`
`Figure 9
`
`i
`Acoustic data (solid) and gain parameter (dashed)
`T re
`T
`T
`
`900
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 13 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 13 of 19
`
`U.S. Patent
`
`Jul. 17, 2007
`
`Sheet 10 of 10
`
`US 7,246,058 B2
`
`Mic 1 and V for "pop pan”in \headmic\micgems_p1.bin
`T
`~T
`T
`—T
`TF
`
`1000
`
`/
`r
`
`~~
`
`
`
`
`
`T
`| VOICING
`| SIGNAL
`-
`41002
`NC
`
`|
`
`AUDIO
`SIGNAL
`{10080¢____. VOICED
`
`LEVEL
`
`.
`
`i
`
`
`
`AalyatehPleaPellethaMe
`
`L
`
`
`
`r
`a oe
`
`UNVOICED
`LEVEL
`
`4
`
`4
`
`GEMS
`SIGNAL
`f 1006
`iY
`:
`/
`
`NOT
`f
`A
`» VOICED
`oa, REVEL
`
`O pews
`
`0
`
`\
`0.5
`
`A
`1
`
`L
`1.5
`
`5
`2
`time (samples)
`
`|
`2.5
`
`3
`
`f
`3.5
`
`4
`x 10°
`
`Figure 10
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 14 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 14 of 19
`
`US 7,246,058 B2
`
`1
`DETECTING VOICED AND UNVOICED
`SPEECH USING BOTH ACOUSTIC AND
`NONACOUSTIC SENSORS
`
`RELATED APPLICATIONS
`
`This application claims the benefit of U.S. application
`Nos. 60/294,383 filed May 30, 2001; 09/905,361 filed Jul.
`12, 2001; 60/335,100 filed Oct. 30, 2001; 60/332,202 and
`09/990,847, both filed Nov. 21, 2001; 60/362,103, 60/362,
`161, 60/362,162, 60/362,170, and 60/361,981, all filed Mar.
`5, 2002; 60/368,208, 60/368,209, and 60/368,343, all filed
`Mar. 27, 2002; all of which are incorporated herein by
`reference in their entirety.
`
`TECHNICAL FIELD
`
`The disclosed embodiments relate to the processing of
`speech signals.
`
`BACKGROUND
`
`The ability to correctly identify voiced and unvoiced
`speech is critical to many speech applications including
`speech recognition, speaker verification, noise suppression,
`and many others. In a typical acoustic application, speech
`from a human speaker is captured and transmitted to a
`receiver in a different location. In the speaker’s environment
`there may exist one or more noise sources that pollute the
`speech signal, or the signal of interest, with unwanted
`acoustic noise. This makes it difficult or impossible for the
`receiver, whether human or machine,
`to understand the
`user’s speech.
`Typical methods for classifying voiced and unvoiced
`speech haverelied mainly on the acoustic content of micro-
`phonedata, which is plagued by problems with noise and the
`corresponding uncertainties in signal content. This is espe-
`cially problematic now with the proliferation of portable
`communication devices like cellular telephones and per-
`sonal digital assistants because, in manycases, the quality of
`service provided by the device depends on the quality of the
`voice services offered by the device. There are methods
`known in the art for suppressing the noise present in the
`speech signals, but these methods demonstrate performance
`shortcomings that include unusually long computing time,
`requirements for cumbersome hardware to perform the
`signal processing, and distorting the signals of interest.
`
`BRIEF DESCRIPTION OF THE FIGURES
`
`FIG. 1 is a block diagram of a NAVSADsystem, under an
`embodiment.
`
`FIG. 2 is a block diagram of a PSAD system, under an
`embodiment.
`
`FIG.3 is a block diagram of a denoising system, referred
`to herein as the Pathfinder system, under an embodiment.
`FIG.4 is a flow diagram of a detection algorithm for use
`in detecting voiced and unvoiced speech, under an embodi-
`ment.
`
`FIG. 5A plots the recetved GEMSsignal for an utterance
`along with the mean correlation between the GEMSsignal
`and the Mic 1 signal and the threshold for voiced speech
`detection.
`
`FIG. 5B plots the recetved GEMSsignal for an utterance
`along with the standard deviation of the GEMSsignal and
`the threshold for voiced speech detection.
`
`20
`
`25
`
`30
`
`40
`
`50
`
`55
`
`60
`
`65
`
`2
`FIG. 6 plots voiced speech detected from an utterance
`along with the GEMSsignal and the acoustic noise.
`FIG. 7 is a microphone array for use under an embodi-
`ment of the PSAD system.
`FIG. 8 is a plot of AM versus d, for several Ad values,
`under an embodiment.
`FIG. 9 showsa plot of the gain parameter as the sum of
`the absolute values of H,(z) and the acoustic data or audio
`from microphone 1.
`FIG. 10 is an alternative plot of acoustic data presented in
`FIG. 9.
`
`In the figures, the same reference numbers identify iden-
`tical or substantially similar elements or acts.
`Any headings provided herein are for convenience only
`and do not necessarily affect the scope or meaning of the
`claimed invention.
`
`DETAILED DESCRIPTION
`
`Systems and methods for discriminating voiced and
`unvoiced speech from background noise are provided below
`including a Non-Acoustic Sensor Voiced Speech Activity
`Detection (NAVSAD) system and a Pathfinder Speech
`Activity Detection (PSAD) system. The noise removal and
`reduction methods provided herein, while allowing for the
`separation andclassification of unvoiced and voiced human
`speech from background noise, address the shortcomings of
`typical systems knownin the art by cleaning acoustic signals
`of interest without distortion.
`FIG. 1 is a block diagram of a NAVSAD system 100,
`under an embodiment. The NAVSADsystem couples micro-
`phones 10 and sensors 20 to at least one processor 30. The
`sensors 20 of an embodimentinclude voicing activity detec-
`tors or non-acoustic sensors. The processor 30 controls
`subsystems including a detection subsystem 50, referred to
`herein as a detection algorithm, and a denoising subsystem
`40. Operation of the denoising subsystem 40 is described in
`detail in the Related Applications. The NAVSAD system
`works extremely well
`in any background acoustic noise
`environment.
`
`FIG. 2 is a block diagram of a PSAD system 200, under
`an embodiment. The PSAD system couples microphones 10
`to at least one processor 30. The processor 30 includes a
`detection subsystem 50, referred to herein as a detection
`algorithm, and a denoising subsystem 40. The PSAD system
`is highly sensitive in low acoustic noise environments and
`relatively insensitive in high acoustic noise environments.
`The PSAD can operate independently or as a backup to the
`NAVSAD,detecting voiced speech if the NAVSADfails.
`Note that
`the detection subsystems 50 and denoising
`subsystems 40 of both the NAVSAD and PSADsystems of
`an embodimentare algorithms controlled by the processor
`30, but are not so limited. Alternative embodiments of the
`NAVSAD and PSAD systems can include detection sub-
`systems 50 and/or denoising subsystems 40 that comprise
`additional hardware, firmware, software, and/or combina-
`tions of hardware, firmware, and software. Furthermore,
`functions of the detection subsystems 50 and denoising
`subsystems 40 maybe distributed across numerous compo-
`nents of the NAVSAD and PSADsystems.
`FIG.3 is a block diagram of a denoising subsystem 300,
`referred to herein as the Pathfinder system, under an embodi-
`ment. The Pathfinder system is briefly described below, and
`is described in detail
`in the Related Applications. Two
`microphones Mic 1 and Mic 2 are used in the Pathfinder
`system, and Mic 1 is considered the “signal” microphone.
`With reference to FIG. 1,
`the Pathfinder system 300 is
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 15 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 15 of 19
`
`US 7,246,058 B2
`
`3
`equivalent to the NAVSAD system 100 when the voicing
`activity detector (VAD) 320 is a non-acoustic voicing sensor
`20 and the noise removal subsystem 340 includes the
`detection subsystem 50 and the denoising subsystem 40.
`With reference to FIG. 2,
`the Pathfinder system 300 is
`equivalent to the PSAD system 200 in the absence of the
`VAD 320, and when the noise removal subsystem 340
`includes the detection subsystem 50 and the denoising
`subsystem 40.
`The NAVSAD and PSAD systems support a two-level
`commercial approach in which(i) a relatively less expensive
`PSADsystem supports an acoustic approach that functions
`in most
`low- to medium-noise environments, and (ii) a
`NAVSAD system adds a non-acoustic sensor to enable
`detection of voiced speech in any environment. Unvoiced
`speech is normally not detected using the sensor, as it
`normally does not sufficiently vibrate human tissue. How-
`ever, in high noise situations detecting the unvoiced speech
`is not as important, as it is normally very low in energy and
`easily washed out by the noise. Therefore in high noise
`environments the unvoiced speech is unlikely to affect the
`voiced speech denoising. Unvoiced speech information is
`most important in the presence oflittle to no noise and,
`therefore, the unvoiced detection should be highly sensitive
`in low noise situations, and insensitive in high noise situa-
`tions. This is not easily accomplished, and comparable
`acoustic unvoiced detectors known in the art are incapable
`of operating under these environmental constraints.
`The NAVSAD and PSADsystemsinclude an array algo-
`rithm for speech detection that uses the difference in fre-
`quency content between two microphones to calculate a
`relationship between the signals of the two microphones.
`This is in contrast to conventional arrays that attempt to use
`the time/phase difference of each microphone to remove the
`noise outside of an “area of sensitivity”. The methods
`described herein provide a significant advantage, as they do
`not require a specific orientation of the array with respect to
`the signal.
`the systems described herein are sensitive to
`Further,
`noise of every type and every orientation, unlike conven-
`tional arrays that depend on specific noise orientations.
`Consequently, the frequency-based arrays presented herein
`are unique as they depend only ontherelative orientation of
`the two microphones themselves with no dependence on the
`orientation of the noise and signal with respect
`to the
`microphones. This results in a robust signal processing
`system with respect to the type of noise, microphones, and
`orientation between the noise/signal source and the micro-
`phones.
`The systems described herein use the information derived
`from the Pathfinder noise suppression system and/or a
`non-acoustic sensor described in the Related Applications to
`determine the voicing state of an input signal, as described
`in detail below. The voicing state includessilent, voiced, and
`unvoiced states. The NAVSAD system,
`for example,
`includes a non-acoustic sensor to detect the vibration of
`
`human tissue associated with speech. The non-acoustic
`sensor of an embodiment
`is a General Electromagnetic
`Movement Sensor (GEMS) as described briefly below and
`in detail in the Related Applications, but is not so limited.
`Alternative embodiments, however, may use any sensor that
`is able to detect human tissue motion associated with speech
`and is unaffected by environmental acoustic noise.
`The GEMSis a radio frequency device (2.4 GHz) that
`allows the detection of moving human tissue dielectric
`interfaces. The GEMSincludes an RF interferometer that
`
`uses homodyne mixing to detect small phase shifts associ-
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`ated with target motion. In essence, the sensor sends out
`weak electromagnetic waves (less than 1 milliwatt) that
`reflect off of whatever is around the sensor. The reflected
`
`waves are mixed with the original transmitted waves and the
`results analyzed for any change in position of the targets.
`Anything that moves near the sensor will cause a change in
`phase of the reflected wave that will be amplified and
`displayed as a change in voltage output from the sensor. A
`similar sensor is described by Gregory C. Burnett (1999) in
`“The physiological
`basis of glottal
`electromagnetic
`micropower sensors (GEMS) andtheir use in defining an
`excitation function for the human vocaltract”; Ph.D. Thesis,
`University of California at Davis.
`FIG. 4 is a flow diagram of a detection algorithm 50 for
`use in detecting voiced and unvoiced speech, under an
`embodiment. With reference to FIGS. 1 and 2, both the
`NAVSADand PSADsystems of an embodimentinclude the
`detection algorithm 50 as the detection subsystem 50. This
`detection algorithm 50 operates in real-time and,
`in an
`embodiment, operates on 20 millisecond windowsand steps
`10 milliseconds at a time, but is not so limited. The voice
`activity determination is recorded for the first 10 millisec-
`onds, and the second 10 milliseconds functions as a “look-
`ahead” buffer. While an embodiment uses the 20/10 win-
`
`dows, alternative embodiments may use numerous other
`combinations of window values.
`
`Consideration was given to a number of multi-dimen-
`sional factors in developing the detection algorithm 50. The
`biggest consideration was to maintaining the effectiveness of
`the Pathfinder denoising technique, described in detail in the
`Related Applications and reviewed herein. Pathfinder per-
`formance can be compromisedif the adaptive filter training
`is conducted on speech rather than on noise. It is therefore
`important not to exclude any significant amount of speech
`from the VAD to keep such disturbances to a minimum.
`Consideration was also given to the accuracy of the
`characterization between voiced and unvoiced speech sig-
`nals, and distinguishing each of these speech signals from
`noise signals. This type of characterization can be useful in
`such applications as speech recognition and speaker verifi-
`cation.
`Furthermore, the systems using the detection algorithm of
`an embodimentfunction in environments containing varying
`amounts of background acoustic noise. If the non-acoustic
`sensor is available, this external noise is not a problem for
`voiced speech. However, for unvoiced speech (and voiced if
`the non-acoustic sensor is not available or has malfunc-
`
`tioned) reliance is placed on acoustic data alone to separate
`noise from unvoiced speech. An advantage inheres in the use
`of two microphones in an embodiment of the Pathfinder
`noise suppression system, and the spatial
`relationship
`between the microphonesis exploited to assist in the detec-
`tion of unvoiced speech. However, there may occasionally
`be noise levels high enough that the speech will be nearly
`undetectable and the acoustic-only method will fail. In these
`situations,
`the non-acoustic sensor (or hereafter just the
`sensor) will be required to ensure good performance.
`In the two-microphonesystem, the speech source should
`be relatively louder in one designated microphone when
`compared to the other microphone. Tests have shown that
`this requirement
`is easily met with conventional micro-
`phones when the microphonesare placed on the head, as any
`noise should result in an H, with a gain near unity.
`Regarding the NAVSAD system, and with reference to
`FIG. 1 and FIG. 3, the NAVSAD relies on two parameters
`to detect voiced speech. These two parameters include the
`energy of the sensor in the window ofinterest, determined
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 16 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 16 of 19
`
`US 7,246,058 B2
`
`5
`in an embodiment by the standard deviation (SD), and
`optionally the cross-correlation (XCORR) between the
`acoustic signal from microphone1 and the sensor data. The
`energy of the sensor can be determined in any one of a
`number of ways, and the SD is just one convenient way to
`determine the energy.
`For the sensor, the SD is akin to the energy ofthe signal,
`which normally corresponds quite accurately to the voicing
`state, but may be susceptible to movement noise (relative
`motion of the sensor with respect to the human user) and/or
`electromagnetic noise. To further differentiate sensor noise
`from tissue motion, the XCORR can be used. The XCORR
`is only calculated to 15 delays, which corresponds to just
`under 2 milliseconds at 8000 Hz.
`
`The XCORRcan also be useful when the sensorsignalis
`distorted or modulated in some fashion. For example, there
`are sensor locations (such as the jaw or back of the neck)
`where speech production can be detected but where the
`signal may have incorrect or distorted time-based informa-
`tion. That is, they may not have well defined features in time
`that will match with the acoustic waveform. However,
`XCORRis more susceptible to errors from acoustic noise,
`and in high (<0 dB SNR) environments is almost useless.
`Therefore it should not be the sole source of voicing
`information.
`The sensor detects human tissue motion associated with
`
`the closure of the vocal folds, so the acoustic signal pro-
`ducedby the closure ofthe folds is highly correlated with the
`closures. Therefore, sensor data that correlates highly with
`the acoustic signal is declared as speech, and sensor data that
`does not correlate well is termed noise. The acoustic datais
`
`expected to lag behind the sensor data by about 0.1 to 0.8
`milliseconds (or about 1-7 samples) as a result of the delay
`time dueto the relatively slower speed of sound (around 330
`m/s). However, an embodiment uses a 15-sample correla-
`tion, as the acoustic wave shapevaries significantly depend-
`ing on the sound produced, and a larger correlation width is
`needed to ensure detection.
`
`The SD and XCORRsignals are related, but are suffi-
`ciently different so that the voiced speech detection is more
`reliable. For simplicity, though, either parameter may be
`used. The values for the SD and XCORRare compared to
`empirical thresholds, and if both are above their threshold,
`voiced speech is declared. Example data is presented and
`described below.
`
`FIGS. 5A, 5B, and 6 show data plots for an example in
`which a subject twice speaks the phrase “pop pan”, under an
`embodiment. FIG. 5A plots the received GEMSsignal 502
`for this utterance along with the mean correlation 504
`between the GEMSsignal and the Mic 1 signal and the
`threshold T1 used for voiced speech detection. FIG. 5B plots
`the received GEMSsignal 502 for this utterance along with
`the standard deviation 506 of the GEMSsignal and the
`threshold T2 used for voiced speech detection. FIG. 6 plots
`voiced speech 602 detected from the acoustic or audio signal
`608, along with the GEMSsignal 604 andthe acoustic noise
`606; no unvoiced speech is detected in this example because
`of the heavy background babble noise 606. The thresholds
`have been set so that there are virtually no false negatives,
`and only occasionalfalse positives. A voiced speech activity
`detection accuracy of greater than 99% has been attained
`under any acoustic background noise conditions.
`The NAVSAD can determine when voiced speech is
`occurring with high degrees of accuracy due to the non-
`acoustic sensor data. However, the sensoroffers little assis-
`tance in separating unvoiced speech from noise, as unvoiced
`speech normally causes no detectable signal in most non-
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`acoustic sensors. If there is a detectable signal, the NAVSAD
`can be used, although use of the SD method is dictated as
`unvoiced speech is normally poorly correlated.
`In the
`absenceofa detectable signal use is made of the system and
`methods of the Pathfinder noise removal algorithm in deter-
`mining when unvoiced speech is occurring. A brief review
`of the Pathfinder algorithm is described below, while a
`detailed description is provided in the Related Applications.
`With reference to FIG.3, the acoustic information coming
`into Microphone 1 is denoted by m,(n), the information
`coming into Microphone 2 is similarly labeled m,(n), and
`the GEMSsensor is assumed available to determine voiced
`
`speech areas. In the z (digital frequency) domain,
`signals are represented as M,(z) and M.(z). Then
`
`these
`
`My(2) = S(x) + Naz)
`
`M(z) = N(z) + S2(z)
`
`with
`
`N2(@) = N@AL(Z)
`
`So () = S(Z)Haz)
`
`so that
`
`My(2) = S(2)+ NAL)
`Ma(z) = N(x) + S()Ha()
`
`dQ)
`
`This is the general case for all two microphone systems.
`There is always going to be some leakage of noise into Mic
`1, and someleakage of signal into Mic 2. Equation 1 has four
`unknownsand only two relationships and cannot be solved
`explicitly.
`However, there is another way to solve for some of the
`unknownsin Equation 1. Examine the case wherethe signal
`is not being generated—that is, where the GEMSsignal
`indicates voicing is not occurring. In this case, s(n)=S(z) =0,
`and Equation 1 reduces to
`
`M,,(2)-N@)A)
`
`M2,(Z)-N@)
`
`where the n subscript on the M variables indicate that only
`noise is being received. This leads to
`
`Min(Z) = Mon(2)A(2)
`
`Min
`noe tl
`
`(2)
`
`H,(@) can be calculated using any of the available system
`identification algorithms and the microphone outputs when
`only noise is being received. The calculation can be done
`adaptively, so that if the noise changes significantly H,(z)
`can be recalculated quickly.
`With a solution for one of the unknowns in Equation 1,
`solutions can be found for another, H(z), by using the
`amplitude of the GEMSor similar device along with the
`amplitude of the two microphones. When the GEMSindi-
`cates voicing, but the recent (less than 1 second) history of
`
`
`
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 17 of 19
`Case 6:21-cv-00984-ADA Document 55-1 Filed 05/25/22 Page 17 of 19
`
`US 7,246,058 B2
`
`7
`the microphones indicate low levels of noise, assume that
`n(s)=N(z)~0. Then Equation 1 reduces to
`M,,(2)-S@)
`
`M2,(2)-S@)Ha)
`
`which in turn leads to
`
`Mos(z) = Mis(Z)H2(z)
`
`Mos(2)
`Ms(z)
`
`F(z) =
`
`which is the inverse of the H,(z) calculation, but note that
`different inputs are being used.
`After calculating H,(z) and H.(z) above, they are used to
`removethe noise from the signal. Rewrite Equation 1 as
`S(@)=M(@)-N@H(2)
`
`N(@=M3@)-S(@)H @)
`
`S(2)=M (@)-[M3(2)-S@)H(Z)2),
`
`S(@)[1-Ha(2)@)]=M| (2)-Ma(@)H(2)
`
`and solve for S(z) as:
`
`S{z) =
`
`Mi (2) — Mo(2Ai@)
`1 — A(z), (2)
`
`3
`®
`
`In practice H,(z) is usually quite small, so that H,(z)H,(z)
`<<l, and
`
`S@)=M(@)-M, 2A),
`
`obviating the need for the H,(z) calculation.
`With reference to FIG. 2 and FIG. 3, the PSAD system is
`described. As sound waves propagate, they normally lose
`energy as they travel due to diffraction and dispersion.
`Assuming the sound waves originate from a point source
`and radiate isotropically, their amplitude will decrease as a
`function of 1/r, where r is the distance from the originating
`point. This function of 1/r proportional to amplitude is the
`worst case, if confined to a smaller area the reduction will be
`less. Howeverit is an adequate model for the configurations
`of interest, specifically the propagation of noise and speech
`to microphones located somewhere on the user’s head.
`FIG. 7 is a microphone array for use under an embodi-
`ment of the PSAD system. Placing the microphones Mic 1
`and Mic 2 in a linear array with the mouth on the array
`midline, the difference in signal strength in Mic 1 and Mic
`2 (assuming the microphones have identical
`frequency
`responses) will be proportional to both d, and Ad. Assuming
`a 1/r (or in this case 1/d) relationship, it is seen that
`
`|Micl|
`~ [Mic)|
`
`
`d, + Ad
`= AM (Z) « aq
`
`where AM is the difference in gain between Mic 1 and Mic
`2 and therefore H,(z), as above in Equation 2. The variable
`d, is the distance from Mic 1 to the speech or noise source.
`
`8
`FIG.8 is a plot 800 of AM versus d, for several Ad values,
`under an embodiment.It is clear that as Ad becomes larger
`and the noise so



