`
`US 20020198705A1
`
`as) United States
`a2) Patent Application Publication 0) Pub. No.: US 2002/0198705 Al
`Burnett
`(43) Pub. Date:
`Dec. 26, 2002
`
`
`(54) DETECTING VOICED AND UNVOICED
`SPEECH USING BOTH ACOUSTIC AND
`NONACOUSTIC SENSORS
`
`(76)
`
`Inventor: Gregory C. Burnett, Livermore, CA
`(US)
`
`application No. 60/368,209, filed on Mar. 27, 2002.
`Provisional application No. 60/368,208,filed on Mar.
`27, 2002. Provisional application No. 60/368,343,
`filed on Mar. 27, 2002.
`
`Publication Classification
`
`Correspondence Address:
`PERKINS COIE LLP
`P.O. BOX 2168
`
`MENLO PARK, CA 94026 (US)
`
`(21) Appl. No.:
`(22)
`Filed:
`
`10/159,770
`May30, 2002
`
`Related U.S. Application Data
`
`(60) Provisional application No. 60/294,383,filed on May
`30, 2001. Provisional application No. 60/335,100,
`filed on Oct. 30, 2001. Provisional application No.
`60/332,202,filed on Nov. 21, 2001. Provisional appli-
`cation No. 60/362,162, filed on Mar. 5, 2002. Provi-
`sional application No. 60/362,103, filed on Mar. 5,
`2002. Provisional application No. 60/362,170, filed
`on Mar. 5, 2002. Provisional application No. 60/361,
`981, filed on Mar. 5, 2002. Provisional application
`No. 60/362,161, filed on Mar. 5, 2002. Provisional
`
`Tint. C07 eeeeeeecceecccceeeecceeseeeeeceeneseeenneeees G10L 11/06
`(SV)
`(52) US. Ch occ cecseesssnecesessnessnessnnssesessnes 704/214
`
`(57)
`
`ABSTRACT
`.
`.
`.
`Systems and methodsare provided for detecting voiced and
`unvoiced speech in acoustic signals having varying levels of
`background noise. The systems receive acoustic signals at
`two microphones,
`and generate difference parameters
`between the acoustic signals received at each of the two
`microphones. The difference parameters are representative
`of the relative difference in signal gain between portions of
`the received acoustic signals. The systems identify informa-
`tion of the acoustic signals as unvoiced speech when the
`difference parameters exceed a first threshold, and identify
`information of the acoustic signals as voiced speech when
`the difference parameters exceed a second threshold. Fur-
`ther, embodiments of the systems include non-acoustic
`sensors that receive physiological
`information to aid in
`identifying voiced speech.
`
`Lves
`
`Is MC> VTC
`and
`GSD> VTS?
`
`Constants:
`50,
`V= 0 if noise, 1 ifUV, 2 if V
`‘ VIC=voiced threshold for corr=
`
`
`Viwindow = 0
`
`
`Read we mses ¢data
`VTS =voicedthreshold for std. dev.
`
`from ml, m2, gems
`ff = forgetting factorfor std. dev.
`
`num_ma= # oftaps in mafilter
`UV_ma = UV std dev m.a..thresh
`Cakulate XCORR of
`UVstd = UV std dev threshold
`ml, gems UV=binary values denoting UV
`
`detected in each subband
`Step 10 msec
`
`Cak mean{abs(XCORR))|
`num_begin = # win at “beginning”
`=MC
`NAVSAD
`Variables:
`bhi = LMS calc of MIC 1-2 TF
`
`keepold= | iflast win W/UV,0 ow
`Cak STD DEV ofgems
`sd_ma_vector = last NVsd values
`=GSD
`
`
`sd_ma=m.a ofthe last NVsd
`
`PSAD
`
`
`UY = [0 0}, Fitter mi and]
`_.
`;
`into
`2 bands,
`
`
`Wewidow) <2
`“00-2500 and
`
`2500-3500 Hz
`i
`
`Cakulate bhi using
`Pathfinder for each
`
`subband
`
`new_sum
`sum(abs(bhi));
`
`Ifnot keep_old or at
`Is
`
`
`beginning, add new_sum|
`a
`tomew_sum_vector (fF
`old std =new sid
`new_std>UV_ma*sd_ma
`bhi_oid= bhl
`
`numbers long)
`
`
`keep_old =0
`
`new_sid > UV_sd
`OR
`new_std = STD DEVof
`are weatthe beginnig?
`
`
`
`new_sum_vector
`
` UVisubband) =2
`Ifnot keep_ofd or at
`
`
`bbl = bhl_old
`beginning, shift
`
`keep_old=1
`sd_ma_vector to right
`
`After both subbands
`
`
`Replacefirst value in
`
`
`checked,is
`sd_ma_vector with
`
`CEL(SUM(UVY¥2) = 1?
`
`oldssl
`‘YES
`
`Filter sd_ma_vector with
`
`
`moving average fiter to
`get sd_ma
`
`1
`
`Amazon v. Jawbone
`
`U.S. Patent 8,321,213 AmazonEx. 1012
`
`1
`
`Amazon v. Jawbone
`U.S. Patent 8,321,213
`Amazon Ex. 1012
`
`
`
`Patent Application Publication Dec. 26,2002 Sheet 1 of 10
`
`US 2002/0198705 Al
`
`
`
`100
`
`PROCESSOR
`30
`
`MICROPHONES
`
`10
`
`DETECTION
`SUBSYSTEM
`
`
`
`
` VOICING
`
`50
`
`SENSORS
`
`20
` DENOISING
`
`SUBSYSTEM
`40
`
`
`
`Figure |
`
`2
`
`
`
`Patent Application Publication Dec. 26,2002 Sheet 2 of 10
`
`US 2002/0198705 Al
`
`200
`
`PROCESSOR
` 30
`
`
`
`MICROPHONES
`10
`
`
` DETECTION
`SUBSYSTEM
`50
`
`
`
`
`
`
` DENOISING
`
`SUBSYSTEM
`40
`
`Figure 2
`
`3
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 3 of 10
`
`US 2002/0198705 A1
`
`
`<@—lpevodspours[D
`Ooze¥UOHVULIOJUTBu10a——|_avA|
`>\efe
`<¢———(uu
`
`IonOS
`
`a>
`
`<+<——(us
`
`
`
`@OIN
`
`¢2ansty
`
`“S
`
`TVNOIS()
`
`(us
`
`“SION()
`
`(wu
`
`4
`
`
`
`
`Patent Application Publication Dec. 26,2002 Sheet 4 of 10
`
`US 2002/0198705 A1
`
`50
`
`
`
`V(window = 0)
`Read in 20 msec of data
`from m1, m2, gems
`
`
`
`
`
`Cakulate XCORR of
`mi, gems
`
`Constants:
`V= Oif noise, 1 if UV, 2if V
`VIC = voiced threshold for corr
`VTS = voiced threshold for std. dev.
`ff = forgetting factor for std. dev.
`num_ma=# oftaps in m.afilter
`UV_ma = UV std dev m.a.thresh
`UV_std = UV std dev threshold
`UV = binary values denoting UV
`detected in each subband
`num_begin = # win at "beginning"
`Variables:
`bhi =LMS calc of MIC 1-2 TF
`keep_old=1if last win V/UV, 0 ow
`sd_ma_vector = last NV sd values
`Cak STD DEV of gems
`sd_ma=m.aofthe last NVsd
`= GSD
`PSAD
`
`
`
`
`
`
`Step 10 msec
`
`
`
`NAVSAD
`
`Cak mean(abs(XCORR
`
`n{ateQXCORR)
`
`
`
`
`
`Vowindow) =2 [YES] Me>WTC
`
`a
`_
`bhi = bhl_old
`Gsp> VTS?
`
`
`
`
`old_std = new_std
`keep_old = 0
`bhi_old= bhl
`
`
`
`Is
`
`
`new_std>UV_ma*sd_ma
`and
`
`
`new_std > UV_sd
`OR
`
` are we at the beginnig?
`YES
`
`
`
`
`
`
`UV(subband) = 2
`bhi = bhi_old
`keep_old =1
`
`After both subbands
`checked, is
`CEIL(SUM(UV)/2) = 1?
`
`
`
`
`
`
`
`
`Figure 4
`
`
`
`UV = [0 0], Filter mi and
`m2 into 2 bands,
`1500-2500 and
`2500-3500 Hz
`
`
`
`
`
`Cakulate bhi using
`Pathfinder for each
`subband
`
`new_sum =
`sum(abs(bhi!));
`
`
`
`If not keep_ok orat
`
`beginning, add new_sum
`to new_sum_vector (ff
`numberslong)
`
`
`
`
`
`
`
`new_std = STD DEV of
`new_sum_vector
`
`
`
`
`
`If not keep_old or at
`beginning, shift
`sd_ma_vectorto right
`
`Replace fast value n
`sd_ma_vector with
`old_std
`
`
`
`Filter sd_ma_vector with
`moving average fiter to
`get sd_ma
`
`5
`
`
`
`Patent Application Publication Dec. 26,2002 Sheet 5 of 10
`
`US 2002/0198705 Al
`
`Figure 5A
`
`GEMS AND MEAN CORRELATION
`
`6
`
`
`
`Patent Application Publication Dec. 26,2002 Sheet 6 of 10
`
`US 2002/0198705 Al
`
`600
`
` | VOICING
`
`
`
`
`
`
`
`IC
`
`606 -
`
`| |
`
`||
`
`|
`
`7
`
`
`
`Patent Application Publication Dec. 26,2002 Sheet 7 of 10
`
`US 2002/0198705 Al
`
`
`
`Figure 7
`
`8
`
`
`
`Patent Application Publication Dec. 26,2002 Sheet 8 of 10
`
`US 2002/0198705 Al
`
`di versus delta M fordeltad =1,2,3,4 cm
`—T
`T
`To
`—T
`oT
`
`800
`
`deltaM
`
`
`20
`
`25
`
`30
`
`d1 (cm)
`
`Figure 8
`
`9
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 9 of 10
`
`US 2002/0198705 A1
`
`Acoustic data (solid) and gain parameter (dashed)
`Ty
`T
`Tv
`tT
`1
`
`T
`
`900
`
`voT
`
`L
`
`L
`
`|
`+——_ACOUSTIC DATA
`904
`
`
`
`|
`
`
`
`I.
`|
`mi
`i
`1
`_
`1
`
`0
`0.5
`1
`1.5
`2
`2.5
`3
`3.5
`4
`time (samples)
`
`x 10
`
`Figure 9
`
`10
`
`10
`
`
`
`Patent Application Publication Dec. 26,2002 Sheet 10 of 10
`
`US 2002/0198705 Al
`
`Mic 1 and V for “pop pan” in \headmic\micgems_p1.bin J
`
`T
`F
`T
`T
`TT
`—T
`TF
`
`1000
`
`VOICING
`SIGNAL
`p 1002
`
`AUDIO
`"4008
`en f ce¢___ VOICED
`: /
`_
`LEVEL
`
`.
`oo:
`
`
`
`i“
`
`J
`
`UNVOICED
`LEVEL
`
`GEMS
`SIGNAL
`
`NOT
`
`- VOICED
`a LEVEL
`
`
`
`
`
`
`/ 4
`
`
` :
`
`
`
`time (samples)
`
`x 10°
`
`Figure 10
`
`11
`
`11
`
`
`
`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`DETECTING VOICED AND UNVOICED SPEECH
`USING BOTH ACOUSTIC AND NONACOUSTIC
`SENSORS
`
`RELATED APPLICATIONS
`
`[0001] This application claims the benefit of U.S. Appli-
`cation Nos. 60/294,383 filed May 30, 2001; 09/905,361 filed
`Jul. 12, 2001; 60/335,100 filed Oct. 30, 2001; 60/332,202
`and 09/990,847, both filed Nov. 21, 2001; 60/362,103,
`60/362,161, 60/362,162, 60/362,170, and 60/361,981, all
`filed Mar. 5, 2002; 60/368,208, 60/368,209, and 60/368,343,
`all filed Mar. 27, 2002; all of which are incorporated herein
`by reference in their entirety.
`
`TECHNICAL FIELD
`
`[0002] The disclosed embodiments relate to the process-
`ing of speech signals.
`
`BACKGROUND
`
`ability to correctly identify voiced and
`[0003] The
`unvoiced speech is critical to many speech applications
`including speech recognition, speaker verification, noise
`suppression, and many others. In a typical acoustic appli-
`cation, speech from a human speakeris captured andtrans-
`mitted to a receiver in a different location. In the speaker’s
`environment there may exist one or more noise sources that
`pollute the speech signal, or the signal of interest, with
`unwanted acoustic noise. This makesit difficult or impos-
`sible for the receiver, whether human or machine, to under-
`stand the user’s speech.
`
`classifying voiced and
`for
`[0004] Typical methods
`unvoiced speech have relied mainly on the acoustic content
`of microphone data, which is plagued by problems with
`noise and the corresponding uncertainties in signal content.
`This is especially problematic now with the proliferation of
`portable communication deviceslike cellular telephones and
`personal digital assistants because, in many cases, the qual-
`ity of service provided by the device depends on the quality
`of the voice services offered by the device. There are
`methods knownin the art for suppressing the noise present
`in the speech signals, but these methods demonstrate per-
`formance shortcomingsthat include unusually long comput-
`ing time, requirements for cumbersome hardware to perform
`the signal processing, and distorting the signals of interest.
`
`BRIEF DESCRIPTION OF THE FIGURES
`
`[0005] FIG. 1 is a block diagram of a NAVSAD system,
`under an embodiment.
`
`FIG.2 is a block diagram of a PSAD system, under
`[0006]
`an embodiment.
`
`[0007] FIG. 3 is a block diagram of a denoising system,
`referred to herein as the Pathfinder system, under an embodi-
`ment.
`
`[0008] FIG. 4 is a flow diagram of a detection algorithm
`for use in detecting voiced and unvoiced speech, under an
`embodiment.
`
`[0009] FIG. 5A plots the received GEMSsignal for an
`utterance along with the mean correlation between the
`GEMSsignal and the Mic 1 signal and the threshold for
`voiced speech detection.
`
`12
`
`[0010] FIG. 5B plots the received GEMSsignal for an
`utterance along with the standard deviation of the GEMS
`signal and the threshold for voiced speech detection.
`
`[0011] FIG. 6 plots voiced speech detected from an utter-
`ance along with the GEMSsignal and the acoustic noise.
`
`[0012] FIG. 7 is a microphone array for use under an
`embodiment of the PSAD system.
`
`[0013] FIG. 8 is a plot of AM versus d, for several Ad
`values, under an embodiment.
`
`[0014] FIG. 9 showsa plot of the gain parameter as the
`sum of the absolute values of H,(z) and the acoustic data or
`audio from microphone1.
`
`[0015] FIG. 10 is an alternative plot of acoustic data
`presented in FIG.9.
`
`In the figures, the same reference numbersidentify
`[0016]
`identical or substantially similar elements oracts.
`
`[0017] Any headings provided herein are for convenience
`only and do not necessarily affect the scope or meaning of
`the claimed invention.
`
`DETAILED DESCRIPTION
`
`[0018] Systems and methods for discriminating voiced
`and unvoiced speech from background noise are provided
`below including a Non-Acoustic Sensor Voiced Speech
`Activity Detection (NAVSAD) system and a Pathfinder
`Speech Activity Detection (PSAD) system. The noise
`removal and reduction methods provided herein, while
`allowing for the separation and classification of unvoiced
`and voiced human speech from background noise, address
`the shortcomings of typical systems known in the art by
`cleaning acoustic signals of interest without distortion.
`
`[0019] FIG. 1 is a block diagram of a NAVSAD system
`100, under an embodiment. The NAVSAD system couples
`microphones 10 and sensors 20 to at least one processor 30.
`The sensors 20 of an embodiment include voicing activity
`detectors or non-acoustic sensors. The processor 30 controls
`subsystems including a detection subsystem 50, referred to
`herein as a detection algorithm, and a denoising subsystem
`40. Operation of the denoising subsystem 40 is described in
`detail in the Related Applications. The NAVSAD system
`works extremely well in any background acoustic noise
`environment.
`
`FIG.2 is a block diagram of a PSAD system 200,
`[0020]
`under an embodiment. The PSAD system couples micro-
`phones 10 to at least one processor 30. The processor 30
`includes a detection subsystem 50, referred to herein as a
`detection algorithm, and a denoising subsystem 40. The
`PSAD system is highly sensitive in low acoustic noise
`environments and relatively insensitive in high acoustic
`noise environments. The PSAD can operate independently
`or as a backup to the NAVSAD, detecting voiced speech if
`the NAVSAD fails.
`
`[0021] Note that the detection subsystems 50 and denois-
`ing subsystems 40 of both the NAVSAD and PSAD systems
`of an embodimentare algorithms controlled by the processor
`30, but are not so limited. Alternative embodiments of the
`NAVSAD and PSAD systems can include detection sub-
`systems 50 and/or denoising subsystems 40 that comprise
`additional hardware, firmware, software, and/or combina-
`
`12
`
`
`
`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`tions of hardware, firmware, and software. Furthermore,
`functions of the detection subsystems 50 and denoising
`subsystems 40 may be distributed across numerous compo-
`nents of the NAVSAD and PSAD systems.
`
`[0026] The systems described herein use the information
`derived from the Pathfinder noise suppression system and/or
`a non-acoustic sensor described in the Related Applications
`to determine the voicingstate of an input signal, as described
`in detail below. The voicing state includessilent, voiced, and
`[0022] FIG.3isa block diagram of a denoising subsystem
`unvoiced states. The NAVSAD system,
`for example,
`300, referred to herein as the Pathfinder system, under an
`includes a non-acoustic sensor to detect the vibration of
`embodiment. The Pathfinder system is briefly described
`human tissue associated with speech. The non-acoustic
`below,and is described in detail in the Related Applications.
`sensor of an embodiment
`is a General Electromagnetic
`Two microphones Mic 1 and Mic2are used in the Pathfinder
`Movement Sensor (GEMS) as described briefly below and
`system, and Mic 1 is considered the “signal” microphone.
`in detail in the Related Applications, but is not so limited.
`With reference to FIG. 1,
`the Pathfinder system 300 is
`Alternative embodiments, however, may use any sensorthat
`equivalent to the NAVSAD system 100 when the voicing
`is able to detect humantissue motion associated with speech
`activity detector (VAD) 320 is a non-acoustic voicing sensor
`and is unaffected by environmental acoustic noise.
`20 and the noise removal subsystem 340 includes the
`detection subsystem 50 and the denoising subsystem 40.
`With reference to FIG. 2,
`the Pathfinder system 300 is
`equivalent to the PSAD system 200 in the absence of the
`VAD 320, and when the noise removal subsystem 340
`includes the detection subsystem 50 and the denoising
`subsystem 40.
`
`[0027] The GEMSis a radio frequency device (2.4 GHz)
`that allows the detection of moving humantissue dielectric
`interfaces. The GEMSincludes an RF interferometer that
`
`uses homodyne mixing to detect small phase shifts associ-
`ated with target motion. In essence, the sensor sends out
`weak electromagnetic waves (less than 1 milliwatt) that
`reflect off of whatever is around the sensor. The reflected
`
`[0023] The NAVSAD and PSAD systems support a two-
`level commercial approach in which (i) a relatively less
`expensive PSAD system supports an acoustic approach that
`functions in most low- to medium-noise environments, and
`(ii) a NAVSAD system adds a non-acoustic sensor to enable
`detection of voiced speech in any environment. Unvoiced
`speech is normally not detected using the sensor, as it
`normally does not sufficiently vibrate human tissue. How-
`ever, in high noise situations detecting the unvoiced speech
`is not as important, as it is normally very low in energy and
`easily washed out by the noise. Therefore in high noise
`environments the unvoiced speech is unlikely to affect the
`voiced speech denoising. Unvoiced speech information is
`most important in the presence of little to no noise and,
`therefore, the unvoiced detection should be highly sensitive
`in low noise situations, and insensitive in high noise situa-
`tions. This is not easily accomplished, and comparable
`acoustic unvoiced detectors known in the art are incapable
`of operating under these environmental constraints.
`
`[0024] The NAVSAD and PSAD systemsinclude an array
`algorithm for speech detection that uses the difference in
`frequency content between two microphonesto calculate a
`relationship between the signals of the two microphones.
`This is in contrast to conventional arrays that attempt to use
`the time/phase difference of each microphone to removethe
`noise outside of an “area of sensitivity”. The methods
`described herein provide a significant advantage, as they do
`not require a specific orientation of the array with respect to
`the signal.
`
`[0025] Further, the systems described herein are sensitive
`to noise of every type and every orientation, unlike conven-
`tional arrays that depend on specific noise orientations.
`Consequently, the frequency-based arrays presented herein
`are unique as they depend only on the relative orientation of
`the two microphonesthemselves with no dependence on the
`orientation of the noise and signal with respect
`to the
`microphones. This results in a robust signal processing
`system with respect to the type of noise, microphones, and
`orientation between the noise/signal source and the micro-
`phones.
`
`waves are mixed with the original transmitted waves and the
`results analyzed for any change in position of the targets.
`Anything that moves near the sensor will cause a change in
`phase of the reflected wave that will be amplified and
`displayed as a change in voltage output from the sensor. A
`similar sensor is described by Gregory C. Burnett (1999) in
`“The
`physiological basis of glottal
`electromagnetic
`micropower sensors (GEMS) and their use in defining an
`excitation function for the human vocaltract”; Ph.D. Thesis,
`University of California at Davis.
`
`FIG.4 is a flow diagram of a detection algorithm
`[0028]
`50 for use in detecting voiced and unvoiced speech, under an
`embodiment. With reference to FIGS. 1 and 2, both the
`NAVSAD and PSAD systemsof an embodimentinclude the
`detection algorithm 50 as the detection subsystem 50. This
`detection algorithm 50 operates in real-time and,
`in an
`embodiment, operates on 20 millisecond windowsand steps
`10 milliseconds at a time, but is not so limited. The voice
`activity determination is recorded for the first 10 millisec-
`onds, and the second 10 milliseconds functions as a “look-
`ahead” buffer. While an embodiment uses the 20/10 win-
`dows, alternative embodiments may use numerous other
`combinations of window values.
`
`[0029] Consideration was given to a number of multi-
`dimensional factors in developing the detection algorithm
`50. The biggest consideration was to maintaining the effec-
`tiveness of the Pathfinder denoising technique, described in
`detail
`in the Related Applications and reviewed herein.
`Pathfinder performance can be compromisedif the adaptive
`filter training is conducted on speech rather than on noise. It
`is therefore important not to exclude any significant amount
`of speech from the VAD to keep such disturbances to a
`minimum.
`
`[0030] Consideration wasalso given to the accuracy of the
`characterization between voiced and unvoiced speech sig-
`nals, and distinguishing each of these speech signals from
`noise signals. This type of characterization can be useful in
`such applications as speech recognition and speaker verifi-
`cation.
`
`13
`
`13
`
`
`
`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`[0031] Furthermore, the systems using the detection algo-
`rithm of an embodiment function in environments contain-
`ing varying amounts of background acoustic noise. If the
`non-acoustic sensor is available, this external noise is not a
`problem for voiced speech. However, for unvoiced speech
`(and voicedif the non-acoustic sensoris not available or has
`malfunctioned) reliance is placed on acoustic data alone to
`separate noise from unvoiced speech. An advantage inheres
`in the use of two microphones in an embodiment of the
`Pathfinder noise suppression system, and the spatial rela-
`tionship between the microphones is exploited to assist in
`the detection of unvoiced speech. However,
`there may
`occasionally be noise levels high enoughthat the speech will
`be nearly undetectable and the acoustic-only method will
`fail. In these situations, the non-acoustic sensor(or hereafter
`just the sensor) will be required to ensure good performance.
`[0032]
`In the two-microphone system, the speech source
`should be relatively louder in one designated microphone
`when compared to the other microphone. Tests have shown
`that this requirementis easily met with conventional micro-
`phones when the microphonesare placed on the head, as any
`noise should result in an H, with a gain near unity.
`
`[0033] Regarding the NAVSAD system, and with refer-
`ence to FIG. 1 and FIG. 3, the NAVSAD relies on two
`parameters to detect voiced speech. These two parameters
`include the energy of the sensor in the windowofinterest,
`determined in an embodiment by the standard deviation
`(SD),
`and optionally the
`cross-correlation (XCORR)
`between the acoustic signal from microphone 1 and the
`sensor data. The energy of the sensor can be determined in
`any one of a number of ways, and the SD is just one
`convenient way to determine the energy.
`
`[0034] For the sensor, the SD is akin to the energy of the
`signal, which normally corresponds quite accurately to the
`voicing state, but may be susceptible to movement noise
`(relative motion of the sensor with respect to the human
`user) and/or electromagnetic noise. To further differentiate
`sensor noise from tissue motion, the XCORR can be used.
`The XCORRis only calculated to 15 delays, which corre-
`sponds to just under 2 milliseconds at 8000 Hz.
`
`[0035] The XCORRcan also be useful when the sensor
`signal
`is distorted or modulated in some fashion. For
`example, there are sensor locations (such as the jaw or back
`of the neck) where speech production can be detected but
`where the signal may have incorrect or distorted time-based
`information. Thatis, they may not have well defined features
`in time that will match with the acoustic waveform. How-
`
`ever, XCORR is more susceptible to errors from acoustic
`noise, and in high (<0 dB SNR) environments is almost
`useless. Therefore it should not be the sole source of voicing
`information.
`
`[0036] The sensor detects human tissue motion associated
`with the closure of the vocal folds, so the acoustic signal
`produced bythe closure ofthe folds is highly correlated with
`the closures. Therefore, sensor data that correlates highly
`with the acoustic signal is declared as speech, and sensor
`data that does not correlate well
`is termed noise. The
`acoustic data is expected to lag behind the sensor data by
`about 0.1 to 0.8 milliseconds (or about 1-7 samples) as a
`result of the delay time dueto the relatively slower speed of
`sound (around 330 m/s). However, an embodiment uses a
`15-sample correlation, as the acoustic wave shape varies
`significantly depending on the sound produced, and a larger
`correlation width is needed to ensure detection.
`
`14
`
`[0037] The SD and XCORRsignals are related, but are
`sufficiently different so that the voiced speech detection is
`morereliable. For simplicity, though, either parameter may
`be used. The values for the SD and XCORR are compared
`to empirical thresholds, and if both are abovetheir threshold,
`voiced speech is declared. Example data is presented and
`described below.
`
`[0038] FIGS. 5A, 5B, and 6 show data plots for an
`example in which a subject twice speaks the phrase “pop
`pan”, under an embodiment. FIG. 5A plots the received
`GEMSsignal 502 for this utterance along with the mean
`correlation 504 between the GEMSsignal and the Mic 1
`signal and the threshold T1 used for voiced speech detection.
`FIG. 5B plots the received GEMS signal 502 for this
`utterance along with the standard deviation 506 of the
`GEMSsignal and the threshold T2 used for voiced speech
`detection. FIG. 6 plots voiced speech 602 detected from the
`acoustic or audio signal 608, along with the GEMSsignal
`604 and the acoustic noise 606; no unvoiced speech is
`detected in this example because of the heavy background
`babble noise 606. The thresholds have beenset so that there
`are virtually no false negatives, and only occasional false
`positives. A voiced speech activity detection accuracy of
`greater than 99% has been attained under any acoustic
`background noise conditions.
`
`[0039] The NAVSAD can determine when voiced speech
`is occurring with high degrees of accuracy due to the
`non-acoustic sensor data. However, the sensor offers little
`assistance in separating unvoiced speech from noise, as
`unvoiced speech normally causes no detectable signal in
`most non-acoustic sensors.If there is a detectable signal, the
`NAVSAD can beused, although use of the SD method is
`dictated as unvoiced speech is normally poorly correlated. In
`the absence of a detectable signal use is made of the system
`and methods of the Pathfinder noise removal algorithm in
`determining when unvoiced speech is occurring. A brief
`review of the Pathfinder algorithm is described below, while
`a detailed description is provided in the Related Applica-
`tions.
`
`[0040] With reference to FIG.3, the acoustic information
`coming into Microphone 1 is denoted by m,(n), the infor-
`mation coming into Microphone 2 is similarly labeled
`m,(n), and the GEMSsensoris assumed available to deter-
`mine voiced speech areas.
`In the z (digital frequency)
`domain, these signals are represented as M,(z) and M.(z).
`Then
`
`My(z) = S(x) + No(z)
`
`M(z) = N(z) + S2(z)
`
`with
`
`N(x) = N@AL()
`
`So () = S(Z)Haz)
`
`so that
`
`My (2) = S(z) + N(2)A(2)
`M2(z) = N(z) + S(z)A2(z)
`
`qd)
`
`[0041] This is the general case for all two microphone
`systems. There is always going to be some leakage of noise
`into Mic 1, and some leakageof signal into Mic 2. Equation
`1 has four unknownsand only tworelationships and cannot
`be solved explicitly.
`
`14
`
`
`
`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`[0042] However, there is another way to solve for some of
`the unknowns in Equation 1. Examine the case where the
`signal is not being generated—that is, where the GEMS
`signal
`indicates voicing is not occurring.
`In this case,
`s(n)=S(z)=0, and Equation 1 reduces to
`Min(@)=N@C)
`Man(2)=N()
`
`[0043] where the n subscript on the M variables indicate
`that only noise is being received. This leads to
`
`[0052] With reference to FIG. 2 and FIG.3, the PSAD
`system is described. As sound waves propagate, they nor-
`mally lose energy as they travel due to diffraction and
`dispersion. Assuming the sound wavesoriginate from a
`point source and radiate isotropically, their amplitude will
`decrease as a function of 1/r, wherer is the distance from the
`originating point. This function of 1/r proportional to ampli-
`tude is the worst case, if confined to a smaller area the
`reduction will be less. Howeverit is an adequate model for
`the configurationsof interest, specifically the propagation of
`noise and speech to microphones located somewhere on the
`user’s head.
`
`Myn(2) = Mon(z)A(z)
`[0053] FIG. 7 is a microphone array for use under an
`
`Min
`ioe
`embodiment of the PSAD system. Placing the microphones
`Mic 1 and Mic 2 inalinear array with the mouth onthe array
`midline, the difference in signal strength in Mic 1 and Mic
`2 (assuming the microphones have identical
`frequency
`responses) will be proportional to both d, and Ad. Assuming
`a 1/r (or in this case 1/d) relationship, it is seen that
`
`(2)
`
`[0044] H,(z) can be calculated using any of the available
`system identification algorithms and the microphone outputs
`when only noise is being received. The calculation can be
`done adaptively, so that if the noise changes significantly
`H,(z) can be recalculated quickly.
`
`[0045] With a solution for one of the unknowns in Equa-
`tion 1, solutions can be found for another, H,(z), by using
`the amplitude of the GEMSorsimilar device along with the
`amplitude of the two microphones. When the GEMSindi-
`cates voicing, but the recent (less than 1 second) history of
`the microphones indicate low levels of noise, assume that
`n(s)=N(z)~0. Then Equation 1 reduces to
`M,.(2)=S@)
`M).(2)=S2)AZ)
`
`[0046] which in turn leads to
`
`Mos(z) = Mis(Z)H2(z)
` Mos(Z)
`O= @
`
`[0047] which is the inverse of the H,(z) calculation, but
`note that different inputs are being used.
`
`[0048] After calculating H,(z) and H(z) above, they are
`used to remove the noise from the signal. Rewrite Equation
`1 as
`
`S@=M,2)-N@H,@)
`N(@)=M,(2)-S(@)Hp (2)
`S@=M,(2)-[M@)-S@)FL@)Ae)
`S@D-A,2)H,2)-M,@)-M_@)A)
`
`[0049]
`
`and solve for S(z) as:
`
`Si= Mi() — Mo(@)MZ)
`“1 Fa(@iie)
`
`|
`
`(3)
`
`In practice H,(z) is usually quite small, so that
`[0050]
`H.(z)H,(z)<<1, and
`S(Z)=M,(2)-M_@)H,(@),
`
`[0051]
`
`obviating the need for the H,(z) calculation.
`
`15
`
`_ |Micl|
`= |Mic2| = AH, (z) «
`
`
`d, +Ad
`aq
`
`[0054] where AM isthe difference in gain between Mic 1
`and Mic 2 and therefore H,(z), as above in Equation 2. The
`variable d, is the distance from Mic 1 to the speech or noise
`source.
`
`FIG.8 is a plot 800 of AM versus d, for several Ad
`[0055]
`values, under an embodiment.It is clear that as Ad becomes
`larger and the noise source is closer, AM becomeslarger.
`The variable Ad will change depending onthe orientation to
`the speech/noise source, from the maximum value on the
`array midline to zero perpendicular to the array midline.
`From the plot 800 it
`is clear that for small Ad and for
`distances over approximately 30 centimeters (cm), AM is
`close to unity. Since most noise sources are farther away
`than 30 cm andare unlikely to be on the midline onthearray,
`it
`is probable that when calculating H,(z) as above in
`Equation 2, AM (or equivalently the gain of H,(z)) will be
`close to unity. Conversely, for noise sources that are close
`(within a few centimeters),
`there could be a substantial
`difference in gain depending on which microphoneis closer
`to the noise.
`
`If the “noise” is the user speaking, and Mic 1 is
`[0056]
`closer to the mouth than Mic 2, the gain increases. Since
`environmental noise normally originates much farther away
`from the user’s head than speech, noise will be found during
`the time when the gain of H,(z) is near unity or somefixed
`value, and speech can be found after a sharp rise in gain. The
`speech can be unvoiced or voiced, as long as it
`is of
`sufficient volume compared to the surrounding noise. The
`gain will stay somewhat high during the speech portions,
`then descend quickly after speech ceases. The rapid increase
`and decrease in the gain of H,(z) should be sufficient to
`allow the detection of speech under almost any circum-
`stances. The gain in this example is calculated by the sum of
`the absolute value of the filter coefficients. This sum is not
`
`equivalent to the gain, but the two are related in that a rise
`in the sum of the absolute value reflects a rise in the gain.
`
`[0057] As an example of this behavior, FIG. 9 shows a
`plot 900 of the gain parameter 902 as the sum ofthe absolute
`values of H,(z) and the acoustic data 904 or audio from
`
`15
`
`
`
`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`microphone 1. The speech signal was an utterance of the
`phrase “pop pan”, repeated twice. The evaluated bandwidth
`included the frequency range from 2500 Hz to 3500 Hz,
`although 1500Hz to 2500 Hz was additionally used in
`practice. Note the rapid increase in the gain when the
`unvoiced speech isfirst encountered, then the rapid return to
`normal when the speech ends. The large changes in gain that
`result from transitions between noise and speech can be
`detected by any standard signal processing techniques. The
`standard deviation of the last few gain calculations is used,
`with thresholds being defined by a running average of the
`standard deviations and the standard deviation noise floor.
`The later changes in gain for the voiced speech are sup-
`pressed in this plot 900 for clarity.
`
`[0058] FIG. 10 is an alternative plot 1000 of acoustic data
`presented in FIG. 9. The data used to form plot 900 is
`presented again inthis plot 1000, along with audio data 1004
`and GEMSdata 1006 without noise to make the unvoiced
`
`speech apparent. The voiced signal 1002 has three possible
`values: O for noise, 1 for unvoiced, and 2 for voiced.
`Denoising is only accomplished when V=0.It is clear that
`the unvoiced speech is captured very well, aside from two
`single dropouts in the unvoiced detection near the end of
`each “pop”. However, these single-window dropouts are not
`common and donotsignificantly affect the denoising algo-
`rithm. They can easily be removed using standard smoothing
`techniques.
`
`[0059] What is not clear from this plot 1000 is that the
`PSAD system functions as an automatic backup to the
`NAVSAD. Thisis because the voiced speech(since it has the
`samespatial relationship