`
`1111111111111111111111111111111111111111111111111111111111111111111111111111
`US 20020198705Al
`
`(19) United States
`(12) Patent Application Publication
`Burnett
`
`(10) Pub. No.: US 2002/0198705 Al
`Dec. 26, 2002
`( 43) Pub. Date:
`
`(54) DETECTING VOICED AND UNVOICED
`SPEECH USING BOTH ACOUSTIC AND
`NONACOUSTIC SENSORS
`
`(76)
`
`Inventor: Gregory C. Burnett, Livermore, CA
`(US)
`
`Correspondence Address:
`PERKINS COlE LLP
`P.O. BOX 2168
`MENLO PARK, CA 94026 (US)
`
`(21)
`
`Appl. No.:
`
`10/159,770
`
`(22)
`
`Filed:
`
`May 30,2002
`
`Related U.S. Application Data
`
`(60)
`
`Provisional application No. 60/294,383, filed on May
`30, 2001. Provisional application No. 60/335,100,
`filed on Oct. 30, 2001. Provisional application No.
`60/332,202, filed on Nov. 21,2001. Provisional appli-
`cation No. 60/362,162, filed on Mar. 5, 2002. Provi-
`sional application No. 60/362,103, filed on Mar. 5,
`2002. Provisional application No. 60/362,170, filed
`on Mar. 5, 2002. Provisional application No. 60/361,
`981, filed on Mar. 5, 2002. Provisional application
`No. 60/362,161, filed on Mar. 5, 2002. Provisional
`
`application No. 60/368,209, filed on Mar. 27, 2002.
`Provisional application No. 60/368,208, filed on Mar.
`27, 2002. Provisional application No. 60/368,343,
`filed on Mar. 27, 2002.
`
`Publication Classification
`
`Int. Cl? ..................................................... G10L 11/06
`(51)
`(52) U.S. Cl. .............................................................. 704/214
`
`(57)
`
`ABSTRACT
`
`Systems and methods are provided for detecting voiced and
`unvoiced speech in acoustic signals having varying levels of
`background noise. The systems receive acoustic signals at
`two microphones, and generate difference parameters
`between the acoustic signals received at each of the two
`microphones. The difference parameters are representative
`of the relative difference in signal gain between portions of
`the received acoustic signals. The systems identify informa-
`tion of the acoustic signals as unvoiced speech when the
`difference parameters exceed a first threshold, and identify
`information of the acoustic signals as voiced speech when
`the difference parameters exceed a second threshold. Fur-
`ther, embodiments of the systems include non-acoustic
`sensors that receive physiological information to aid in
`identifying voiced speech.
`
`Constants:
`V"" 0 if noise, 1 iftN, 2 ifV
`VfC =voiced threshold for COIT
`VIS =voiced threshold for std. dev.
`ff = forgetting factor for std. dev.
`num rna= # of taps in m.a.filter
`uv rna= uv std dev m.a.thresh
`uv-std = uv std devthreshold
`tN-::, binary values denoting UV
`detected in each subband
`numJJegin=# v.inat 'begiming"
`Variables:
`bhl ~ LMS calc of MIC 1-2 1F
`keep_old= 1 iflastwinV!UV,O ow
`sd rna vector = last NV sd \Wues
`sd=ma-=m.a of the lastNVsd
`
`1
`
`Sony v. Jawbone
`
`U.S. Patent No. 8,321,213
`
`Sony Ex. 1012
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 1 of 10
`
`US 2002/0198705 A1
`
`100
`
`~
`
`MICROPHONES
`10
`
`VOICING
`SENSORS
`20
`
`__.
`
`~
`
`__...
`...
`
`PROCESSOR
`30
`
`DETECTION
`SUBSYSTEM
`50
`
`DE NOISING
`SUBSYSTEM
`40
`
`Figure 1
`
`2
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 2 of 10
`
`US 2002/0198705 A1
`
`200
`
`~
`
`MICROPHONES
`10
`
`.....
`~ ...
`
`PROCESSOR
`30
`
`DETECTION
`SUBSYSTEM
`50
`
`DE NOISING
`SUBSYSTEM
`40
`
`Figure 2
`
`3
`
`
`
`00
`'"""'
`'0
`~ c
`N c c
`
`'JJ.
`Cj
`
`-..J c Ul >
`
`'"""'
`
`'"""' c
`0 ......,
`~
`~ .....
`'JJ. =-~
`N 8 N
`Cleaned speech....... ~
`~
`~
`
`~~
`
`340
`Noise removal
`
`Figure 3
`
`MIC2
`
`n(n)
`NOISE
`(~;)))~ n(n)-~
`
`.... 0 =
`0' -....
`.... 0 =
`~ 't:l -....
`~ ..... ~ = .....
`
`~ .....
`
`I")
`
`~
`
`~ .....
`
`I")
`
`""C
`
`300
`
`)
`
`~~
`~
`
`MIC 1
`
`.,
`
`s(n)
`
`SIGNAL
`(~; )))~ _
`
`s(n)
`
`Voicing information---
`
`320
`!
`I V AD I
`
`4
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 4 of 10
`
`US 2002/0198705 A1
`
`50 \
`
`r Step 10 msec
`
`NAVSAD
`
`V(wildow) = 2 YFS
`bh1 = bhl_oll 1+-
`
`f+-
`
`V(wildow = 0)
`Read in 20 msec of dam
`from ml, m2, gems
`
`"
`
`Cak:ulate XCORR of
`ml,gems
`~.
`Cal: rnear(are(XCORR})
`=MC
`l
`Cak: STD DEV of gems
`=GSD
`
`+
`
`Is MC>VTC
`arrl
`GSD> VTS?
`
`NO
`
`Constants:
`V= 0 if noise 1 ifUV 2 ifV
`•
`'
`vrc =voiced threshold for c orr
`td. dev.
`VIS= voiced threshold for s
`ff = forgetting factor for std.
`dev.
`num_ma =#of taps in m.afilt er
`esh
`lN ma=UVstddevm.a.thr
`UV_std = UV std devthreshol
`d
`lN =binary values denoting uv
`and
`detected in each subb
`num _begin= # wn It 'beginni ng''
`Variables:
`bh1 = LMS calc of MIC 1-2 1F
`,Oow
`keep_old= 1 iflastwin V/lN
`sd rna vector = last NV sd
`\-alues
`sd rna= m.a of the last NV
`sd
`PSAD
`UV = [0 O],Filter m1 arrl
`m2 into 2 rends,
`1500-2500 arrl
`
`2500-3500 Hz "
`
`Cal:ulate bhl using
`Pathfmder for each
`subbanl
`_i
`new_sum=
`surr(are(bhl ));
`+
`Ifnotkeep_oldor at
`begiming, add new_ sum
`to new_sum_ vector (ff
`numbers long)
`~•-
`new_std = STD DEV of
`new_sum_ vector
`t
`If not keep_ old or at
`beginning, shift
`sd _rna_ vector to right
`+
`Replace fi!St value i1
`sd_ma_vector with
`old sti
`t
`Filter sd_ma_ vector with
`..___ movng average fiter to
`get sd rna
`
`old std =new std
`keep_old=O
`bhl_old= bhl
`
`After both subbands
`checked, is
`CEIL(SUM(UV)/2) = 1?
`YES
`V(wndow) = 1
`
`y
`
`Is
`new std>UV ma*sd rna
`-
`-
`-
`arrl
`I~
`new_std > UV _sd
`OR
`are we at tre beginng?
`lvFS
`UV(subbanl) = 2
`bh1 = bhl oll
`keep_old-;;, 1
`
`l
`
`I
`
`Figure 4
`
`5
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 5 of 10
`
`US 2002/0198705 A1
`
`Figure SA
`GEMS AND MEAN CORRELATION
`
`0
`
`O.ei
`
`us
`
`2
`
`3
`
`3.~
`
`4
`•
`)( 10
`
`GEMS AND STANDARD DEVIATION
`
`Figure5B
`
`2.!5
`
`3
`
`3.!5
`
`4
`)( 10
`
`~
`
`6
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 6 of 10
`
`US 2002/0198705 A1
`
`600 I
`
`VOICING
`602~-:
`I , I
`I
`I
`I
`I
`I
`I
`
`0
`
`I
`
`I
`
`I
`
`AUDIO~
`
`608
`
`I
`
`0
`
`0.5
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`604
`
`Figure 6
`
`7
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 7 of 10
`
`US 2002/0198705 A1
`
`... ..,
`
`. Ad
`.· .... dl
`
`. .. .. ... .... •• • • • I • • • • • • ..
`....
`' Linear array
`
`midline
`
`I
`
`Figure 7
`
`8
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 8 of 10
`
`US 2002/0198705 A1
`
`/
`3~--~--~--~--~--~--~
`
`d1 versus delta M for delta d = 1, 2, 3, 4 em
`
`800
`
`2.8
`
`2.6
`
`2.4
`
`2.2
`~
`jg
`G)
`"0
`
`2
`
`1.8
`
`1.6
`
`1.4
`
`5
`
`10
`
`15
`d 1 (em)
`
`20
`
`25
`
`30
`
`Figure 8
`
`9
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 9 of 10
`
`US 2002/0198705 A1
`
`Acoustic data (solid) and gain parameter (dashed)
`
`/
`
`900
`
`GAIN PARAMETER
`902
`\
`
`\ \
`
`I
`
`--ACOUSTIC DATA
`904
`
`I
`'~ _____ [ ______ ~ ____ _L ______ L _ _ _ _ _ ~----~-
`0.5
`1.5
`2
`2.5
`3
`0
`time (samples)
`
`3.5
`
`4
`4
`X 10
`
`Figure 9
`
`10
`
`
`
`Patent Application Publication Dec. 26, 2002 Sheet 10 of 10
`
`US 2002/0198705 A1
`
`Mic 1 and V for "pop pan" in \headmic\micgems_p1.bin
`
`/
`
`1000
`
`VOICING
`SIGNAL
`1002\
`
`AUDIO
`SIGNAL
`j1004
`I
`
`·I
`'I
`
`i /: I,
`I ~
`i .
`
`... ~- VOICED
`
`LEVEL
`
`UNVOICE
`.._ .. __ LEVEL
`
`GEMS
`SIGNAL
`
`I 10::T
`I
`/ VOICED!
`I
`
`: ..... / .... L~~-~-~-
`
`1
`
`0
`
`0.5
`
`2
`1.5
`time (samples)
`
`2.5
`
`3
`
`3.5
`
`4
`X 104
`
`Figure 10
`
`11
`
`
`
`US 2002/0198705 Al
`
`1
`
`Dec. 26, 2002
`
`DETECTING VOICED AND UNVOICED SPEECH
`USING BOTH ACOUSTIC AND NONACOUSTIC
`SENSORS
`
`RELATED APPLICATIONS
`[0001] This application claims the benefit of U.S. Appli-
`cation Nos. 60/294,383 filed May 30, 2001; 09/905,361 filed
`Jul. 12, 2001; 60/335,100 filed Oct. 30, 2001; 60/332,202
`and 09/990,847, both filed Nov. 21, 2001; 60/362,103,
`60/362,161, 60/362,162, 60/362,170, and 60/361,981, all
`filed Mar. 5, 2002; 60/368,208, 60/368,209, and 60/368,343,
`all filed Mar. 27, 2002; all of which are incorporated herein
`by reference in their entirety.
`
`TECHNICAL FIELD
`[0002] The disclosed embodiments relate to the process-
`ing of speech signals.
`
`BACKGROUND
`[0003] The ability
`to correctly
`identify voiced and
`unvoiced speech is critical to many speech applications
`including speech recognition, speaker verification, noise
`suppression, and many others. In a typical acoustic appli-
`cation, speech from a human speaker is captured and trans-
`mitted to a receiver in a different location. In the speaker's
`environment there may exist one or more noise sources that
`pollute the speech signal, or the signal of interest, with
`unwanted acoustic noise. This makes it difficult or impos-
`sible for the receiver, whether human or machine, to under-
`stand the user's speech.
`[0004] Typical methods for classifying voiced and
`unvoiced speech have relied mainly on the acoustic content
`of microphone data, which is plagued by problems with
`noise and the corresponding uncertainties in signal content.
`This is especially problematic now with the proliferation of
`portable communication devices like cellular telephones and
`personal digital assistants because, in many cases, the qual-
`ity of service provided by the device depends on the quality
`of the voice services offered by the device. There are
`methods known in the art for suppressing the noise present
`in the speech signals, but these methods demonstrate per-
`formance shortcomings that include unusually long comput-
`ing time, requirements for cumbersome hardware to perform
`the signal processing, and distorting the signals of interest.
`
`BRIEF DESCRIPTION OF THE FIGURES
`[0005] FIG. 1 is a block diagram of a NAVSAD system,
`under an embodiment.
`[0006] FIG. 2 is a block diagram of a PSAD system, under
`an embodiment.
`[0007] FIG. 3 is a block diagram of a denoising system,
`referred to herein as the Pathfinder system, under an embodi-
`ment.
`[0008] FIG. 4 is a flow diagram of a detection algorithm
`for use in detecting voiced and unvoiced speech, under an
`embodiment.
`[0009] FIG. SA plots the received GEMS signal for an
`utterance along with the mean correlation between the
`GEMS signal and the Mic 1 signal and the threshold for
`voiced speech detection.
`
`[0010] FIG. 5B plots the received GEMS signal for an
`utterance along with the standard deviation of the GEMS
`signal and the threshold for voiced speech detection.
`[0011] FIG. 6 plots voiced speech detected from an utter-
`ance along with the GEMS signal and the acoustic noise.
`[0012] FIG. 7 is a microphone array for use under an
`embodiment of the PSAD system.
`[0013] FIG. 8 is a plot of llM versus d1 for several lld
`values, under an embodiment.
`[0014] FIG. 9 shows a plot of the gain parameter as the
`sum of the absolute values of H1(z) and the acoustic data or
`audio from microphone 1.
`[0015] FIG. 10 is an alternative plot of acoustic data
`presented in FIG. 9.
`[0016]
`In the figures, the same reference numbers identify
`identical or substantially similar elements or acts.
`[0017] Any headings provided herein are for convenience
`only and do not necessarily affect the scope or meaning of
`the claimed invention.
`
`DETAILED DESCRIPTION
`[0018] Systems and methods for discriminating voiced
`and unvoiced speech from background noise are provided
`below including a Non-Acoustic Sensor Voiced Speech
`Activity Detection (NAVSAD) system and a Pathfinder
`Speech Activity Detection (PSAD) system. The noise
`removal and reduction methods provided herein, while
`allowing for the separation and classification of unvoiced
`and voiced human speech from background noise, address
`the shortcomings of typical systems known in the art by
`cleaning acoustic signals of interest without distortion.
`[0019] FIG. 1 is a block diagram of a NAVSAD system
`100, under an embodiment. The NAVSAD system couples
`microphones 10 and sensors 20 to at least one processor 30.
`The sensors 20 of an embodiment include voicing activity
`detectors or non-acoustic sensors. The processor 30 controls
`subsystems including a detection subsystem 50, referred to
`herein as a detection algorithm, and a denoising subsystem
`40. Operation of the de noising subsystem 40 is described in
`detail in the Related Applications. The NAVSAD system
`works extremely well in any background acoustic noise
`environment.
`[0020] FIG. 2 is a block diagram of a PSAD system 200,
`under an embodiment. The PSAD system couples micro-
`phones 10 to at least one processor 30. The processor 30
`includes a detection subsystem 50, referred to herein as a
`detection algorithm, and a denoising subsystem 40. The
`PSAD system is highly sensitive in low acoustic noise
`environments and relatively insensitive in high acoustic
`noise environments. The PSAD can operate independently
`or as a backup to the NAVSAD, detecting voiced speech if
`the NAVSAD fails.
`[0021] Note that the detection subsystems 50 and denois-
`ing subsystems 40 of both the NAVSAD and PSAD systems
`of an embodiment are algorithms controlled by the processor
`30, but are not so limited. Alternative embodiments of the
`NAVSAD and PSAD systems can include detection sub-
`systems 50 and/or denoising subsystems 40 that comprise
`additional hardware, firmware, software, and/or combina-
`
`12
`
`
`
`US 2002/0198705 Al
`
`2
`
`Dec. 26, 2002
`
`tions of hardware, firmware, and software. Furthermore,
`functions of the detection subsystems 50 and denoising
`subsystems 40 may be distributed across numerous compo-
`nents of the NAVSAD and PSAD systems.
`[0022] FIG. 3 is a block diagram of a denoising subsystem
`300, referred to herein as the Pathfinder system, under an
`embodiment. The Pathfinder system is briefly described
`below, and is described in detail in the Related Applications.
`Two microphones Mic 1 and Mic 2 are used in the Pathfinder
`system, and Mic 1 is considered the "signal" microphone.
`With reference to FIG. 1, the Pathfinder system 300 is
`equivalent to the NAVSAD system 100 when the voicing
`activity detector (VAD) 320 is a non-acoustic voicing sensor
`20 and the noise removal subsystem 340 includes the
`detection subsystem 50 and the denoising subsystem 40.
`With reference to FIG. 2, the Pathfinder system 300 is
`equivalent to the PSAD system 200 in the absence of the
`VAD 320, and when the noise removal subsystem 340
`includes the detection subsystem 50 and the denoising
`subsystem 40.
`[0023] The NAVSAD and PSAD systems support a two-
`level commercial approach in which (i) a relatively less
`expensive PSAD system supports an acoustic approach that
`functions in most low- to medium-noise environments, and
`(ii) a NAVSAD system adds a non-acoustic sensor to enable
`detection of voiced speech in any environment. Unvoiced
`speech is normally not detected using the sensor, as it
`normally does not sufficiently vibrate human tissue. How-
`ever, in high noise situations detecting the unvoiced speech
`is not as important, as it is normally very low in energy and
`easily washed out by the noise. Therefore in high noise
`environments the unvoiced speech is unlikely to affect the
`voiced speech denoising. Unvoiced speech information is
`most important in the presence of little to no noise and,
`therefore, the unvoiced detection should be highly sensitive
`in low noise situations, and insensitive in high noise situ a-
`tions. This is not easily accomplished, and comparable
`acoustic unvoiced detectors known in the art are incapable
`of operating under these environmental constraints.
`[0024] The NAVSAD and PSAD systems include an array
`algorithm for speech detection that uses the difference in
`frequency content between two microphones to calculate a
`relationship between the signals of the two microphones.
`This is in contrast to conventional arrays that attempt to use
`the time/phase difference of each microphone to remove the
`noise outside of an "area of sensitivity". The methods
`described herein provide a significant advantage, as they do
`not require a specific orientation of the array with respect to
`the signal.
`[0025] Further, the systems described herein are sensitive
`to noise of every type and every orientation, unlike conven-
`tional arrays that depend on specific noise orientations.
`Consequently, the frequency-based arrays presented herein
`are unique as they depend only on the relative orientation of
`the two microphones themselves with no dependence on the
`orientation of the noise and signal with respect to the
`microphones. This results in a robust signal processing
`system with respect to the type of noise, microphones, and
`orientation between the noise/signal source and the micro-
`phones.
`
`[0026] The systems described herein use the information
`derived from the Pathfinder noise suppression system and/or
`a non-acoustic sensor described in the Related Applications
`to determine the voicing state of an input signal, as described
`in detail below. The voicing state includes silent, voiced, and
`unvoiced states. The NAVSAD system, for example,
`includes a non-acoustic sensor to detect the vibration of
`human tissue associated with speech. The non-acoustic
`sensor of an embodiment is a General Electromagnetic
`Movement Sensor (GEMS) as described briefly below and
`in detail in the Related Applications, but is not so limited.
`Alternative embodiments, however, may use any sensor that
`is able to detect human tissue motion associated with speech
`and is unaffected by environmental acoustic noise.
`[0027] The GEMS is a radio frequency device (2.4 GHz)
`that allows the detection of moving human tissue dielectric
`interfaces. The GEMS includes an RF interferometer that
`uses homodyne mixing to detect small phase shifts associ-
`ated with target motion. In essence, the sensor sends out
`weak electromagnetic waves (less than 1 milliwatt) that
`reflect off of whatever is around the sensor. The reflected
`waves are mixed with the original transmitted waves and the
`results analyzed for any change in position of the targets.
`Anything that moves near the sensor will cause a change in
`phase of the reflected wave that will be amplified and
`displayed as a change in voltage output from the sensor. A
`similar sensor is described by Gregory C. Burnett (1999) in
`"The physiological basis of glottal electromagnetic
`micropower sensors (GEMS) and their use in defining an
`excitation function for the human vocal tract"; Ph.D. Thesis,
`University of California at Davis.
`[0028] FIG. 4 is a flow diagram of a detection algorithm
`50 for use in detecting voiced and unvoiced speech, under an
`embodiment. With reference to FIGS. 1 and 2, both the
`NAVSAD and PSAD systems of an embodiment include the
`detection algorithm 50 as the detection subsystem 50. This
`detection algorithm 50 operates in real-time and, in an
`embodiment, operates on 20 millisecond windows and steps
`10 milliseconds at a time, but is not so limited. The voice
`activity determination is recorded for the first 10 millisec-
`onds, and the second 10 milliseconds functions as a "look-
`ahead" buffer. While an embodiment uses the 20/10 win-
`dows, alternative embodiments may use numerous other
`combinations of window values.
`[0029] Consideration was given to a number of multi-
`dimensional factors in developing the detection algorithm
`50. The biggest consideration was to maintaining the effec-
`tiveness of the Pathfinder de noising technique, described in
`detail in the Related Applications and reviewed herein.
`Pathfinder performance can be compromised if the adaptive
`filter training is conducted on speech rather than on noise. It
`is therefore important not to exclude any significant amount
`of speech from the VAD to keep such disturbances to a
`m1mmum.
`[0030] Consideration was also given to the accuracy of the
`characterization between voiced and unvoiced speech sig-
`nals, and distinguishing each of these speech signals from
`noise signals. This type of characterization can be useful in
`such applications as speech recognition and speaker verifi-
`cation.
`
`13
`
`
`
`US 2002/0198705 Al
`
`3
`
`Dec. 26, 2002
`
`[0031] Furthermore, the systems using the detection algo-
`rithm of an embodiment function in environments contain-
`ing varying amounts of background acoustic noise. If the
`non-acoustic sensor is available, this external noise is not a
`problem for voiced speech. However, for unvoiced speech
`(and voiced if the non-acoustic sensor is not available or has
`malfunctioned) reliance is placed on acoustic data alone to
`separate noise from unvoiced speech. An advantage inheres
`in the use of two microphones in an embodiment of the
`Pathfinder noise suppression system, and the spatial rela-
`tionship between the microphones is exploited to assist in
`the detection of unvoiced speech. However, there may
`occasionally be noise levels high enough that the speech will
`be nearly undetectable and the acoustic-only method will
`fail. In these situations, the non-acoustic sensor (or hereafter
`just the sensor) will be required to ensure good performance.
`[0032]
`In the two-microphone system, the speech source
`should be relatively louder in one designated microphone
`when compared to the other microphone. Tests have shown
`that this requirement is easily met with conventional micro-
`phones when the microphones are placed on the head, as any
`noise should result in an H 1 with a gain near unity.
`[0033] Regarding the NAVSAD system, and with refer-
`ence to FIG. 1 and FIG. 3, the NAVSAD relies on two
`parameters to detect voiced speech. These two parameters
`include the energy of the sensor in the window of interest,
`determined in an embodiment by the standard deviation
`(SD), and optionally
`the cross-correlation (XCORR)
`between the acoustic signal from microphone 1 and the
`sensor data. The energy of the sensor can be determined in
`any one of a number of ways, and the SD is just one
`convenient way to determine the energy.
`[0034] For the sensor, the SD is akin to the energy of the
`signal, which normally corresponds quite accurately to the
`voicing state, but may be susceptible to movement noise
`(relative motion of the sensor with respect to the human
`user) and/or electromagnetic noise. To further differentiate
`sensor noise from tissue motion, the XCORR can be used.
`The XCORR is only calculated to 15 delays, which corre-
`sponds to just under 2 milliseconds at 8000 Hz.
`[003S] The XCORR can also be useful when the sensor
`signal is distorted or modulated in some fashion. For
`example, there are sensor locations (such as the jaw or back
`of the neck) where speech production can be detected but
`where the signal may have incorrect or distorted time-based
`information. That is, they may not have well defined features
`in time that will match with the acoustic waveform. How-
`ever, XCORR is more susceptible to errors from acoustic
`noise, and in high ( <0 dB SNR) environments is almost
`useless. Therefore it should not be the sole source of voicing
`information.
`[0036] The sensor detects human tissue motion associated
`with the closure of the vocal folds, so the acoustic signal
`produced by the closure of the folds is highly correlated with
`the closures. Therefore, sensor data that correlates highly
`with the acoustic signal is declared as speech, and sensor
`data that does not correlate well is termed noise. The
`acoustic data is expected to lag behind the sensor data by
`about 0.1 to 0.8 milliseconds (or about 1-7 samples) as a
`result of the delay time due to the relatively slower speed of
`sound (around 330 m/s). However, an embodiment uses a
`15-sample correlation, as the acoustic wave shape varies
`significantly depending on the sound produced, and a larger
`correlation width is needed to ensure detection.
`
`[0037] The SD and XCORR signals are related, but are
`sufficiently different so that the voiced speech detection is
`more reliable. For simplicity, though, either parameter may
`be used. The values for the SD and XCORR are compared
`to empirical thresholds, and if both are above their threshold,
`voiced speech is declared. Example data is presented and
`described below.
`[0038] FIGS. SA, SB, and 6 show data plots for an
`example in which a subject twice speaks the phrase "pop
`pan", under an embodiment. FIG. SA plots the received
`GEMS signal S02 for this utterance along with the mean
`correlation S04 between the GEMS signal and the Mic 1
`signal and the threshold T1 used for voiced speech detection.
`FIG. SB plots the received GEMS signal S02 for this
`utterance along with the standard deviation S06 of the
`GEMS signal and the threshold T2 used for voiced speech
`detection. FIG. 6 plots voiced speech 602 detected from the
`acoustic or audio signal 608, along with the GEMS signal
`604 and the acoustic noise 606; no unvoiced speech is
`detected in this example because of the heavy background
`babble noise 606. The thresholds have been set so that there
`are virtually no false negatives, and only occasional false
`positives. A voiced speech activity detection accuracy of
`greater than 99% has been attained under any acoustic
`background noise conditions.
`[0039] The NAVSAD can determine when voiced speech
`is occurring with high degrees of accuracy due to the
`non-acoustic sensor data. However, the sensor offers little
`assistance in separating unvoiced speech from noise, as
`unvoiced speech normally causes no detectable signal in
`most non-acoustic sensors. If there is a detectable signal, the
`NAVSAD can be used, although use of the SD method is
`dictated as unvoiced speech is normally poorly correlated. In
`the absence of a detectable signal use is made of the system
`and methods of the Pathfinder noise removal algorithm in
`determining when unvoiced speech is occurring. A brief
`review of the Pathfinder algorithm is described below, while
`a detailed description is provided in the Related Applica-
`tions.
`[0040] With reference to FIG. 3, the acoustic information
`coming into Microphone 1 is denoted by m1(n), the infor-
`mation coming into Microphone 2 is similarly labeled
`min), and the GEMS sensor is assumed available to deter-
`mine voiced speech areas. In the z (digital frequency)
`domain, these signals are represented as M1(z) and Miz).
`Then
`
`M 1 (Z) = S(z) + N2(z)
`M2(z) = N(z) + S2(z)
`
`with
`
`N2(z) = N(z)HI(Z)
`
`S2(z) = S(z)H2(Z)
`
`so that
`M 1 (z) = S(z) + N(z)H1 (z)
`M2(z) = N(z) + S(z)H2(Z)
`
`(1)
`
`[0041] This is the general case for all two microphone
`systems. There is always going to be some leakage of noise
`into Mic 1, and some leakage of signal into Mic 2. Equation
`1 has four unknowns and only two relationships and cannot
`be solved explicitly.
`
`14
`
`
`
`US 2002/0198705 Al
`
`4
`
`Dec. 26, 2002
`
`[0042] However, there is another way to solve for some of
`the unknowns in Equation 1. Examine the case where the
`signal is not being generated-that is, where the GEMS
`signal indicates voicing is not occurring. In this case,
`s(n)=S(z)=O, and Equation 1 reduces to
`
`M1n(z)~N(z)H1 (z)
`
`M2n(z)~N(z)
`[0043] where the n subscript on the M variables indicate
`that only noise is being received. This leads to
`
`M1n(Z) = M2n(z)H1(z)
`
`H () = M1n(Z)
`1 Z M2n(Z)
`
`(2)
`
`[0044] H1(z) can be calculated using any of the available
`system identification algorithms and the microphone outputs
`when only noise is being received. The calculation can be
`done adaptively, so that if the noise changes significantly
`H1(z) can be recalculated quickly.
`[0045] With a solution for one of the unknowns in Equa-
`tion 1, solutions can be found for another, Hiz), by using
`the amplitude of the GEMS or similar device along with the
`amplitude of the two microphones. When the GEMS indi-
`cates voicing, but the recent (less than 1 second) history of
`the microphones indicate low levels of noise, assume that
`n(s)=N(z)-0. Then Equation 1 reduces to
`
`M1Jz)~S(z)
`M2Jz)~S(z)H2(z)
`[0046] which in turn leads to
`
`M-,(z) = Mb(Z)H2(Z)
`
`_ M-,(z)
`H
`2(z)- M1,(z)
`
`[0047] which is the inverse of the H1(z) calculation, but
`note that different inputs are being used.
`[0048] After calculating H1(z) and Hiz) above, they are
`used to remove the noise from the signal. Rewrite Equation
`1 as
`
`S(z)~M1 (z)-N(z)H1 (z)
`N(z)~M2(z)-S(z)H2 (z)
`S(z)~M1 (z)-[M2(z)-S(z)H2 (z))H1 (z)'
`S(z)[1-H2(z)H1 (z)]~M1 (z)-M2 (z)H1 (z)
`[0049] and solve for S(z) as:
`
`(3)
`
`[0050]
`In practice Hiz) is usually quite small, so that
`Hiz)H1(z)«1, and
`S(z)=M1(z)-M2 (z)H1(z),
`[0051] obviating the need for the Hiz) calculation.
`
`[0052] With reference to FIG. 2 and FIG. 3, the PSAD
`system is described. As sound waves propagate, they nor-
`mally lose energy as they travel due to diffraction and
`dispersion. Assuming the sound waves originate from a
`point source and radiate isotropically, their amplitude will
`decrease as a function of 1/r, where r is the distance from the
`originating point. This function of 1/r proportional to ampli-
`tude is the worst case, if confined to a smaller area the
`reduction will be less. However it is an adequate model for
`the configurations of interest, specifically the propagation of
`noise and speech to microphones located somewhere on the
`user's head.
`[0053] FIG. 7 is a microphone array for use under an
`embodiment of the PSAD system. Placing the microphones
`Mic 1 and Mic 2 in a linear array with the mouth on the array
`midline, the difference in signal strength in Mic 1 and Mic
`2 (assuming the microphones have identical frequency
`responses) will be proportional to both d1 and !J.d. Assuming
`a 1/r (or in this case l!d) relationship, it is seen that
`
`d1 + !ld
`IMicll
`!lM = 1Mic21 = !lH1 (z)" -d-1 -,
`
`[0054] where ll.M is the difference in gain between Mic 1
`and Mic 2 and therefore H1(z), as above in Equation 2. The
`variable d1 is the distance from Mic 1 to the speech or noise
`source.
`[0055] FIG. 8 is a plot 800 of ll.M versus d1 for severalll.d
`values, under an embodiment. It is clear that as ll.d becomes
`larger and the noise source is closer, ll.M becomes larger.
`The variable ll.d will change depending on the orientation to
`the speech/noise source, from the maximum value on the
`array midline to zero perpendicular to the array midline.
`From the plot 800 it is clear that for small ll.d and for
`distances over approximately 30 centimeters (em), ll.M is
`close to unity. Since most noise sources are farther away
`than 30 em and are unlikely to be on the midline on the array,
`it is probable that when calculating H1(z) as above in
`Equation 2, ll.M (or equivalently the gain of H1(z)) will be
`close to unity. Conversely, for noise sources that are close
`(within a few centimeters), there could be a substantial
`difference in gain depending on which microphone is closer
`to the noise.
`[0056]
`If the "noise" is the user speaking, and Mic 1 is
`closer to the mouth than Mic 2, the gain increases. Since
`environmental noise normally originates much farther away
`from the user's head than speech, noise will be found during
`the time when the gain of H1(z) is near unity or some fixed
`value, and speech can be found after a sharp rise in gain. The
`speech can be unvoiced or voiced, as long as it is of
`sufficient volume compared to the surrounding noise. The
`gain will stay somewhat high during the speech portions,
`then descend quickly after speech ceases. The rapid increase
`and decrease in the gain of H1(z) should be sufficient to
`allow the detection of speech under almost any circum-
`stances. The gain in this example is calculated by the sum of
`the absolute value of the filter coefficients. This sum is not
`equivalent to the gain, but the two are related in that a rise
`in the sum of the absolute value reflects a rise in the gain.
`[0057] As an example of this behavior, FIG. 9 shows a
`plot 900 of the gain parameter 902 as the sum of the absolute
`values of H1(z) and the acoustic data 904 or audio from
`
`15
`
`
`
`US 2002/0198705 Al
`
`5
`
`Dec. 26, 2002
`
`microphone 1. The speech signal was an utterance of the
`phrase "pop pan", repeated twice. The evaluated bandwidth
`included the frequency range from 2500 Hz to 3500 Hz,
`although 1500Hz to 2500 Hz was additionally used in
`practice. Note the rapid increase in the gain when the
`unvoiced speech is first encountered, then the rapid return to
`normal when the speech ends. The large changes in gain that
`result from transitions between noise and speech can be
`detected by any standard signal processing techniques. The
`standard deviation of the last few gain calculations is used,
`with thresholds being defined by a running average of the
`standard deviations and the standard deviation noise floor.
`The later changes in gain for the voiced speech are sup-
`pressed in this plot 900 for clarity.
`[0058] FIG. 10 is an alternative plot 1000 of acoustic data
`presented in FIG. 9. The data used to form plot 900 is
`presented again in this plot 1000, along with audio data 1004
`and GEMS data 1006 without noise to make the unvoiced
`speech apparent. The voiced signal 1002 has three pos