throbber
AAUAAA LYAA
`
`US 20020198705A1
`
`as) United States
`a2) Patent Application Publication 0) Pub. No.: US 2002/0198705 Al
`Burnett
`(43) Pub. Date:
`Dec. 26, 2002
`
`
`(54) DETECTING VOICED AND UNVOICED
`SPEECH USING BOTH ACOUSTIC AND
`NONACOUSTIC SENSORS
`
`(76)
`
`Inventor: Gregory C. Burnett, Livermore, CA
`(US)
`
`application No. 60/368,209, filed on Mar. 27, 2002.
`Provisional application No. 60/368,208,filed on Mar.
`27, 2002. Provisional application No. 60/368,343,
`filed on Mar. 27, 2002.
`
`Publication Classification
`
`Correspondence Address:
`PERKINS COIE LLP
`P.O. BOX 2168
`
`MENLO PARK, CA 94026 (US)
`
`(21) Appl. No.:
`(22)
`Filed:
`
`10/159,770
`May30, 2002
`
`Related U.S. Application Data
`
`(60) Provisional application No. 60/294,383,filed on May
`30, 2001. Provisional application No. 60/335,100,
`filed on Oct. 30, 2001. Provisional application No.
`60/332,202,filed on Nov. 21, 2001. Provisional appli-
`cation No. 60/362,162, filed on Mar. 5, 2002. Provi-
`sional application No. 60/362,103, filed on Mar. 5,
`2002. Provisional application No. 60/362,170, filed
`on Mar. 5, 2002. Provisional application No. 60/361,
`981, filed on Mar. 5, 2002. Provisional application
`No. 60/362,161, filed on Mar. 5, 2002. Provisional
`
`Tint. C07 eeeeeeecceecccceeeecceeseeeeeceeneseeenneeees G10L 11/06
`(SV)
`(52) US. Ch occ cecseesssnecesessnessnessnnssesessnes 704/214
`
`(57)
`
`ABSTRACT
`.
`.
`.
`Systems and methodsare provided for detecting voiced and
`unvoiced speech in acoustic signals having varying levels of
`background noise. The systems receive acoustic signals at
`two microphones,
`and generate difference parameters
`between the acoustic signals received at each of the two
`microphones. The difference parameters are representative
`of the relative difference in signal gain between portions of
`the received acoustic signals. The systems identify informa-
`tion of the acoustic signals as unvoiced speech when the
`difference parameters exceed a first threshold, and identify
`information of the acoustic signals as voiced speech when
`the difference parameters exceed a second threshold. Fur-
`ther, embodiments of the systems include non-acoustic
`sensors that receive physiological
`information to aid in
`identifying voiced speech.
`
`Lves
`
`Is MC> VTC
`and
`GSD> VTS?
`
`Constants:
`50,
`V= 0 if noise, 1 ifUV, 2 if V
`‘ VIC=voiced threshold for corr=
`
`
`Viwindow = 0
`
`
`Read we mses ¢data
`VTS =voicedthreshold for std. dev.
`
`from ml, m2, gems
`ff = forgetting factorfor std. dev.
`
`num_ma= # oftaps in mafilter
`UV_ma = UV std dev m.a..thresh
`Cakulate XCORR of
`UVstd = UV std dev threshold
`ml, gems UV=binary values denoting UV
`
`detected in each subband
`Step 10 msec
`
`Cak mean{abs(XCORR))|
`num_begin = # win at “beginning”
`=MC
`NAVSAD
`Variables:
`bhi = LMS calc of MIC 1-2 TF
`
`keepold= | iflast win W/UV,0 ow
`Cak STD DEV ofgems
`sd_ma_vector = last NVsd values
`=GSD
`
`
`sd_ma=m.a ofthe last NVsd
`
`PSAD
`
`
`UY = [0 0}, Fitter mi and]
`_.
`;
`into
`2 bands,
`
`
`Wewidow) <2
`“00-2500 and
`
`2500-3500 Hz
`i
`
`Cakulate bhi using
`Pathfinder for each
`
`subband
`
`new_sum
`sum(abs(bhi));
`
`Ifnot keep_old or at
`Is
`
`
`beginning, add new_sum|
`a
`tomew_sum_vector (fF
`old std =new sid
`new_std>UV_ma*sd_ma
`bhi_oid= bhl
`
`numbers long)
`
`
`keep_old =0
`
`new_sid > UV_sd
`OR
`new_std = STD DEVof
`are weatthe beginnig?
`
`
`
`new_sum_vector
`
` UVisubband) =2
`Ifnot keep_ofd or at
`
`
`bbl = bhl_old
`beginning, shift
`
`keep_old=1
`sd_ma_vector to right
`
`After both subbands
`
`
`Replacefirst value in
`
`
`checked,is
`sd_ma_vector with
`
`CEL(SUM(UVY¥2) = 1?
`
`oldssl
`‘YES
`
`Filter sd_ma_vector with
`
`
`moving average fiter to
`get sd_ma
`
`1
`
`Amazon v. Jawbone
`
`U.S. Patent 8,321,213 AmazonEx. 1012
`
`1
`
`Amazon v. Jawbone
`U.S. Patent 8,321,213
`Amazon Ex. 1012
`
`

`

`Patent Application Publication Dec. 26,2002 Sheet 1 of 10
`
`US 2002/0198705 Al
`
`
`
`100
`
`PROCESSOR
`30
`
`MICROPHONES
`
`10
`
`DETECTION
`SUBSYSTEM
`
`
`
`
` VOICING
`
`50
`
`SENSORS
`
`20
` DENOISING
`
`SUBSYSTEM
`40
`
`
`
`Figure |
`
`2
`
`

`

`Patent Application Publication Dec. 26,2002 Sheet 2 of 10
`
`US 2002/0198705 Al
`
`200
`
`PROCESSOR
` 30
`
`
`
`MICROPHONES
`10
`
`
` DETECTION
`SUBSYSTEM
`50
`
`
`
`
`
`
` DENOISING
`
`SUBSYSTEM
`40
`
`Figure 2
`
`3
`
`

`

`Patent Application Publication Dec. 26, 2002 Sheet 3 of 10
`
`US 2002/0198705 A1
`
`
`<@—lpevodspours[D
`Ooze¥UOHVULIOJUTBu10a——|_avA|
`>\efe
`<¢———(uu
`
`IonOS
`
`a>
`
`<+<——(us
`
`
`
`@OIN
`
`¢2ansty
`
`“S
`
`TVNOIS()
`
`(us
`
`“SION()
`
`(wu
`
`4
`
`
`

`

`Patent Application Publication Dec. 26,2002 Sheet 4 of 10
`
`US 2002/0198705 A1
`
`50
`
`
`
`V(window = 0)
`Read in 20 msec of data
`from m1, m2, gems
`
`
`
`
`
`Cakulate XCORR of
`mi, gems
`
`Constants:
`V= Oif noise, 1 if UV, 2if V
`VIC = voiced threshold for corr
`VTS = voiced threshold for std. dev.
`ff = forgetting factor for std. dev.
`num_ma=# oftaps in m.afilter
`UV_ma = UV std dev m.a.thresh
`UV_std = UV std dev threshold
`UV = binary values denoting UV
`detected in each subband
`num_begin = # win at "beginning"
`Variables:
`bhi =LMS calc of MIC 1-2 TF
`keep_old=1if last win V/UV, 0 ow
`sd_ma_vector = last NV sd values
`Cak STD DEV of gems
`sd_ma=m.aofthe last NVsd
`= GSD
`PSAD
`
`
`
`
`
`
`Step 10 msec
`
`
`
`NAVSAD
`
`Cak mean(abs(XCORR
`
`n{ateQXCORR)
`
`
`
`
`
`Vowindow) =2 [YES] Me>WTC
`
`a
`_
`bhi = bhl_old
`Gsp> VTS?
`
`
`
`
`old_std = new_std
`keep_old = 0
`bhi_old= bhl
`
`
`
`Is
`
`
`new_std>UV_ma*sd_ma
`and
`
`
`new_std > UV_sd
`OR
`
` are we at the beginnig?
`YES
`
`
`
`
`
`
`UV(subband) = 2
`bhi = bhi_old
`keep_old =1
`
`After both subbands
`checked, is
`CEIL(SUM(UV)/2) = 1?
`
`
`
`
`
`
`
`
`Figure 4
`
`
`
`UV = [0 0], Filter mi and
`m2 into 2 bands,
`1500-2500 and
`2500-3500 Hz
`
`
`
`
`
`Cakulate bhi using
`Pathfinder for each
`subband
`
`new_sum =
`sum(abs(bhi!));
`
`
`
`If not keep_ok orat
`
`beginning, add new_sum
`to new_sum_vector (ff
`numberslong)
`
`
`
`
`
`
`
`new_std = STD DEV of
`new_sum_vector
`
`
`
`
`
`If not keep_old or at
`beginning, shift
`sd_ma_vectorto right
`
`Replace fast value n
`sd_ma_vector with
`old_std
`
`
`
`Filter sd_ma_vector with
`moving average fiter to
`get sd_ma
`
`5
`
`

`

`Patent Application Publication Dec. 26,2002 Sheet 5 of 10
`
`US 2002/0198705 Al
`
`Figure 5A
`
`GEMS AND MEAN CORRELATION
`
`6
`
`

`

`Patent Application Publication Dec. 26,2002 Sheet 6 of 10
`
`US 2002/0198705 Al
`
`600
`
` | VOICING
`
`
`
`
`
`
`
`IC
`
`606 -
`
`| |
`
`||
`
`|
`
`7
`
`

`

`Patent Application Publication Dec. 26,2002 Sheet 7 of 10
`
`US 2002/0198705 Al
`
`
`
`Figure 7
`
`8
`
`

`

`Patent Application Publication Dec. 26,2002 Sheet 8 of 10
`
`US 2002/0198705 Al
`
`di versus delta M fordeltad =1,2,3,4 cm
`—T
`T
`To
`—T
`oT
`
`800
`
`deltaM
`
`
`20
`
`25
`
`30
`
`d1 (cm)
`
`Figure 8
`
`9
`
`

`

`Patent Application Publication Dec. 26, 2002 Sheet 9 of 10
`
`US 2002/0198705 A1
`
`Acoustic data (solid) and gain parameter (dashed)
`Ty
`T
`Tv
`tT
`1
`
`T
`
`900
`
`voT
`
`L
`
`L
`
`|
`+——_ACOUSTIC DATA
`904
`
`
`
`|
`
`
`
`I.
`|
`mi
`i
`1
`_
`1
`
`0
`0.5
`1
`1.5
`2
`2.5
`3
`3.5
`4
`time (samples)
`
`x 10
`
`Figure 9
`
`10
`
`10
`
`

`

`Patent Application Publication Dec. 26,2002 Sheet 10 of 10
`
`US 2002/0198705 Al
`
`Mic 1 and V for “pop pan” in \headmic\micgems_p1.bin J
`
`T
`F
`T
`T
`TT
`—T
`TF
`
`1000
`
`VOICING
`SIGNAL
`p 1002
`
`AUDIO
`"4008
`en f ce¢___ VOICED
`: /
`_
`LEVEL
`
`.
`oo:
`
`
`
`i“
`
`J
`
`UNVOICED
`LEVEL
`
`GEMS
`SIGNAL
`
`NOT
`
`- VOICED
`a LEVEL
`
`
`
`
`
`
`/ 4
`
`
` :
`
`
`
`time (samples)
`
`x 10°
`
`Figure 10
`
`11
`
`11
`
`

`

`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`DETECTING VOICED AND UNVOICED SPEECH
`USING BOTH ACOUSTIC AND NONACOUSTIC
`SENSORS
`
`RELATED APPLICATIONS
`
`[0001] This application claims the benefit of U.S. Appli-
`cation Nos. 60/294,383 filed May 30, 2001; 09/905,361 filed
`Jul. 12, 2001; 60/335,100 filed Oct. 30, 2001; 60/332,202
`and 09/990,847, both filed Nov. 21, 2001; 60/362,103,
`60/362,161, 60/362,162, 60/362,170, and 60/361,981, all
`filed Mar. 5, 2002; 60/368,208, 60/368,209, and 60/368,343,
`all filed Mar. 27, 2002; all of which are incorporated herein
`by reference in their entirety.
`
`TECHNICAL FIELD
`
`[0002] The disclosed embodiments relate to the process-
`ing of speech signals.
`
`BACKGROUND
`
`ability to correctly identify voiced and
`[0003] The
`unvoiced speech is critical to many speech applications
`including speech recognition, speaker verification, noise
`suppression, and many others. In a typical acoustic appli-
`cation, speech from a human speakeris captured andtrans-
`mitted to a receiver in a different location. In the speaker’s
`environment there may exist one or more noise sources that
`pollute the speech signal, or the signal of interest, with
`unwanted acoustic noise. This makesit difficult or impos-
`sible for the receiver, whether human or machine, to under-
`stand the user’s speech.
`
`classifying voiced and
`for
`[0004] Typical methods
`unvoiced speech have relied mainly on the acoustic content
`of microphone data, which is plagued by problems with
`noise and the corresponding uncertainties in signal content.
`This is especially problematic now with the proliferation of
`portable communication deviceslike cellular telephones and
`personal digital assistants because, in many cases, the qual-
`ity of service provided by the device depends on the quality
`of the voice services offered by the device. There are
`methods knownin the art for suppressing the noise present
`in the speech signals, but these methods demonstrate per-
`formance shortcomingsthat include unusually long comput-
`ing time, requirements for cumbersome hardware to perform
`the signal processing, and distorting the signals of interest.
`
`BRIEF DESCRIPTION OF THE FIGURES
`
`[0005] FIG. 1 is a block diagram of a NAVSAD system,
`under an embodiment.
`
`FIG.2 is a block diagram of a PSAD system, under
`[0006]
`an embodiment.
`
`[0007] FIG. 3 is a block diagram of a denoising system,
`referred to herein as the Pathfinder system, under an embodi-
`ment.
`
`[0008] FIG. 4 is a flow diagram of a detection algorithm
`for use in detecting voiced and unvoiced speech, under an
`embodiment.
`
`[0009] FIG. 5A plots the received GEMSsignal for an
`utterance along with the mean correlation between the
`GEMSsignal and the Mic 1 signal and the threshold for
`voiced speech detection.
`
`12
`
`[0010] FIG. 5B plots the received GEMSsignal for an
`utterance along with the standard deviation of the GEMS
`signal and the threshold for voiced speech detection.
`
`[0011] FIG. 6 plots voiced speech detected from an utter-
`ance along with the GEMSsignal and the acoustic noise.
`
`[0012] FIG. 7 is a microphone array for use under an
`embodiment of the PSAD system.
`
`[0013] FIG. 8 is a plot of AM versus d, for several Ad
`values, under an embodiment.
`
`[0014] FIG. 9 showsa plot of the gain parameter as the
`sum of the absolute values of H,(z) and the acoustic data or
`audio from microphone1.
`
`[0015] FIG. 10 is an alternative plot of acoustic data
`presented in FIG.9.
`
`In the figures, the same reference numbersidentify
`[0016]
`identical or substantially similar elements oracts.
`
`[0017] Any headings provided herein are for convenience
`only and do not necessarily affect the scope or meaning of
`the claimed invention.
`
`DETAILED DESCRIPTION
`
`[0018] Systems and methods for discriminating voiced
`and unvoiced speech from background noise are provided
`below including a Non-Acoustic Sensor Voiced Speech
`Activity Detection (NAVSAD) system and a Pathfinder
`Speech Activity Detection (PSAD) system. The noise
`removal and reduction methods provided herein, while
`allowing for the separation and classification of unvoiced
`and voiced human speech from background noise, address
`the shortcomings of typical systems known in the art by
`cleaning acoustic signals of interest without distortion.
`
`[0019] FIG. 1 is a block diagram of a NAVSAD system
`100, under an embodiment. The NAVSAD system couples
`microphones 10 and sensors 20 to at least one processor 30.
`The sensors 20 of an embodiment include voicing activity
`detectors or non-acoustic sensors. The processor 30 controls
`subsystems including a detection subsystem 50, referred to
`herein as a detection algorithm, and a denoising subsystem
`40. Operation of the denoising subsystem 40 is described in
`detail in the Related Applications. The NAVSAD system
`works extremely well in any background acoustic noise
`environment.
`
`FIG.2 is a block diagram of a PSAD system 200,
`[0020]
`under an embodiment. The PSAD system couples micro-
`phones 10 to at least one processor 30. The processor 30
`includes a detection subsystem 50, referred to herein as a
`detection algorithm, and a denoising subsystem 40. The
`PSAD system is highly sensitive in low acoustic noise
`environments and relatively insensitive in high acoustic
`noise environments. The PSAD can operate independently
`or as a backup to the NAVSAD, detecting voiced speech if
`the NAVSAD fails.
`
`[0021] Note that the detection subsystems 50 and denois-
`ing subsystems 40 of both the NAVSAD and PSAD systems
`of an embodimentare algorithms controlled by the processor
`30, but are not so limited. Alternative embodiments of the
`NAVSAD and PSAD systems can include detection sub-
`systems 50 and/or denoising subsystems 40 that comprise
`additional hardware, firmware, software, and/or combina-
`
`12
`
`

`

`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`tions of hardware, firmware, and software. Furthermore,
`functions of the detection subsystems 50 and denoising
`subsystems 40 may be distributed across numerous compo-
`nents of the NAVSAD and PSAD systems.
`
`[0026] The systems described herein use the information
`derived from the Pathfinder noise suppression system and/or
`a non-acoustic sensor described in the Related Applications
`to determine the voicingstate of an input signal, as described
`in detail below. The voicing state includessilent, voiced, and
`[0022] FIG.3isa block diagram of a denoising subsystem
`unvoiced states. The NAVSAD system,
`for example,
`300, referred to herein as the Pathfinder system, under an
`includes a non-acoustic sensor to detect the vibration of
`embodiment. The Pathfinder system is briefly described
`human tissue associated with speech. The non-acoustic
`below,and is described in detail in the Related Applications.
`sensor of an embodiment
`is a General Electromagnetic
`Two microphones Mic 1 and Mic2are used in the Pathfinder
`Movement Sensor (GEMS) as described briefly below and
`system, and Mic 1 is considered the “signal” microphone.
`in detail in the Related Applications, but is not so limited.
`With reference to FIG. 1,
`the Pathfinder system 300 is
`Alternative embodiments, however, may use any sensorthat
`equivalent to the NAVSAD system 100 when the voicing
`is able to detect humantissue motion associated with speech
`activity detector (VAD) 320 is a non-acoustic voicing sensor
`and is unaffected by environmental acoustic noise.
`20 and the noise removal subsystem 340 includes the
`detection subsystem 50 and the denoising subsystem 40.
`With reference to FIG. 2,
`the Pathfinder system 300 is
`equivalent to the PSAD system 200 in the absence of the
`VAD 320, and when the noise removal subsystem 340
`includes the detection subsystem 50 and the denoising
`subsystem 40.
`
`[0027] The GEMSis a radio frequency device (2.4 GHz)
`that allows the detection of moving humantissue dielectric
`interfaces. The GEMSincludes an RF interferometer that
`
`uses homodyne mixing to detect small phase shifts associ-
`ated with target motion. In essence, the sensor sends out
`weak electromagnetic waves (less than 1 milliwatt) that
`reflect off of whatever is around the sensor. The reflected
`
`[0023] The NAVSAD and PSAD systems support a two-
`level commercial approach in which (i) a relatively less
`expensive PSAD system supports an acoustic approach that
`functions in most low- to medium-noise environments, and
`(ii) a NAVSAD system adds a non-acoustic sensor to enable
`detection of voiced speech in any environment. Unvoiced
`speech is normally not detected using the sensor, as it
`normally does not sufficiently vibrate human tissue. How-
`ever, in high noise situations detecting the unvoiced speech
`is not as important, as it is normally very low in energy and
`easily washed out by the noise. Therefore in high noise
`environments the unvoiced speech is unlikely to affect the
`voiced speech denoising. Unvoiced speech information is
`most important in the presence of little to no noise and,
`therefore, the unvoiced detection should be highly sensitive
`in low noise situations, and insensitive in high noise situa-
`tions. This is not easily accomplished, and comparable
`acoustic unvoiced detectors known in the art are incapable
`of operating under these environmental constraints.
`
`[0024] The NAVSAD and PSAD systemsinclude an array
`algorithm for speech detection that uses the difference in
`frequency content between two microphonesto calculate a
`relationship between the signals of the two microphones.
`This is in contrast to conventional arrays that attempt to use
`the time/phase difference of each microphone to removethe
`noise outside of an “area of sensitivity”. The methods
`described herein provide a significant advantage, as they do
`not require a specific orientation of the array with respect to
`the signal.
`
`[0025] Further, the systems described herein are sensitive
`to noise of every type and every orientation, unlike conven-
`tional arrays that depend on specific noise orientations.
`Consequently, the frequency-based arrays presented herein
`are unique as they depend only on the relative orientation of
`the two microphonesthemselves with no dependence on the
`orientation of the noise and signal with respect
`to the
`microphones. This results in a robust signal processing
`system with respect to the type of noise, microphones, and
`orientation between the noise/signal source and the micro-
`phones.
`
`waves are mixed with the original transmitted waves and the
`results analyzed for any change in position of the targets.
`Anything that moves near the sensor will cause a change in
`phase of the reflected wave that will be amplified and
`displayed as a change in voltage output from the sensor. A
`similar sensor is described by Gregory C. Burnett (1999) in
`“The
`physiological basis of glottal
`electromagnetic
`micropower sensors (GEMS) and their use in defining an
`excitation function for the human vocaltract”; Ph.D. Thesis,
`University of California at Davis.
`
`FIG.4 is a flow diagram of a detection algorithm
`[0028]
`50 for use in detecting voiced and unvoiced speech, under an
`embodiment. With reference to FIGS. 1 and 2, both the
`NAVSAD and PSAD systemsof an embodimentinclude the
`detection algorithm 50 as the detection subsystem 50. This
`detection algorithm 50 operates in real-time and,
`in an
`embodiment, operates on 20 millisecond windowsand steps
`10 milliseconds at a time, but is not so limited. The voice
`activity determination is recorded for the first 10 millisec-
`onds, and the second 10 milliseconds functions as a “look-
`ahead” buffer. While an embodiment uses the 20/10 win-
`dows, alternative embodiments may use numerous other
`combinations of window values.
`
`[0029] Consideration was given to a number of multi-
`dimensional factors in developing the detection algorithm
`50. The biggest consideration was to maintaining the effec-
`tiveness of the Pathfinder denoising technique, described in
`detail
`in the Related Applications and reviewed herein.
`Pathfinder performance can be compromisedif the adaptive
`filter training is conducted on speech rather than on noise. It
`is therefore important not to exclude any significant amount
`of speech from the VAD to keep such disturbances to a
`minimum.
`
`[0030] Consideration wasalso given to the accuracy of the
`characterization between voiced and unvoiced speech sig-
`nals, and distinguishing each of these speech signals from
`noise signals. This type of characterization can be useful in
`such applications as speech recognition and speaker verifi-
`cation.
`
`13
`
`13
`
`

`

`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`[0031] Furthermore, the systems using the detection algo-
`rithm of an embodiment function in environments contain-
`ing varying amounts of background acoustic noise. If the
`non-acoustic sensor is available, this external noise is not a
`problem for voiced speech. However, for unvoiced speech
`(and voicedif the non-acoustic sensoris not available or has
`malfunctioned) reliance is placed on acoustic data alone to
`separate noise from unvoiced speech. An advantage inheres
`in the use of two microphones in an embodiment of the
`Pathfinder noise suppression system, and the spatial rela-
`tionship between the microphones is exploited to assist in
`the detection of unvoiced speech. However,
`there may
`occasionally be noise levels high enoughthat the speech will
`be nearly undetectable and the acoustic-only method will
`fail. In these situations, the non-acoustic sensor(or hereafter
`just the sensor) will be required to ensure good performance.
`[0032]
`In the two-microphone system, the speech source
`should be relatively louder in one designated microphone
`when compared to the other microphone. Tests have shown
`that this requirementis easily met with conventional micro-
`phones when the microphonesare placed on the head, as any
`noise should result in an H, with a gain near unity.
`
`[0033] Regarding the NAVSAD system, and with refer-
`ence to FIG. 1 and FIG. 3, the NAVSAD relies on two
`parameters to detect voiced speech. These two parameters
`include the energy of the sensor in the windowofinterest,
`determined in an embodiment by the standard deviation
`(SD),
`and optionally the
`cross-correlation (XCORR)
`between the acoustic signal from microphone 1 and the
`sensor data. The energy of the sensor can be determined in
`any one of a number of ways, and the SD is just one
`convenient way to determine the energy.
`
`[0034] For the sensor, the SD is akin to the energy of the
`signal, which normally corresponds quite accurately to the
`voicing state, but may be susceptible to movement noise
`(relative motion of the sensor with respect to the human
`user) and/or electromagnetic noise. To further differentiate
`sensor noise from tissue motion, the XCORR can be used.
`The XCORRis only calculated to 15 delays, which corre-
`sponds to just under 2 milliseconds at 8000 Hz.
`
`[0035] The XCORRcan also be useful when the sensor
`signal
`is distorted or modulated in some fashion. For
`example, there are sensor locations (such as the jaw or back
`of the neck) where speech production can be detected but
`where the signal may have incorrect or distorted time-based
`information. Thatis, they may not have well defined features
`in time that will match with the acoustic waveform. How-
`
`ever, XCORR is more susceptible to errors from acoustic
`noise, and in high (<0 dB SNR) environments is almost
`useless. Therefore it should not be the sole source of voicing
`information.
`
`[0036] The sensor detects human tissue motion associated
`with the closure of the vocal folds, so the acoustic signal
`produced bythe closure ofthe folds is highly correlated with
`the closures. Therefore, sensor data that correlates highly
`with the acoustic signal is declared as speech, and sensor
`data that does not correlate well
`is termed noise. The
`acoustic data is expected to lag behind the sensor data by
`about 0.1 to 0.8 milliseconds (or about 1-7 samples) as a
`result of the delay time dueto the relatively slower speed of
`sound (around 330 m/s). However, an embodiment uses a
`15-sample correlation, as the acoustic wave shape varies
`significantly depending on the sound produced, and a larger
`correlation width is needed to ensure detection.
`
`14
`
`[0037] The SD and XCORRsignals are related, but are
`sufficiently different so that the voiced speech detection is
`morereliable. For simplicity, though, either parameter may
`be used. The values for the SD and XCORR are compared
`to empirical thresholds, and if both are abovetheir threshold,
`voiced speech is declared. Example data is presented and
`described below.
`
`[0038] FIGS. 5A, 5B, and 6 show data plots for an
`example in which a subject twice speaks the phrase “pop
`pan”, under an embodiment. FIG. 5A plots the received
`GEMSsignal 502 for this utterance along with the mean
`correlation 504 between the GEMSsignal and the Mic 1
`signal and the threshold T1 used for voiced speech detection.
`FIG. 5B plots the received GEMS signal 502 for this
`utterance along with the standard deviation 506 of the
`GEMSsignal and the threshold T2 used for voiced speech
`detection. FIG. 6 plots voiced speech 602 detected from the
`acoustic or audio signal 608, along with the GEMSsignal
`604 and the acoustic noise 606; no unvoiced speech is
`detected in this example because of the heavy background
`babble noise 606. The thresholds have beenset so that there
`are virtually no false negatives, and only occasional false
`positives. A voiced speech activity detection accuracy of
`greater than 99% has been attained under any acoustic
`background noise conditions.
`
`[0039] The NAVSAD can determine when voiced speech
`is occurring with high degrees of accuracy due to the
`non-acoustic sensor data. However, the sensor offers little
`assistance in separating unvoiced speech from noise, as
`unvoiced speech normally causes no detectable signal in
`most non-acoustic sensors.If there is a detectable signal, the
`NAVSAD can beused, although use of the SD method is
`dictated as unvoiced speech is normally poorly correlated. In
`the absence of a detectable signal use is made of the system
`and methods of the Pathfinder noise removal algorithm in
`determining when unvoiced speech is occurring. A brief
`review of the Pathfinder algorithm is described below, while
`a detailed description is provided in the Related Applica-
`tions.
`
`[0040] With reference to FIG.3, the acoustic information
`coming into Microphone 1 is denoted by m,(n), the infor-
`mation coming into Microphone 2 is similarly labeled
`m,(n), and the GEMSsensoris assumed available to deter-
`mine voiced speech areas.
`In the z (digital frequency)
`domain, these signals are represented as M,(z) and M.(z).
`Then
`
`My(z) = S(x) + No(z)
`
`M(z) = N(z) + S2(z)
`
`with
`
`N(x) = N@AL()
`
`So () = S(Z)Haz)
`
`so that
`
`My (2) = S(z) + N(2)A(2)
`M2(z) = N(z) + S(z)A2(z)
`
`qd)
`
`[0041] This is the general case for all two microphone
`systems. There is always going to be some leakage of noise
`into Mic 1, and some leakageof signal into Mic 2. Equation
`1 has four unknownsand only tworelationships and cannot
`be solved explicitly.
`
`14
`
`

`

`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`[0042] However, there is another way to solve for some of
`the unknowns in Equation 1. Examine the case where the
`signal is not being generated—that is, where the GEMS
`signal
`indicates voicing is not occurring.
`In this case,
`s(n)=S(z)=0, and Equation 1 reduces to
`Min(@)=N@C)
`Man(2)=N()
`
`[0043] where the n subscript on the M variables indicate
`that only noise is being received. This leads to
`
`[0052] With reference to FIG. 2 and FIG.3, the PSAD
`system is described. As sound waves propagate, they nor-
`mally lose energy as they travel due to diffraction and
`dispersion. Assuming the sound wavesoriginate from a
`point source and radiate isotropically, their amplitude will
`decrease as a function of 1/r, wherer is the distance from the
`originating point. This function of 1/r proportional to ampli-
`tude is the worst case, if confined to a smaller area the
`reduction will be less. Howeverit is an adequate model for
`the configurationsof interest, specifically the propagation of
`noise and speech to microphones located somewhere on the
`user’s head.
`
`Myn(2) = Mon(z)A(z)
`[0053] FIG. 7 is a microphone array for use under an
`
`Min
`ioe
`embodiment of the PSAD system. Placing the microphones
`Mic 1 and Mic 2 inalinear array with the mouth onthe array
`midline, the difference in signal strength in Mic 1 and Mic
`2 (assuming the microphones have identical
`frequency
`responses) will be proportional to both d, and Ad. Assuming
`a 1/r (or in this case 1/d) relationship, it is seen that
`
`(2)
`
`[0044] H,(z) can be calculated using any of the available
`system identification algorithms and the microphone outputs
`when only noise is being received. The calculation can be
`done adaptively, so that if the noise changes significantly
`H,(z) can be recalculated quickly.
`
`[0045] With a solution for one of the unknowns in Equa-
`tion 1, solutions can be found for another, H,(z), by using
`the amplitude of the GEMSorsimilar device along with the
`amplitude of the two microphones. When the GEMSindi-
`cates voicing, but the recent (less than 1 second) history of
`the microphones indicate low levels of noise, assume that
`n(s)=N(z)~0. Then Equation 1 reduces to
`M,.(2)=S@)
`M).(2)=S2)AZ)
`
`[0046] which in turn leads to
`
`Mos(z) = Mis(Z)H2(z)
` Mos(Z)
`O= @
`
`[0047] which is the inverse of the H,(z) calculation, but
`note that different inputs are being used.
`
`[0048] After calculating H,(z) and H(z) above, they are
`used to remove the noise from the signal. Rewrite Equation
`1 as
`
`S@=M,2)-N@H,@)
`N(@)=M,(2)-S(@)Hp (2)
`S@=M,(2)-[M@)-S@)FL@)Ae)
`S@D-A,2)H,2)-M,@)-M_@)A)
`
`[0049]
`
`and solve for S(z) as:
`
`Si= Mi() — Mo(@)MZ)
`“1 Fa(@iie)
`
`|
`
`(3)
`
`In practice H,(z) is usually quite small, so that
`[0050]
`H.(z)H,(z)<<1, and
`S(Z)=M,(2)-M_@)H,(@),
`
`[0051]
`
`obviating the need for the H,(z) calculation.
`
`15
`
`_ |Micl|
`= |Mic2| = AH, (z) «
`
`
`d, +Ad
`aq
`
`[0054] where AM isthe difference in gain between Mic 1
`and Mic 2 and therefore H,(z), as above in Equation 2. The
`variable d, is the distance from Mic 1 to the speech or noise
`source.
`
`FIG.8 is a plot 800 of AM versus d, for several Ad
`[0055]
`values, under an embodiment.It is clear that as Ad becomes
`larger and the noise source is closer, AM becomeslarger.
`The variable Ad will change depending onthe orientation to
`the speech/noise source, from the maximum value on the
`array midline to zero perpendicular to the array midline.
`From the plot 800 it
`is clear that for small Ad and for
`distances over approximately 30 centimeters (cm), AM is
`close to unity. Since most noise sources are farther away
`than 30 cm andare unlikely to be on the midline onthearray,
`it
`is probable that when calculating H,(z) as above in
`Equation 2, AM (or equivalently the gain of H,(z)) will be
`close to unity. Conversely, for noise sources that are close
`(within a few centimeters),
`there could be a substantial
`difference in gain depending on which microphoneis closer
`to the noise.
`
`If the “noise” is the user speaking, and Mic 1 is
`[0056]
`closer to the mouth than Mic 2, the gain increases. Since
`environmental noise normally originates much farther away
`from the user’s head than speech, noise will be found during
`the time when the gain of H,(z) is near unity or somefixed
`value, and speech can be found after a sharp rise in gain. The
`speech can be unvoiced or voiced, as long as it
`is of
`sufficient volume compared to the surrounding noise. The
`gain will stay somewhat high during the speech portions,
`then descend quickly after speech ceases. The rapid increase
`and decrease in the gain of H,(z) should be sufficient to
`allow the detection of speech under almost any circum-
`stances. The gain in this example is calculated by the sum of
`the absolute value of the filter coefficients. This sum is not
`
`equivalent to the gain, but the two are related in that a rise
`in the sum of the absolute value reflects a rise in the gain.
`
`[0057] As an example of this behavior, FIG. 9 shows a
`plot 900 of the gain parameter 902 as the sum ofthe absolute
`values of H,(z) and the acoustic data 904 or audio from
`
`15
`
`

`

`US 2002/0198705 Al
`
`Dec. 26, 2002
`
`microphone 1. The speech signal was an utterance of the
`phrase “pop pan”, repeated twice. The evaluated bandwidth
`included the frequency range from 2500 Hz to 3500 Hz,
`although 1500Hz to 2500 Hz was additionally used in
`practice. Note the rapid increase in the gain when the
`unvoiced speech isfirst encountered, then the rapid return to
`normal when the speech ends. The large changes in gain that
`result from transitions between noise and speech can be
`detected by any standard signal processing techniques. The
`standard deviation of the last few gain calculations is used,
`with thresholds being defined by a running average of the
`standard deviations and the standard deviation noise floor.
`The later changes in gain for the voiced speech are sup-
`pressed in this plot 900 for clarity.
`
`[0058] FIG. 10 is an alternative plot 1000 of acoustic data
`presented in FIG. 9. The data used to form plot 900 is
`presented again inthis plot 1000, along with audio data 1004
`and GEMSdata 1006 without noise to make the unvoiced
`
`speech apparent. The voiced signal 1002 has three possible
`values: O for noise, 1 for unvoiced, and 2 for voiced.
`Denoising is only accomplished when V=0.It is clear that
`the unvoiced speech is captured very well, aside from two
`single dropouts in the unvoiced detection near the end of
`each “pop”. However, these single-window dropouts are not
`common and donotsignificantly affect the denoising algo-
`rithm. They can easily be removed using standard smoothing
`techniques.
`
`[0059] What is not clear from this plot 1000 is that the
`PSAD system functions as an automatic backup to the
`NAVSAD. Thisis because the voiced speech(since it has the
`samespatial relationship

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket