`
`(19) World Intellectual Property Organization
`International Bureau
`
`1111111111111111 IIIIII 111111111111111111111111111111 lllll lllll llll 1111111111111111111
`
`(43) International Publication Date
`20 November 2003 (20.11.2003)
`
`PCT
`
`(10) International Publication Number
`WO 03/096031 A2
`
`(51) International Patent Classification 7:
`
`GOlR
`
`(21) International Application Number:
`
`PCT/US03/06893
`
`(74) Agent: GREGORY, Richard, L., Jr.; Shemwell Gregory
`& Courtney LLP, 4880 Stevens Creek Blvd., Suite 201, San
`Jose, CA 95129 (US).
`
`(22) International Filing Date:
`
`5 March 2003 (05.03.2003)
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`English
`
`English
`
`(30) Priority Data:
`60/361,981
`60/362,103
`60/362,161
`60/362,162
`60/362,170
`
`5 March 2002 (05.03.2002) US
`5 March 2002 (05.03.2002) US
`5 March 2002 (05.03.2002) US
`5 March 2002 (05.03.2002) US
`5 March 2002 (05.03.2002) US
`
`(71) Applicant: ALIPHCOM [US/US]; 410 Jesse Street, Unit
`601, San Francisco, CA 94103 (US).
`
`(72) Inventors: BURNETT, Gregory, C.; 675 South H Street,
`Livermore, CA 94550 (US). PETIT, Nicolas, J.; 3300
`Scott Street #207, San Francisco, CA 94123 (US). AS(cid:173)
`SEILY, Alexander, M.; 2538 Post Street, San Francisco,
`CA 94115 (US). EINUADI, Andrew, E.; 1570 Grove
`Street, San Francisco, CA 94117 (US).
`
`(81) Designated States (national): AE, AG, AL, AM, AT, AU,
`AZ, BA, BB, BG, BR, BY, BZ, CA, CH, CN, CO, CR, CU,
`CZ, DE, DK, DM, DZ, EC, EE, ES, Fl, GB, GD, GE, GH,
`GM, HR, HU, ID, IL, IN, IS, JP, KE, KG, KP, KR, KZ, LC,
`LK, LR, LS, LT, LU, LV, MA, MD, MG, MK, MN, MW,
`MX, MZ, NO, NZ, OM, PH, PL, PT, RO, RU, SD, SE, SG,
`SK, SL, TJ, TM, TN, TR, TT, TZ, UA, UG, UZ, VN, YU,
`ZA, ZM,ZW.
`
`(84) Designated States (regional): ARIPO patent (GH, GM,
`KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZM, ZW),
`Eurasian patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM),
`European patent (AT, BE, BG, CH, CY, CZ, DE, DK, EE,
`ES, Fl, FR, GB, GR, HU, IE, IT, LU, MC, NL, PT, RO,
`SE, SI, SK, TR), OAPI patent (BF, BJ, CF, CG, CI, CM,
`GA, GN, GQ, GW, ML, MR, NE, SN, TD, TG).
`
`Published:
`without international search report and to be republished
`upon receipt of that report
`
`[Continued on next page]
`
`(54) Title: VOICE ACTNITY DETECTION (VAD) DEVICES AND METHODS FOR USE WITH NOISE SUPPRESSION SYS(cid:173)
`TEMS
`
`---iiiiiiii
`iiiiiiii -
`== -------------------------------------------
`-iiiiiiii ---
`---iiiiiiii
`
`/00
`J
`
`VAD
`10;:z.
`
`iiiiiiii
`
`SIGNAL
`s(n)
`)10
`
`MIC 1
`110
`
`Cleaned speech +
`
`,-...I
`~
`Q
`\0
`~
`
`NOISE
`11(11)
`1.2;2.
`
`MIC2
`11.t
`
`Q ---
`~ (57) Abstract: Voice Activity Detection (VAD) devices, systems and methods are described for use with signal processing systems
`to denoise acoustic signals. Components of a signal processing system and/or VAD system receive acoustic signals and voice activity
`0 signals. Control signals are automatically generated from data of the voice activity signals. Components of the signal processing
`> system and/or VAD system use the control signals to automatically select a denoising method appropriate to data of frequency sub(cid:173)
`
`;;, bands of the acoustic signals. The selected denoising method is applied to the acoustic signals to generate denoised acoustic signals.
`
`Page 1 of 72
`
`GOOGLE EXHIBIT 1015
`
`
`
`WO 03/096031 A2
`
`1111111111111111 IIIIII 111111111111111111111111111111 lllll lllll llll 1111111111111111111
`
`For two-letter codes and other abbreviations, refer to the "Guid(cid:173)
`ance Notes on Codes and Abbreviations" appearing at the begin(cid:173)
`ning of each regular issue of the PCT Gazette.
`
`Page 2 of 72
`
`
`
`WO 03/096031
`
`1
`
`PCT /0S03/06893
`
`Attorney Docket No. ALPH.P015WO
`
`Transmittal of Patent Application for Filing
`
`Certification Under 37 C.F.R. §I.JO (if applicable)
`
`EV 235 876 028 US
`"Express Mail" Label Number
`
`March 5, 2003
`Date of Deposit
`
`I hereby certify that this application, and any other documents referred to as enclosed herein are being
`deposited in an envelope with the United States Postal Service "Express Mail Post Office to Addressee" service under
`37 CFR §1.10 on the date indicated above and addressed to the Assistant Commissioner for Patents, Washington,
`;J., / J
`D.C. 20231
`/ ( j,X.
`-~-~---.t--+-------
`
`Richard L. Gregory, Jr.
`
`(Print Name of Person Mailing Application)
`
`Voice Activity Detection {YAD) Devices and Methods For Use With Noise
`
`5
`
`Suppression Systems
`
`INVENTORS:
`
`GREGORY C. BURNETT
`NICOLAS J. PETIT
`ALEXANDER M. ASSEILY
`ANDREW E. EINAUDI
`
`RELATED APPLICATIONS
`
`This application claims priority from the following.United States Patent
`
`Applications: Application Number 60/362,162, entitled PATHFINDER-BASED
`
`15 VOICE ACTIVITY DETECTION (PV AD) USED WITH PATHFINDER NOISE
`
`SUPPRESSION, filed March 5, 2002; Application Number 60/362,170, entitled
`
`ACCELEROMETER-BASED VOICE ACTIVITY DETECTION (PV AD) WITH
`
`PATHFINDER NOISE SUPPRESSION, filed March 5, 2002; Application Number
`
`60/361,981, entitled ARRAY-BASED VOICE ACTIVITY DETECTION (AV AD)
`
`20 AND PATHFINDER NOISE SUPPRESSION, filed March 5, 2002; Application
`
`Number 60/362,161, entitled PATHFINDER NOISE SUPPRESSION USING AN
`
`EXTERNAL VOICE ACTIVITY DETECTION (V AD) DEVICE, filed March 5, 2002;
`
`Application Number 60/362,103, entitled ACCELEROMETER-BASED VOICE
`
`Page 3 of 72
`
`
`
`WO 03/096031
`
`2
`
`PCT /0S03/06893
`
`ACTIVITY DETECTION, filed March 5, 2002; and Application Number 60/368,343,
`
`entitled TWO-MICROPHONE FREQUENCY-BASED VOICE ACTIVITY
`
`DETECTION, filed March 27, 2002, all of which are currently pending.
`
`Further, this application relates to the following United States Patent
`
`5 Applications: Application Number 09/905,361, entitled METHOD AND APPARATUS
`
`FOR REMOVING NOISE FROM ELECTRONIC SIGNALS, filed July 12, 2001;
`
`Application Number 10/159,770, entitled DETE_CTING VOICED AND UNVOICED
`
`SPEECH USING BOTH ACOUSTIC AND NONACOUSTIC SENSORS, filed May
`
`30, 2002; and Application Number 10/301,237, entitled METHOD AND
`
`10
`
`_APPARATUS FOR REMOVING NOISE FROM ELECTRONIC SIGNALS, filed
`
`November 21, 2002.
`
`TECHNICAL FIELD
`
`The disclosed embodiments relate to systems and methods for detecting and
`
`15
`
`processing a desired signal in the presence of acoustic noise.
`
`BACKGROUND
`
`Many noise suppression algorithms and techniques have been developed over
`
`the years. Most of the noise suppression systems in use today for speech
`
`20
`
`communication systems are based on a single-microphone spectral subtraction
`
`technique first develop in the 1970's and described, for example, by S. F. Boll in
`
`"Suppression of Acoustic Noise in Speech using Spectral Subtraction," IEEE Trans. on
`
`ASSP, pp. 113-120, 1979. These techniques have been refined over the years, but the
`
`basic principles of operation have remained the same. See, for example, United States
`
`25
`
`Patent Number 5,687,243 of McLaughlin, et al., and United States Patent Number
`
`4,811,404 of Vilmur, et al. Generally, these techniques make use of a single(cid:173)
`
`microphone Voice Activity Detector (V AD) to determin_e the background noise
`
`characteristics, where "voice" is generally understood to include human voiced speech,
`
`unvoiced speech, or a combination of voiced and unvoiced speech.
`
`30
`
`The V AD has also been used in digital cellular systems. As an example of such
`
`a use, see United States Patent Number 6,453,291 of Ashley, where a V AD
`
`configuration appropriate to the front-end of a digital cellular system is described.
`
`Further, some Code Division Multiple Access (CDMA) systems utilize a V AD to
`
`Page 4 of 72
`
`
`
`WO 03/096031
`
`3
`
`PCT /0S03/06893
`
`minimize the effective radio spectrum used, thereby allowing for more system capacity.
`
`Also, Global System for Mobile Communication (GSM) systems can include a V AD to
`
`reduce co-channel interference and to reduce battery consumption ·on the client or
`
`subscriber device.
`
`5
`
`These typical single-microphone V AD systems are significantly limited in
`
`capability as a result of the analysis of acoustic information received by the single
`
`microphone, wherein the analysis is performed using typical signal processing
`
`techniques. In particular, limitations in performance of these single-microphone V AD
`
`systems are noted when processing signals having a low signal-to-noise ratio (SNR),
`
`10
`
`and in settings where the background noise varies quickly. Thus, similar limitations are
`
`found in noise suppression systems using these single-microphone V ADs.
`
`Page 5 of 72
`
`
`
`WO 03/096031
`
`4
`
`PCT /0S03/06893
`
`BRIEF DESCRIPTION OF THE FIGURES
`
`Figure 1 is a block diagram of a signal processing system including the
`
`Pathfinder noise suppression system and a V AD system, under an embodiment.
`
`Figure lA is a block diagram of a V AD system including hardware for use in
`
`5
`
`receiving and processing signals relating to V AD, under an embodiment.
`
`Figure lB is a block diagram of a V AD system using hardware of the
`
`associated noise suppression system for use in receiving V AD information, under an
`
`alternative embodiment.
`
`Figure 2 is a block diagram of a signal processing system that incorporates a
`
`10
`
`classical adaptive noise cancellation system, as known in the art.
`
`Figure 3 is a flow diagram of a method for determining voiced and unvoiced
`
`speech using an accelerometer-based V AD, under an embodiment.
`
`Figure 4 shows plots including a noisy audio signal (live recording) along with
`
`a corresponding accelerometer-based V AD signal, the corresponding accelerometer
`
`15
`
`output signal, and the denoised audio signal following processing by the Pathfinder
`
`system using the V AD signal, under an embodiment.
`
`Figure 5 shows plots including a noisy audio signal (live recording) along with
`
`a corresponding SSM-based V AD signal, the corresponding SSM output signal, and the
`
`denoised audio signal following processing by the Pathfinder system using the V AD
`
`20
`
`signal, under an embodime~t.
`
`Figure 6 shows plots including a noisy audio signal (live recording) along with
`
`a corresponding GEMS-based V AD signal, the corresponding GEMS output signal, and
`
`the denoised audio signal following processing by the Pathfinder system using the
`
`V AD sign~, under an embodiment.
`
`25
`
`Figure 7 shows plots including recorded spoken acoustic data with digitally
`
`added noise along with a corresponding EGG-based V AD signal, and the
`
`corresponding highpass filtered EGG output signal, unde:r an embodiment.
`
`Figure 8 is a•flow diagram 80 of a method for determining voiced speech using
`
`a video-based V AD, under an embodiment.
`
`,.
`
`.. •
`
`30
`
`Figure 9 shows plots including a noisy audio signal (live recording) along with
`
`a corresponding single (gradient) microphone-~ased V AD signal, the corresponding
`
`gradient microphone output signal, and the denoised audio signal following processing
`
`by the Pathfinder system using the V AD signal, under an embodiment.
`
`Page 6 of 72
`
`
`
`WO 03/096031
`
`5
`
`PCT /0S03/06893
`
`Figure 10 shows a single cardioid unidirectional microphone of the microphone
`
`array, along with the associated spatial response curve, under an embodiment.
`
`Figure 11 shows a microphone array of a PV AD· system, under an embodiment.
`
`Figure 12 is a flow diagram of a method for determining voiced and unvoiced
`
`5
`
`speech using H 1(z) gain values, under an alternative embodiment of the PV AD.
`Figure 13 shows plots including a noisy audio signal (live recording) along
`
`with a corresponding microphone-based PV AD signal, the corresponding PV AD gain
`
`versus time signal, and the denoised audio signal following processing by the
`
`Pathfinder system using the PV AD signal, under an embodiment.
`
`10
`
`Figure 14 is a flow diagram of a method for determining voiced and unvoiced
`
`speech using a stereo V AD, under an embodiment.
`
`Figure 15 shows plots including a noisy audio signal (live recording) along
`
`with a corresponding SV AD signal, and the denoised audio signal following processing
`
`by the Pathfinder system using the SV AD signal, under an embodiment.
`
`15
`
`Figure 16 is a flow diagram of a method for determining voiced and unvoiced
`
`speech using an AV AD, under an embodiment.
`
`Figure 17 shows plots including audio signals and from each microphone of an
`
`AV AD system along with the corresponding combined energy signal, under an
`
`embodiment.
`
`20
`
`Figure 18 is a block diagram of a signal processing system including the
`
`Pathfinder noise suppression system and a single-microphone ( conventional) V AD
`
`system, under an embodiment.
`
`Figure 19 is a flow diagram of a method for generating voicing information
`
`using a single-microphone V AD, under an embodiment.
`
`25
`
`Figure 20 is a flow diagram of a method for determining voiced and unvoiced
`
`speech using an airflow-based V AD, under an embodiment.
`
`Figure 21 shows plots including a noisy audio signal along with a
`
`corresponding manually activated/calculated V AD signal, and the denoised audio
`
`signal following processing by the Pathfinder system using the manual V AD signal,
`
`30
`
`under an embodiment.
`
`In the drawings, the same reference numbers identify identical or substantially
`
`similar elements or acts. To easily identify the discussion· of any particular element or
`
`act, the most significant digit or digits in a reference number refer to the Figure number
`
`Page 7 of 72
`
`
`
`WO 03/096031
`
`6
`
`PCT /0S03/06893
`
`in which that element is first introduced (e.g., element 104 is first introduced and
`
`discussed with respect to Figure 1 ).
`
`Page 8 of 72
`
`
`
`WO 03/096031
`
`7
`
`PCT /0S03/06893
`
`DETAILED DESCRIPTION
`
`Numerous Voice Activity Detection (V AD) devices and methods are described
`
`below for use with adaptive noise suppression systems. Further, results are presented
`
`5
`
`below from experiments using the V AD devices and methods described herein as a
`
`component of a noise suppression system, in particular the Pathfinder Noise
`
`Suppression System available from Aliph, San Francisco, California
`
`(http://www.aliph.com), but the embodiments are not so limited. In the description
`
`below, when the Pathfinder noise suppression system is referred to, it should be kept in
`
`10 mind that noise suppression systems that estimate the noise waveform and subtract it
`
`from a signal and that use or are capable of using V AD information for reliable
`
`operation are included in that reference. Pathfinder is simply a convenient referenced
`
`implementation for a system that operates on signals comprising desired speech signals
`
`along with noise.
`
`15
`
`When using the V AD devices and methods described herein with a noise
`
`suppression system, the V AD signal is processed independently of the noise
`
`suppression system, so that the receipt and processing ofVAD information is
`
`independent from the processing associated with the noise suppression, but the
`
`embodiments are not so limited. This independence is attained physically (i.e.,
`
`20
`
`different hardware for use in receiving and processing signals relating to the V AD and
`
`the noise suppression), through processing (i.e., using the same hardware to receive
`
`signals into the noise suppression system while using independent techniques
`
`(software, algorithms, routines) to process the received signals), and through a
`
`combination of different hardware and different software.
`
`25
`
`In the following description, "acoustic" is generally defined as acoustic waves
`
`propagating in air. Propagation of acoustic waves in media other than air will be noted
`
`as such. References to "speech" or "voice" generally refer to human speech including
`
`voiced speech, unvoiced speech, and/or a combination of voiced and unvoiced speech.
`
`Unvoiced speech or voiced speech is distinguished where necessary. The term "noise
`
`30
`
`suppression" generally describes any method by which noise is reduced or eliminated
`
`in an electronic signal.
`
`Moreover, the term "V AD" is generally defined as a vector or array signal, data,
`
`or information that in some manner represents the occurrence of speech in the digital or
`
`Page 9 of 72
`
`
`
`WO 03/096031
`
`8
`
`PCT /0S03/06893
`
`analog domain. A common representation ofVAD information is a one-bit digital
`
`signal sampled at the same rate as the corresponding acoustic signals, with a zero value
`
`representing .that no speech has occurred during the corresponding time sample, and a
`
`unity value indicating that speech has occurred during the corresponding time sample.
`
`5 While the embodiments described herein are generally described in the digital domain,
`the descriptions are also valid for the analog domain.
`The V AD devices/methods described herein generally include vibration and
`
`movement sensors, acoustic sensors, and manual V AD devices, but are not so limited.
`
`In one embodiment, an accelerometer is placed on the skin for use in detecting skin
`
`10
`
`surface vibrations that correlate with human speech. These recorded vibrations are then
`
`used to calculate a V AD signal for use with or by an adaptive noise suppression
`
`algorithm in suppressing environmental acoustic noise from a simultaneously (within a
`
`few milliseconds) recorded acoustic signal that includes both speech and noise.
`
`Another embodiment of the V AD devices/methods described herein includes an
`
`15
`
`acoustic microphone modified with a membrane so that the microphone no longer
`
`efficiently detects acoustic vibrations in air. The membrane, though, allows the
`
`microphone to detect acoustic vibrations in objects with which it is in physical contact
`
`(allowing a good mechanical impedance match), such as human skin. That is, the
`
`acoustic microphone is modified in some way such that it no longer detects acoustic
`
`20
`
`vibrations in air (where it no longer has a good physical impedance match), but only in
`
`objects with which the microphone is in contact. This configures the microphone, like
`
`the accelerometer, to detect vibrations of human skin associated with the speech
`
`production of that human while not efficiently detecting acoustic environmental noise
`
`in the air. The detected vibrations are processed to form a V AD signal- for use in a
`
`25
`
`noise suppression system, as detailed below.
`
`Yet another embodiment of the V AD described herein uses an electromagnetic
`
`vibration sensor, such as a radiofrequency vibrometer (RF) or laser vibrometer, which
`
`detect skin vibrations. Further, the RF vibrometer detects the movement of tissue
`
`within the body, such as the inner surface of the cheek or the tracheal wall. Both the
`
`30
`
`exterior skin and internal tissue vibrations associated with speech production can be
`
`used to form a VAD signal for use in a noise suppression system as detailed below.
`
`Further embodiments of the V AD devices/methods described herein include an
`
`electroglottograph (EGG) to directly detect vocal fold movement. The EGG is an
`
`Page 10 of 72
`
`
`
`WO 03/096031
`
`9
`
`PCT /0S03/06893
`
`alternating current- (AC) ba~ed method of measuring vocal fold contact area. When the
`
`EGG indicates sufficient vocal fold contact the assumption that follows is that voiced
`
`speech is occurring, and a corresponding V AD signal representative of voiced speech is
`
`generated for use in a noise suppression system as detailed below. Similarly, an
`
`5
`
`additional V AD embodiment uses a video system to detect movement of a person's
`
`vocal articulators, an indication that speech is being produced.
`
`Another set of V AD devices/methods described below use signals received at
`
`one or more acoustic microphones along with corresponding signal processing
`
`techniques to produce V AD signals accurately and reliably under most environmental
`
`10
`
`noise conditions. These embodiments include simple arrays and co-located (or nearly
`
`so) combinations of omnidirectional and unidirectional acoustic microphones. The
`
`simplest configuration in this set ofV AD embodiments includes the use of a single
`
`microphone~ located very close to the mouth of the user in order to record signals at a
`
`relatively high SNR. This microphone can be a gradient or "close-talk" microphone,
`
`15
`
`for example. Other configurations include the use of combinations of unidirectional
`
`and omnidirectional microphones in various orientations and configurations. The
`
`signals received at these microphones, along with the associated signal processing, are
`
`used to calculate a V AD signal for use with a noise suppression system, as described
`
`below. Also described below is a V AD system that is activated manually, as in a
`
`20 walkie-talkie, or by an observer to the system.
`
`As referenced above, the V AD devices and methods described herein are for
`
`use with noise suppression systeµis like, for example, the Pathfinder Noise Suppression
`
`System (referred to herein as the "Pathfinder system") available from Aliph of San
`
`Francisco, California. While the descriptions of the V AD devices herein are provided
`
`25
`
`in the context of the Pathfinder Noise Suppression System, those skilled in the art will
`
`recognize that the V AD devices and methods can be used with a variety of noise
`
`suppression systems and methods known in the art.
`
`The Pathfinder system is a digital signal processing- (DSP) based acoustic noise
`
`suppression and echo-cancellation system. The Pathfinder system, which can couple to
`
`30
`
`the front-end of speech processing systems, uses V AD information and received
`
`acoustic information to reduce or eliminate noise in desired acoustic signals by
`
`estimating the noise waveform and subtracting it from a signal including both speech
`
`Page 11 of 72
`
`
`
`WO 03/096031
`
`PCT /0S03/06893
`
`and noise. The Pathfinder system is described further below and in the Related
`
`. Applications.
`
`Figure 1 is a block diagram of a signal processing system 100 including the
`
`Pathfinder noise suppression system 101 and a V AD system 102, under an
`
`5
`
`embodiment. The signal processing system 100 includes two microphones MIC 1 110
`
`and MIC 2 112 that receive signals or information from at least one speech signal
`
`source 120 and at least one noise source 122. The path s(n) from the speech signal
`
`source 120 to MIC 1 and the path n(n) from the noise source 122 to MIC 2 are
`
`considered to be unity. Further, H1(z) represents the path from the noise source 122 to
`10 MIC 1, and H2(z) represents the path from the speech signal source 120 to MIC 2. In
`
`contrast to the signal processing system 100 including the Pathfinder system 101,
`
`Figure 2 is a block diagram of a signal processing system 200 that incorporates a
`
`classical adaptive noise cancellation system 202 as known in the art.
`
`Components of the signal processing system 100, for example the noise
`
`15
`
`suppression system 101, couple to the microphones MIC 1 and MIC 2 via wireless
`
`couplings, wired couplings, and/or a combination of wireless and wired couplings.
`
`Likewise, the V AD system 102 couples to components of the signal processing system
`
`100, like the noise suppression system 101, via wireless couplings, wired couplings,
`
`and/or a combination of wireless and wired couplings. As an example, the V AD
`
`20
`
`devices and microphones described below as components of the V AD system 102 can
`
`comply with the Bluetooth wireless specification for wireless communication with
`
`other components of the signal processing system, but are not so limited.
`
`Referring to Figure 1, the V AD signal 104 from the V AD system 102, derived
`
`in a manner described herein, controls noise removal from the received signals without
`
`25
`
`respect to noise type, amplitude, and/or orientation. When the V AD signal 104
`
`indicates an absence of voicing, the Pathfinder system 101 uses MIC 1 and MIC 2
`
`signals to calculate the coefficients for a model of trap.sfer function H1(z) over pre(cid:173)
`specified sub bands of the received signals. When the V AD signal 104 indicates the
`
`presence of voicing, the Pathfinder system 101 stops updating H1(z) and starts
`
`30
`
`calculating the coefficients for transfer function H2(z) over pre-specified subbands of
`the received signals. Updates ofH1 coefficients can continue in a subband during
`
`speech production if the SNR in the subband is low (note that H1(z) and H2(z) are
`sometimes referred to herein as H1 and H2, respectively, for convenience). The
`
`Page 12 of 72
`
`
`
`WO 03/096031
`
`PCT /0S03/06893
`
`11
`
`Pathfinder system 101 of an embodiment uses the Least Mean Squares (LMS)
`technique to calculate H1 and H2, as described further by B. Widrow and S. Stearns in
`"Adaptive Signal Processing", Prentice-Hall Publishing, ISBN 0-13-004029-0, but is
`
`not so limited. The· transfer function can be calculated in the time domain, frequency
`
`5
`
`domain, or a combination of both the time/frequency domains. The Pathfinder system
`
`subsequently removes noise from the received acoustic signals of interest using
`
`combinations of the transfer functions H1(z) and H2(z), thereby generating at least one
`
`denoised acoustic stream.
`
`The Pathfinder system can be implemented in a variety of ways, but common to
`
`10
`
`all of the embodiments is reliance on an accurate and reliable VAD device and/or
`
`method. The V AD device/method should be accurate because the Pathfinder system
`
`updates its filter coefficients when there is no speech or when the SNR during speech is
`
`low. If sufficient speech energy is present during coefficient update, subsequent speech
`
`with similar spectral characteristics can be suppressed, an undesirable occurrence. The
`
`15 V AD device/method should be robust to support high accuracy under a variety of
`
`environmental conditions. Obviously, there are likely to be some conditions under
`
`which no V AD device/method will operate satisfactorily, but under normal
`
`circumstances the V AD device/method should work to provide maximum noise
`
`suppression with few adverse affects on the speech signal of interest.
`
`20
`
`When using V AD devices/methods with a noise suppression system, the V AD
`
`signal is processed independently of the noise suppression system, so that the receipt
`
`and processing ofV AD information is independent from the processing associated with
`
`the noise suppression, but the embodiments are not so limited. This independence is
`
`attained physically (i.e., different hardware for use in receiving and processing signals
`
`25
`
`relating to the V AD and the noise suppression), through processing (i.e., using the same
`
`hardware to receive signals into the noise suppression system while using independent
`
`techniques (software, algorithms, routines) to process the received signals), and through
`
`a combination of different hardware and different software, as described below.
`Figure lA is a block diagram of a V AD system 102A including hardware for
`
`30
`
`use in receiving and processing signals relating to V AD, under an embodiment. The
`
`VAD system 102A includes a VAD device 130 coupled to provide data to a
`
`corresponding V AD algorithm 140. Note that noise suppression systems of alternative
`
`Page 13 of 72
`
`
`
`WO 03/096031
`
`12
`
`PCT /0S03/06893
`
`embodiments can integrate some or all functions of the V AD algorithm with the noise
`suppression processing in any manner obvious to those skilled in the art.
`Figure lB is a block diagram of a V AD system 102B using hardware of the
`
`associated noise suppression system 101 for use in receiving V AD information 164,
`
`5
`
`under an embodiment. The V AD system 102B includes a V AD algorithm 150 that
`
`receives data 164 from MIC 1 and MIC 2, or other components, of the corresponding
`
`signal processing system 100. Alternative embodiments of the noise suppression
`
`system can integrate some or all functions of the V AD algorithm with the noise
`
`suppression processing in any manner obvious to those skilled in the art.
`
`10
`
`Vibration/Movement-based V AD Devices/Methods
`
`The vibration/movement-based VAD devices include the physical hardware
`
`devices for use in receiving and processing signals relating to the V AD and the noise
`
`suppression. As a speaker or user produces speech, the resulting vibrations propagate
`
`15
`
`through the tissue of the speaker and, therefore can be detected on and beneath the skin
`
`using various methods. These vibrations are an excellent source ofVAD information,
`
`as they are strongly associated with both voiced and unvoiced speech (although the
`
`unvoiced speech vibrations are much weaker and more difficult to detect) and generally
`
`are only slightly affected by environmental acoustic noise (some devices/methods, for
`
`20
`
`example the electromagnetic vibrometers described below, are not affected by
`
`environmental acoustic noise). These tissue vibrations or movements are detected
`
`using a number ofVAD devices including, for example, accelerometer-based devices,
`
`skin surface microphone (SSM) devices, electromagnetic (EM) vibrometer devices
`
`including both radio frequency (RF) vibrometers and laser vibrometers, direct glottal
`
`25 motion measurement devices, and video detection devices.
`
`Accelerometer-based V AD Devices/Methods
`
`Accelerometyrs can detect skin vibrations associated with speech. As such, and
`
`30 with reference to Figure 1 and Figure lA, a V AD system 102A of an embodiment
`
`includes an accelerometer-based device 130 providing data of the skin vibrations to an
`
`associated algorithm 140. The algorithm of an embodiment uses energy calculation
`
`Page 14 of 72
`
`
`
`WO 03/096031
`
`PCT /0S03/06893
`
`13
`
`techniques along with a threshold comparison, as described below, but is not so limited.
`Note that more compiex energy-based methods are available to those skilled in the art.
`Figure 3 is a flow diagram ·300 of a method for determining voiced and
`
`unvoiced speech using an accelerometer-based V AD, under an embodiment.
`
`5 Generally, the energy is calculated by defining a standard window size over which the
`
`calculation is to take place and summing the square of the amplitude over time as
`Energy = L xf ,
`
`i
`
`where i is the digital sample subscript and ranges from the beginning of the window to
`
`the end of the window.
`
`10
`
`Referring to Figure 3, operation begins upon receiving accelerometer data, at
`
`block 302. The processing associated with the V AD includes filtering the data from the
`
`accelerometer to preclude aliasing, and digitizing the filtered data for processing, at
`
`block 304. The digitized data is segmented into windows 20 milliseconds (msec) in
`
`length, and the data is stepped 8 msec at a time, at block 306. The processing further
`
`15
`
`includes filtering the windowed data, at block 308, to remove spectral information that
`
`is corrupted by noise or is otherwise unwanted. The energy in each window is
`
`calculated by summing the squares of the amplitudes as described above, at block 310.
`
`The calculated energy values can be normalized by dividing the energy values by the
`
`window length; however, this involves an extra calculation and is not needed as long as
`
`20
`
`the window length is not varied.
`
`The calculated, or normalized, energy values are compared to a threshold, at
`
`block 312. The speech corresponding to the accelerometer data is designated as voiced
`
`speech when the energy of the accelerometer data is at or above a threshold value, at
`
`block 314. Likewise, the speech corresponding to the accelerometer data is designated
`
`25
`
`as unvoiced speech when the energy of the accelerometer data is below the threshold
`
`value, at block 316. Noise suppression systems of alternative embodiments can use
`
`multiple threshold values to indicate the relative strength or confidence of the voicing
`
`signal, but are not so' limited. Multiple sub bands may also be processed for increased
`
`accuracy.
`
`30
`
`Figure 4 shows plots including a noisy audio signal (live recording) 402 along
`
`with a corresponding accelerometer-based V AD signal 404, the corresponding
`
`accelerometer output signal 412, and the denoised audio signal 422 following
`
`Page 15 of 72
`
`
`
`WO 03/096031
`
`PCT /0S03/06893
`
`14
`
`processing by the Pathfinder system using the V AD signal 404, under an embodiment.
`
`In this example, the accelerometer d,ata has been bandpass filtered between 500 and
`
`2500 Hz to remove unwanted acoustic noise that can couple to the accelerometer below
`
`500 Hz. The audio signal 402 was recorded using an Aliph microphone set and
`
`5
`
`standard accelerometer in a babble noise environment inside a chamber measuring six
`
`(6) feet on a side and having a ceiling height of eight (8) feet. The Pathfinder system is
`
`implemented in real-time, with a delay of approx