`US008019091B2
`
`c12) United States Patent
`Burnett et al.
`
`(IO) Patent No.:
`(45) Date of Patent:
`
`US 8,019,091 B2
`*Sep. 13, 2011
`
`(54) VOICE ACTIVITY DETECTOR (VAD) -BASED
`MULTIPLE-MICROPHONE ACOUSTIC
`NOISE SUPPRESSION
`
`(75)
`
`Inventors: Gregory C. Burnett, Dodge Center, MN
`(US); Eric F. Breitfeller, Dublin, CA
`(US)
`
`(73) Assignee: Aliphcom, Inc., San Francisco, CA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 713 days.
`
`This patent is subject to a terminal dis(cid:173)
`claimer.
`
`(21) Appl. No.: 10/667,207
`
`(22) Filed:
`
`Sep.18,2003
`
`(65)
`
`Prior Publication Data
`
`US 2004/0133421 Al
`
`Jul. 8, 2004
`
`Related U.S. Application Data
`
`(63) Continuation-in-part of application No. 09/905,361,
`filed on Jul. 12, 2001, now abandoned.
`
`(60) Provisional application No. 60/219,297, filed on Jul.
`19, 2000.
`
`(51)
`
`Int. Cl.
`H03B 29100
`(2006.01)
`(52) U.S. Cl. ....................................... 381/71.8; 704/215
`(58) Field of Classification Search .................... 381/70,
`381/94.1-94.7, 71.8, 91-92, 122, 71.1; 704/200,
`704/231,233,246,214-215
`See application file for complete search history.
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`3,789,166 A *
`1/1974 Sebesta
`4,006,318 A *
`2/ 1977 Sebesta et al.
`4,591,668 A *
`5/ 1986 Iwata
`
`4,901,354 A *
`5,097,515 A *
`5,212,764 A
`5,400,409 A
`5,406,622 A *
`5,414,776 A
`5,463,694 A *
`
`2/1990 Gollmar et al.
`3/1992 Baba
`5/1993 Ariyoshi
`3/1995 Linhard
`4/1995 Silverberg et al. ........... 381/94.7
`5/1995 Sims, Jr.
`10/1995 Bradley et al. .................. 381/92
`(Continued)
`
`EP
`
`FOREIGN PATENT DOCUMENTS
`0 637 187 A * 2/1995
`(Continued)
`
`OTHER PUBLICATIONS
`
`Zhao Li et al: "Robust Speech Coding Using Microphone Arrays",
`Signals Systems and Computers, 1997. Conf. recordof3 l st Asilomar
`Conf., Nov. 2-5, 1997, IEEE Comput. Soc. Nov. 2, 1997. USA.
`
`(Continued)
`
`Primary Examiner - Davetta Goins
`Assistant Examiner - Lun-See Lao
`(74) Attorney, Agent, or Firm - Gregory & Sawrie LLP
`
`(57)
`
`ABSTRACT
`
`Acoustic noise suppression is provided in multiple-micro(cid:173)
`phone systems using Voice Activity Detectors (V AD). A host
`system receives acoustic signals via multiple microphones.
`The system also receives information on the vibration of
`human tissue associated with human voicing activity via the
`VAD. In response, the system generates a transfer function
`representative of the received acoustic signals upon determin(cid:173)
`ing that voicing information is absent from the received
`acoustic signals during at least one specified period of time.
`The system removes noise from the received acoustic signals
`using the transfer function, thereby producing a denoised
`acoustic data stream.
`
`204
`
`s(n)
`
`H/z)
`
`H1(z)
`
`n(n)
`
`100
`
`<
`((1:>))
`Signal
`s(n)
`
`101
`
`<
`((1:>))
`Noise
`n(n)
`
`20 Claims, 10 Drawing Sheets
`
`..,,,-----200
`
`Voicing Information
`
`c~2
`
`Mic 1 m1(n)
`
`/
`ni(n)
`
`Noise Removal
`
`Cleaned Speech
`
`205
`
`103
`si(n)
`r:!__ / m,(,)
`~
`
`Mic 2
`
`Page 1 of 21
`
`GOOGLE EXHIBIT 1001
`
`
`
`US 8,019,091 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`5,473,701 A *
`12/1995 Cezanne et al.
`................ 381/92
`5,473,702 A *
`12/1995 Yoshida et al.
`.............. 381/94.7
`5,515,865 A *
`5/ 1996 Scanlon et al.
`5,517,435 A *
`5/1996 Sugiyama ..................... 708/322
`5,539,859 A
`7/1996 Robbe et al.
`5,590,241 A *
`12/1996 Park et al. ..................... 704/227
`5,633,935 A *
`5/ 1997 Kanamori et al.
`.............. 381/26
`5,649,055 A
`7 / 1997 Gupta et al.
`5,684,460 A *
`1111997 Scanlon et al.
`5,729,694 A *
`3/1998 Holzrichter et al. ............ 705/17
`5,754,665 A *
`5/ 1998 Hosoi
`.......................... 381/94.1
`5,835,608 A
`1111998 Warnaka et al.
`5,853,005 A *
`12/1998 Scanlon
`6/1999 Sasaki et al.
`5,917,921 A
`5,966,090 A
`10/1999 McEwan
`5,986,600 A
`1111999 McEwan
`6,006,175 A *
`12/1999 Holzrichter ................... 704/208
`6,009,396 A
`12/1999 Nagata
`6,069,963 A *
`5/2000 Martin et al.
`6,191,724 Bl
`2/2001 McEwan
`7/2001 Ikeda
`6,266,422 Bl
`6,430,295 Bl
`8/2002 Handel et al.
`6,707,910 Bl*
`3/2004 Valve et al. .............. 379/388.06
`2002/0039425 Al*
`4/2002 Burnett et al.
`2003/0228023 Al *
`12/2003 Burnett et al ................... 381/92
`
`EP
`EP
`
`FOREIGN PATENT DOCUMENTS
`0 795 851 A2 * 9/1997
`0 984 660 A2 * 3/2000
`
`JP
`JP
`WO
`
`2000 312 395
`2001 189 987
`WO 02 07151
`
`* 11/2000
`* 7/2001
`* 1/2002
`
`OTHER PUBLICATIONS
`
`L.C. Ng et al.: "Denoising of Human Speech Using Combined
`Acoustic and EM Sensor Signal Processing", 2000 IEEE Intl Conf on
`Acoustics Speech and Signal Processing. Proceedings (Cat. No.
`00CH37100), Istanbul, Turkey, Jun. 5-9, 2000 XP002186255, ISBN
`0-7803-6293-4.
`S. Affes et al.: "A Signal Subspace Tracking Algorithm for Micro(cid:173)
`phone Array Processing of Speech". IEEE Transactions on Speech
`and Audio Processing, N.Y, USA vol. 5, No. 5, Sep. 1, 1997.
`XP000774303, ISBN 1063-6676.
`GregoryC. Burnett: "The Physiological Basis of Glottal Electromag(cid:173)
`netic Micropower Sensors (GEMS) and Their Use in Defining an
`Excitation Function for the Human Vocal Tract", Dissertation. Uni(cid:173)
`versity of California at Davis, Jan. 1999, USA.
`L.C. Ng et al.: "Speaker Verification Using Combined Acoustic and
`EM Sensor Signal Processing", ICASSP-2001, Salt Lake City, USA.
`A. Hussain: "Intelligibility Assessment of a Multi-Band Speech
`Enhancement Scheme", Proceedings IEEE Intl. Conf. on Acoustics,
`Speech & Signal Processing (ICASSP-2000). Istanbul, Turkey, Jun.
`2000.
`
`* cited by examiner
`
`Page 2 of 21
`
`
`
`"'""' = N
`=
`= "'""'
`
`\0
`
`\0
`
`00
`r.,;_
`d
`
`....
`0 ....
`....
`.....
`rJJ =-
`
`('D
`('D
`
`0
`
`N
`~
`
`~
`
`....
`0 ....
`'? ....
`
`('D
`rJJ
`
`FIG.2
`
`205
`
`Cleaned Speech
`
`I 1'Ul>C 1'.CWUVill
`
`Mic 2
`
`• ~ / m2(n)
`
`103
`
`~
`
`si(n)
`
`~
`/
`
`ni(n)
`/
`
`~
`
`m1(n)
`► ~
`cl02
`
`Mic I
`
`~200
`
`I
`
`Voicing Information
`
`204 ---i V AD I
`
`n( n)
`
`H (z) f
`
`({<~•>) ~ 1
`
`n(n)
`Noise
`
`IOI )
`
`s(n)
`
`Signal: H (z) I
`({•:>>) r
`~
`100
`
`2
`
`s(n)
`
`=
`
`~
`~
`~
`~
`•
`00
`~
`
`~
`
`FIG.I
`
`► Subsystem
`11 Denoising ~
`
`40
`
`Voicing I
`
`Sensors
`
`20~
`
`~30
`
`Processor
`
`-1
`
`I 000 ~ 10 ~ Microphones I
`
`Page 3 of 21
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep.13,2011
`Sep. 13, 2011
`
`Sheet 2 of 10
`Sheet 2 of 10
`
`US 8,019,091 B2
`US 8,019,091 B2
`
`t ---N ----~
`I
`~i
`
`<—(2)'W—f(%)
`;(2)s|tN(2)'Hjeusig
`
`
`......
`B
`s:::1 N
`t,::S - - (cid:173)
`.!,◄ bl)---
`~ ....... Cl')
`CZ)
`
`t ---N ---N
`~
`I
`~i
`
`<—(z)"~—~(z)'p(®)
`éZW(2)"D(2)"y(2)'N]9SION
`
`c:,
`c:,
`("r)
`
`006—~,
`)
`
`€OldoyUaston
`
`s:::l
`
`(>)
`
`(z)"9()
`(z)H¢9SION
`-(z)"N
`
`---
`-:=::- CUN
`..::::::- oz
`•◄ <:'-I - - -
`('°'I
`-
`.........
`z
`
`Page 4 of 21
`
`Page 4 of 21
`
`
`
`U.S. Patent
`U.S. Patent
`
`Sep.13,2011
`Sep. 13, 2011
`
`Sheet 3 of 10
`Sheet 3 of 10
`
`US 8,019,091 B2
`US 8,019,091 B2
`
`t
`
`,,-..,
`...._.
`N
`C"l
`::E
`I
`C\J~
`
`- - - - - - -
`
`C)
`C)
`-.::f"
`
`—400
`)
`
`FIG.4
`
`t
`
`,,-..,
`...._.
`N
`::E ......
`I
`C\Ji
`
`✓
`✓
`
`-✓
`
`H(2)
`
`()) SignalS(z)
`
`)
`
`Page 5 of 21
`
`Page 5 of 21
`
`
`
`U.S. Patent
`
`Sep.13,2011
`
`Sheet 4 of 10
`
`US 8,019,091 B2
`
`Start
`--
`
`¥500
`
`Receive acoustic signals
`
`r--..-- 502
`
`'
`Receive voice activity
`(V AD) information
`
`r--..--
`
`504
`
`,
`Determine absence of
`voicing and generate first
`transfer function
`
`---,__. 506
`
`'
`Determine presence of
`voicing and generate
`second transfer function
`
`r--..-- 508
`
`,~
`
`Produce denoised ~
`510
`acoustic data stream
`
`FIG.5
`
`~
`
`End
`
`Page 6 of 21
`
`
`
`U.S. Patent
`
`Sep.13,2011
`
`Sheet 5 of 10
`
`US 8,019,091 B2
`
`Noise Removal Results for American English Female Saying 406-5562
`X 104
`
`40
`
`5 5 6 2
`
`Dirty
`Audio
`604
`
`1.5
`1
`0.5
`0
`-0.5
`-1
`
`0
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`Cleaned
`Audio
`602
`
`10000
`8000
`6000
`4000
`2000
`0
`-2000
`-4000
`-6000
`-8000
`0
`
`406 5 5 6 2
`
`1
`
`2
`
`3
`
`4
`
`5
`
`6
`
`7
`
`8
`
`FIG.6
`
`Page 7 of 21
`
`
`
`U.S. Patent
`
`Sep.13,2011
`
`Sheet 6 of 10
`
`US 8,019,091 B2
`
`FIG.7A
`
`FIG.7B
`
`VAD
`
`VAD
`Device
`
`VAD
`Algorithm
`
`704
`
`Noise
`Suppression
`
`702A
`
`730
`
`740
`
`701
`
`VAD
`
`~ 702B
`
`VAD
`Algorithm
`
`_ __,,,_
`
`750
`
`764
`
`~ 1--._r----7 0
`4
`
`Signal
`Processing ~ 700
`System
`
`.
`
`Noise
`Suppression
`System
`
`1-.J--._ 701
`
`Page 8 of 21
`
`
`
`U.S. Patent
`
`Sep.13,2011
`
`Sheet 7 of 10
`
`US 8,019,091 B2
`
`~800
`
`802
`
`804
`
`806
`
`808
`
`810
`
`812
`
`814
`
`816
`
`Receive accelerometer data
`
`Filter and digitize accelerometer data
`
`Segment and step digitized data
`
`Remove spectral information corrupted by noise
`
`Calculate energy in each window
`
`Compare energy to threshold values
`
`Energy above threshold indicates voiced speech
`
`Energy below threshold indicates unvoiced speech
`
`FIG.8
`
`Page 9 of 21
`
`
`
`U.S. Patent
`
`Sep.13,2011
`
`Sheet 8 of 10
`
`US 8,019,091 B2
`
`0
`
`·-"'O
`~
`00 ·-0
`z
`
`>,
`
`..... cu
`$--< cu
`8
`$--< cu -cu
`0
`
`u
`u
`<
`
`"'O cu
`
`00 ·-0
`
`i:::l cu
`A
`
`0.4
`0.2
`0
`-0.2
`-0.4
`
`0.2
`0.1
`0
`-0.1
`-0.2
`
`0.2
`0.1
`0
`-0.1
`-0.2
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`4.5
`
`5
`
`5.5
`
`912
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`4.5
`
`5
`
`5.5
`
`6.5
`6
`X 104
`
`6.5
`6
`X 104
`
`922
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`4.5
`
`5
`
`5.5
`
`Time (samples at 8 kHz)
`
`6.5
`6
`X 104
`
`FIG.9
`
`Page 10 of 21
`
`
`
`U.S. Patent
`
`Sep.13,2011
`
`Sheet 9 of 10
`
`US 8,019,091 B2
`
`0
`
`·-"Cj
`:::s <
`er., --0
`z
`
`>,
`
`0.4
`0.2
`0
`-0.2
`-0.4
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`4.5
`
`5
`
`5.5
`
`::E
`00
`00
`
`0.1
`0.05
`0
`-0.05
`-0.1
`
`1012
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`4.5
`
`5
`
`5.5
`
`6.5
`6
`X 104
`
`6.5
`6
`X 104
`
`0 ·-"Cj
`:::s <
`er., --
`
`"Cj
`0
`
`0
`$::I
`0
`0
`
`0.2
`0.1
`0
`-0.1
`-0.2
`
`1022
`
`2
`
`2.5
`
`3
`
`3.5
`
`4
`
`4.5
`
`5
`
`5.5
`
`Time (samples at 8 kHz)
`
`6.5
`6
`X 104
`
`FIG.IO
`
`Page 11 of 21
`
`
`
`U.S. Patent
`
`Sep.13,2011
`
`Sheet 10 of 10
`
`US 8,019,091 B2
`
`0
`
`·--0 :::s <
`>-er.I ·-0
`z
`
`1
`
`0.5
`0
`-0.5
`-1
`
`0
`
`1
`0.5
`0
`-0.5
`-1
`
`0.6
`0.4
`0.2
`0
`-0.2
`-0.4
`-0.6
`
`C'-l
`::E
`~
`0
`
`0
`
`·--0 :::s
`<
`er.I ·-
`
`'"t:l
`(I.)
`
`0
`i::::l
`(I.)
`0
`
`0.5
`
`1
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`1112
`
`0.5
`
`1
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`0.5
`
`1
`
`1.5
`
`2
`
`2.5
`
`3
`
`3.5
`
`Time (samples at 8 kHz)
`
`FIG.11
`
`4
`X 104
`
`4
`X 104
`
`4
`X 104
`
`Page 12 of 21
`
`
`
`US 8,019,091 B2
`
`1
`VOICE ACTIVITY DETECTOR (VAD) -BASED
`MULTIPLE-MICROPHONE ACOUSTIC
`NOISE SUPPRESSION
`
`RELATED APPLICATIONS
`
`This patent application is a continuation-in-part of U.S.
`patent application Ser. No. 09/905,361, filed Jul. 12, 2001,
`now abandoned which claims priority from U.S. patent appli(cid:173)
`cation Ser. No. 60/219,297, filed Jul. 19, 2000. This patent
`application also claims priority from U.S. patent application
`Ser. No. 10/383,162, filed Mar. 5, 2003.
`
`FIELD OF THE INVENTION
`
`The disclosed embodiments relate to systems and methods
`for detecting and processing a desired signal in the presence
`of acoustic noise.
`
`BACKGROUND
`
`2
`FIG. 3 is a block diagram including front-end components
`of a noise removal algorithm of an embodiment generalized
`ton distinct noise sources (these noise sources may be reflec(cid:173)
`tions or echoes of one another).
`FIG. 4 is a block diagram including front-end components
`of a noise removal algorithm of an embodiment in a general
`case where there are n distinct noise sources and signal reflec(cid:173)
`tions.
`FIG. 5 is a flow diagram of a denoising method, under an
`10 embodiment.
`FIG. 6 shows results of a noise suppression algorithm of an
`embodiment for an American English female speaker in the
`presence of airport terminal noise that includes many other
`15 human speakers and public announcements.
`FIG. 7A is a block diagram of a Voice Activity Detector
`(VAD) system including hardware for use in receiving and
`processing signals relating to VAD, under an embodiment.
`FIG. 7B is a block diagram of a VAD system using hard-
`20 ware of a coupled noise suppression system for use in receiv(cid:173)
`ing VAD information, under an alternative embodiment.
`FIG. 8 is a flow diagram of a method for determining
`voiced and unvoiced speech using an accelerometer-based
`VAD, under an embodiment.
`FIG. 9 shows plots including a noisy audio signal (live
`recording) along with a corresponding accelerometer-based
`VAD signal, the corresponding accelerometer output signal,
`and the denoised audio signal following processing by the
`noise suppression system using the VAD signal, under an
`30 embodiment.
`FIG. 10 shows plots including a noisy audio signal (live
`recording) along with a corresponding SSM-based VAD sig(cid:173)
`nal, the corresponding SSM output signal, and the denoised
`audio signal following processing by the noise suppression
`35 system using the VAD signal, under an embodiment.
`FIG. 11 shows plots including a noisy audio signal (live
`recording) along with a corresponding GEMS-based VAD
`signal, the corresponding GEMS output signal, and the
`denoised audio signal following processing by the noise sup-
`40 pression system using the VAD signal, under an embodiment.
`
`DETAILED DESCRIPTION
`
`The following description provides specific details for a
`thorough understanding of, and enabling description for,
`embodiments of the noise suppression system. However, one
`skilled in the art will understand that the invention may be
`practiced without these details. In other instances, well(cid:173)
`known structures and functions have not been shown or
`described in detail to avoid unnecessarily obscuring the
`description of the embodiments of the noise suppression sys(cid:173)
`tem. In the following description, "signal" represents any
`acoustic signal (such as human speech) that is desired, and
`"noise" is any acoustic signal (which may include human
`speech) that is not desired. An example would be a person
`talking on a cellular telephone with a radio in the background.
`The person's speech is desired and the acoustic energy from
`the radio is not desired. In addition, "user" describes a person
`who is using the device and whose speech is desired to be
`60 captured by the system.
`Also, "acoustic" is generally defined as acoustic waves
`propagating in air. Propagation of acoustic waves in media
`other than air will be noted as such. References to "speech" or
`"voice" generally refer to human speech including voiced
`65 speech, unvoiced speech, and/or a combination of voiced and
`unvoiced speech. Unvoiced speech or voiced speech is dis(cid:173)
`tinguished where necessary. The term "noise suppression"
`
`Many noise suppression algorithms and techniques have
`been developed over the years. Most of the noise suppression
`systems in use today for speech communication systems are
`based on a single-microphone spectral subtraction technique 25
`first develop in the 1970's and described, for example, by S.
`F. Boll in "Suppression of Acoustic Noise in Speech using
`Spectral Subtraction," IEEE Trans. on ASSP, pp. 113-120,
`1979. These techniques have been refined over the years, but
`the basic principles of operation have remained the same. See,
`for example, U.S. Pat. No. 5,687,243 of McLaughlin, et al.,
`and U.S. Pat. No. 4,811,404 ofVilmur, et al. Generally, these
`techniques make use of a microphone-based Voice Activity
`Detector (VAD) to determine the background noise charac(cid:173)
`teristics, where "voice" is generally understood to include
`human voiced speech, unvoiced speech, or a combination of
`voiced and unvoiced speech.
`The VAD has also been used in digital cellular systems. As
`an example of such a use, see U.S. Pat. No. 6,453,291 of
`Ashley, where a VAD configuration appropriate to the front(cid:173)
`end of a digital cellular system is described. Further, some
`Code Division Multiple Access (CDMA) systems utilize a
`VAD to minimize the effective radio spectrum used, thereby
`allowing for more system capacity. Also, Global System for
`Mobile Communication (GSM) systems can include a VAD 45
`to reduce co-channel interference and to reduce battery con(cid:173)
`sumption on the client or subscriber device.
`These typical microphone-based VAD systems are signifi(cid:173)
`cantly limited in capability as a result of the addition of
`environmental acoustic noise to the desired speech signal 50
`received by the single microphone, wherein the analysis is
`performed using typical signal processing techniques. In par(cid:173)
`ticular, limitations in performance of these microphone(cid:173)
`based VAD systems are noted when processing signals having
`a low signal-to-noise ratio (SNR), and in settings where the 55
`background noise varies quickly. Thus, similar limitations are
`found in noise suppression systems using these microphone(cid:173)
`based VADs.
`
`BRIEF DESCRIPTION OF THE FIGURES
`
`FIG. 1 is a block diagram of a denoising system, under an
`embodiment.
`FIG. 2 is a block diagram including components of a noise
`removal algorithm, under the denoising system of an embodi(cid:173)
`ment assuming a single noise source and direct paths to the
`microphones.
`
`Page 13 of 21
`
`
`
`US 8,019,091 B2
`
`4
`equal to one when speech is produced, a substantial improve(cid:173)
`ment in the noise removal can be made.
`In analyzing the single noise source 101 and the direct path
`to the microphones, with reference to FIG. 2, the total acous(cid:173)
`tic information coming into MIC 1 is denoted by m 1 (n). The
`total acoustic information coming into MIC 2 is similarly
`labeled min). In the z (digital frequency) domain, these are
`represented as M 1(z) and Miz). Then,
`
`M 1(z)~S(z)+No(z)
`
`with
`
`So(z)~S(z)Ho(z),
`
`so that
`
`Eq.1
`
`This is the general case for all two microphone systems. In
`a practical system there is always going to be some leakage of
`noise into MIC 1, and some leakage of signal into MIC 2.
`Equation 1 has four unknowns and only two known relation(cid:173)
`ships and therefore cannot be solved explicitly.
`However, there is another way to solve for some of the
`unknowns in Equation 1. The analysis starts with an exami(cid:173)
`nation of the case where the signal is not being generated, that
`is, where a signal from the VAD element 204 equals zero and
`speech is not being produced. In this case, s(n) S(z)=O, and
`Equation 1 reduces to
`
`M1n(z)~N(z)H1 (z)
`
`M2n(z)~N(z),
`
`where the n subscript on the M variables indicate that only
`noise is being received. This leads to
`
`Eq. 2
`
`3
`generally describes any method by which noise is reduced or
`eliminated in an electronic signal.
`Moreover, the term "VAD" is generally defined as a vector
`or array signal, data, or information that in some manner
`represents the occurrence of speech in the digital or analog 5
`domain. A common representation ofVAD information is a
`one-bit digital signal sampled at the same rate as the corre(cid:173)
`sponding acoustic signals, with a zero value representing that
`no speech has occurred during the corresponding time
`sample, and a unity value indicating that speech has occurred 10
`during the corresponding time sample. While the embodi(cid:173)
`ments described herein are generally described in the digital
`domain, the descriptions are also valid for the analog domain.
`FIG.1 is a block diagram of a denoising system 1000 ofan
`embodiment that uses knowledge of when speech is occurring 15
`derived from physiological information on voicing activity.
`The system 1000 includes microphones 10 and sensors 20
`that provide signals to at least one processor 30. The proces(cid:173)
`sor includes a denoising subsystem or algorithm 40.
`FIG. 2 is a block diagram including components of a noise 20
`removal algorithm 200 of an embodiment. A single noise
`source and a direct path to the microphones are assumed. An
`operational description of the noise removal algorithm 200 of
`an embodiment is provided using a single signal source 100
`and a single noise source 101, but is not so limited. This 25
`algorithm 200 uses two microphones: a "signal" microphone
`1 ("MICl") and a "noise" microphone 2 ("MIC 2"), but is not
`so limited. The signal microphone MIC 1 is assumed to cap(cid:173)
`ture mostly signal with some noise, while MIC 2 captures
`mostly noise with some signal. The data from the signal 30
`source 100 to MIC 1 is denoted by s(n), where s(n) is a
`discrete sample of the analog signal from the source 100. The
`data from the signal source 100 to MIC 2 is denoted by sin).
`The data from the noise source 101 to MIC 2 is denoted by
`n(n). The data from the noise source 101 to MIC 1 is denoted 35
`by nin). Similarly, the data from MIC 1 to noise removal
`element 205 is denoted by m 1 (n), and the data from MIC 2 to
`noise removal element 205 is denoted by min).
`The noise removal element 205 also receives a signal from
`a voice activity detection (VAD) element 204. The VAD 204 40
`uses physiological information to determine when a speaker
`is speaking. In various embodiments, the VAD can include at
`least one of an accelerometer, a skin surface microphone in
`physical contact with skin of a user, a human tissue vibration
`detector, a radio frequency (RF) vibration and/or motion 45
`detector/device, an electroglottograph, an ultrasound device,
`an acoustic microphone that is being used to detect acoustic
`frequency signals that correspond to the user's speech
`directly from the skin of the user (anywhere on the body), an
`airflow detector, and a laser vibration detector.
`The transfer functions from the signal source 100 to MIC 1
`and from the noise source 101 to MIC 2 are assumed to be
`unity. The transfer function from the signal source 100 to MIC
`2 is denoted by Hiz), and the transfer function from the noise
`source 101 to MIC 1 is denoted by H 1 (z). The assumption of 55
`unity transfer functions does not inhibit the generality of this
`algorithm, as the actual relations between the signal, noise,
`and microphones are simply ratios and the ratios are redefined
`in this manner for simplicity.
`In conventional two-microphone noise removal systems, 60
`the information from MIC 2 is used to attempt to remove
`noise from MIC 1. However, an (generally unspoken)
`assumption is that the VAD element 204 is never perfect, and
`thus the denoising must be performed cautiously, so as not to
`remove too much of the signal along with the noise. However, 65
`if the VAD 204 is assumed to be perfect such that it is equal to
`zero when there is no speech being produced by the user, and
`
`The function H 1 (z) can be calculated using any of the
`available system identification algorithms and the micro(cid:173)
`phone outputs when the system is certain that only noise is
`50 being received. The calculation can be done adaptively, so
`that the system can react to changes in the noise.
`A solution is now available for one of the unknowns in
`Equation 1. Another unknown, Hiz), can be determined by
`using the instances where the VAD equals one and speech is
`being produced. When this is occurring, but the recent (per(cid:173)
`haps less than 1 second) history of the microphones indicate
`low levels of noise, it can be assumed that n(s)=N(z)-0. Then
`Equation 1 reduces to
`
`M2,(z)~S(z )Ho(z ),
`
`which in turn leads to
`
`Page 14 of 21
`
`
`
`US 8,019,091 B2
`
`5
`-continued
`
`which is the inverse of the H 1 (z) calculation. However, it is
`noted that different inputs are being used (now only the signal
`is occurring whereas before only the noise was occurring).
`While calculating Hiz), the values calculated for H 1(z) are
`held constant and vice versa. Thus, it is assumed that while 10
`one ofH 1 (z) and Hiz) are being calculated, the one not being
`calculated does not change substantially.
`After calculating H 1 (z) and Hiz), they are used to remove
`the noise from the signal. If Equation 1 is rewritten as
`
`S(z)~M1 (z)-N(z)H1 (z)
`
`15
`
`6
`mitted. Once again, the "n" subscripts on the microphone
`inputs denote only that noise is being detected, while an "s"
`subscript denotes that only signal is being received by the
`microphones.
`Examining Equation 4 while assuming an absence of noise
`produces
`
`M2,~SH0 .
`Thus, H0 can be solved for as before, using any available
`transfer function calculating algorithm. Mathematically,
`then,
`
`S(z) [1-Hiz)H1 (z) ]~M1 (z)-Miz)H1 (z),
`then N(z) may be substituted as shown to solve for S(z) as
`
`Rewriting Equation 4, using H 1 defined in Equation 6,
`20 provides,
`
`M1(z)-M2(z)H1(z)
`S(z) = 1 - H2(z)H1 (z)
`·
`
`M1 -S
`-
`Hi=M2-SH0·
`
`Eq. 3 25
`
`Solving for S yields,
`
`Eq. 7
`
`Eq. 8
`
`If the transfer functions H 1(z) and Hiz) can be described
`with sufficient accuracy, then the noise can be completely
`removed and the original signal recovered. This remains true 30
`without respect to the amplitude or spectral characteristics of
`the noise. The only assumptions made include use of a perfect
`VAD, sufficiently accurate H 1(z) and Hiz), and that when
`one of H 1 ( z) and Hi z) are being calculated the other does not
`change substantially. In practice these assumptions have 35
`proven reasonable.
`The noise removal algorithm described herein is easily
`generalized to include any number of noise sources. FIG. 3 is
`a block diagram including front-end components 300 of a
`noise removal algorithm of an embodiment, generalized to n 40
`distinct noise sources. These distinct noise sources may be
`reflections or echoes of one another, but are not so limited.
`There are several noise sources shown, each with a transfer
`function, or path, to each microphone. The previously named
`path H2 has been relabeled as H0 , so that labeling noise source 45
`2's path to MIC 1 is more convenient. The outputs of each
`microphone, when transformed to the z domain, are:
`
`Miz)~S(z)Ho(z)+N1(z)G1(z)+Niz)Giz)+ ... Nn(z)Gn
`(z)
`
`Eq. 4
`
`When there is no signal (VAD=0), then (suppressing z for
`clarity)
`
`50
`
`55
`
`M1 -M2H1
`S= - - - -
`l-HoH1
`
`which is the same as Equation 3, with H0 taking the place of
`H2 , and H 1 taking the place of H 1 . Thus the noise removal
`algorithm still is mathematically valid for any number of
`noise sources, including multiple echoes of noise sources.
`Again, ifH0 and H 1 can be estimated to a high enough accu(cid:173)
`racy, and the above assumption of only one path from the
`signal to the microphones holds, the noise may be removed
`completely.
`The most general case involves multiple noise sources and
`multiple signal sources. FIG. 4 is a block diagram including
`front-end components 400 of a noise removal algorithm of an
`embodiment in the most general case where there are n dis(cid:173)
`tinct noise sources and signal reflections. Here, signal reflec(cid:173)
`tions enter both microphones MIC 1 and MIC 2. This is the
`most general case, as reflections of the noise source into the
`microphones MIC 1 and MIC 2 can be modeled accurately as
`simple additional noise sources. For clarity, the direct path
`from the signal to MIC 2 is changed from Ho(z) to H00(z), and
`the reflected paths to MIC 1 and MIC 2 are denoted by H01 (z)
`and H0iz), respectively.
`The input into the microphones now becomes
`
`M 1 (z)~S(z)+S(z)H01 (z)+N1 (z)H1 (z)+N2(z)H2(z)+ ...
`Nn(z)Hn(z)
`
`M2n~N1G1+N2G2+ ... NnGn"
`A new transfer function can now be defined as
`
`N1H1 + N2H2 + ... NnHn
`_
`Min
`Hi= M2n = N1G1+N2G2+ ... NnGn'
`
`Eq. 5
`
`Eq. 6
`
`MoCz)~S(z )H00(z)+S(z)H0iz)+N1 (z )G 1 (z )+Niz)Giz)+
`... Nn(z)Gn(z).
`
`Eq. 9
`
`60 When the VAD=0, the inputs become (suppressing z again)
`
`where H 1 is analogous to H 1 (z) above. Thus H 1 depends only
`on the noise sources and their respective transfer functions
`and can be calculated any time there is no signal being trans-
`
`M2n~N1G1+N2G2+ ... NnGn,
`65 which is the same as Equation 5. Thus, the calculation ofH 1
`in Equation 6 is unchanged, as expected. In examining the
`situation where there is no noise, Equation 9 reduces to
`
`Page 15 of 21
`
`
`
`US 8,019,091 B2
`
`7
`
`M2, =SH00+SH02 .
`This leads to the definition of H.2 as
`
`Eq. 10
`
`Rewriting Equation 9 again using the definition for H. 1 (as
`in Equation 7) provides
`
`8
`tially while the other is calculated. If the user environment is
`such that echoes are present, they can be compensated for if
`coming from a noise source. If signal echoes are also present,
`they will affect the cleaned signal, but the effect should be
`5 negligible in most environments.
`In operation, the algorithm of an embodiment has shown
`excellent results in dealing with a variety of noise types,
`amplitudes, and orientations. However, there are always
`approximations and adjustments that have to be made when
`10 moving from mathematical concepts to engineering applica(cid:173)
`tions. One assumption is made in Equation 3, where Hiz) is
`assumed small and therefore Hiz)H 1 (z)ss0, so that Equation
`3 reduces to
`
`Eq. 11 15
`
`Some algebraic manipulation yields
`
`-
`(Hoo+ Ho2)]
`-
`S(l +HoiJ l-H1 (l +HoiJ =M1 -M2H1
`[
`
`S(l + HoiJ[l -H1H2] = M1 -M2H1,
`
`and finally
`
`Eq. 12
`
`Equation 12 is the same as equation 8, with the replacement
`ofH0 by H.2 , and the addition of the (1 +H01 ) factor on the left
`side. This extra factor (l+H01 ) means that S cannot be solved
`for directly in this situation, but a solution can be generated
`for the signal plus the addition of all of its echoes. This is not
`such a bad situation, as there are many conventional methods
`for dealing with echo suppression, and even if the echoes are
`not suppressed, it is unlikely that they will affect the compre(cid:173)
`hensibility of the speech to any meaningful extent. The more
`complex calculation ofH.2 is needed to account for the signal
`echoes in MIC 2, which act as noise sources.
`FIG. 5 is a flow diagram 500 of a denoising algorithm,
`under an embodiment. In operation, the acoustic signals are
`received, at block 502. Further, physiological information
`associated with human voicing activity is received, at block
`504. A first transfer function representative of the acoustic
`signal is calculated upon determining that voicing informa(cid:173)
`tion is absent from the acoustic signal for at least one specified
`period of time, at block 506. A second transfer function rep(cid:173)
`resentative of the acoustic signal is calculated upon determin(cid:173)
`ing that voicing information is present in the acoustic signal 55
`for at least one specified period of time, at block 508. Noise is
`removed from the acoustic signal using at least one combi(cid:173)
`nation of the first transfer function and the second transfer
`function, producing denoised acoustic data streams, at block
`510.
`An algorithm for noise removal, or denoising algorithm, is
`described herein, from the simplest case of a single noise
`source with a direct path to multiple noise sources with reflec(cid:173)
`tions and echoes. The algorithm has been shown herein to be
`viable under any environmental conditions. The type and
`amount of noise are inconsequential if a good estimate has
`been made ofH. 1 and H.2 , and if one does not change substan-
`
`This means that only H 1 ( z) has to be calculated, speeding up
`the process and reducing the number of computations
`required considerably. With the proper selection of micro-
`20 phones, this approximation is easily realized.
`Another approximation involves the filter used in an
`embodiment. The actual H 1 (z) will undoubtedly have both
`poles and zeros, but for stability and simplicity an all-zero
`Finite Impulse Response (FIR) filter is used. With enough
`25 taps the approximation to the actual H 1 (z) can be very good.
`To further increase the performance of the noise suppres(cid:173)
`sion system, the spectrum of interest (generally about 125 to
`3700 Hz) is divided into subbands. The wider the range of
`frequencies over which a transfer function must be calcu-
`30 lated, the more difficult it is to calculate it accurately. There(cid:173)
`fore the acoustic data was divided into 16 subbands, and the
`denoising algorithm was then applied to each sub band in tum.
`Finally, the 16 denoised data streams were recombined to
`yield the denoised acoustic data. This works very well, but
`35 any combinations of subbands (i.e., 4, 6, 8, 32, equally
`spaced, perceptually spaced, etc.) can be used and all have
`been found to work better than a single sub band.
`The amplitude of the noise was constrained in an embodi(cid:173)
`ment so that the microphones used did not saturate (that is,
`40 operate outside a linear response region). It is important that
`the microphones operate linearly to ensure the best perfor(cid:173)
`mance. Even with this restriction, very low signal-to-noise
`ratio (SNR) signals can be denoised (down to -10 dB or less).
`The calculation ofH 1(z) is accomplished every 10 milli-
`45 seconds using the Least-Mean Squares (LMS) method, a
`common adaptive transfer function. An explanation may be
`found in "Adaptive Signal Processing" (