`
`1111111111111111111111111111111111111111111111111111111111111
`US007464029B2
`
`c12) United States Patent
`Visser et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,464,029 B2
`Dec. 9, 2008
`
`(54) ROBUST SEPARATION OF SPEECH SIGNALS
`IN A NOISY ENVIRONMENT
`
`wo
`wo
`
`wo 2006/012578
`wo 2006/028587
`
`2/2006
`3/2006
`
`(75)
`
`Inventors: Erik Visser, San Diego, CA (US);
`Jeremy Toman, San Marcos, CA (US);
`Kwokleung Chan, San Diego, CA (US)
`
`(73) Assignee: QUALCOMM Incorporated, San
`Diego, CA (US)
`
`( *) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 246 days.
`
`(21) Appl. No.: 11/187,504
`
`(22) Filed:
`
`Jul. 22, 2005
`
`(65)
`
`(51)
`
`(52)
`
`(58)
`
`(56)
`
`EP
`wo
`
`Prior Publication Data
`US 2007/0021958Al
`Jan.25,2007
`
`Int. Cl.
`(2006.01)
`G10L 19114
`(2006.01)
`G10L 11106
`(2006.01)
`G10L 21102
`(2006.01)
`G10L 15120
`U.S. Cl. ....................... 704/210; 704/215; 704/228;
`704/233
`Field of Classification Search ....................... None
`See application file for complete search history.
`References Cited
`U.S. PATENT DOCUMENTS
`3/1987 Zinser, Jr. et al.
`4,649,505 A
`4,912,767 A
`3/1990 Chang
`5,208,786 A
`511993 Weinstein et a!.
`5,251,263 A
`10/1993 Andrea et al.
`(Continued)
`FOREIGN PATENT DOCUMENTS
`1 006 652 A2
`6/2000
`wo 01/27874
`4/2001
`
`OTHER PUBLICATIONS
`
`Amari, eta!. 1996. A new learning algorithm for blind signal sepa-
`ration. In D. Touretzky, M. Mozer, and M. Hasselmo (Eds.).
`Advances in NeurallnformationProcessing Systems 8 (pp. 757 -763).
`Cambridge: MIT Press.
`
`(Continued)
`Primary Examiner-David R. Hudspeth
`Assistant Examiner-Brian L Albertalli
`(74) Attorney, Agent, or Firm-Espartaco Diaz Hidelgo;
`Timothy F. Loomis; Thomas R. Rouse
`
`(57)
`
`ABSTRACT
`
`A method for improving the quality of a speech signal
`extracted from a noisy acoustic environment is provided. In
`one approach, a signal separation process is associated with a
`voice activity detector. The voice activity detector is a two-
`channel detector, which enables a particularly robust and
`accurate detection of voice activity. When speech is detected,
`the voice activity detector generates a control signal. The
`control signal is used to activate, adjust, or control signal
`separation processes or post-processing operations
`to
`improve the quality of the resulting speech signal. In another
`approach, a signal separation process is provided as a learning
`stage and an output stage. The learning stage aggressively
`adjusts to current acoustic conditions, and passes coefficients
`to the output stage. The output stage adapts more slowly, and
`generates a speech-content signal and a noise dominant sig-
`nal. When the learning stage becomes unstable, only the
`learning stage is reset, allowing the output stage to continue
`outputting a high quality speech signal.
`
`44 Claims, 13 Drawing Sheets
`
`102
`
`11M
`
`106
`
`- TransdJcerSignal
`.speech Signal
`
`114
`
`VOICE
`ACTI\111Y
`DETECTOR
`
`107
`
`1
`
`Sony v. Jawbone
`
`U.S. Patent No. 8,321,213
`
`Sony Ex. 1006
`
`
`
`US 7,464,029 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`5,327,178 A
`7/1994 McManigal
`5,375,174 A
`12/1994 Denenberg
`5,383,164 A
`111995 Sejnowski et al.
`5,706,402 A
`111998 Bell
`5,715,321 A
`2/1998 Andrea et al.
`5,732,143 A
`3/1998 Andrea et al.
`5,770,841 A
`6/1998 Moedetal.
`5,999,567 A
`12/1999 Torkkola
`5,999,956 A
`12/1999 Deville
`6,002,776 A
`12/1999 Bhadkarnkar et al.
`6,108,415 A
`8/2000 Andrea
`6,130,949 A * 10/2000 Aoki eta!. ................. 381194.3
`6,167,417 A
`12/2000 Parra eta!.
`6,381,570 B2
`4/2002 Li eta!.
`6,424,960 B1
`7/2002 Lee et al.
`6,526,178 B1
`2/2003 Fukuhara
`6,549,630 B1 * 4/2003 Bobisuthi .................. 381194.7
`6,606,506 B1
`8/2003 Jones
`7,099,821 B2
`8/2006 Visser eta!.
`200110037195 A1
`1112001 Acero eta!.
`2002/0110256 A1
`8/2002 Watson et al.
`2002/0136328 A1
`9/2002 Shimizu
`2002/0193130 A1
`12/2002 Yang eta!.
`2003/0055735 A1
`3/2003 Cameron et al.
`2003/0179888 A1 * 9/2003 Burnett et al.
`. ............ 381171.8
`2004/0039464 A1
`2/2004 Virolainen et a!.
`2004/0120540 A1
`6/2004 Mullenborn et a!.
`2004/0136543 A1
`7/2004 White eta!.
`
`OTHER PUBLICATIONS
`Amari, eta!. 1997. Stability analysis of! earning algorithms for blind
`source separation. Neural Networks, 10(8):1345-1351.
`Bell, eta!. 1995. An information-maximization approach to blind
`separation and blind deconvolution. Neural Computation, 7:1129-
`1159.
`Cardoso, J-F. 1992. Fourth-order cumulant structore forcing. Appli-
`cation to blind array processing. Proc. IEEE SP Workshop on SSAP-
`92, 136-139.
`Comon, P. 1994. Independent component analysis, A new concept?
`Signal Porcessing, 36:287-314.
`Griffiths, eta!. 1982. An alternative approach to linearly constrained
`adaptive beamforming. IEEE Transactions on Antennas and Propa-
`gation, AP-30(1):27-34.
`Herault, et a!. ( 1986). Space or time adaptive signal processing by
`neural network models. Neural Networks for Computing, In J.S.
`Denker (Ed.), Proc. oftheAIP Conference(pp. 206-211). New York:
`American Institute of Physics.
`Hoshuyama, eta!. 1999. A robust adaptive bearnformer for micro-
`phone arrays with a blocking matrix using constrained adaptive fil-
`ters. IEEE Transactions on Signal Processing, 47(10):2677-2684.
`Hyvarinen, eta!. 1997. A fast fixed-point algorithm for independent
`component analysos. Neural Computation, 9:1483-1492.
`Hyvarinen, A. 1999. Fast and robust fixed-point algorithms for inde-
`pendent component analysos, IEEE Trans. on Neural Networks,
`10(3):626-634.
`Jutten, et al. 1991. Blind separation of sources, Part I: An adaptive
`algorithm based on neuromimetic architecture. Signal Processing,
`24:1-10.
`Lambert, R. H. 1996. Multichannel blind deconvolution; FIR matrix
`algebra and separation of multi path mixtures. Doctoral Dissertation,
`University of Southern California.
`Lee, eta!. 1997. A contextual blind separation of delayed and con-
`volved sources. Proceedings of the 1997 IEEE International Confer-
`ence on Acoustics, Speech, and Signal Processing(ICASSP'97),
`2:1199-1202.
`Lee, et a!. 1998. Combining time-delayed decorrelation and ICA:
`Towards solving the cocktail party problem. Proceedings of the 1998
`IEEE International Conference on Acoustics, Speech, and Signal
`Processing (ICASSP'98), 2:1249-1252.
`
`Murata, Ikeda. 1998. An On-line Algorithm for Blind Source Sepa-
`ration on Speech Signals. Proc. of1998 International Symposium on
`Nonlinear Theory and its Application (NOLTA98), pp. 923-926, Le
`Regent, Crans-Montana, Switzerland.
`Molgedey, et al. 1994. Separation of a mixture of independent signals
`using time delayed correlations. Physical Review Letters, The Ameri-
`can Physical Society, 72(23):3634-3637.
`Parra, et al. 2000. Convolutive blind separation of non-stationary
`sources. IEEE Trnsactions of Speech and Audio Processing,
`8(3):320-327.
`Platt, et a!. 1992. Networks for the separation of sources that are
`superimposed and delayed. In J. Moody, S. Hanson, R. Lippmann
`(Eds.), Advances in Neural Information Processing 4 (pp. 730-737).
`San Francisco: Morgan-Kaufmann .
`Tong, eta!. 1991. A necessary and sufficient condition for the blind
`identification of memoryless systems. Circuits and Systems, IEEE
`International Symposium, 1:1-4.
`Torkkola, K. 1996. Blind separation of convolved sources based on
`information maximization. Neural Networks for Signal Processing:
`VI Proceedings of the 1996 IEEE Signal Processing Society Work-
`shop, pp. 423-432.
`Torkkola, K. 1997. Blind deconvolution, information mazimization
`and recursive filters. IEEE International Conference on Acoustics,
`Speech, and Signal Processing(ICASSP'97), 4:3301-3304.
`Van Compernolle, et a!. 1992. Signal separation in a symmetric
`adaptive noise canceler by output decorrelation. Acoustics, Speech,
`and Signal Processing, 1992. ICASSP-92., 1992 IEEE International
`Conference, 4:221-224.
`Visser, eta!. Blind source separation in mobile environments using a
`priori knowledge. Acoustics, Speech, and Signal Processing, 2004,
`Proceedings. (ICASSP'04). IEEE International Conference on, vol.
`3, May 17-21, 2004, pp. :iii-893-896.
`Vissser, eta!. Speech enhancement using blind source separation and
`two-channel energy based speaker detection. Acoustics, Speech, and
`Signal Processing, 2003. Proceedings . (ICASSP '03). 2003 IEEE
`International Conference on, vol. 1, Apr. 6-10, 2003, pp. I-884 -
`I-887.
`Yellin, et a!. 1996. Multichannel signal separation: Methods and
`analysis. IEEE Transactions on Signal Processing, 44( 1 ): 106-118.
`First Examination Report dated Oct. 23, 2006 from Indian Applica-
`tion No. 15711CHENP/2005.
`International Search Report from PCT/US03/39593 dated Apr. 29,
`2004.
`International Search Report from the EPO, Reference No. P400550,
`dated Oct, 15. 2007, in regards to European Publication No.
`EP1570464.
`International Preliminary Report on Patentability dated Feb. 1, 2007,
`with copy of Written Opinion of ISA dated Apr. 19, 2006, for PCT/
`US2005/026195 filed on Jul. 22, 2005.
`International Preliminary Report on Patentability dated Feb. 1, 2007,
`with copy ofWritten Opinion ofiSA dated Mar. 10, 2006, for PCT/
`US2005/026196 filed on Jul. 22, 2005.
`Office Action dated Oct. 31, 2006 from U.S. Appl. No. 10/537,985
`filed Jun. 9, 2005.
`Final Office Action dated Apr. 13, 2007 from U.S. Appl. No.
`10/537,985 filed Jun. 9, 2005.
`Notice of Allowance with Examiner's Amendment dated Jul. 30,
`2007 from U.S. Appl. No. 10/537,985 filed Jun. 9, 2005.
`Notice of Allowance dated Dec. 12, 2007 from U.S. Appl. No.
`10/537,985 filed Jun. 9, 2005.
`Office Action dated Mar. 23, 2007 from U.S. Appl. No. 111463,376
`filed Aug. 9, 2006.
`Notice of Allowance dated Dec. 12, 2007 from U.S. Appl. No.
`111463,376 filed Aug. 9, 2006.
`Office Action dated Dec. 27, 2005 from U.S. Appl. No. 10/897,219
`filed Jul. 22, 2004.
`Notice of Allowance dated Apr. 10, 2006 from U.S. Appl. No.
`10/897,219 filed Jul. 22, 2004.
`International Preliminary Report on Patentability dated Jan. 31,
`2008, with copy of Written Opinion ofiSA dated Aug. 31, 2007, for
`PCT/US2006/028627 filed On Jul. 21, 2006.
`* cited by examiner
`
`2
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 1 of 13
`
`US 7,464,029 B2
`
`102
`
`••• @
`
`104
`
`106"'
`
`SIGNAL
`08...,
`'-.. SEPARATION
`PROCESS
`
`- Transducer Signal
`-Speech Signal
`
`/100
`
`106
`
`VOICE
`ACTIV11Y
`DETECTOR
`
`10 '
`
`POST
`~ PROCESSNG ~----------~~------~
`"-107
`
`121--
`
`123""'
`
`TRANSMISSION
`
`~125
`
`FIG.1
`
`3
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 2 of 13
`
`US 7,464,029 B2
`
`177
`
`•••
`
`,('179
`~
`
`178
`
`- Transducer Signal
`- Speech Signal
`
`186
`
`185
`
`180
`
`LEARNING PROCESS
`
`VOLUME AOJUSTM ENT
`
`SIGNAL
`SEPARATION
`PROCESS
`
`NOISE ESTIMATION
`
`NOISE REDUCTION
`
`VOICE
`ACTIVITY
`DETECTOR
`
`AGC
`
`Speech Signal
`
`181
`
`I
`:
`195
`- 196
`_____ .L.._···--···-··--· ·--·····-·---····-···-·-·-····~······-·····-J
`
`I
`
`191
`
`TRANSMISSION
`
`193
`
`FIG. 2
`
`4
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 3 of 13
`
`US 7,464,029 B2
`
`
`206 "'-..
`
`POSITION A FIRST MICROPHONE
`CLOSER TO THE SPEECH SOURCE
`THAN A SECOND MICROPHONE
`
`207
`
`I
`
`"RECEIVE A SIGNAL FROM EACH OF THE
`MICROPHONES
`
`208"""
`MONITOR A THRESHOLD 01 FFERENCE
`AND COMPARE ENERGY LEVELS
`
`209"'
`
`/210
`
`THE RRST MIC SIGNAL
`HAS A HIGHER
`ENERGY LEVEL THEN
`THESECONDMIC
`SIGNAL
`
`THE SECOND MIC
`SIGNAL HAS A HIGHER
`ENERGY LEVEL THEN
`THE Fl RST Ml C Sl GNAL
`
`212"
`LIKELY SPEECH
`
`213
`
`"LIKELY NOISE
`
`FIG. 3
`
`5
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 4 of 13
`
`US 7,464,029 B2
`
`/250
`
`251"
`
`252""'
`
`POSITION A FIRST MICROPHONE
`CLOSER TO THE SPEECH SOURCE
`THAN A SECOND MICROPHONE
`
`I
`
`A SIGNAL SEPARATION PROCESS
`GENERATES A NOISE SIGNAL ANDA
`SPEECH SIGNAL
`
`253""'
`MONITOR A THRESHOLD Dl FFERENCE
`AND COMPARE ENERGY LEVELS
`
`I
`
`254""'
`
`/255
`
`THE SPEECH Sl GNAL
`HASAHIGHER
`ENERGY LEVEL THEN
`THE NOISE SIGNAL
`
`THE NOISE SIGNAL
`HAS A HIGHER
`ENERGY LEVEL THEN
`THE SPEECH SIGNAL
`
`257
`
`""'LIKELY SPEECH
`
`"'LIKELY NOISE
`
`258
`
`FIG. 4
`
`6
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 5 of 13
`
`US 7,464,029 B2
`
`329
`
`330
`
`332
`
`SPEECH
`SEPARATION
`PROCESS
`
`- Blind Signal Separation
`-Independent Component Analysis
`
`{
`
`231
`
`TRANSMISSION
`
`- BlueTooth
`-Wired
`- I EEE 802.11
`
`{
`
`335
`
`FIG. 5
`
`7
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 6 of 13
`
`US 7,464,029 B2
`
`352
`
`SPEECH
`SEPARATION
`PROCESS
`
`355
`
`354
`
`SPEAKER
`
`SIDE TONE
`PROCESSING
`
`360
`
`356
`
`TRANSMISSION
`
`358
`
`362
`
`FIG. 6
`
`8
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 7 of 13
`
`US 7,464,029 B2
`
`402
`
`405
`
`SPEECH
`SEPARATION
`PROCESS
`
`Speech Signal
`
`Noisy Signal
`
`406
`
`407
`
`413
`
`NOISE ESTIMATION
`
`410
`
`VOICE ACTIVITY
`DETECTOR
`
`411
`
`NOISE REDUCTION
`
`415
`
`Control Signal
`
`TRANSMISSION
`
`418
`
`420
`
`FIG. 7
`
`9
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 8 of 13
`
`US 7,464,029 B2
`
`/451
`,.
`
`/452
`
`Mic 1
`
`0
`
`FIG. 8
`
`10
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 9 of 13
`
`US 7,464,029 B2
`
`502""
`POSITION TRANSDUCERS
`
`!
`
`504""
`RECBVE Sl GNALS HAVING NOISE AND
`INFORMATION
`
`/517
`
`Set gain
`
`506""
`PROCESS Sl GNALS INTO CHANNELS
`/521
`Adapt filter coefficients
`
`519""
`Rearrange-----~
`coeffidents
`
`/523 _j
`
`Apply filters
`
`508""
`IDENTIFY CHANNEL WITH BOTH NOISE
`AND INFORMATION
`
`arrangement
`510""
`PROCESS THE IDENTIRED CHANNEL TO
`GENERATE AN INFORMATION SIGNAL
`
`/515
`Measure
`Noise signal
`Combination signal
`
`FIG. 9
`
`11
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 10 of 13
`
`US 7,464,029 B2
`
`/600
`
`xjt)
`
`FIG. 10
`
`FIG. 11
`
`12
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 11 of 13
`
`US 7,464,029 B2
`
`/750
`
`/765
`RESET MONITOR
`
`/767
`DEFAULT COEFFICIENTS
`
`t I
`776"'-J
`
`I
`I
`
`777-
`
`I
`~-----------------------,
`I
`752~
`I
`I
`I
`I
`
`1+----· I
`
`I
`I
`I
`I
`I
`I
`I
`r
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`,.. ___ I
`
`760""'
`
`762\
`
`LEARNING STAGE
`754 '\. LEARNING STAGE
`'\
`FILTER
`COEFFI a ENTS
`
`756"'
`
`OUTPUT STAGE
`~....-_. 758 '\. OUTPUT STAGE
`'\
`FILTER
`COEFFI a ENTS
`
`770'\
`
`773'\
`
`FIG. 12
`
`13
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 12 of 13
`
`US 7,464,029 B2
`
`801 ~802
`~
`
`/800
`
`807
`
`806
`
`812
`
`----------------,
`
`------------------~
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`I
`
`SIGNAL
`SEPARATION 1--i-----P~
`PROCESS
`
`814
`
`SCALING
`MONITOR
`
`803
`
`POST
`PROCESSING
`
`821--
`
`823
`
`TRANSMISSION
`
`825
`
`FIG. 13
`
`14
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 13 of 13
`
`US 7,464,029 B2
`
`/.!100
`
`"POSITION A RRST MICROPHONE TO A
`DIFFERENT WIND DIRECTION THEN A
`SECOND MICROPHONE
`
`902
`
`904
`
`"" MONITOR MICROPHONE SIGNALS FOR
`A LOW-FREQUENCY WIND SIGNATURE
`
`DEACTIVATE OR DE-EMPHASIZE
`MICROPHONE HIT BY WI NO
`
`906"
`
`908
`
`"OPERATE AS A SINGLE CHANNEL
`COMMUNICATION PROCESS
`
`911"
`MONITOR MICROPHONE SIGNALS FOR
`ENDING OF THE LOW-FREQUENCY
`WIND SIGNATURE
`
`913"
`REACTIVE MICROPHONE AND ACTIVATE
`TWO CHANNEL SEPARATION AND POST
`PROCESSING
`
`FIG. 14
`
`15
`
`
`
`1
`ROBUST SEPARATION OF SPEECH SIGNALS
`IN A NOISY ENVIRONMENT
`
`RELATED APPLICATIONS
`
`This application is related to U.S. patent application Ser.
`No. 10/897,219, filed Jul. 22,2004, (now U.S. Pat. No. 7,099,
`821, issued Aug. 29, 2006) and entitled "Separation of Target
`Acoustic Signals in a Multi-Transducer Arrangement", which
`is related to a co-pending Patent Cooperation Treaty applica-
`tionnumber PCT/US03/39593, entitled "System and Method
`for Speech Processing Using Improved Independent Compo-
`nent Analysis", filed Dec. 11, 2003, which claims priority to
`U.S. patent application Ser. No. 60/502,253, both of which
`are incorporated herein by reference.
`
`FIELD OF THE INVENTION
`
`The present invention relates to processes and methods for
`separating a speech signal from a noisy acoustic environment.
`More particularly, one example of the present invention pro-
`vides a blind signal source process for separating a speech
`signal from a noisy environment.
`
`BACKGROUND
`
`An acoustic environment is often noisy, making it difficult
`to reliably detect and react to a desired informational signal.
`For example, a person may desire to communicate with
`another person using a voice communication channel. The
`channel may be provided, for example, by a mobile wireless
`handset, a walkie-talkie, a two-way radio, or other commu-
`nication device. To improve usability, the person may use a
`headset or earpiece connected to the communication device.
`The headset or earpiece often has one or more ear speakers
`and a microphone. Typically, the microphone extends on a
`boom toward the person's mouth, to increase the likelihood
`that the microphone will pick up the sound of the person
`speaking. When the person speaks, the microphone receives
`the person's voice signal, and converts it to an electronic
`signal. The microphone also receives sound signals from
`various noise sources, and therefore also includes a noise
`component in the electronic signal. Since the headset may
`position the microphone several inches from the person's
`mouth, and the environment may have many uncontrollable
`noise sources, the resulting electronic signal may have a
`substantial noise component. Such substantial noise causes
`an unsatisfactory communication experience, and may cause
`the communication device to operate in an inefficient manner,
`thereby increasing battery drain.
`In one particular example, a speech signal is generated in a
`noisy environment, and speech processing methods are used
`to separate the speech signal from the environmental noise.
`Such speech signal processing is important in many areas of
`everyday communication, since noise is almost always 55
`present in real-world conditions. Noise is defined as the com-
`bination of all signals interfering or degrading the speech
`signal of interest. The real world abounds from multiple noise
`sources, including single point noise sources, which often
`transgress into multiple sounds resulting in reverberation. 60
`Unless separated and isolated from background noise, it is
`difficult to make reliable and efficient use of the desired
`speech signal. Background noise may include numerous
`noise signals generated by the general environment, signals
`generated by background conversations of other people, as
`well as reflections and reverberation generated from each of
`the signals. In communication where users often talk in noisy
`
`US 7,464,029 B2
`
`10
`
`2
`environments, it is desirable to separate the user's speech
`signals from background noise. Speech communication
`mediums, such as cell phones, speakerphones, headsets,
`cordless telephones, teleconferences, CB radios, walkie-talk-
`ies, computer telephony applications, computer and automo-
`bile voice command applications and other hands-free appli-
`cations, intercoms, microphone systems and so forth, can take
`advantage of speech signal processing to separate the desired
`speech signals from background noise.
`Many methods have been created to separate desired sound
`signals from background noise signals, including simple fil-
`tering processes. Prior art noise filters identify signals with
`predetermined characteristics as white noise signals, and sub-
`15 tract such signals from the input signals. These methods,
`while simple and fast enough for real time processing of
`sound signals, are not easily adaptable to different sound
`environments, and can result in substantial degradation of the
`speech signal sought to be resolved. The predetermined
`20 assumptions of noise characteristics can be over-inclusive or
`under-inclusive. As a result, portions of a person's speech
`may be considered "noise" by these methods and therefore
`removed from the output speech signals, while portions of
`background noise such as music or conversation may be
`25 considered non-noise by these methods and therefore
`included in the output speech signals.
`In signal processing applications, typically one or more
`input signals are acquired using a transducer sensor, such as a
`microphone. The signals provided by the sensors are mixtures
`30 of many sources. Generally, the signal sources as well as their
`mixture characteristics are unknown. Without knowledge of
`the signal sources other than the general statistical assump-
`tion of source independence, this signal processing problem
`is known in the art as the "blind source separation (BSS)
`35 problem". The blind separation problem is encountered in
`many familiar forms. For instance, it is well known that a
`human can focus attention on a single source of sound even in
`an environment that contains many such sources, a phenom-
`enon commonly referred to as the "cocktail-party effect."
`40 Each of the source signals is delayed and attenuated in some
`time varying manner during transmission from source to
`microphone, where it is then mixed with other independently
`delayed and attenuated source signals, including multipath
`versions of itself (reverberation), which are delayed versions
`45 arriving from different directions.A person receiving all these
`acoustic signals may be able to listen to a particular set of
`sound source while filtering out or ignoring other interfering
`sources, including multi-path signals.
`Considerable effort has been devoted in the prior art to
`50 solve the cocktail-party effect, both in physical devices and in
`computational simulations of such devices. Various noise
`mitigation techniques are currently employed, ranging from
`simple elimination of a signal prior to analysis to schemes for
`adaptive estimation of the noise spectrum that depend on a
`correct discrimination between speech and non-speech sig-
`nals. A description of these techniques is generally character-
`ized in U.S. Pat. No. 6,002,776 (herein incorporated by ref-
`erence). In particular, U.S. Pat. No. 6,002,776 describes a
`scheme to separate source signals where two or more micro-
`phones are mounted in an environment that contains an equal
`or lesser number of distinct sound sources. Using direction-
`of-arrival information, a first module attempts to extract the
`original source signals while any residual crosstalk between
`the channels is removed by a second module. Such an
`65 arrangement may be effective in separating spatially local-
`ized point sources with clearly defined direction-of-arrival
`but fails to separate out a speech signal in a real-world spa-
`
`16
`
`
`
`US 7,464,029 B2
`
`3
`tially distributed noise environment for which no particular
`direction-of-arrival can be determined.
`Methods, such as Independent Component Analysis
`("ICA"), provide relatively accurate and flexible means for
`the separation of speech signals from noise sources. ICA is a
`technique for separating mixed source signals (components)
`which are presumably independent from each other. In its
`simplified form, independent component analysis operates an
`"un-mixing" matrix of weights on the mixed signals, for
`example multiplying the matrix with the mixed signals, to 10
`produce separated signals. The weights are assigned initial
`values, and then adjusted to maximize joint entropy of the
`signals in order to minimize information redundancy. This
`weight-adjusting and entropy-increasing process is repeated
`until the information redundancy of the signals is reduced to 15
`a minimum. Because this technique does not require infor-
`mation on the source of each signal, it is known as a "blind
`source separation" method. Blind separation problems refer
`to the idea of separating mixed signals that come from mul-
`tiple independent sources.
`Many popular ICA algorithms have been developed to
`optimize their performance, including a nnmber which have
`evolved by significant modifications of those which only
`existed a decade ago. For example, the work described inA.
`J. Bell and T J Sejnowski, Neural Computation 7:1129-1159 25
`(1995), and Bell, A. J. U.S. Pat. No. 5,706,402, is usually not
`used in its patented form. Instead, in order to optimize its
`performance, this algorithm has gone through several rechar-
`acterizations by a number of different entities. One such
`change includes the use of the "natural gradient", described in 30
`Amari, Cichocki, Yang (1996). Other popular ICA algorithms
`include methods that compute higher-order statistics such as
`cnmulants (Cardoso, 1992; Coman, 1994; Hyvaerinen and
`Oja, 1997).
`However, many known ICA algorithms are not able to 35
`effectively separate signals that have been recorded in a real
`environment which inherently include acoustic echoes, such
`as those due to room architecture related reflections. It is
`emphasized that the methods mentioned so far are restricted
`to the separation of signals resulting from a linear stationary 40
`mixture of source signals. The phenomenon resulting from
`the snmming of direct path signals and their echoic counter-
`parts is termed reverberation and poses a major issue in arti-
`ficial speech enhancement and recognition systems. ICA
`algorithms may require long filters which can separate those
`time-delayed and echoed signals, thus precluding effective
`real time use.
`Known ICA signal separation systems typically use a net-
`work of filters, acting as a neural network, to resolve indi-
`vidual signals from any nnmber of mixed signals input into
`the filter network. That is, the ICAnetworkis used to separate
`a set of sound signals into a more ordered set of signals, where
`each signal represents a particular sound source. For example,
`if an ICA network receives a sound signal comprising piano
`music and a person speaking, a two port ICA network will
`separate the sound into two signals: one signal having mostly
`piano music, and another signal having mostly speech.
`Another prior technique is to separate sound based on
`auditory scene analysis. In this analysis, vigorous use is made
`of assumptions regarding the nature of the sources present. It
`is assumed that a sound can be decomposed into small ele-
`ments such as tones and bursts, which in turn can be grouped
`according to attributes such as harmonicity and continuity in
`time. Auditory scene analysis can be performed using infor-
`mation from a single microphone or from several micro-
`phones. The field of auditory scene analysis has gained more
`attention due to the availability of computational machine
`
`4
`learning approaches leading to computational auditory scene
`analysis or CASA. Although interesting scientifically since it
`involves the understanding of the human auditory processing,
`the model assumptions and the computational techniques are
`still in its infancy to solve a realistic cocktail party scenario.
`Other techniques for separating sounds operate by exploit-
`ing the spatial separation of their sources. Devices based on
`this principle vary in complexity. The simplest such devices
`are microphones that have highly selective, but fixed patterns
`of sensitivity. A directional microphone, for example, is
`designed to have maximnm sensitivity to sounds emanating
`from a particular direction, and can therefore be used to
`enhance one audio source relative to others. Similarly, a
`close-talking microphone mounted near a speaker's mouth
`may reject some distant sources. Microphone-array process-
`ing techniques are then used to separate sources by exploiting
`perceived spatial separation. These techniques are not prac-
`tical because sufficient suppression of a competing sound
`20 source cannot be achieved due to their assnmption that at least
`one microphone contains only the desired signal, which is not
`practical in an acoustic environment.
`A widely known technique for linear microphone-array
`processing is often referred to as "beamforming". In this
`method the time difference between signals due to spatial
`difference of microphones is used to enhance the signal. More
`particularly, it is likely that one of the microphones will
`"look" more directly at the speech source, whereas the other
`microphone may generate a signal that is relatively attenu-
`ated. Although some attenuation can be achieved, the beam-
`former cannot provide relative attenuation of frequency com-
`ponents whose wavelengths are larger than the array. These
`techniques are methods for spatial filtering to steer a beam
`towards a sound source and therefore putting a null at the
`other directions. Beamforming techniques make no assump-
`tion on the sound source but assume that the geometry
`between source and sensors or the sound signal itself is
`known for the purpose of dereverberating the signal or local-
`izing the sound source.
`A known technique in robust adaptive beamforming
`referred to as "Generalized Sidelobe Canceling" (GSC) is
`discussed in Hoshuyama, 0., Sugiyama, A., Hirano, A., A
`Robust Adaptive Beamformer for Microphone Arrays with a
`Blocking Matrix using Constrained Adaptive Filters, IEEE
`45 Transactions on Signal Processing, vol 47, No 10, pp 2677-
`2684, October 1999. GSC aims at filtering out a single desired
`source signal z_i from a set of measurements x, as more fully
`explained in The GSC principle, Griffiths, L. J., Jim, C. W.,
`An alternative approach to linear constrained adaptive
`50 beamforming, IEEE Transaction Antennas and Propagation,
`vol 30, no 1, pp. 27-34, January 1982. Generally, GSC pre-
`defines that a signal-independent beamformer c filters the
`sensor signals so that the direct path from the desired source
`remains undistorted whereas, ideally, other directions should
`55 be suppressed. Most often, the position of the desired source
`must be pre-determined by additional localization methods.
`In the lower, side path, an adaptive blocking matrix B aims at
`suppressing all components originating from the desired sig-
`nal z_i so that only noise components appear at the output of
`60 B. From these, an adaptive interference canceller a derives an
`estimate for the remaining noise component in the output of c,
`by minimizing an estimate of the total output power
`E(z_i*z_i). Thus the fixed beamformer c and the interference
`canceller a jointly perform interference suppression. Since
`65 GSC requires the desired speaker to be confined to a limited
`tracking region, its applicability is limited to spatially rigid
`scenanos.
`
`17
`
`
`
`US 7,464,029 B2
`
`5
`Another known technique is a class of active-cancellation
`algorithms, which is related to sound separation. However,
`this technique requires a "reference signal," i.e., a signal
`derived from only of one of the sources. Active noise-cancel-
`lation and echo cancellation techniques make extensive use of
`this technique and the noise reduction is relative to the con-
`tribution of noise to a mixture by filtering a known signal that
`contains only the noise, and subtracting it from the mixture.
`This method assumes that one of the measured signals con-
`sists of one and only one source, an assumption which is not 10
`realistic in many real life settings.
`Techniques for active cancellation that do not require a
`reference signal are called "blind" and are of primary interest
`in this application. They are now classified, based on the
`degree of realism of the underlying assumptions regarding the 15
`acoustic processes by which the unwanted signals reach the
`microphones. One class of blind active-cancellation tech-
`niques may be called "gain-based" or also known as "instan-
`taneous mixing": it is presumed that the waveform produced
`by each source is received by the microphones simulta-
`neously, but with varying relative gains. (Directional micro-
`phones are most often used to produce the required differ-
`ences in gain.) Thus, a gain-based system attempts to cancel
`copies of an undesired source in different microphone signals
`by applying relative gains to the microphone signals and
`subtracting, but not applying time delays or other filtering.
`Numerous gain-based methods for blind active cancellation
`have been proposed; see Herault and Jutten (1986), Tong eta!.
`(1991), and Molgedey and Schuster (1994). The gain-based
`or instantaneous mixing assumption is violated when micro-
`phones are separated in space as in most acoustic applica-
`tions. A simple extension of this method is to include a time
`delay factor but without any other filtering, which will work
`under anechoic conditions. However, this simple model of
`acoustic propagation from the sources to the microphones is
`oflimited use when echoes and reverberation are present. The
`most realistic active-cancellation techniques currently known
`are "convolutive": the effect of acoustic propagation from
`each source to each microphone is modeled as a convolutive
`filter. These techniques are more realistic than gain-based and
`delay-based techniques because they explicitly accommodate
`the effects of inter-microphone separation, echoes and rever-
`beration. They are also more general since, in principle, gains
`and