`(10) Patent No:
`a2) United States Patent
`Visser et al.
`(45) Date of Patent:
`Dec. 9, 2008
`
`
`US007464029B2
`
`2/2006
`WO 2006/012578
`WO
`(54) ROBUST SEPARATION OF SPEECH SIGNALS
`
`
`IN A NOISY ENVIRONMENT WO—WO 2006/028587 3/2006
`
`(75)
`
`Inventors: Erik Visser, San Diego, CA (US);
`Jeremy Toman, San Marcos, CA (US);
`Kwokleung Chan, San Diego, CA (US)
`
`(73) Assignee: QUALCOMM Incorporated, San
`Diego, CA (US)
`.
`.
`.
`.
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 246 days.
`
`.
`ae
`(*) Notice:
`
`(21) Appl. No.: 11/187,504
`
`(22)
`
`Filed:
`
`Jul. 22, 2005
`
`(65)
`
`Prior Publication Data
`US 2007/0021958 Al
`Jan. 25, 2007
`
`(51)
`
`Int. Cl.
`(2006.01)
`GIOL 19/14
`(2006.01)
`GIOL 11/06
`(2006.01)
`GIOL 21/02
`(2006.01)
`GIOL 15/20
`(52) US. Ch cece 704/210; 704/215; 704/228;
`704/233
`(58) Field of Classification Search o0...0000..sees None
`See applicationfile for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`4,649,505 A
`3/1987 Zinser, Ir.et al.
`4,912,767 A
`3/1990 Chang
`5,208,786 A
`5/1993 Weinsteinetal.
`5,251,263 A
`10/1993 Andreaetal.
`.
`(Continued)
`
`EP
`
`WO
`
`FOREIGN PATENT DOCUMENTS
`1 006 652 A2
`6/2000
`
`WO01/27874
`
`4/2001
`
`OTHER PUBLICATIONS
`
`Amari, et al. 1996. A new learning algorithm for blind signal sepa-
`ration.
`In D. Touretzky, M. Mozer, and M. Hasselmo (Eds.).
`Advances in NeuralInformation Processing Systems8 (pp. 757-763).
`Cambridge: MIT Press.
`
`(Continued)
`
`Primary Examiner—David R. Hudspeth
`Assistant Examiner—Brian L Albertalli
`(74) Attorney, Agent, or Firm—Espartaco Diaz Hidelgo;
`Timothy F. Loomis; Thomas R. Rouse
`
`(57)
`
`ABSTRACT
`
`A method for improving the quality of a speech signal
`extracted from a noisy acoustic environmentis provided. In
`one approach, a signal separation process is associated with a
`voice activity detector. The voiceactivity detector is a two-
`channel detector, which enables a particularly robust and
`accurate detection ofvoice activity. When speechis detected,
`the voice activity detector generates a control signal. The
`control signal is used to activate, adjust, or control signal
`separation processes or post-processing operations
`to
`improve the quality of the resulting speech signal. In another
`approach,a signal separation process is providedas a learning
`stage and an output stage. The learning stage aggressively
`adjusts to current acoustic conditions, and passes coefficients
`to the output stage. The output stage adapts more slowly, and
`generates a speech-content signal and a noise dominantsig-
`nal. When the learning stage becomes unstable, only the
`.
`.
`:
`.
`learning stage is reset, allowing the output stage to continue
`outputting a high quality speech signal.
`
`44 Claims, 13 Drawing Sheets
`
`105)
`~ Transchicer Signal
`« Speech Signal
`
`
`
`VOICE
`ACTIVITY
`DETECTOR
`
`
`
`POST
`
`PROCESSING
`
`
`107
`
`
`421-——
`
`
`
`
`
`
`423
`
`
`
`TRANSMISSION
`
`125
`
`1
`
`Amazon Ex. 1006
`
`Amazon v. Jawbone
`USS. Patent 8,321,213
`
`
`
`1
`
`Amazon v. Jawbone
`U.S. Patent 8,321,213
`Amazon Ex. 1006
`
`
`
`US 7,464,029 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`7/1994 McManigal
`5,327,178 A
`12/1994 Denenberg
`5,375,174 A
`1/1995 Sejnowskiet al.
`5,383,164 A
`1/1998 Bell
`5,706,402 A
`2/1998 Andreaetal.
`5,715,321 A
`3/1998 Andreaetal.
`5,732,143 A
`6/1998 Moedetal.
`5,770,841 A
`12/1999 Torkkola
`5,999,567 A
`12/1999 Deville
`5,999,956 A
`12/1999 Bhadkamkar etal.
`6,002,776 A
`8/2000 Andrea
`6,108,415 A
`6,130,949 A * 10/2000 Aokietal. oo... 381/943
`6,167,417 A
`12/2000 Parra etal.
`6,381,570 B2
`4/2002 Lietal.
`6,424,960 Bl
`7/2002 Lee et al.
`6,526,178 Bl
`2/2003 Fukuhara
`6,549,630 B1*
`4/2003 Bobisuthi
`6,606,506 Bl
`8/2003 Jones
`7,099,821 B2
`8/2006 Visseret al.
`2001/0037195 Al
`11/2001 Aceroet al.
`2002/0110256 Al
`8/2002 Watsonetal.
`2002/0136328 Al
`9/2002 Shimizu
`2002/0193130 Al
`12/2002 Yanget al.
`2003/0055735 Al
`3/2003 Cameronetal.
`2003/0179888 Al*
`9/2003 Burnett et al. 0... 381/718
`2004/0039464 Al
`2/2004 Virolainenet al.
`2004/0120540 Al
`6/2004 Mullenbornetal.
`2004/0136543 Al
`7/2004 Whiteet al.
`
`.........0..... 381/94.7
`
`OTHER PUBLICATIONS
`
`Amari,et al. 1997. Stability analysis of learning algorithmsfor blind
`source separation. Neural Networks, 10(8):1345-1351.
`Bell, et al. 1995. An information-maximization approach to blind.
`separation and blind deconvolution. Neural Computation, 7:1129-
`1159.
`Cardoso, J-F. 1992. Fourth-order cumulantstructore forcing. Appli-
`cation to blind array processing. Proc. IEEE SP Workshop on SSAP-
`92, 136-139.
`Comon, P. 1994. Independent component analysis, A new concept?
`Signal Porcessing, 36:287-314.
`Griffiths, et al. 1982. An alternative approachto linearly constrained
`adaptive beamforming. JEEE Transactions on Antennas and Propa-
`gation, AP-30(1):27-34.
`Herault, et al. (1986). Space or time adaptive signal processing by
`neural network models. Neural Networks for Computing, In J.S.
`Denker (Ed.), Proc. ofthe AIP Conference(pp. 206-211). New York:
`American Institute of Physics.
`Hoshuyama,et al. 1999. A robust adaptive beamformer for micro-
`phonearrays with a blocking matrix using constrained adaptivefil-
`ters. IEEE Transactions on Signal Processing, 47(10):2677-2684.
`Hyvarinen,et al. 1997. A fast fixed-point algorithm for independent
`component analysos. Neural Computation, 9:1483-1492.
`Hyvarinen, A. 1999. Fast and robust fixed-point algorithmsfor inde-
`pendent component analysos, JEEE Trans. on Neural Networks,
`10(3):626-634.
`Jutten, et al. 1991. Blind separation of sources, Part I: An adaptive
`algorithm based. on neuromimetic architecture. Signal Processing,
`24:1-10.
`Lambert, R. H. 1996. Multichannel blind deconvolution; FIR matrix
`algebra and separation of multipath mixtures. Doctoral Dissertation,
`University of Southern California.
`Lee, et al. 1997. A contextual blind separation of delayed and con-
`volved sources. Proceedings ofthe 1997 JEEE International Confer-
`ence on Acoustics, Speech, and Signal ProcessingICASSP’97),
`2:1199-1202.
`Lee, et al. 1998. Combining time-delayed decorrelation and ICA:
`Towardssolving the cocktail party problem. Proceedings ofthe1998
`IEEE International Conference on Acoustics, Speech, and Signal
`Processing (ICASSP’98), 2:1249-1252.
`
`Murata, Ikeda. 1998. An On-line Algorithm for Blind Source Sepa-
`ration on Speech Signals. Proc. of 1998 International Symposium on
`Nonlinear Theory andits Application (NOLTA98), pp. 923-926, Le
`Regent, Crans-Montana, Switzerland.
`Molgedey,et al. 1994. Separation of a mixture of independentsignals
`using time delayed correlations. Physical Review Letters, TheAmeri-
`can Physical Society, 72(23):3634-3637.
`Parra, et al. 2000. Convolutive blind separation of non-stationary
`sources. JEEE Trnsactions of Speech and Audio Processing,
`8(3):320-327.
`Platt, et al. 1992. Networks for the separation of sources that are
`superimposed and. delayed. In J. Moody, S. Hanson, R. Lippmann
`(Eds.), Advances in Neural Information Processing 4 (pp. 730-737).
`San Francisco: Morgan-Kaufmann.
`Tong, et al. 1991. A necessary and sufficient condition for the blind
`identification of memoryless systems. Circuits and Systems, IEEE
`International Symposium, 1:1-4.
`Torkkola, K. 1996. Blind separation of convolved sources based on
`information maximization. Neural Networks for Signal Processing:
`VI. Proceedings of the 1996 IEEE Signal Processing Society Work-
`shop, pp. 423-432.
`Torkkola, K. 1997. Blind deconvolution, information mazimization
`and recursive filters. JEEE International Conference on Acoustics,
`Speech, and Signal ProcessingICASSP’97), 4:3301-3304.
`Van Compernolle, et al. 1992. Signal separation in a symmetric
`adaptive noise canceler by output decorrelation. Acoustics, Speech,
`and Signal Processing, 1992. ICASSP-92., 1992 IEEE International
`Conference, 4:22 1-224.
`Visser, et al. Blind source separation in mobile environments using a
`priori knowledge. Acoustics, Speech, and Signal Processing, 2004,
`Proceedings. ICASSP’ 04). TEEE International Conference on,vol.
`3, May 17-21, 2004, pp. :i1i-893-896.
`Vissser, et al. Speech enhancementusing blind source separation and
`two-channel energy based speaker detection. Acoustics, Speech, and
`Signal Processing , 2003. Proceedings . ICASSP ’03). 2003 IEEE
`International Conference on, vol. 1, Apr. 6-10, 2003, pp. I-884 -
`1-887.
`Yellin, et al. 1996. Multichannel signal separation: Methods and
`analysis.EEE Transactions on Signal Processing, 44(1):106-118.
`First Examination Report dated Oct. 23, 2006 from Indian Applica-
`tion No. 1571/CHENP/2005.
`International Search Report from PCT/US03/39593 dated Apr. 29,
`2004.
`International Search Report from the EPO, Reference No. P400550,
`dated Oct, 15. 2007,
`in regards to European Publication No.
`EP1570464.
`International Preliminary Report on Patentability dated Feb. 1, 2007,
`with copy of Written Opinion of ISA dated Apr. 19, 2006, for PCT/
`US2005/026195 filed on Jul. 22, 2005.
`International Preliminary Report on Patentability dated Feb. 1, 2007,
`with copy of Written Opinion of ISA dated Mar. 10, 2006, for PCT/
`US2005/026196 filed on Jul. 22, 2005.
`Office Action dated Oct. 31, 2006 from U.S. Appl. No. 10/537,985
`filed Jun. 9, 2005.
`Final Office Action dated Apr. 13, 2007 from U.S. Appl. No.
`10/537,985 filed Jun. 9, 2005.
`Notice of Allowance with Examiner’s Amendment dated Jul. 30,
`2007 from U.S. Appl. No. 10/537,985 filed Jun. 9, 2005.
`Notice of Allowance dated Dec. 12, 2007 from U.S. Appl. No.
`10/537,985 filed Jun. 9, 2005.
`Office Action dated Mar. 23, 2007 from U.S. Appl. No. 11/463,376
`filed Aug. 9, 2006.
`Notice of Allowance dated Dec. 12, 2007 from U.S. Appl. No.
`11/463,376 filed Aug. 9, 2006.
`Office Action dated Dec. 27, 2005 from U.S. Appl. No. 10/897,219
`filed Jul. 22, 2004.
`Notice of Allowance dated Apr. 10, 2006 from U.S. Appl. No.
`10/897,219 filed Jul. 22, 2004.
`International Preliminary Report on Patentability dated Jan. 31,
`2008, with copy of Written Opinion of ISA dated Aug. 31, 2007, for
`PCT/US2006/028627filed On Jul. 21, 2006.
`
`* cited by examiner
`
`2
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 1 of 13
`
`US 7,464,029 B2
`
`106
`
`
`
`
`
`
`
`
`105
`
`\
`
`- Transducer Signal
`Speech Signal
`
`
`
`
`121 -——~
`
`
`
`
`
`
`
`
`
`VOICE
`ACTIMTY
`DETECTOR
`
`106
`
`108
`
`SIGNAL
`SEPARATION
`PROCESS
`
`112
`
`110
`
`POST
`
`: PROCESSING
`
`107
`
`123
`
`TRANSMISSION
`
`
`
`FIG. 1
`
`3
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 2 of 13
`
`US 7,464,029 B2
`
`
`
`- Transducer Signal
`- Speech Signal
`
`178
`
`180
`
`
`
` |
`
`Speech Signal
`
`
`
`
`SIGNAL
`VOICE
`NOISEREDUCTION a
`SEPARATION
`
`
`ACTIVITY
`
`
`PROCESS
`DETECTOR
`
`
`AGG 2
`ie
`
`
`
`. POST PROCESSING a
`
` -4----
`
`
`
`
`
`181
`
`191
`
`
`
`TRANSMISSION
`
`ee eS ee ee
`
`FIG. 2
`
`4
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 3 of 13
`
`US 7,464,029 B2
`
`26.
`
`POSITION A FIRST MICROPHONE
`CLOSER TO THE SPEECH SOURCE
`THAN A SECOND MICROPHONE
`
`‘S
`
`207
`
`\
`
`ze
`
`RECEIVE A SIGNAL FROM EACH OF THE
`MICROPHONES
`
`MONITOR A THRESHOLD DIFFERENCE
`AND COMPARE ENERGY LEVELS
`
`209 \
`
`210
`
`/
`
`THE FIRST MIC SIGNAL
`HAS AHIGHER
`ENERGY LEVEL THEN
`THE SECOND MIC
`SIGNAL
`
`THE SECOND MIC
`SIGNAL HAS A HIGHER
`ENERGY LEVEL THEN
`THE FIRST MIC SIGNAL
`
`212
`
`LIKELY SPEECH
`
`213
`
`\UIKELY NOISE
`
`FIG. 3
`
`5
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 4 of 13
`
`US 7,464,029 B2
`
`250
`
`/
`
`251
`
`\
`
`POSITION A FIRST MICROPHONE
`CLOSER TO THE SPEECH SOURCE
`THAN A SECOND MICROPHONE
`
`252
`
`|
`
`\ A SIGNAL SEPARATION PROCESS
`GENERATES A NOISE SIGNAL ANDA
`SPEECH SIGNAL
`
`253
`
`MONITOR A THRESHOLD DIFFERENCE
`AND COMPARE ENERGY LEVELS
`
`254 \
`
`J 255
`
`THE SPEECH SIGNAL
`HAS A HIGHER
`ENERGY LEVEL THEN
`THE NOISE SIGNAL
`
`THE NOISE SIGNAL
`HAS AHIGHER
`ENERGY LEVEL THEN
`THE SPEECH SIGNAL
`
`257 \ LIKELY SPEECH
`
`258 \
`
`LIKELY NOISE
`
`FIG. 4
`
`6
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 5 of 13
`
`US 7,464,029 B2
`
`327
`
`329
`
`.
`- Blind Signal Separation
`- Independent ComponentAnalysis
`
`- BlueTooth
`- Wired
`-
`[IEEE 802.11
`
`330
`
`332
`
`
`
` SPEECH
`SEPARATION
`PROCESS
`
`
`
`
`TRANSMISSION
`
`FIG. 5
`
`7
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 6 of 13
`
`US 7,464,029 B2
`
`351
`
`352
`
`SPEECH
`SEPARATION
`PROCESS
`
`355
`
` 360
`PROCESSING
`
`354
`
`SPEAKER
`
`SIDE TONE
`
`356
`
`TRANSMISSION
`
`358
`
`362
`
`FIG. 6
`
`8
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 7 of 13
`
`US 7,464,029 B2
`
`401
`
`402
`
`\ E
`
`410
`
`405
`
`
`
`VOICE ACTIVITY
`DETECTOR
`
`SPEECH
`SEPARATION
`PROCESS
`
`Speech Signal
`
`Noisy Signal
`
`411
`
`406
`
`407
`
`413
`
`NOISE ESTIMATION
`
`=}-------
`
`NOISE REDUCTION
`
`415
`
`Control Signal
`
`TRANSMISSION
`
`418
`
`420
`
`FIG. 7
`
`9
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 8 of 13
`
`US 7,464,029 B2
`
`451
`
`/
`
`fe
`
`Mic 1
`
`Mic 2
`
`\)oe)
`
`FIG. 8
`
`10
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 9 of 13
`
`US 7,464,029 B2
`
`502
`
`on
`
`POSITION TRANSDUCERS
`
`|
`
`RECEIVE SIGNALS HAVING NOISE AND
`INFORMATION
`
`506
`
`\
`
`\
`PROCESS SIGNALS INTO CHANNELS
`
`an
`
`ZY 517
`
`Set gain
`
`519
`
`\
`Rearrange
`coefficients
`
`508
`
`521
`..
`a“
`Adaptfilter coefficients
`523
`Applyfilters
`
`518 \
`Detect
`transducer
`arrangement
`
`IDENTIFY CHANNEL WITH BOTH NOISE
`AND INFORMATION
`
`545
`
`4
`Measure —
`—
`Noise signal
`Combination signal
`
`510
`
`PROCESS THE IDENTIFIED CHANNEL TO
`GENERATE AN INFORMATION SIGNAL
`
`FIG. 9
`
`11
`
`11
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 10 of 13
`
`US 7,464,029 B2
`
`s¢_fact
`
`oS
`
`
`
`
`FIG. 10
`
`FIG. 11
`
`12
`
`12
`
`
`
`765
`
`RESET MONITOR
`
`I
`
`116 \
`
`\
`
`752
`
`77 ——»
`
`LEARNING STAGE
`
`LEARNING STAGE
`FILTER
`COEFFICIENTS
`
`
`
`
`760
`
`762
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 11 of 13
`
`US 7,464,029 B2
`
`/ 750
`
`787
`
`DEFAULT COEFFICIENTS
`
`
`
`
`SET 4
`
`SET 2
`
`SET 3
`
`oe
`
`i
`'
`
`"a
`:
`
`'
`
`
`
`OUTPUT STAGE
`
`
`
`OUTPUT STAGE
`
`FILTER
`COEFFICIENTS
`
`
`770
`
`773
`
`FIG. 12
`
`13
`
`13
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 12 of 13
`
`US 7,464,029 B2
`
`801
`
`800
`
`/
`
`807
`
`SCALE
`
`
`
`808
`
`SIGNAL
`SEPARATION
`
`
`PROCESS
`
`
`810
`POST
`PROCESSING
`
` 806
`
`
` 803p|812 SCALINGMONITOR
`
`
`
`821 -—
`
`823
`
`TRANSMISSION
`
`825
`
`FIG. 13
`
`14
`
`14
`
`
`
`U.S. Patent
`
`Dec. 9, 2008
`
`Sheet 13 of 13
`
`US 7,464,029 B2
`
`900
`
`/
`
`POSITION A FIRST MICROPHONE TO A
`DIFFERENT WIND DIRECTION THEN A
`SECOND MICROPHONE
`
`MONITOR MICROPHONE SIGNALS FOR
`A LOW-FREQUENCY WIND SIGNATURE
`
`902
`
`\
`
`904.
`
`906
`
`DEACTIVATE OR DE-EMPHASIZE
`MICROPHONE HIT BY WIND
`
`OPERATE AS A SINGLE CHANNEL
`COMMUNICATION PROCESS
`
`98
`
`911 \
`
`MONITOR MICROPHONE SIGNALS FOR
`ENDING OF THE LOW-FREQUENCY
`WIND SIGNATURE
`
`913 \
`
`REACTIVE MICROPHONE AND ACTIVATE
`TWO CHANNEL SEPARATION AND POST
`PROCESSING
`
`FIG. 14
`
`15
`
`15
`
`
`
`US 7,464,029 B2
`
`1
`ROBUST SEPARATION OF SPEECH SIGNALS
`IN A NOISY ENVIRONMENT
`
`RELATED APPLICATIONS
`
`This application is related to U.S. patent application Ser.
`No. 10/897,219, filed Jul. 22, 2004, (now USS. Pat. No. 7,099,
`821, issued Aug. 29, 2006) andentitled “Separation of Target
`Acoustic Signals ina Multi-TransducerArrangement”, which
`is related to a co-pending Patent Cooperation Treaty applica-
`tion number PCT/US03/39593, entitled “System and Method
`for Speech Processing Using Improved Independent Compo-
`nent Analysis”, filed Dec. 11, 2003, which claimspriority to
`USS. patent application Ser. No. 60/502,253, both of which
`are incorporated herein by reference.
`
`2
`environments, it is desirable to separate the user’s speech
`signals from background noise. Speech communication
`mediums, such as cell phones, speakerphones, headsets,
`cordless telephones, teleconferences, CB radios, walkie-talk-
`ies, computer telephony applications, computer and automo-
`bile voice commandapplications and other hands-free appli-
`cations, intercoms, microphone systemsand soforth, can take
`advantage of speech signal processing to separate the desired
`speech signals from background noise.
`Many methods have been created to separate desired sound
`signals from backgroundnoise signals, including simplefil-
`tering processes. Prior art noise filters identify signals with
`predetermined characteristics as white noise signals, and sub-
`tract such signals from the input signals. These methods,
`while simple and fast enough for real time processing of
`sound signals, are not easily adaptable to different sound
`environments, and can result in substantial degradation ofthe
`speech signal sought to be resolved. The predetermined
`assumptions of noise characteristics can be over-inclusive or
`under-inclusive. As a result, portions of a person’s speech
`may be considered “noise” by these methods and therefore
`removed from the output speech signals, while portions of
`background noise such as music or conversation may be
`considered non-noise by these methods and therefore
`included in the output speech signals.
`In signal processing applications, typically one or more
`An acoustic environmentis often noisy, makingit difficult
`input signals are acquired using a transducersensor, suchas a
`to reliably detect and react to a desired informational signal.
`microphone. Thesignals provided by the sensors are mixtures
`For example, a person may desire to communicate with
`ofmany sources. Generally, the signal sources as well as their
`another person using a voice communication channel. The
`mixture characteristics are unknown. Without knowledge of
`channel maybe provided, for example, by a mobile wireless
`the signal sources other than the generalstatistical assump-
`handset, a walkie-talkie, a two-way radio, or other commu-
`tion of source independence,this signal processing problem
`nication device. To improve usability, the person may use a
`is known in the art as the “blind source separation (BSS)
`headset or earpiece connected to the communication device.
`problem”. The blind separation problem is encountered in
`The headset or earpiece often has one or more ear speakers
`many familiar forms. For instance, it is well known that a
`and a microphone. Typically, the microphone extends on a
`humancan focus attention onasingle source of sound even in
`boom toward the person’s mouth, to increase the likelihood
`an environmentthat contains many such sources, a phenom-
`that the microphone will pick up the sound of the person
`enon commonly referred to as the “cocktail-party effect.”
`speaking. Whenthe person speaks, the microphonereceives
`Eachofthe source signals is delayed and attenuated in some
`the person’s voice signal, and converts it to an electronic
`time varying manner during transmission from source to
`signal. The microphone also receives sound signals from
`microphone,whereit is then mixed with other independently
`various noise sources, and therefore also includes a noise
`delayed and attenuated source signals, including multipath
`component in the electronic signal. Since the headset may
`versionsofitself (reverberation), which are delayed versions
`position the microphone several inches from the person’s
`arriving from different directions. A person receiving all these
`mouth, and the environment may have many uncontrollable
`acoustic signals may be able to listen to a particular set of
`noise sources, the resulting electronic signal may have a
`sound source while filtering out or ignoring other interfering
`substantial noise component. Such substantial noise causes
`sources, including multi-path signals.
`an unsatisfactory communication experience, and may cause
`the communication device to operate in an inefficient manner,
`Considerable effort has been devoted in the prior art to
`thereby increasing battery drain.
`solve the cocktail-party effect, both in physical devices and in
`In one particular example, a speech signal is generated in a
`computational simulations of such devices. Various noise
`noisy environment, and speech processing methods are used
`mitigation techniques are currently employed, ranging from
`to separate the speech signal from the environmental noise.
`simple elimination ofa signal prior to analysis to schemesfor
`Such speech signal processing is important in many areas of
`adaptive estimation of the noise spectrum that depend on a
`everyday communication, since noise is almost always
`correct discrimination between speech and non-speechsig-
`present in real-world conditions. Noise is defined as the com-
`nals. A description ofthese techniques is generally character-
`bination of all signals interfering or degrading the speech
`ized in U.S. Pat. No. 6,002,776 (herein incorporated by ref-
`signal of interest. The real world abounds from multiple noise
`erence). In particular, U.S. Pat. No. 6,002,776 describes a
`sources, including single point noise sources, which often
`schemeto separate source signals where two or more micro-
`transgress into multiple sounds resulting in reverberation.
`phones are mountedin an environmentthat contains an equal
`Unless separated and isolated from background noise, it is
`or lesser numberof distinct sound sources. Using direction-
`difficult to make reliable and efficient use of the desired
`of-arrival information, a first module attempts to extract the
`speech signal. Background noise may include numerous
`original source signals while any residual crosstalk between
`noise signals generated by the general environment, signals
`the channels is removed by a second module. Such an
`generated by background conversations of other people, as
`arrangement may beeffective in separating spatially local-
`well as reflections and reverberation generated from each of
`ized point sources with clearly defined direction-of-arrival
`the signals. In communication where users oftentalk in noisy
`but fails to separate out a speech signal in a real-world spa-
`
`FIELD OF THE INVENTION
`
`Thepresentinvention relates to processes and methods for
`separating a speechsignal from a noisy acoustic environment.
`Moreparticularly, one example of the present invention pro-
`vides a blind signal source process for separating a speech
`signal from a noisy environment.
`
`BACKGROUND
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`16
`
`16
`
`
`
`US 7,464,029 B2
`
`3
`tially distributed noise environment for which no particular
`direction-of-arrival can be determined.
`
`Independent Component Analysis
`such as
`Methods,
`(“ICA”), provide relatively accurate and flexible means for
`the separation of speech signals from noise sources. ICA is a
`technique for separating mixed source signals (components)
`which are presumably independent from each other. In its
`simplified form, independent componentanalysis operates an
`“un-mixing” matrix of weights on the mixed signals, for
`example multiplying the matrix with the mixed signals, to
`produce separated signals. The weights are assigned initial
`values, and then adjusted to maximize joint entropy of the
`signals in order to minimize information redundancy. This
`weight-adjusting and entropy-increasing process is repeated
`until the information redundancyof the signals is reduced to
`a minimum. Because this technique does not require infor-
`mation on the source of each signal, it is known as a “blind
`source separation” method. Blind separation problemsrefer
`to the idea of separating mixed signals that come from mul-
`tiple independent sources.
`Many popular ICA algorithms have been developed to
`optimize their performance, including a number which have
`evolved by significant modifications of those which only
`existed a decade ago. For example, the work described in A.
`J. Bell and T J Sejnowski, Neural Computation 7:1129-1159
`(1995), and Bell, A. J. U.S. Pat. No. 5,706,402,is usually not
`used in its patented form. Instead, in order to optimize its
`performance,this algorithm has gone through several rechar-
`acterizations by a numberof different entities. One such
`change includesthe use ofthe “natural gradient”, described in
`Amari, Cichocki, Yang (1996). Other popular ICA algorithms
`include methods that compute higher-order statistics such as
`cumulants (Cardoso, 1992; Comon, 1994; Hyvaerinen and
`Oja, 1997).
`However, many known ICA algorithms are not able to
`effectively separate signals that have been recorded in a real
`environment which inherently include acoustic echoes, such
`as those due to room architecture related reflections. It is
`
`emphasized that the methods mentionedso far are restricted
`to the separation of signals resulting from a linear stationary
`mixture of source signals. The phenomenonresulting from
`the summingofdirect path signals and their echoic counter-
`parts is termed reverberation and poses a major issuein arti-
`ficial speech enhancement and recognition systems. ICA
`algorithms may require long filters which can separate those
`time-delayed and echoed signals, thus precluding effective
`real time use.
`
`Known ICA signal separation systemstypically use a net-
`workoffilters, acting as a neural network, to resolve indi-
`vidual signals from any number of mixed signals input into
`the filter network. Thatis, the ICA network is used to separate
`aset of sound signals into a more orderedset of signals, where
`each signal represents a particular sound source. For example,
`if an ICA networkreceives a sound signal comprising piano
`music and a person speaking, a two port ICA network will
`separate the sound into two signals: one signal having mostly
`piano music, and another signal having mostly speech.
`Another prior technique is to separate sound based on
`auditory scene analysis. In this analysis, vigorous use is made
`of assumptions regarding the nature of the sourcespresent.It
`is assumed that a sound can be decomposed into small ele-
`ments such as tones and bursts, which in turn can be grouped
`according to attributes such as harmonicity and continuity in
`time. Auditory scene analysis can be performed using infor-
`mation from a single microphone or from several micro-
`phones. Thefield of auditory scene analysis has gained more
`attention due to the availability of computational machine
`
`4
`learning approaches leading to computational auditory scene
`analysis or CASA.Although interesting scientifically sinceit
`involves the understanding ofthe humanauditory processing,
`the model assumptions and the computational techniques are
`still in its infancy to solve a realistic cocktail party scenario.
`Other techniques for separating sounds operate by exploit-
`ing the spatial separation of their sources. Devices based on
`this principle vary in complexity. The simplest such devices
`are microphonesthat have highly selective, but fixed patterns
`of sensitivity. A directional microphone, for example,
`is
`designed to have maximum sensitivity to sounds emanating
`from a particular direction, and can therefore be used to
`enhance one audio source relative to others. Similarly, a
`close-talking microphone mounted near a speaker’s mouth
`mayreject some distant sources. Microphone-array process-
`ing techniquesare then used to separate sources by exploiting
`perceived spatial separation. These techniques are not prac-
`tical because sufficient suppression of a competing sound
`source cannot be achieved due to their assumption thatat least
`one microphonecontains only the desired signal, which is not
`practical in an acoustic environment.
`A widely known technique for linear microphone-array
`processing is often referred to as “beamforming”. In this
`method the time difference between signals due to spatial
`difference ofmicrophonesis used to enhance the signal. More
`particularly,
`it is likely that one of the microphones will
`“look” more directly at the speech source, whereas the other
`microphone may generate a signal that is relatively attenu-
`ated. Although someattenuation can be achieved, the beam-
`former cannotprovide relative attenuation of frequency com-
`ponents whose wavelengths are larger than the array. These
`techniques are methods for spatial filtering to steer a beam
`towards a sound source and therefore putting a null at the
`other directions. Beamforming techniques make no assump-
`tion on the sound source but assume that the geometry
`between source and sensors or the sound signal itself is
`knownfor the purpose of dereverberating the signal or local-
`izing the sound source.
`A known technique in robust adaptive beamforming
`referred to as “Generalized Sidelobe Canceling” (GSC) is
`discussed in Hoshuyama, O., Sugiyama, A., Hirano, A., A
`Robust Adaptive Beamformerfor Microphone Arrays with a
`Blocking Matrix using Constrained Adaptive Filters, JEEE
`Transactions on Signal Processing, vol 47, No 10, pp 2677-
`2684, October 1999. GSC aimsatfiltering out a single desired
`source signal z_i from a set of measurements x, as more fully
`explained in The GSCprinciple , Griffiths, L. J., Jim, C. W.,
`An alternative approach to linear constrained adaptive
`beamforming, IEEE Transaction Antennas and Propagation,
`vol 30, no 1, pp. 27-34, January 1982. Generally, GSC pre-
`defines that a signal-independent beamformerc filters the
`sensor signals so that the direct path from the desired source
`remains undistorted whereas, ideally, other directions should
`be suppressed. Mostoften, the position of the desired source
`must be pre-determined by additional localization methods.
`In the lower, side path, an adaptive blocking matrix B aimsat
`suppressing all components originating from the desired sig-
`nal z_i so that only noise components appearat the output of
`B. From these, an adaptive interference canceller a derives an
`estimate for the remaining noise componentin the output ofc,
`by minimizing an estimate of the total output power
`E(z_i*z_i). Thus the fixed beamformerc and the interference
`canceller a jointly perform interference suppression. Since
`GSCrequires the desired speaker to be confined to a limited
`tracking region, its applicability is limited to spatially rigid
`scenarios.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`17
`
`17
`
`
`
`US 7,464,029 B2
`
`5
`Another knowntechniqueis a class of active-cancellation
`algorithms, which is related to sound separation. However,
`this technique requires a “reference signal,” i.e., a signal
`derived from only of one of the sources. Active noise-cancel-
`lation and echo cancellation techniques make extensive use of
`this technique and the noise reduction is relative to the con-
`tribution of noise to a mixture byfiltering a knownsignal that
`contains only the noise, and subtracting it from the mixture.
`This method assumesthat one of the measured signals con-
`sists of one and only one source, an assumption whichis not
`realistic in manyreallife settings.
`Techniques for active cancellation that do not require a
`reference signalare called “blind” and are of primary interest
`in this application. They are now classified, based on the
`degree ofrealism ofthe underlying assumptionsregarding the
`acoustic processes by which the unwanted signals reach the
`microphones. One class of blind active-cancellation tech-
`niques maybe called “gain-based”or also knownas “‘instan-
`taneous mixing”: it is presumedthat the waveform produced
`by each source is received by the microphones simulta-
`neously, but with varying relative gains. (Directional micro-
`phones are most often used to produce the required differ-
`ences in gain.) Thus, a gain-based system attempts to cancel
`copies of an undesired source in different microphonesignals
`by applying relative gains to the microphone signals and
`subtracting, but not applying time delays or otherfiltering.
`Numerous gain-based methodsfor blind active cancellation
`have been proposed; see Herault and Jutten (1986), Tongetal.
`(1991), and Molgedey and Schuster (1994). The gain-based
`or instantaneous mixing assumption is violated when micro-
`phones are separated in space as in most acoustic applica-
`tions. A simple extension of this methodis to include a time
`delay factor but without any otherfiltering, which will work
`under anechoic conditions. However, this simple model of
`acoustic propagation from the sources to the microphonesis
`oflimited use when echoes and reverberation are present. The
`mostrealistic active-cancellation techniques currently known
`are “convolutive”: the effect of acoustic propagation from
`each source to each microphone is modeled as a convolutive
`filter. These techniques are morerealistic than gain-based and
`delay-based techniques becausethey explicitly accommodate
`the effects of inter-microphone separation, echoes andrever-
`beration. They are also more generalsince, in principle, gains
`and delays are special cases of convolutivefiltering.
`Convolutive blind cancellation techniques have been
`described by manyresearchers including Juttenet al. (1992),
`by Van Compernolle and Van Gerven (1992), by Platt and
`Faggin (1992), Bell and Sejnowski (1995), Torkkola (1996),
`Lee (1998) and by Parra et al. (2000). The mathematical
`model predominantly used in the case of multiple channel
`observations through an array of microphones, the multiple
`source models can be formulated as follows:
`
`Lom
`HO = D1 > ay (Osje-D + rl
`=0 j=l
`
`where the x(t) denotes the observed data, s(t) is the hidden
`source signal, n(t) is the additive sensory noise signal anda(t)
`is the mixingfilter. The parameter m is the numberof sources,
`L is the convolution order and depends on the environment
`acoustics andt indicates the time index. The first summation
`
`is due to filtering of the sources in the environment and the
`second summation is due to the mixing of the different
`sources. Most of the work on ICA has been centered on
`
`6
`algorithms for instantaneous mixing scenarios in which the
`first summation is removed and the task is to simplified to
`i