`
`(12) United States Patent
`Balan et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,146,315 B2
`Dec. 5, 2006
`
`(54) MULTICHANNEL VOICE DETECTION IN
`ADVERSE ENVIRONMENTS
`
`(75) Inventors: Radu Victor Balan, Levittown, PA
`(US); Justinian Rosca, Princeton
`Junction, NJ (US); Christophe
`Beaugeant, Münich (DE)
`(73) Assignee: Siemens Corporate Research, Inc.,
`Princeton, NJ (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 925 days.
`(21) Appl. No.: 10/231,613
`(22) Filed:
`Aug. 30, 2002
`
`(*) Notice:
`
`(65)
`
`Prior Publication Data
`US 2004/0042626 A1
`Mar. 4, 2004
`
`(51) Int. Cl.
`(2006.01)
`GOL 5/20
`(52) U.S. Cl. ..................... 704/233; 704/247; 381/94.3:
`381/56; 381/110: 379/406.04
`(58) Field of Classification Search ........ 704/225-228,
`704/233, 247, 275; 381/94.3, 56, 110: 379/406.04
`See application file for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`5,012.519 A * 4/1991 Adlersberg et al. ......... TO4/226
`5,276,765 A *
`1/1994 Freeman et al. ............ TO4,233
`5,550,924. A * 8/1996 Helf et al. ......
`... 381 94.3
`5,563,944. A * 10/1996 Hasegawa ....... ... 379,406.04
`5,839,101 A * 11/1998 Vahatalo et al. ............ TO4/226
`6,011,853 A
`1/2000 Koski et al. .................. 381.56
`6,070,140 A * 5/2000 Tran ........................... 704/275
`
`6,088,668 A *
`7/2000 Zack .......................... 704,225
`6,097,820 A *
`8/2000 Turner ...........
`... 381 (94.3
`6,141,426 A * 10/2000 Stobba et al. .....
`... 381.110
`6,363,345 B1* 3/2002 Marash et al. ....
`... 704/226
`6,377,637 B1 * 4/2002 Berdugo ...........
`... 375,346
`2003/0004720 A1
`1/2003 Garudadri et al. .......... TO4,247
`
`FOREIGN PATENT DOCUMENTS
`
`EP
`
`1081985
`
`T 2001
`
`OTHER PUBLICATIONS
`
`Rosca et al.: "Multichannel voice detection in adverse environ
`ments' XI European Signal Processing Conference EUSIPCO Sep.
`2, 2002, XPO08025382.
`
`(Continued)
`Primary Examiner Vijay B. Chawan
`(74) Attorney, Agent, or Firm Donald B. Paschburg; F.
`Chau & Associates, LLC.
`
`(57)
`
`ABSTRACT
`
`A multichannel source activity detection system, e.g., a
`voice activity detection (VAD) system, and method that
`exploits spatial localization of a target audio source is
`provided. The method includes the steps of receiving a
`mixed sound signal by at least two microphones; Fast
`Fourier transforming each received mixed sound signal into
`the frequency domain; filtering the transformed signals to
`output a signal corresponding to a spatial signature of a
`Source: Summing an absolute value squared of the filtered
`signal over a predetermined range of frequencies; and com
`paring the Sum to a threshold to determine if a voice is
`present. Additionally, the filtering step includes multiplying
`the transformed signals by an inverse of a noise spectral
`power matrix, a vector of channel transfer function ratios,
`and a source signal spectral power.
`
`22 Claims, 6 Drawing Sheets
`
`
`
`610
`
`88
`1.
`
`
`
`62
`A.
`FILTER
`(A, B,
`
`8 20 2
`S.
`
`22
`
`2. o
`
`filter
`(A, B
`t
`FILTERS
`
`\
`
`CALIBRATION UN
`
`His LEARNR
`
`---time SPEC, SB
`
`t; F USER
`SEAKNG
`
`WA
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Seas
`
`LEARNR
`
`
`
`Page 1 of 15
`
`GOOGLE EXHIBIT 1005
`
`
`
`US 7,146.315 B2
`Page 2
`
`OTHER PUBLICATIONS
`
`Aalburg et al.: "Single-and two-channel noise reduction for robust
`speech recognition in car' ISCA Workshop Multi-Modal Dialogue
`in Mobile Environments Jun. 2002 XPOO2264.041.
`Balan R et al.: “Microphone array speech enhancement by Bayesian
`estimation of spectral amplitude and phase” Aug. 2002 pp. 209-213,
`XPO10635740.
`
`Philippe Renevey et al.: “Entropy Based Voice Activity Detection in
`very noisy conditions' Eurospeech 2001 Proceedings vol. 3, Sep.
`2001 pp. 1887-1890 XP007004739.
`Srinivasan Ket al.: “Voice activity detection for cellular networks'
`Proceedings of the IEEE Workshop on Speech Coding for Tele
`communications Oct. 1993 pp. 85-86 XP002204645.
`International Search Report.
`* cited by examiner
`
`Page 2 of 15
`
`
`
`U.S. Patent
`U.S. Patent
`
`Dec. 5, 2006
`Dec. 5, 2006
`
`Sheet 1 of 6
`Sheet 1 of6
`
`US 7,146,315 B2
`US 7,146,315 B2
`
`
`
`
`
`/O6
`
`FIG. 1B)
`FIG. 16
`
`Page 3 of 15
`
`Page 3 of 15
`
`
`
`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet 2 of 6
`
`US 7,146,315 B2
`
`
`
`
`
`X,
`
`114
`
`
`
`
`
`FILTER
`A
`
`W
`
`VAD
`
`t . SPEC SUB
`
`R
`
`ESMA
`K
`
`LEARNR
`
`FIG 2
`
`Page 4 of 15
`
`
`
`U.S. Patent
`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet 3 of 6
`
`
`
`
`
`US 7,146,315 B2
`US 7,146,315 B2
`
`€Old
`
`
`are]rae
`
`
`
`
`
`OVNI
`
`Page 5 of 15
`
`Page 5 of 15
`
`
`
`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet 4 of 6
`
`US 7,146,315 B2
`
`WADE (ois' lit la Ich Dada - Vad1C towadzchowadupdatek Boost 100
`
`O.35
`
`O 3
`
`O2
`
`O.
`
`s
`
`2
`
`3
`
`5
`a.
`Erol Type
`
`B
`
`
`
`Total VAD Enos Aix flach Data - Wad 1ChrowadzChoVadupdaka KBoos 100
`
`AMR1
`
`S
`
`3
`
`O. 25
`
`2
`
`.15
`
`O1
`
`s
`
`Algorithm
`
`| S. -
`
`Page 6 of 15
`
`
`
`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet S of 6
`
`US 7,146,315 B2
`
`WAD. Eitors' At
`
`la Ich Data - WadictiowadzChavadu.pdadskSoosio
`r
`
`AIR
`, AllR2
`
`35
`
`3
`
`O
`
`O
`
`
`
`35
`
`15
`
`1
`
`5
`
`Page 7 of 15
`
`
`
`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet 6 of 6
`
`US 7,146,315 B2
`
`
`
`
`
`# OF USER
`SPEAKNG
`
`VAD
`
`LEARNR
`
`SPEC. SUB
`
`FIG. 6
`
`Page 8 of 15
`
`
`
`US 7,146,315 B2
`
`1.
`MULTICHANNEL VOICE DETECTION IN
`ADVERSE ENVIRONMENTS
`
`BACKGROUND OF THE INVENTION
`
`2
`phones; Fast Fourier transforming each received mixed
`Sound signal into the frequency domain; filtering the trans
`formed signals to output a signal corresponding to a spatial
`signature for each of the transformed signals; Summing an
`absolute value squared of the filtered signals over a prede
`termined range of frequencies; and comparing the Sum to a
`threshold to determine if a voice is present, wherein if the
`Sum is greater than or equal to the threshold, a voice is
`present, and if the Sum is less than the threshold, a voice is
`not present. Additionally, the filtering step includes multi
`plying the transformed signals by an inverse of a noise
`spectral power matrix, a vector of channel transfer function
`ratios, and a source signal spectral power.
`According to another aspects of the present invention, a
`method for determining if a voice is present in a mixed
`Sound signal includes the steps of receiving the mixed Sound
`signal by at least two microphones; Fast Fourier transform
`ing each received mixed sound signal into the frequency
`domain; filtering the transformed signals to output signals
`corresponding to a spatial signature for each of a predeter
`mined number of users; Summing separately for each of the
`users an absolute value squared of the filtered signals over
`a predetermined range of frequencies; determining a maxi
`mum of the Sums; and comparing the maximum Sum to a
`threshold to determine if a voice is present, wherein if the
`Sum is greater than or equal to the threshold, a voice is
`present, and if the Sum is less than the threshold, a voice is
`not present, wherein if a voice is present, a specific user
`associated with the maximum sum is determined to be the
`active speaker. The threshold is adapted with the received
`mixed sound signal.
`According to a further embodiment of the present inven
`tion, a voice activity detector for determining if a voice is
`present in a mixed sound signal is provided. The Voice
`activity detector including at least two microphones for
`receiving the mixed sound signal; a Fast Fourier transformer
`for transforming each received mixed sound signal into the
`frequency domain; a filter for filtering the transformed
`signals to output a signal corresponding to an estimated
`spatial signature of a speaker; a first Summer for Summing an
`absolute value Squared of the filtered signal over a prede
`termined range of frequencies; and a comparator for com
`paring the Sum to a threshold to determine if a voice is
`present, wherein if the Sum is greater than or equal to the
`threshold, a voice is present, and if the Sum is less than the
`threshold, a voice is not present.
`According to yet another aspect of the present invention,
`a voice activity detector for determining if a voice is present
`in a mixed sound signal includes at least two microphones
`for receiving the mixed sound signal; a Fast Fourier trans
`former for transforming each received mixed sound signal
`into the frequency domain; at least one filter for filtering the
`transformed signals to output a signal corresponding to a
`spatial signature of a speaker for each of a predetermined
`number of users; at least one first Summer for Summing
`separately for each of the users an absolute value Squared of
`the filtered signal over a predetermined range of frequencies;
`a processor for determining a maximum of the Sums; and a
`comparator for comparing the maximum sum to a threshold
`to determine if a voice is present, wherein if the sum is
`greater than or equal to the threshold, a voice is present, and
`if the Sum is less than the threshold, a voice is not present,
`wherein if a voice is present, a specific user associated with
`the maximum sum is determined to be the active speaker.
`
`10
`
`15
`
`1. Field of the Invention
`The present invention relates generally to digital signal
`processing systems, and more particularly, to a system and
`method for voice activity detection in adverse environments,
`e.g., noisy environments.
`2. Description of the Related Art
`The Voice (and more generally acoustic source) activity
`detection (VAD) is a cornerstone problem in signal process
`ing practice, and often, it has a stronger influence on the
`overall performance of a system than any other component.
`Speech coding, multimedia communication (voice and
`data), speech enhancement in noisy conditions and speech
`recognition are important applications where a good VAD
`method or system can Substantially increase the performance
`of the respective system. The role of a VAD method is
`basically to extract features of an acoustic signal that empha
`size differences between speech and noise and then classify
`them to take a final VAD decision. The variety and the
`varying nature of speech and background noises makes the
`VAD problem challenging.
`Traditionally, VAD methods use energy criteria such as
`SNR (signal-to-noise ratio) estimation based on long-term
`noise estimation, such as disclosed in K. Srinivasan and A.
`Gersho, Voice activity detection for cellular networks, in
`Proc. Of the IEEE Speech Coding Workshop, October 1993,
`30
`pp. 85–86. Improvements proposed use a statistical model of
`the audio signal and derive the likelihood ratio as disclosed
`in Y. D. Cho, K Al-Naimi, and A. Kondoz, Improved voice
`activity detection based on a smoothed statistical likelihood
`ratio, in Proceedings ICASSP 2001, IEEE Press, or compute
`the kurtosis as disclosed in R. Goubran, E. Nemer and S.
`Mahmoud, Snr estimation of speech signals using subbands
`and fourth-order statistics, IEEE Signal Processing Letters,
`vol. 6, no. 7, pp. 171-174, July 1999. Alternatively, other
`VAD methods attempt to extract robust features (e.g. the
`presence of a pitch, the formant shape, or the cepstrum) and
`compare them to a speech model. Recently, multiple channel
`(e.g., multiple microphones or sensors) VAD algorithms
`have been investigated to take advantage of the extra infor
`mation provided by the additional sensors.
`
`25
`
`35
`
`40
`
`45
`
`SUMMARY OF THE INVENTION
`
`Detecting when Voices are or are not present is an
`outstanding problem for speech transmission, enhancement
`and recognition. Here, a novel multichannel source activity
`detection system, e.g., a voice activity detection (VAD)
`system, that exploits spatial localization of a target audio
`Source is provided. The VAD system uses an array signal
`processing technique to maximize the signal-to-interference
`ratio for the target source thus decreasing the activity
`detection error rate. The system uses outputs of at least two
`microphones placed in a noisy environment, e.g., a car, and
`outputs a binary signal (0/1) corresponding to the absence
`(O) or presence (1) of a drivers and/or passenger's voice
`signals. The VAD output can be used by other signal
`processing components, for instance, to enhance the voice
`signal.
`According to one aspect of the present invention, a
`method for determining if a voice is present in a mixed
`Sound signal is provided. The method includes the steps of
`receiving the mixed sound signal by at least two micro
`
`50
`
`55
`
`60
`
`65
`
`Page 9 of 15
`
`
`
`US 7,146,315 B2
`
`4
`ation criteria used and Section 5 discusses implementation
`issues and experimental results on real data.
`1. Mixing Model and Statistical Assumptions
`The time-domain mixing model assumes D microphone
`signals x(t), ..., x(t), which record a source S(t) and noise
`signals n (t). . . . . no(t):
`
`E-i
`x(t) = X. as(t-t') + n, (t), i = 1, ..., D.
`
`(1)
`
`where (a., T.) are the attenuation and delay on the k" path
`to microphone i, and L, is the total number of paths to
`microphone i.
`In the frequency domain, convolutions become multipli
`cations. Therefore, the source is redefined so that the first
`channel transfer function, K, becomes unity:
`
`3
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The above and other objects, features, and advantages of
`the present invention will become more apparent in light of
`the following detailed description when taken in conjunction
`with the accompanying drawings in which:
`FIGS. 1A and 1B are schematic diagrams illustrating two
`scenarios for implementing the system and method of the
`present invention, where FIG. 1A illustrates a scenario using
`two fixed inside-the-car microphones and FIG. 1B illustrates
`the scenario of using one fixed microphone and a second
`microphone contained in a mobile phone;
`FIG. 2 is a block diagram illustrating a voice activity
`detection (VAD) system and method according to a first
`embodiment of the present invention:
`FIG. 3 is a chart illustrating the types of errors considered
`for evaluating VAD methods;
`FIG. 4 is a chart illustrating frame error rates by error type
`and total error for a medium noise, distant microphone
`scenario;
`FIG. 5 is a chart illustrating frame error rates by error type
`and total error for a high noise, distant microphone scenario;
`and
`FIG. 6 is a block diagram illustrating a voice activity
`detection (VAD) system and method according to a second
`embodiment of the present invention.
`
`5
`
`10
`
`15
`
`25
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`30
`
`where k is the frame index, and w is the frequency index.
`More compactly, this model can be rewritten as
`XFKSN
`
`(3)
`
`Preferred embodiments of the present invention will be
`described herein below with reference to the accompanying
`drawings. In the following description, well-known func
`tions or constructions are not described in detail to avoid
`obscuring the invention in unnecessary detail.
`A multichannel VAD (Voice Activity Detection) system
`and method is provided for determining whether speech is
`present or not in a signal. Spatial localization is the key
`underlying the present invention, which can be used equally
`for voice and non-voice signals of interest. To illustrate the
`present invention, assume the following scenario: the target
`Source (Such as a person speaking) is located in a noisy
`environment, and two or more microphones record an audio
`mixture. For example as shown in FIGS. 1A and 1B, two
`signals are measured inside a car by two microphones where
`one microphone 102 is fixed inside the car and the second
`microphone can either be fixed inside the car 104 or can be
`in a mobile phone 106. Inside the car, there is only one
`speaker, or if more persons are present, only one speaks at
`a time. Assume d is the number of users. Noise is assumed
`diffused, but not necessarily uniform, i.e., the sources of
`noise are not spatially well-localized, and the spectral coher
`ence matrix may be time-varying. Under this scenario, the
`system and method of the present invention blindly identi
`fies a mixing model and outputs a signal corresponding to a
`spatial signature with the largest signal-to-interference-ratio
`(SIR) possibly obtainable through linear filtering. Although
`the output signal contains large artifacts and is unsuitable for
`signal estimation, it is ideal for signal activity detection.
`To understand the various features and advantages of the
`present invention, a detailed description of an exemplary
`implementation will now be provided. In the Section 1, the
`mixing model and main statistical assumptions will be
`provided. Section 2 shows the filter derivations and presents
`the overall VAD architecture. Section 3 addresses the blind
`model identification problem. Section 4 discusses the evalu
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`where X, K, N are complex vectors. The vector K represents
`the spatial signature of the Source S.
`The following assumptions are made: (1) The source
`signal s(t) is statistically independent of the noise signals
`n,(t), for all i; (2) The mixing parameters K(w) are either
`time-invariant, or slowly time-varying; (3) S(w) is a Zero
`mean stochastic process with spectral power R(w)=ELISI:
`and (4)(N. N. . . . . N.) is a Zero-mean stochastic signal
`with noise spectral power matrix R(w).
`2. Filter Derivations and Vad Architecture
`In this section, an optimal-gain filter is derived and
`implemented in the overall system architecture of the VAD
`system.
`A linear filter A applied on X produces:
`Z=AX=AKSAN
`
`The linear filter that maximizes the SNR (SIR) is desired.
`The output SNR (oSNR) achieved by A is:
`
`ELIAKS
`EAN2)
`
`RAKKA'
`ARA
`
`(4)
`
`Maximizing oSNR over A results in a generalized eigen
`value problem: AR v AKK, whose maximizer can be
`obtained based on the Rayleigh quotient theory, as is known
`in the art:
`
`where G) is an arbitrary nonzero scalar. This expression
`Suggests to run the output Z through an energy detector with
`
`Page 10 of 15
`
`
`
`US 7,146,315 B2
`
`5
`an input dependent threshold in order to decide whether the
`Source signal is present or not in the current data frame. The
`voice activity detection (VAD) decision becomes:
`
`6
`are chosen uses the Frobenius norm, as is known in the art,
`and where R is a measured signal spectral covariance
`matrix. Thus, the following should be minimized:
`
`VAD(k) =
`
`1 if XZ. >
`t
`O if otherwise
`
`(5)
`
`i(d2, ... , ap. 02, ... , op) =Xtrace{(R – R, - R.KK)}
`
`(9)
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`where a threshold t is BX and B>0 is a constant boosting
`factor. Since on the one hand A is determined up to a
`multiplicative constant, and on the other hand, the maxi
`mized output energy is desired when the signal is present, it
`is determined that (3) =R, the estimated signal spectral
`power. The filter becomes:
`(6)
`A=RKR,
`Based on the above, the overall architecture of the VAD
`of the present invention is presented in FIG. 2. The VAD
`decision is based on equations 5 and 6. K. R. R. are
`estimated from data, as will be described below.
`Referring to FIG. 2, signals X and X, are input from
`microphones 102 and 104 on channels 106 and 108 respec
`tively. Signals X and X, are time domain signals. The
`signals X, X are transformed into frequency domain sig
`nals, X and X, respectively, by a Fast Fourier Transformer
`110 and are outputted to filter A 120 on channels 112 and
`114. Filter 120 processes the signals X, X, based on Eq. (6)
`described above to generate output Z corresponding to a
`spatial signature for each of the transformed signals. The
`variables R. R., and K which are supplied to filter 120 will
`be described in detail below. The output Z is processed and
`Summed over a range of frequencies in Summer 122 to
`produce a sum IZ, i.e., an absolute value squared of the
`filtered signal. The sum Z is then compared to a threshold
`T. in comparator 124 to determine if a Voice is present or not.
`If the Sum is greater than or equal to the threshold T, a voice
`is determined to be present and comparator 124 outputs a
`VAD signal of 1. If the sum is less than the threshold t, a
`Voice is determined not to be present and the comparator
`outputs a VAD signal of 0.
`To determine the threshold, frequency domain signals X,
`X, are inputted to a second Summer 116 where an absolute
`45
`value Squared of signals X, X are Summed over the
`number of microphones D and that sum is Summed over a
`range of frequencies to produce sum X. Sum X is then
`multiplied by boosting factor B through multiplier 118 to
`determine the threshold T.
`3. Mixing Model Identification
`Now, the estimators for the transfer function ratio K and
`spectral power densities R and R are presented. The most
`recently available VAD signal is also employed in updating
`the values of K. R. and R.
`3.1 Adaptive Model-Based Estimator of K
`With continued reference to FIG. 2, the adaptive estimator
`130 estimates a value of K, the user's spatial signature, that
`makes use of a direct path mixing model to reduce the
`number of parameters:
`
`50
`
`55
`
`60
`
`The parameters (a ce) that best fit into
`
`65
`
`Summation above is across frequencies because the same
`parameters (a ce) 21 D should explain all frequencies.
`The gradient of I evaluated on the current estimate (a ce)
`21 D is:
`
`aa, -4) R. real(K. Ev)
`
`s:
`
`s:
`, -2a). wR, imag K. Evi)
`
`(10)
`
`(11)
`
`where E-R-R-RKK* and V, the D-vector of Zeros every
`where except on the 1" entry where it is e"oe", v, O... 0
`e'"oq 0 . . . O. Then, the updating rule is given by
`
`a = a-e
`Öat
`
`! - is, -r
`0 = 0 - c. do
`
`(12)
`
`(13)
`
`with 0 or 1 the learning rate.
`3.2 Estimation of Spectral Power Densities
`The noise spectral power matrix, R., is initially measured
`through a first learning module 132. Thereafter, the estima
`tion of R is based on the most recently available VAD
`signal, generated by comparator 124, simply by the follow
`ing:
`
`{ (1-f3)R." + f3X X if voice not present
`R;
`if voice present
`
`-
`
`(14)
`
`where f3 is a floor-dependent constant. After R, is deter
`mined by Eq. (14), the result is sent to update filter 120.
`The signal spectral power R is estimated through spectral
`Subtraction. The measured signal spectral covariance matrix,
`R is determined by a second learning module 126 based on
`the frequency-domain input signals, X, X and is input to
`spectral Subtractor 128 along with R., which is generated
`from the first learning module 132. R is then determined by
`the following:
`
`R11 - R11 if R11 > BSS R11
`R { (f3ss - 1) R,
`if otherwise
`
`(15)
`
`where cass>1 is a floor-dependent constant. After R is
`determined by Eq. (15), the result is sent to update filter 120.
`
`Page 11 of 15
`
`
`
`7
`4. VAD Performance Criteria
`To evaluate the performance of the VAD system of the
`present invention, the possible errors that can be obtained
`when comparing the VAD signal with the true Source
`presence signal must be defined. Errors take into account the
`context of the VAD prediction, i.e. the true VAD state
`(desired signal present or absent) before and after the state
`of the present data frame as follows (see FIG. 3): (1) Noise
`detected as useful signal (e.g. speech); (2) Noise detected as
`signal before the true signal actually starts; (3) Signal
`detected as noise in a true noise context; (4) Signal detection
`delayed at the beginning of signal; (5) Noise detected as
`signal after the true signal Subsides; (6) Noise detected as
`signal in between frames with signal presence; (7) Signal
`detected as noise at the end of the active signal part, and (8)
`Signal detected as noise during signal activity.
`The prior art literature is mostly concerned with four error
`types showing that speech is misclassified as noise (types
`3,4,7,8 above). Some only consider errors 1.4.5.8: these are
`called “noise detected as speech” (1), “front-end clipping
`(2), “noise interpreted as speech in passing from speech to
`noise” (5), and “midspeech clipping” (8) as described in F.
`Beritelli, S. Casale, and G. Ruggeri, "Performance evalua
`tion and comparison of itu -t/etsi voice activity detectors, in
`Proceedings ICASSP, 2001, IEEE Press.
`The evaluation of the present invention aims at assessing
`the VAD system and method in three problem areas (1)
`Speech transmission/coding, where error types 3,4,7, and 8
`should be as small as possible so that speech is rarely if ever
`clipped and all data of interest (voice but noise) is trans
`mitted; (2) Speech enhancement, where error types 3,4,7,
`and 8 should be as small as possible, nonetheless errors 12.5
`and 6 are also weighted in depending on how noisy and
`non-stationary noise is in common environments of interest;
`and (3) Speech recognition (SR), where all errors are taken
`into account. In particular error types 1.2.5 and 6 are
`important for non-restricted SR. A good classification of
`background noise as non-speech allows SR to work effec
`tively on the frames of interest.
`5. Experimental Results
`Three VAD algorithms were compared: (1–2) Implemen
`tations of two conventional adaptive multi-rate (AMR)
`algorithms, AMR1 and AMR2, targeting discontinuous
`transmission of voice; and (3) a Two-Channel (TwoCh)
`VAD system following the approach of the present invention
`using D-2 microphones. The algorithms were evaluated on
`real data recorded in a car environment in two setups, where
`the two sensors, i.e., microphones, are either closeby or
`distant. For each case, car noise while driving was recorded
`separately and additively Superimposed on car voice record
`ings from static situations. The average input SNR for the
`“medium noise' test suite was Zero dB for the closeby case,
`and -3 dB for the distant case. In both cases, a second test
`suite “high noise' was also considered, where the input SNR
`55
`dropped another 3 dB, was considered.
`5.1 Algorithm Implementation
`The implementation of the AMR1 and AMR2 algorithms
`is based on the conventional GSM AMR speech encoder
`version 7.3.0. The VAD algorithms use results calculated by
`the encoder, which may depend on the encoder input mode,
`therefore a fixed mode of MRDTX was used here. The
`algorithms indicate whether each 20 ms frame (160 samples
`frame length at 8 kHz) contains signals that should be
`transmitted, i.e. speech, music or information tones. The
`output of the VAD algorithm is a boolean flag indicating
`presence of Such signals.
`
`8
`For the TwoCh VAD based on the MaxSNR filter, adap
`tive model-based K estimator and spectral power density
`estimators as presented above, the following parameters
`were used: boost factor B=100, the learning rates ca-0.01
`(in Kestimation), Ge=0.2 (for R.), and cass -1.1 (in Spectral
`Subtraction). Processing was done blockwise with a frame
`size of 256 samples and a time step of 160 samples.
`5.2 Results
`Ideal VAD labeling on car voice data only with a simple
`power level voice detector was obtained. Then, overall VAD
`errors with the three algorithms under study were obtained.
`Errors represent the average percentage of frames with
`decision different from ideal VAD relative to the total
`number of frames processed.
`FIGS. 4 and 5 present individual and overall errors
`obtained with the three algorithms in the medium and high
`noise scenarios. Table 1 Summarizes average results
`obtained when comparing the TwoCh VAD with AMR2.
`Note that in the described tests, the mono AMR algorithms
`utilized the best (highest SNR) of the two channels (which
`was chosen by hand).
`
`TABLE 1.
`
`Data
`
`Med. Noise
`
`High Noise
`
`Best mic (closeby)
`Worst mic (closeby)
`Best mic (distant)
`Worst mic (distant)
`
`54.5
`56.5
`65.5
`68.7
`
`25
`29
`50
`S4
`
`Percentage improvement in overall error rate over AMR2 for the two
`channel VAD across two data and microphone configurations.
`TwoCh VAD is superior to the other approaches when
`comparing error types 1.4.5, and 8. In terms of errors of type
`3,4,7, and 8 only, AMR2 has a slight edge over the TwoCh
`VAD solution which really uses no special logic or hangover
`scheme to enhance results. However, with different settings
`of parameters (particularly the boost factor) TwoCh VAD
`becomes competitive with AMR2 on this subset of errors.
`Nonetheless, in terms of overall error rates, TwoCh VAD
`was clearly Superior to the other approaches.
`Referring to FIG. 6, a block diagram illustrating a voice
`activity detection (VAD) system and method according to a
`second embodiment of the present invention is provided. In
`the second embodiment, in addition to determining if a voice
`is present or not, the system and method determines which
`speaker is speaking the utterance when the VAD decision is
`positive.
`It is to be understood several elements of FIG. 6 have the
`same structure and functions as those described in reference
`to FIG. 2, and therefore, are depicted with like reference
`numerals and will be not described in detail with relation to
`FIG. 6. Furthermore, this embodiment is described for a
`system of two microphones, wherein the extension to more
`than 2 microphones would be obvious to one having ordi
`nary skill in the art.
`In this embodiment, instead of estimating the ratio chan
`nel transfer function, K, it will be determined by calibrator
`650, during an initial calibration phase, for each speaker out
`of a total of d speakers. Each speaker will have a different
`K whenever there is sufficient spatial diversity between the
`speakers and the microphones, e.g., in a car when the
`speakers are not sitting symmetrically with respect to the
`microphones.
`During the calibration phase, in the absence (or low level)
`of noise, each of the d users speaks a sentence separately.
`Based on the two clean recordings, X(t) and X(t) as
`
`US 7,146,315 B2
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`60
`
`65
`
`Page 12 of 15
`
`
`
`9
`received by microphones 602 and 604, the ratio channel
`transfer function K(co) is estimated for an user by:
`
`US 7,146,315 B2
`
`10
`
`F
`
`XX:(l, co)Xi (l, so)
`2. X(l, (o)
`
`K(co) = =
`
`F
`
`2
`
`(16)
`
`5
`
`where X (1.co).X. (1.co)represents the discrete windowed
`Fourier transform at frequency (), and time-frame index 1 of
`the clean signals X, X. Thus, a set of ratios of channel
`transfer functions K (c)), 1 Slsd, one for each speaker, is
`obtained. Despite of the apparently simpler form of the ratio
`channel transfer function, such as
`
`10
`
`15
`
`X(co)
`Ko) = x.
`
`25
`
`a calibrator 650 based directly on this simpler form would
`not be robust. Hence, the calibrator 650 based on Eq. (16)
`minimizes a least-square problem and thus is more robust to
`non-linearities and noises.
`Once K has been determined for each speaker, the VAD
`decision is implemented in a similar fashion to that
`30
`described above in relation to FIG. 2. However, the second
`embodiment of the present invention detects if a voice of any
`of the d speakers is present, and if so, estimates which one
`is speaking, and updates the noise spectral power matrix R,
`and the threshold T. Although the embodiment of FIG. 6
`illustrates a method and system concerning two speakers, it
`is to be understood that the present invention is not limited
`to two speakers and can encompass an environment with a
`plurality of speakers.
`After the initial calibration phase, signals X and X are
`input from microphones 602 and 604 on channels 606 and
`608 respectively. Signals X and X are time domain signals.
`The signals X, X are transformed into frequency domain
`signals, X and X respectively, by a Fast Fourier Trans
`former 610 and are outputted to a plurality of filters 620-1,
`620-2 on channels 612 and 614. In this embodiment, there
`will be one filter for each speaker interacting with the
`system. Therefore, for each of the d speakers, 1s1sd,
`compute the filter becomes:
`
`35
`
`40
`
`45
`
`50
`
`and the following is outputted from each filter 620-1, 620-2:
`SA-X+BX2
`(18)
`The spectral power densities, R. and R, to be supplied to
`the filters will be calculated as described above in relation to
`the first embodiment through first learning module 626,
`second learning module 632 and spectral subtractor 628. The
`K of each speaker will be inputted to the filters from the
`calibration unit 650 determined during the calibration phase.
`The output S, from each of the filters is summed over a
`range of frequencies in Summers 622-1 and 622-2 to produce
`a Sum E, an absolute value Squared of the filtered signal, as
`determined below:
`
`55
`
`60
`
`65
`
`E =X ISI(a)
`
`19
`(19)
`
`As can seen from FIG. 6, for each filter, there is a summer
`and it can be appreciated that for each speaker of the system
`600, there is a filter/summer combination.
`The sums E, are then sent to processor 623 to determine
`a maximum value of all the