throbber
USOO7146315B2
`
`(12) United States Patent
`Balan et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,146,315 B2
`Dec. 5, 2006
`
`(54) MULTICHANNEL VOICE DETECTION IN
`ADVERSE ENVIRONMENTS
`
`(75) Inventors: Radu Victor Balan, Levittown, PA
`(US); Justinian Rosca, Princeton
`Junction, NJ (US); Christophe
`Beaugeant, Münich (DE)
`(73) Assignee: Siemens Corporate Research, Inc.,
`Princeton, NJ (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 925 days.
`(21) Appl. No.: 10/231,613
`(22) Filed:
`Aug. 30, 2002
`
`(*) Notice:
`
`(65)
`
`Prior Publication Data
`US 2004/0042626 A1
`Mar. 4, 2004
`
`(51) Int. Cl.
`(2006.01)
`GOL 5/20
`(52) U.S. Cl. ..................... 704/233; 704/247; 381/94.3:
`381/56; 381/110: 379/406.04
`(58) Field of Classification Search ........ 704/225-228,
`704/233, 247, 275; 381/94.3, 56, 110: 379/406.04
`See application file for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`5,012.519 A * 4/1991 Adlersberg et al. ......... TO4/226
`5,276,765 A *
`1/1994 Freeman et al. ............ TO4,233
`5,550,924. A * 8/1996 Helf et al. ......
`... 381 94.3
`5,563,944. A * 10/1996 Hasegawa ....... ... 379,406.04
`5,839,101 A * 11/1998 Vahatalo et al. ............ TO4/226
`6,011,853 A
`1/2000 Koski et al. .................. 381.56
`6,070,140 A * 5/2000 Tran ........................... 704/275
`
`6,088,668 A *
`7/2000 Zack .......................... 704,225
`6,097,820 A *
`8/2000 Turner ...........
`... 381 (94.3
`6,141,426 A * 10/2000 Stobba et al. .....
`... 381.110
`6,363,345 B1* 3/2002 Marash et al. ....
`... 704/226
`6,377,637 B1 * 4/2002 Berdugo ...........
`... 375,346
`2003/0004720 A1
`1/2003 Garudadri et al. .......... TO4,247
`
`FOREIGN PATENT DOCUMENTS
`
`EP
`
`1081985
`
`T 2001
`
`OTHER PUBLICATIONS
`
`Rosca et al.: "Multichannel voice detection in adverse environ
`ments' XI European Signal Processing Conference EUSIPCO Sep.
`2, 2002, XPO08025382.
`
`(Continued)
`Primary Examiner Vijay B. Chawan
`(74) Attorney, Agent, or Firm Donald B. Paschburg; F.
`Chau & Associates, LLC.
`
`(57)
`
`ABSTRACT
`
`A multichannel source activity detection system, e.g., a
`voice activity detection (VAD) system, and method that
`exploits spatial localization of a target audio source is
`provided. The method includes the steps of receiving a
`mixed sound signal by at least two microphones; Fast
`Fourier transforming each received mixed sound signal into
`the frequency domain; filtering the transformed signals to
`output a signal corresponding to a spatial signature of a
`Source: Summing an absolute value squared of the filtered
`signal over a predetermined range of frequencies; and com
`paring the Sum to a threshold to determine if a voice is
`present. Additionally, the filtering step includes multiplying
`the transformed signals by an inverse of a noise spectral
`power matrix, a vector of channel transfer function ratios,
`and a source signal spectral power.
`
`22 Claims, 6 Drawing Sheets
`
`
`
`610
`
`88
`1.
`
`
`
`62
`A.
`FILTER
`(A, B,
`
`8 20 2
`S.
`
`22
`
`2. o
`
`filter
`(A, B
`t
`FILTERS
`
`\
`
`CALIBRATION UN
`
`His LEARNR
`
`---time SPEC, SB
`
`t; F USER
`SEAKNG
`
`WA
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Seas
`
`LEARNR
`
`
`
`Page 1 of 15
`
`GOOGLE EXHIBIT 1005
`
`

`

`US 7,146.315 B2
`Page 2
`
`OTHER PUBLICATIONS
`
`Aalburg et al.: "Single-and two-channel noise reduction for robust
`speech recognition in car' ISCA Workshop Multi-Modal Dialogue
`in Mobile Environments Jun. 2002 XPOO2264.041.
`Balan R et al.: “Microphone array speech enhancement by Bayesian
`estimation of spectral amplitude and phase” Aug. 2002 pp. 209-213,
`XPO10635740.
`
`Philippe Renevey et al.: “Entropy Based Voice Activity Detection in
`very noisy conditions' Eurospeech 2001 Proceedings vol. 3, Sep.
`2001 pp. 1887-1890 XP007004739.
`Srinivasan Ket al.: “Voice activity detection for cellular networks'
`Proceedings of the IEEE Workshop on Speech Coding for Tele
`communications Oct. 1993 pp. 85-86 XP002204645.
`International Search Report.
`* cited by examiner
`
`Page 2 of 15
`
`

`

`U.S. Patent
`U.S. Patent
`
`Dec. 5, 2006
`Dec. 5, 2006
`
`Sheet 1 of 6
`Sheet 1 of6
`
`US 7,146,315 B2
`US 7,146,315 B2
`
`
`
`
`
`/O6
`
`FIG. 1B)
`FIG. 16
`
`Page 3 of 15
`
`Page 3 of 15
`
`

`

`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet 2 of 6
`
`US 7,146,315 B2
`
`
`
`
`
`X,
`
`114
`
`
`
`
`
`FILTER
`A
`
`W
`
`VAD
`
`t . SPEC SUB
`
`R
`
`ESMA
`K
`
`LEARNR
`
`FIG 2
`
`Page 4 of 15
`
`

`

`U.S. Patent
`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet 3 of 6
`
`
`
`
`
`US 7,146,315 B2
`US 7,146,315 B2
`
`€Old
`
`
`are]rae
`
`
`
`
`
`OVNI
`
`Page 5 of 15
`
`Page 5 of 15
`
`

`

`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet 4 of 6
`
`US 7,146,315 B2
`
`WADE (ois' lit la Ich Dada - Vad1C towadzchowadupdatek Boost 100
`
`O.35
`
`O 3
`
`O2
`
`O.
`
`s
`
`2
`
`3
`
`5
`a.
`Erol Type
`
`B
`
`
`
`Total VAD Enos Aix flach Data - Wad 1ChrowadzChoVadupdaka KBoos 100
`
`AMR1
`
`S
`
`3
`
`O. 25
`
`2
`
`.15
`
`O1
`
`s
`
`Algorithm
`
`| S. -
`
`Page 6 of 15
`
`

`

`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet S of 6
`
`US 7,146,315 B2
`
`WAD. Eitors' At
`
`la Ich Data - WadictiowadzChavadu.pdadskSoosio
`r
`
`AIR
`, AllR2
`
`35
`
`3
`
`O
`
`O
`
`
`
`35
`
`15
`
`1
`
`5
`
`Page 7 of 15
`
`

`

`U.S. Patent
`
`Dec. 5, 2006
`
`Sheet 6 of 6
`
`US 7,146,315 B2
`
`
`
`
`
`# OF USER
`SPEAKNG
`
`VAD
`
`LEARNR
`
`SPEC. SUB
`
`FIG. 6
`
`Page 8 of 15
`
`

`

`US 7,146,315 B2
`
`1.
`MULTICHANNEL VOICE DETECTION IN
`ADVERSE ENVIRONMENTS
`
`BACKGROUND OF THE INVENTION
`
`2
`phones; Fast Fourier transforming each received mixed
`Sound signal into the frequency domain; filtering the trans
`formed signals to output a signal corresponding to a spatial
`signature for each of the transformed signals; Summing an
`absolute value squared of the filtered signals over a prede
`termined range of frequencies; and comparing the Sum to a
`threshold to determine if a voice is present, wherein if the
`Sum is greater than or equal to the threshold, a voice is
`present, and if the Sum is less than the threshold, a voice is
`not present. Additionally, the filtering step includes multi
`plying the transformed signals by an inverse of a noise
`spectral power matrix, a vector of channel transfer function
`ratios, and a source signal spectral power.
`According to another aspects of the present invention, a
`method for determining if a voice is present in a mixed
`Sound signal includes the steps of receiving the mixed Sound
`signal by at least two microphones; Fast Fourier transform
`ing each received mixed sound signal into the frequency
`domain; filtering the transformed signals to output signals
`corresponding to a spatial signature for each of a predeter
`mined number of users; Summing separately for each of the
`users an absolute value squared of the filtered signals over
`a predetermined range of frequencies; determining a maxi
`mum of the Sums; and comparing the maximum Sum to a
`threshold to determine if a voice is present, wherein if the
`Sum is greater than or equal to the threshold, a voice is
`present, and if the Sum is less than the threshold, a voice is
`not present, wherein if a voice is present, a specific user
`associated with the maximum sum is determined to be the
`active speaker. The threshold is adapted with the received
`mixed sound signal.
`According to a further embodiment of the present inven
`tion, a voice activity detector for determining if a voice is
`present in a mixed sound signal is provided. The Voice
`activity detector including at least two microphones for
`receiving the mixed sound signal; a Fast Fourier transformer
`for transforming each received mixed sound signal into the
`frequency domain; a filter for filtering the transformed
`signals to output a signal corresponding to an estimated
`spatial signature of a speaker; a first Summer for Summing an
`absolute value Squared of the filtered signal over a prede
`termined range of frequencies; and a comparator for com
`paring the Sum to a threshold to determine if a voice is
`present, wherein if the Sum is greater than or equal to the
`threshold, a voice is present, and if the Sum is less than the
`threshold, a voice is not present.
`According to yet another aspect of the present invention,
`a voice activity detector for determining if a voice is present
`in a mixed sound signal includes at least two microphones
`for receiving the mixed sound signal; a Fast Fourier trans
`former for transforming each received mixed sound signal
`into the frequency domain; at least one filter for filtering the
`transformed signals to output a signal corresponding to a
`spatial signature of a speaker for each of a predetermined
`number of users; at least one first Summer for Summing
`separately for each of the users an absolute value Squared of
`the filtered signal over a predetermined range of frequencies;
`a processor for determining a maximum of the Sums; and a
`comparator for comparing the maximum sum to a threshold
`to determine if a voice is present, wherein if the sum is
`greater than or equal to the threshold, a voice is present, and
`if the Sum is less than the threshold, a voice is not present,
`wherein if a voice is present, a specific user associated with
`the maximum sum is determined to be the active speaker.
`
`10
`
`15
`
`1. Field of the Invention
`The present invention relates generally to digital signal
`processing systems, and more particularly, to a system and
`method for voice activity detection in adverse environments,
`e.g., noisy environments.
`2. Description of the Related Art
`The Voice (and more generally acoustic source) activity
`detection (VAD) is a cornerstone problem in signal process
`ing practice, and often, it has a stronger influence on the
`overall performance of a system than any other component.
`Speech coding, multimedia communication (voice and
`data), speech enhancement in noisy conditions and speech
`recognition are important applications where a good VAD
`method or system can Substantially increase the performance
`of the respective system. The role of a VAD method is
`basically to extract features of an acoustic signal that empha
`size differences between speech and noise and then classify
`them to take a final VAD decision. The variety and the
`varying nature of speech and background noises makes the
`VAD problem challenging.
`Traditionally, VAD methods use energy criteria such as
`SNR (signal-to-noise ratio) estimation based on long-term
`noise estimation, such as disclosed in K. Srinivasan and A.
`Gersho, Voice activity detection for cellular networks, in
`Proc. Of the IEEE Speech Coding Workshop, October 1993,
`30
`pp. 85–86. Improvements proposed use a statistical model of
`the audio signal and derive the likelihood ratio as disclosed
`in Y. D. Cho, K Al-Naimi, and A. Kondoz, Improved voice
`activity detection based on a smoothed statistical likelihood
`ratio, in Proceedings ICASSP 2001, IEEE Press, or compute
`the kurtosis as disclosed in R. Goubran, E. Nemer and S.
`Mahmoud, Snr estimation of speech signals using subbands
`and fourth-order statistics, IEEE Signal Processing Letters,
`vol. 6, no. 7, pp. 171-174, July 1999. Alternatively, other
`VAD methods attempt to extract robust features (e.g. the
`presence of a pitch, the formant shape, or the cepstrum) and
`compare them to a speech model. Recently, multiple channel
`(e.g., multiple microphones or sensors) VAD algorithms
`have been investigated to take advantage of the extra infor
`mation provided by the additional sensors.
`
`25
`
`35
`
`40
`
`45
`
`SUMMARY OF THE INVENTION
`
`Detecting when Voices are or are not present is an
`outstanding problem for speech transmission, enhancement
`and recognition. Here, a novel multichannel source activity
`detection system, e.g., a voice activity detection (VAD)
`system, that exploits spatial localization of a target audio
`Source is provided. The VAD system uses an array signal
`processing technique to maximize the signal-to-interference
`ratio for the target source thus decreasing the activity
`detection error rate. The system uses outputs of at least two
`microphones placed in a noisy environment, e.g., a car, and
`outputs a binary signal (0/1) corresponding to the absence
`(O) or presence (1) of a drivers and/or passenger's voice
`signals. The VAD output can be used by other signal
`processing components, for instance, to enhance the voice
`signal.
`According to one aspect of the present invention, a
`method for determining if a voice is present in a mixed
`Sound signal is provided. The method includes the steps of
`receiving the mixed sound signal by at least two micro
`
`50
`
`55
`
`60
`
`65
`
`Page 9 of 15
`
`

`

`US 7,146,315 B2
`
`4
`ation criteria used and Section 5 discusses implementation
`issues and experimental results on real data.
`1. Mixing Model and Statistical Assumptions
`The time-domain mixing model assumes D microphone
`signals x(t), ..., x(t), which record a source S(t) and noise
`signals n (t). . . . . no(t):
`
`E-i
`x(t) = X. as(t-t') + n, (t), i = 1, ..., D.
`
`(1)
`
`where (a., T.) are the attenuation and delay on the k" path
`to microphone i, and L, is the total number of paths to
`microphone i.
`In the frequency domain, convolutions become multipli
`cations. Therefore, the source is redefined so that the first
`channel transfer function, K, becomes unity:
`
`3
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The above and other objects, features, and advantages of
`the present invention will become more apparent in light of
`the following detailed description when taken in conjunction
`with the accompanying drawings in which:
`FIGS. 1A and 1B are schematic diagrams illustrating two
`scenarios for implementing the system and method of the
`present invention, where FIG. 1A illustrates a scenario using
`two fixed inside-the-car microphones and FIG. 1B illustrates
`the scenario of using one fixed microphone and a second
`microphone contained in a mobile phone;
`FIG. 2 is a block diagram illustrating a voice activity
`detection (VAD) system and method according to a first
`embodiment of the present invention:
`FIG. 3 is a chart illustrating the types of errors considered
`for evaluating VAD methods;
`FIG. 4 is a chart illustrating frame error rates by error type
`and total error for a medium noise, distant microphone
`scenario;
`FIG. 5 is a chart illustrating frame error rates by error type
`and total error for a high noise, distant microphone scenario;
`and
`FIG. 6 is a block diagram illustrating a voice activity
`detection (VAD) system and method according to a second
`embodiment of the present invention.
`
`5
`
`10
`
`15
`
`25
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`30
`
`where k is the frame index, and w is the frequency index.
`More compactly, this model can be rewritten as
`XFKSN
`
`(3)
`
`Preferred embodiments of the present invention will be
`described herein below with reference to the accompanying
`drawings. In the following description, well-known func
`tions or constructions are not described in detail to avoid
`obscuring the invention in unnecessary detail.
`A multichannel VAD (Voice Activity Detection) system
`and method is provided for determining whether speech is
`present or not in a signal. Spatial localization is the key
`underlying the present invention, which can be used equally
`for voice and non-voice signals of interest. To illustrate the
`present invention, assume the following scenario: the target
`Source (Such as a person speaking) is located in a noisy
`environment, and two or more microphones record an audio
`mixture. For example as shown in FIGS. 1A and 1B, two
`signals are measured inside a car by two microphones where
`one microphone 102 is fixed inside the car and the second
`microphone can either be fixed inside the car 104 or can be
`in a mobile phone 106. Inside the car, there is only one
`speaker, or if more persons are present, only one speaks at
`a time. Assume d is the number of users. Noise is assumed
`diffused, but not necessarily uniform, i.e., the sources of
`noise are not spatially well-localized, and the spectral coher
`ence matrix may be time-varying. Under this scenario, the
`system and method of the present invention blindly identi
`fies a mixing model and outputs a signal corresponding to a
`spatial signature with the largest signal-to-interference-ratio
`(SIR) possibly obtainable through linear filtering. Although
`the output signal contains large artifacts and is unsuitable for
`signal estimation, it is ideal for signal activity detection.
`To understand the various features and advantages of the
`present invention, a detailed description of an exemplary
`implementation will now be provided. In the Section 1, the
`mixing model and main statistical assumptions will be
`provided. Section 2 shows the filter derivations and presents
`the overall VAD architecture. Section 3 addresses the blind
`model identification problem. Section 4 discusses the evalu
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`where X, K, N are complex vectors. The vector K represents
`the spatial signature of the Source S.
`The following assumptions are made: (1) The source
`signal s(t) is statistically independent of the noise signals
`n,(t), for all i; (2) The mixing parameters K(w) are either
`time-invariant, or slowly time-varying; (3) S(w) is a Zero
`mean stochastic process with spectral power R(w)=ELISI:
`and (4)(N. N. . . . . N.) is a Zero-mean stochastic signal
`with noise spectral power matrix R(w).
`2. Filter Derivations and Vad Architecture
`In this section, an optimal-gain filter is derived and
`implemented in the overall system architecture of the VAD
`system.
`A linear filter A applied on X produces:
`Z=AX=AKSAN
`
`The linear filter that maximizes the SNR (SIR) is desired.
`The output SNR (oSNR) achieved by A is:
`
`ELIAKS
`EAN2)
`
`RAKKA'
`ARA
`
`(4)
`
`Maximizing oSNR over A results in a generalized eigen
`value problem: AR v AKK, whose maximizer can be
`obtained based on the Rayleigh quotient theory, as is known
`in the art:
`
`where G) is an arbitrary nonzero scalar. This expression
`Suggests to run the output Z through an energy detector with
`
`Page 10 of 15
`
`

`

`US 7,146,315 B2
`
`5
`an input dependent threshold in order to decide whether the
`Source signal is present or not in the current data frame. The
`voice activity detection (VAD) decision becomes:
`
`6
`are chosen uses the Frobenius norm, as is known in the art,
`and where R is a measured signal spectral covariance
`matrix. Thus, the following should be minimized:
`
`VAD(k) =
`
`1 if XZ. >
`t
`O if otherwise
`
`(5)
`
`i(d2, ... , ap. 02, ... , op) =Xtrace{(R – R, - R.KK)}
`
`(9)
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`where a threshold t is BX and B>0 is a constant boosting
`factor. Since on the one hand A is determined up to a
`multiplicative constant, and on the other hand, the maxi
`mized output energy is desired when the signal is present, it
`is determined that (3) =R, the estimated signal spectral
`power. The filter becomes:
`(6)
`A=RKR,
`Based on the above, the overall architecture of the VAD
`of the present invention is presented in FIG. 2. The VAD
`decision is based on equations 5 and 6. K. R. R. are
`estimated from data, as will be described below.
`Referring to FIG. 2, signals X and X, are input from
`microphones 102 and 104 on channels 106 and 108 respec
`tively. Signals X and X, are time domain signals. The
`signals X, X are transformed into frequency domain sig
`nals, X and X, respectively, by a Fast Fourier Transformer
`110 and are outputted to filter A 120 on channels 112 and
`114. Filter 120 processes the signals X, X, based on Eq. (6)
`described above to generate output Z corresponding to a
`spatial signature for each of the transformed signals. The
`variables R. R., and K which are supplied to filter 120 will
`be described in detail below. The output Z is processed and
`Summed over a range of frequencies in Summer 122 to
`produce a sum IZ, i.e., an absolute value squared of the
`filtered signal. The sum Z is then compared to a threshold
`T. in comparator 124 to determine if a Voice is present or not.
`If the Sum is greater than or equal to the threshold T, a voice
`is determined to be present and comparator 124 outputs a
`VAD signal of 1. If the sum is less than the threshold t, a
`Voice is determined not to be present and the comparator
`outputs a VAD signal of 0.
`To determine the threshold, frequency domain signals X,
`X, are inputted to a second Summer 116 where an absolute
`45
`value Squared of signals X, X are Summed over the
`number of microphones D and that sum is Summed over a
`range of frequencies to produce sum X. Sum X is then
`multiplied by boosting factor B through multiplier 118 to
`determine the threshold T.
`3. Mixing Model Identification
`Now, the estimators for the transfer function ratio K and
`spectral power densities R and R are presented. The most
`recently available VAD signal is also employed in updating
`the values of K. R. and R.
`3.1 Adaptive Model-Based Estimator of K
`With continued reference to FIG. 2, the adaptive estimator
`130 estimates a value of K, the user's spatial signature, that
`makes use of a direct path mixing model to reduce the
`number of parameters:
`
`50
`
`55
`
`60
`
`The parameters (a ce) that best fit into
`
`65
`
`Summation above is across frequencies because the same
`parameters (a ce) 21 D should explain all frequencies.
`The gradient of I evaluated on the current estimate (a ce)
`21 D is:
`
`aa, -4) R. real(K. Ev)
`
`s:
`
`s:
`, -2a). wR, imag K. Evi)
`
`(10)
`
`(11)
`
`where E-R-R-RKK* and V, the D-vector of Zeros every
`where except on the 1" entry where it is e"oe", v, O... 0
`e'"oq 0 . . . O. Then, the updating rule is given by
`
`a = a-e
`Öat
`
`! - is, -r
`0 = 0 - c. do
`
`(12)
`
`(13)
`
`with 0 or 1 the learning rate.
`3.2 Estimation of Spectral Power Densities
`The noise spectral power matrix, R., is initially measured
`through a first learning module 132. Thereafter, the estima
`tion of R is based on the most recently available VAD
`signal, generated by comparator 124, simply by the follow
`ing:
`
`{ (1-f3)R." + f3X X if voice not present
`R;
`if voice present
`
`-
`
`(14)
`
`where f3 is a floor-dependent constant. After R, is deter
`mined by Eq. (14), the result is sent to update filter 120.
`The signal spectral power R is estimated through spectral
`Subtraction. The measured signal spectral covariance matrix,
`R is determined by a second learning module 126 based on
`the frequency-domain input signals, X, X and is input to
`spectral Subtractor 128 along with R., which is generated
`from the first learning module 132. R is then determined by
`the following:
`
`R11 - R11 if R11 > BSS R11
`R { (f3ss - 1) R,
`if otherwise
`
`(15)
`
`where cass>1 is a floor-dependent constant. After R is
`determined by Eq. (15), the result is sent to update filter 120.
`
`Page 11 of 15
`
`

`

`7
`4. VAD Performance Criteria
`To evaluate the performance of the VAD system of the
`present invention, the possible errors that can be obtained
`when comparing the VAD signal with the true Source
`presence signal must be defined. Errors take into account the
`context of the VAD prediction, i.e. the true VAD state
`(desired signal present or absent) before and after the state
`of the present data frame as follows (see FIG. 3): (1) Noise
`detected as useful signal (e.g. speech); (2) Noise detected as
`signal before the true signal actually starts; (3) Signal
`detected as noise in a true noise context; (4) Signal detection
`delayed at the beginning of signal; (5) Noise detected as
`signal after the true signal Subsides; (6) Noise detected as
`signal in between frames with signal presence; (7) Signal
`detected as noise at the end of the active signal part, and (8)
`Signal detected as noise during signal activity.
`The prior art literature is mostly concerned with four error
`types showing that speech is misclassified as noise (types
`3,4,7,8 above). Some only consider errors 1.4.5.8: these are
`called “noise detected as speech” (1), “front-end clipping
`(2), “noise interpreted as speech in passing from speech to
`noise” (5), and “midspeech clipping” (8) as described in F.
`Beritelli, S. Casale, and G. Ruggeri, "Performance evalua
`tion and comparison of itu -t/etsi voice activity detectors, in
`Proceedings ICASSP, 2001, IEEE Press.
`The evaluation of the present invention aims at assessing
`the VAD system and method in three problem areas (1)
`Speech transmission/coding, where error types 3,4,7, and 8
`should be as small as possible so that speech is rarely if ever
`clipped and all data of interest (voice but noise) is trans
`mitted; (2) Speech enhancement, where error types 3,4,7,
`and 8 should be as small as possible, nonetheless errors 12.5
`and 6 are also weighted in depending on how noisy and
`non-stationary noise is in common environments of interest;
`and (3) Speech recognition (SR), where all errors are taken
`into account. In particular error types 1.2.5 and 6 are
`important for non-restricted SR. A good classification of
`background noise as non-speech allows SR to work effec
`tively on the frames of interest.
`5. Experimental Results
`Three VAD algorithms were compared: (1–2) Implemen
`tations of two conventional adaptive multi-rate (AMR)
`algorithms, AMR1 and AMR2, targeting discontinuous
`transmission of voice; and (3) a Two-Channel (TwoCh)
`VAD system following the approach of the present invention
`using D-2 microphones. The algorithms were evaluated on
`real data recorded in a car environment in two setups, where
`the two sensors, i.e., microphones, are either closeby or
`distant. For each case, car noise while driving was recorded
`separately and additively Superimposed on car voice record
`ings from static situations. The average input SNR for the
`“medium noise' test suite was Zero dB for the closeby case,
`and -3 dB for the distant case. In both cases, a second test
`suite “high noise' was also considered, where the input SNR
`55
`dropped another 3 dB, was considered.
`5.1 Algorithm Implementation
`The implementation of the AMR1 and AMR2 algorithms
`is based on the conventional GSM AMR speech encoder
`version 7.3.0. The VAD algorithms use results calculated by
`the encoder, which may depend on the encoder input mode,
`therefore a fixed mode of MRDTX was used here. The
`algorithms indicate whether each 20 ms frame (160 samples
`frame length at 8 kHz) contains signals that should be
`transmitted, i.e. speech, music or information tones. The
`output of the VAD algorithm is a boolean flag indicating
`presence of Such signals.
`
`8
`For the TwoCh VAD based on the MaxSNR filter, adap
`tive model-based K estimator and spectral power density
`estimators as presented above, the following parameters
`were used: boost factor B=100, the learning rates ca-0.01
`(in Kestimation), Ge=0.2 (for R.), and cass -1.1 (in Spectral
`Subtraction). Processing was done blockwise with a frame
`size of 256 samples and a time step of 160 samples.
`5.2 Results
`Ideal VAD labeling on car voice data only with a simple
`power level voice detector was obtained. Then, overall VAD
`errors with the three algorithms under study were obtained.
`Errors represent the average percentage of frames with
`decision different from ideal VAD relative to the total
`number of frames processed.
`FIGS. 4 and 5 present individual and overall errors
`obtained with the three algorithms in the medium and high
`noise scenarios. Table 1 Summarizes average results
`obtained when comparing the TwoCh VAD with AMR2.
`Note that in the described tests, the mono AMR algorithms
`utilized the best (highest SNR) of the two channels (which
`was chosen by hand).
`
`TABLE 1.
`
`Data
`
`Med. Noise
`
`High Noise
`
`Best mic (closeby)
`Worst mic (closeby)
`Best mic (distant)
`Worst mic (distant)
`
`54.5
`56.5
`65.5
`68.7
`
`25
`29
`50
`S4
`
`Percentage improvement in overall error rate over AMR2 for the two
`channel VAD across two data and microphone configurations.
`TwoCh VAD is superior to the other approaches when
`comparing error types 1.4.5, and 8. In terms of errors of type
`3,4,7, and 8 only, AMR2 has a slight edge over the TwoCh
`VAD solution which really uses no special logic or hangover
`scheme to enhance results. However, with different settings
`of parameters (particularly the boost factor) TwoCh VAD
`becomes competitive with AMR2 on this subset of errors.
`Nonetheless, in terms of overall error rates, TwoCh VAD
`was clearly Superior to the other approaches.
`Referring to FIG. 6, a block diagram illustrating a voice
`activity detection (VAD) system and method according to a
`second embodiment of the present invention is provided. In
`the second embodiment, in addition to determining if a voice
`is present or not, the system and method determines which
`speaker is speaking the utterance when the VAD decision is
`positive.
`It is to be understood several elements of FIG. 6 have the
`same structure and functions as those described in reference
`to FIG. 2, and therefore, are depicted with like reference
`numerals and will be not described in detail with relation to
`FIG. 6. Furthermore, this embodiment is described for a
`system of two microphones, wherein the extension to more
`than 2 microphones would be obvious to one having ordi
`nary skill in the art.
`In this embodiment, instead of estimating the ratio chan
`nel transfer function, K, it will be determined by calibrator
`650, during an initial calibration phase, for each speaker out
`of a total of d speakers. Each speaker will have a different
`K whenever there is sufficient spatial diversity between the
`speakers and the microphones, e.g., in a car when the
`speakers are not sitting symmetrically with respect to the
`microphones.
`During the calibration phase, in the absence (or low level)
`of noise, each of the d users speaks a sentence separately.
`Based on the two clean recordings, X(t) and X(t) as
`
`US 7,146,315 B2
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`60
`
`65
`
`Page 12 of 15
`
`

`

`9
`received by microphones 602 and 604, the ratio channel
`transfer function K(co) is estimated for an user by:
`
`US 7,146,315 B2
`
`10
`
`F
`
`XX:(l, co)Xi (l, so)
`2. X(l, (o)
`
`K(co) = =
`
`F
`
`2
`
`(16)
`
`5
`
`where X (1.co).X. (1.co)represents the discrete windowed
`Fourier transform at frequency (), and time-frame index 1 of
`the clean signals X, X. Thus, a set of ratios of channel
`transfer functions K (c)), 1 Slsd, one for each speaker, is
`obtained. Despite of the apparently simpler form of the ratio
`channel transfer function, such as
`
`10
`
`15
`
`X(co)
`Ko) = x.
`
`25
`
`a calibrator 650 based directly on this simpler form would
`not be robust. Hence, the calibrator 650 based on Eq. (16)
`minimizes a least-square problem and thus is more robust to
`non-linearities and noises.
`Once K has been determined for each speaker, the VAD
`decision is implemented in a similar fashion to that
`30
`described above in relation to FIG. 2. However, the second
`embodiment of the present invention detects if a voice of any
`of the d speakers is present, and if so, estimates which one
`is speaking, and updates the noise spectral power matrix R,
`and the threshold T. Although the embodiment of FIG. 6
`illustrates a method and system concerning two speakers, it
`is to be understood that the present invention is not limited
`to two speakers and can encompass an environment with a
`plurality of speakers.
`After the initial calibration phase, signals X and X are
`input from microphones 602 and 604 on channels 606 and
`608 respectively. Signals X and X are time domain signals.
`The signals X, X are transformed into frequency domain
`signals, X and X respectively, by a Fast Fourier Trans
`former 610 and are outputted to a plurality of filters 620-1,
`620-2 on channels 612 and 614. In this embodiment, there
`will be one filter for each speaker interacting with the
`system. Therefore, for each of the d speakers, 1s1sd,
`compute the filter becomes:
`
`35
`
`40
`
`45
`
`50
`
`and the following is outputted from each filter 620-1, 620-2:
`SA-X+BX2
`(18)
`The spectral power densities, R. and R, to be supplied to
`the filters will be calculated as described above in relation to
`the first embodiment through first learning module 626,
`second learning module 632 and spectral subtractor 628. The
`K of each speaker will be inputted to the filters from the
`calibration unit 650 determined during the calibration phase.
`The output S, from each of the filters is summed over a
`range of frequencies in Summers 622-1 and 622-2 to produce
`a Sum E, an absolute value Squared of the filtered signal, as
`determined below:
`
`55
`
`60
`
`65
`
`E =X ISI(a)
`
`19
`(19)
`
`As can seen from FIG. 6, for each filter, there is a summer
`and it can be appreciated that for each speaker of the system
`600, there is a filter/summer combination.
`The sums E, are then sent to processor 623 to determine
`a maximum value of all the

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket