`Freeman et al.
`
`[54] VOICE ACTIVITY DETECTION
`Inventors: Daniel K. Freeman; Ivan Boyd, both
`[75]
`of Ipswich, England
`[73] Assignee: British Telecommunications public
`limited Company, London, England
`952,147
`[21] Appl. No.:
`Mar. 10, 1989
`[22] PCT Filed:
`PCT/GB89/00247
`[86] PCT No.:
`Aug. 15, 1990
`§ 371 Date:
`§ 102(e) Date: Aug, 15, 1990
`[87] PCT Pub. No.: WO89/08910
`PCT Pub. Date: Sep. 21, 1989
`
`Related U.S. Application Data
`[63] Continuation of Ser. No. 555,445, Aug. 15, 1990, aban(cid:173)
`doned.
`Foreign Application Priority Data
`[30]
`Mar. II, 1988 [GB] United Kingdom ................. 8805795
`Aug. 6, 1988 [GB] United Kingdom ................. 8813346
`Aug. 24, 1988 [GB] United Kingdom ................. 8820105
`[51] Int. Cl.s ................................................ Gl0L 5/00
`[52] U.S. Cl ......................................................... 395/2
`[58] Field of Search ............. 395/2; 381/71, 94, 46-50
`References Cited
`[56]
`U.S. PATENT DOCUMENTS
`4,227,046 10/1980 Nakajima et al. ..................... 381/47
`4,283,601 8/1981 Nakajima et al ...................... 381/47
`4,338,738 11/1982 Kahn ..................................... 381/94
`
`I 111111111111111111111 IIIII IIIII IIIII IIIII IIIII IIIII IIIII IIIIII Ill lllll llll
`US005276765A
`5,276,765
`[11] Patent Number:
`Jan. 4, 1994
`[45] Date of Patent:
`
`4,672,669 6/1987 DesBlache et al. ................... 381/46
`4,696,039 9/1987 Doddington .......................... 381/46
`4,731,846 3/1988 Secrest et al. ......................... 381/49
`
`OTHER PUB LI CA TIO NS
`Rabiner et al., "Application of an LPC Distance Mea(cid:173)
`sure to the Voiced-Unvoiced-Silence Detection Prob(cid:173)
`lem", IEEE Trans. on·ASSP, vol. ASSP-25, No. 4,
`Aug. 1977, pp. 338-343.
`McAulay, "Optimum Speech Classification and Its Ap(cid:173)
`plication to Adaptive Noise Cancellation", 1977 IEEE
`ICASSP, Hartford, CN, May 9-11, 1977, pp. 425-428.
`Un, "Improving LPC Analysis of Noisy Speech by
`Autocorrelation Subtraction Method", ICASSP '81,
`Atlanta, GA, Mar. 30, 31, Apr. 1981, pp. 1082-1085.
`Primary Examiner-David D. Knepper
`Attorney, Agent, or Firm-Nixon & Vanderhye
`ABSTRACT
`[57]
`Voice activity detector (VAD) for use in an LPC coder
`in a mobile radio system uses autocorrelation coefficient
`Ro, R1 ... of the input signal, weighted and combined,
`to provide a measure M which depends on the power
`within that part of the spectrum containing no noise,
`which is thresholded against a variable threshold to
`provide a speech/no speech logic output. The measure
`is formula (I), where H;are the autocorrelation coeffici(cid:173)
`ents of the impulse response of an Nth order FIR in(cid:173)
`verse noise filter derived from LPC analysis of previous
`non-speech signal frames. Threshold adaption and coef(cid:173)
`ficient update are controlled by a second V AD re(cid:173)
`sponse to rate of spectral change between frames.
`
`23 Claims, 3 Drawing Sheets
`
`30
`
`29
`
`THRESHQD ADAPTER
`
`21
`
`20
`,.... _J ______ ---------------- ---,
`I.!::::===;-, 22 23
`I
`~~ECH
`I
`I
`I
`•
`SIGNAL
`i
`I
`I
`I
`I
`I
`I
`I
`I ~ - - - - - ,
`I
`I
`I
`PITCH ANALYSIS 1 - - - - - - - - - '
`I
`..._ ___ ___,,-27
`I
`L----------------------------~
`
`L PC ANALYSIS
`._ ___ ..,--;,J._,____.
`
`Page 1 of 10
`
`GOOGLE EXHIBIT 1022
`
`
`
`U.S. Patent
`
`Jan. 4, 1994
`
`Sheet 1 of 3
`
`5,276,765
`
`s
`
`1
`
`ADC
`
`ACF
`
`LPC
`COEFFICIENTS
`OF NOISE
`
`3
`r-----,_ __ LPC COEFFICIENTS
`2
`t----~LPC ANALYSIS
`Lj FOR SPEECH
`CODING
`
`------
`
`ACF
`
`4
`
`14
`
`AUTOCORRELATION COEFFICIENTS
`Ri
`
`X
`
`5
`
`6
`
`LPC
`ANALYSIS
`
`13
`
`8
`.,__ __ .r,SPEECH / NON SPEECH
`LOGIC OUTPUT
`
`ADC
`
`12
`
`F/61.
`
`11
`
`N
`
`Page 2 of 10
`
`
`
`U.S. Patent
`
`Jan. 4, 1994
`
`Sheet 2 of 3
`
`5,276,765
`
`2
`
`1
`
`__..,.___ ____ _
`
`3
`
`LPC COEFFICIENTS Li
`
`ADC
`
`LP C ANALYSIS
`
`ACF
`
`4
`
`14
`
`16
`
`'--.r-"----"-..___.~-
`
`Ri
`
`X +
`
`5,6
`
`1 NOISE 1
`VALUES
`
`BUFFER
`
`15
`
`7
`
`FIG.2.
`
`8
`
`SPEECH/NONSPEECH
`OUTPUT
`
`Page 3 of 10
`
`
`
`U.S. Patent
`
`Jan. 4, 1994
`
`Sheet 3 of 3
`
`5,276,765
`
`2
`
`1
`L>--- ADC
`INPUT
`
`3
`
`14
`
`Aj
`
`INVERSE FILTER
`ANALYSIS
`
`4
`
`5,6
`x+
`
`15
`
`BUFFER
`
`30
`
`B
`
`SPEECH/NON
`SPEECH OUTPUT
`
`20
`
`THRESHOLD ADAPTER
`
`29
`r _J ______ ---------------- ---,
`~~~ECH
`I
`--=---=---=--=--=-..:::-
`I
`22 23
`i ___ __.___
`I
`•
`I
`SIGNAL
`I
`~~
`I
`I
`,_ ~
`
`21
`
`L PC ANALYSIS
`
`-==-•
`
`I
`I
`I
`. - - - - - -
`I
`I
`---------
`I
`I
`I
`PITCH ANALYSIS
`.,_____________
`I
`i..,_ ___ ___,jr--27
`I
`L----------------------------~
`FIG. 3.
`
`- - .L
`- -
`
`Page 4 of 10
`
`
`
`1
`
`VOICE ACTIVITY DETECTION
`
`5,276,765
`
`This is a continuation of application Ser.
`07/555,445, filed Aug. 15, 1990, now abandoned.
`
`No.
`
`2
`DETAILED DESCRIPTION OF THE
`DRAWINGS
`The general principle underlying a first Voice Activ-
`5 ity Detector according to the a first embodiment of the
`invention is as follows.
`A frame of n signal samples
`
`s' = (S()),
`(s1 + hoso),
`(sz + hos1 + hi.Ill),
`(s3 + hosi + h1s1 -+ h2.llJ),
`(s4 + /ios3 + h1s2 + h2s1 + h1so),
`(s5 + hos4 + h1s3 + h2sz + h3S1),
`(S(, + /ios5 + h1s4 + h2s3 + h3S2),
`(.i; ... )
`
`n-1
`R'o = . I
`0
`I=
`
`(s';)2
`
`and this is therefore a measure of the power of the no(cid:173)
`tional filtered signal s'-in other words, of that part of
`the signal s which falls within the passband of the no(cid:173)
`tional filter.
`Expanding, neglecting the first 4 terms,
`
`BACKGROUND OF THE INVENTION
`A voice activity detector is a device which is supplied
`with a signal with the object of detecting periods of
`speech, or periods containing only noise. Although the 10
`present invention is not limited thereto, one application
`of particular interest for such detectors is in mobile
`radio telephone systems where the knowledge as to the
`presence or otherwise of speech can be used and ex- 15
`ploited by a speech coder to improve the efficient utili-
`The zero order autocorrelation coefficient is the sum
`sation of radio spectrum, and where also the noise level
`(from a vehicle-mou?ted u~i9 is likely_ to ~ high.
`of each term squared, which may be normalized i.e.
`divided by the total number of terms (for constant frame
`The essen~e o~ vmce act1v1~y detection is to locate a
`measure which ?1ffers appreciably betw~en ~peech and 20 lengths it is easier to omit the division); that of the fil-
`non-speech penods. In apparatus which includes a
`tered signal is thus
`speech coder, a number of parameters are readily avail-
`able from one or other stage of the coder, and it is there(cid:173)
`fore desirable to economise on processing needed by
`utilising some such parameter. In many environments, 25
`the main noise sources occur in known defined areas of
`the frequency spectrum. For example, in a moving car
`much of the noise (e.g., engine noise) is concentrated in
`the low frequency regions of the spectrum. Where such
`knowledge of the spectral position of noise is available, 30
`it is desirable to base the decision as to whether speech
`is present or absent upon measurements taken from that
`portion of the spectrum which contains relatively little
`noise. It would, of course, be possible in practice to 35
`pre-filter the signal before analysing to detect speech
`activity, but where the voice activity detector follows
`the output of a speech coder, prefiltering would distort
`the voice signal to be coded.
`
`R'o = (s4 + /ios3 + h1s1 + h2s1 + h3S())2
`+ (ss + hos4 + h1s3 + h2sz + h3St)2
`+ ...
`= s42 + hos4s3 + h1s4si + h2.s+r1 + h3S4SQ
`+ h()l'4S3 + ho2so2 + hoh1s3S2 + hoh2s3S1 + hoh3S3.llJ
`+ h1S4SZ + hohtS3S2 + h1 2sz2 + h1h2s2s1 + hth3S2SQ
`+ h2s4S1 + hoh1s3S1 + h1h2s2s1 + hi2sz2 + h2h3StSO
`+ h3S4SQ + hoh3S3.llJ + h1h3S2so + h2h3S1SO + h32sz0
`+ ...
`= Ro(! + ho2 + h1 2 + hi + h32)
`+ R1(2ho + 2hoh1 + 2h1h2 + 2h2h3)
`+ R2(2h1 + 2h1h3 + 2hoh2)
`+ R3(2h2 + 2hoh3)
`+ R4(2h3)
`
`40
`
`SUMMARY OF THE INVENTION
`According to the invention there is provided a voice
`activity detection apparatus comprising means for re(cid:173)
`ceiving an input signal, means for periodically adap-
`tively generating an estimate of the noise signal compo- 45
`So R'o can be obtained from a combination of the
`autocorrelation coefficients R;, weighted by the brack-
`nent of the input signal, means for periodically forming
`a measure M of the spectral similarity between a portion
`eted constants which determine the frequency band to
`which the value of R'o is responsive. In fact, the brack-
`of the input signa! and the noise sign~) component,
`eted terms are the autocorrelation coefficients of the
`means for companng a parameter denved from the
`measure M with a threshold value T, and means for 50 impulse response of the notional filter, so that the ex-
`producing an output to indicate the presence or absence
`pression above may be simplified to
`of speech in dependence upon whether or not that value
`is exceeded.
`Preferably, the measure is the Itakura-Saito Distor-
`. M
`t1on
`easure.
`
`N
`R'o = RoHo + 2 I R;H;,
`i=I
`
`55
`
`(I)
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`Other aspects of the present invention are as defined
`in the claims.
`Some embodiments of the invention will now be
`described, by way of example, with reference to the
`accompanying drawings, in which:
`FIG. 1 is a block diagram ofa first embodiment of the
`invention;
`FIG. 2 shows a second embodiment of the invention;
`FIG. 3 shows a third, preferred embodiment of the
`invention.
`
`60
`
`65
`
`where N is the filter order and H; are the (un-normal(cid:173)
`ised) autocorrelation coefficients of the impulse re(cid:173)
`sponse of the filter.
`In other words, the effect on the signal autocorrela(cid:173)
`tion coefficients of filtering a signal may be simulated by
`producing a weighted sum of the autocorrelation coeffi(cid:173)
`cients of the (unfiltered) signal, using the impulse re(cid:173)
`sponse that the required filter would have had.
`Thus, a relatively simple algorithm, involving a small
`number of multiplication operations, may simulate the
`effect of a digital filter requiring typically a hundred
`times this number of multiplication operations.
`
`Page 5 of 10
`
`
`
`N
`M = Ro,4o + 2 l: R;A;,
`i=I
`
`15
`
`5,276,765
`
`3
`This filtering operation may alternatively be viewed
`as a form of spectrum comparison, with the signal spec(cid:173)
`trum being matched against a reference spectrum (the
`inverse of the response of the notional filter). Since the
`notional filter in this application is selected so as to
`approximate the inverse of the noise spectrum, this
`operation may be viewed as a spectral comparison be(cid:173)
`tween speech and noise spectra, and the zeroth autocor(cid:173)
`relation coefficient thus generated (i.e. the energy of the
`inverse filtered signal) as a measure of dissimilarity
`between the spectra. The Itakura-Saito distortion mea(cid:173)
`sure is used in LPC to assess the match between the
`predictor filter and the input spectrum, and in one form
`is expressed as
`
`4
`is converted to a digital input sample train by AD con(cid:173)
`verter 12 and LPC analysed by a second LPC analyser
`13. The "noise" LPC coefficients produced from analy(cid:173)
`ser 13 are passed to correlator unit 14, and the autocor-
`5 relation vector thus produced is multiplied term by
`term with the autocorrelation coefficients Ri of the
`input signal from the speech microphone in multiplier 5
`and the weighted coefficients thus produced are com(cid:173)
`bined in adder 6 according to Equation 1, so as to apply
`10 a filter having the inverse shape of the noise spectrum
`from the noise-only microphone (which in practice is
`the same as the shape of the noise spectrum in the signal(cid:173)
`plus-noise microphone) and thus filter out most of the
`noise. The resulting measure Mis thresholded by thre(cid:173)
`sholder 7 to produce a logic output 8 indicating the
`presence or absence of speech; if M is high, speech is
`deemed to be present.
`This embodiment does, however, require two micro-
`.
`.
`phones and two LPC analysers, which adds to the ex-
`where Ao etc are the aut~correlat1on coeffi~1e~ts of the
`~p~ parameter s~t. It '."'ill b~ seen that this is clo~elr 20 pense and complexity of the equipment necessar .
`_
`similar to the relat10nsh1p denved_ above, and when 1t 1s
`Alternatively another · embodim nt uses
`y
`remembered that the LPC coefficients are the taps of an
`.
`'
`.
`e
`a c~rre
`spondmg ~easu~e formed usmg the autocorrela~10ns
`FIR filter having the inverse spectral response of the
`from the nm~e m~crophone 11 and the LPC coefficients
`input signal so that the LPC coefficient set is the im-
`pulse response of the inverse LPC filter, it will be appar- 25 from the mam microphone 1, so that_ an extra autocor-
`relator rather than an LPC analyser 1s necessary.
`ent that the ltakura-Saito Distortion Measure is an fact
`merely a form of equation 1, wherein the filter response
`_T~ese_ embodime?ts are theref?re a~le to ?perate
`H is the inverse of the spectral shape of an all-pole
`w1thm d~fferent e?v1!onments ~avmg _nmse at d1ffe~ent
`model of the input signal.
`frequencies, or w1thm a changing nmse spectrum m a
`In fact, it is also possible to transpose the spectra, 30 given environment.
`Referring to FIG. 2, in the preferred embodiment of
`using the LPC coefficients of the test spectrum and the
`the invention, there is provided a buffer 15 which stores
`autocorrelation coefficients of the reference spectrum,
`to obtain a different measure of spectral similarity.
`a set of LPC coefficients (or the autocorrelation vector
`The 1-S Distortion measure is further discussed in
`of the set) derived from the microphone input 1 in a
`"Speech Coding based upon Vector Quantisation" by A 35 period identified as being a "non speech" (i.e. noise
`Buzo, A H Gray, R M Gray and J D Markel, IEEE
`only) period. These coefficients are then used to derive
`Trans on ASSP, Vol ASSP-28, No 5, October 1980.
`a measure using equation 1, which also of course corre-
`Since the frames of signal have 011:Iy a finite length,
`sponds to the Itakura-Saito Distortion Measure, except
`and a number of terms (N, where N 1s the filter order)
`that a single stored frame of LPC coefficients corre-
`are n~gl~cted, the above res~l~ is an ap~ro~imation 40 sponding to an approximation of the inverse noise spec-
`trum is used, rather than the present frame of LPC
`only; 1t gives, however, a surpnsmgly good md1cator of
`the presence or absence of speech and thus may be used
`coefficients.
`as a measure !"fin speech ~etection. In an envir~nment
`The LPC coefficient vector L;output by analyser 3 is
`'."'~ere t~e no1s~ spectru~ 1s well known and stationary,
`also routed to a correlator 14, which produces the auto-
`it 1s q~lte possible to si~ply empl~y fixed ho, h1 etc 45 correlation vector of the LPC coefficient vector. The
`buffer memory is is controlled by the speech/non-
`coefficients to model the m:,erse noise filter.
`.
`~owev~r, appara~us which can ~dapt to different
`speech output of thresholder 7, in such a way that dur-
`nmse en".ironments is m~ch more wide!~ useful.
`.
`ing "speech" frames the buffer retains the "noise" auto-
`,, ti
`Referrmg to FIG. l, m a first embodiment, a signal
`ffi •
`t b t d •
`..
`·
`1 t'
`•
`h
`(
`h
`) .
`. d
`.
`corre a 10n coe 1c1en s, u
`unng nmse
`rames a
`f
`rom a m1crop one not s own 1s receive at an mput 50
`t f LPC
`ffi •
`t
`b
`d
`pd
`h
`new se O
`1 and converted to digital samples s at a suitable sam-
`coe icien s ~ay e ~se
`to u . ate ~ e
`piing rate by an analogue to digital converter 2. An
`buffer, for example by a multiple ~witch 16, via which
`LPC analysis unit 3 (in a known type of LPC coder)
`~utputs of t?e correlator 14, carrying each autocorrel~-
`t1on coeffi_c1ent, are connected to the buffer 15 .. •~ will
`then derives, for successive frames of n (e.g. 160) sam-
`pies, a se~ of N (e.g. 8 or 12) LPC filter coefficients L; 55 be appreciated that correlator 14 could be pos1t1?~ed
`after buffe~ lS. Further, the speech/no-speech dec1S1on
`which are transmitted to represent the input speech.
`The speech signal s also enters a correlator unit 4 (nor-
`for coefficient update ne~d not be from ~utput 8, but
`mally part of the LPC coder 3 since the autocorrelation
`cou!d be (and prefera~ly 1s) ~therw1se denved.
`vector Riofthe speech is also usually produced as a step
`Smee fre~uent peno~s without speech occur,· the
`in the LPC analysis although it will be appreciated that 60 LPC coefficients stored m the buffer are updated from
`a separate correlator could be provided). The correlator
`time to time, so that the apparatus is thus capable of
`4 produces the autocorrelation vector Ri, including the
`tracking changes in the noise spectrum. It will be appre-
`zero order correlation coefficient Ro and at least 2 fur-
`ciated that such updating of the buffer may be necessary
`ther autocorrelation coefficients R1, R2, R3. These are
`only occasionally, or may occur only once at the start of
`then supplied to a multiplier unit 5.
`65 operation of the detector, if (as is often the case) the
`A second input 11 is connected to a second micro-
`noise spectrum is relatively stationary over time, but in
`phone located distant from the speaker so as to receive
`a mobile radio environment frequent updating is pre-
`only background noise. The input from this microphone
`ferred.
`
`Page 6 of 10
`
`
`
`5,276,765
`
`5
`In a modification of this embodiment, the system
`initially employs equation 1 with coefficient terms cor(cid:173)
`responding to a simple fixed high pass filter, and then
`subsequently starts to adapt by switching over to using
`"noise period" LPC coefficients. If, for some reason,
`speech detection fails, the system may return to using
`the simple high pass filter.
`It is possible to normalise the above measure by divid(cid:173)
`ing through by Ro, so that the expression to be thre(cid:173)
`sholded has the form
`
`6
`periods of noise; the degree of variation (as illustrated
`by the standard deviation) is also higher, and less inter(cid:173)
`mittently variable.
`It is noted that the standard deviation of the standard
`5 deviation of M is also a reliable measure; the effect of
`taking each standard deviation is essentially to smooth
`the measure.
`In this second form of Voice Activity Detector, the
`measured parameter used to decide whether speech is
`10 present is preferably the standard deviation of the ltaku(cid:173)
`ra-Saito Distortion Measure, but other measures of vari(cid:173)
`ance and other spectral· distortion measures (based for
`example on FFT analysis) could be employed.
`It is found advantageous to employ an adaptive
`This measure is independent of the total signal energy in l5 threshold in voice activity detection. Such thresholds
`a frame and is thus compensated for gross signal level
`must not be adjusted during speech periods or the
`speech signal will be thresholded out. It is accordingly
`changes, but gives rather less marked contrast between
`"noise" an? "speech''. levels _and is hence preferably not
`necessary to control the threshold adapter using a
`.
`employed m h1gh-no1~e environments..
`speech/non-speech control signal, and it is preferable
`Instead of employmg LPC analysis to denve the 20 that this control signal should be independent of the
`invers~ filte~ coefficients oft~e noise sign~) (from ~ither
`output of the threshold adapter. The threshold T is
`the. noise micr?phone or n?ise only pei:i~s, as _m the
`adaptively adjusted so as to keep the threshold level just
`above the level of the measure M when noise only is
`vanous em~od1ments ~escnbed above)'. it is possible_ to
`model the mverse _noise spe~trum usmg an adaptive
`present. Since the measure will in general vary ran-
`filter ofkno~n type, as the noise s~ctrum changes ~nly
`domly when noise is present, the threshold is varied by
`determining an average level over a number of blocks,
`slowlr (as discussed below) a relat1vely_slow coefficient
`adaption ra~e commo~ for such filters IS acceptable. In
`and setting the threshold at a level ro
`rtional t
`th'
`O
`one embodiment, which corresponds to FIG. 1, LPC
`.
`.
`. P. po
`is
`analysis unit 13 is simply replaced by an adaptive filter
`a~erage. In a noisy environment this 1s not usually suffi-
`(for example a transversal FIR or lattice filter), con- 30 c1e~t, _however, and so an assessment of the de~ree of
`nected so as to whiten the noise input by modelling the
`vanat1~m of the parameter over several blocks IS also
`inverse filter, and its coefficients are supplied as before
`taken mto account.
`.
`to autocorrelator 14.
`The thres~old value T 1s therefore preferably calcu-
`In a second embodiment, corresponding to that of
`lated accordmg to
`FIG. 2, LPC analysis means 3 is replaced by such an 35
`adaptive filter, and buffer means 15 is omitted, but
`switch 16 operates to prevent the adaptive filter from
`where M' is the average value of the measure over a
`adapting its coefficients during speech periods.
`A second Voice Activity Detector for use with an-
`number of consecutive frames, dis the standard devia-
`other embodiment of the invention will now be de- 40 tion of the measure over those frames, and K is a con-
`scribed.
`stant (which may typically be 2).
`From the foregoing, it will be apparent that the LPC
`In practice, it is preferred not to resume adaptation
`coefficient vector is simply the impulse response of an
`immediately after speech is indicated to be absent, but to
`FIR filter which has a response approximating the in-
`wait to ensure the fall is stable (to avoid rapid repeated
`verse spectral shape of the input signal. When the Itaku- 45 switching between the adapting and non-adapting
`ra-Saito Distortion Measure between adjacent frames is
`states).
`formed, this is in fact equal to the power of the signal, as
`Referring to FIG. 3, in a preferred embodiment of the
`invention incorporating the above aspects, an input 1
`filtered by the LPC filter of the previous frame. So if
`spectra of adjacent frames differ little, a correspond-
`receives a signal which is sampled and digitised by
`ingly small amount of the spectral power of a frame will 50 analogue to digital converter (ADC) 2, and supplied to
`the input of an inverse filter analyser 3, which in prac-
`escape filtering and the measure will be low. Corre-
`tice is part of a speech coder with which the voice
`spondingly, a large interframe spectral difference pro-
`activity detector is to work, and which generates coeffi-
`duces a high Itakura-Saito Distortion Measure, so that
`cients L; (typically 8) of a filter corresponding to the
`the measure reflects the spectral similarity of adjacent
`frames. In a speech coder, it is desirable to minimise the 55 inverse of the input signal spectrum. The digitised signal
`is also supplied to an autocorrelator 4, (which is part of
`data rate, so frame length is made as long as possible; in
`analyser 3) which generates the autocorrelation vector
`other words, if the frame length is Jong enough, then a
`R; of the input signal (or at least as many low order
`speech signal should show a significant spectral change
`from frame to frame (ifit does not, the coding is redun-
`terms as there are LPC coefficients). Operation of these
`dant). Noise, on the other hand, has a slowly varying 60 parts of the apparatus is as described in FIGS. 1 and 2.
`Preferably, the autocorrelation coefficients R; are then
`spectral shape from frame to frame, and so in a period
`where speech is absent from the signal then the ltakura-
`averaged over several successive speech frames (typi-
`cally 5-20 ms long) to improve their reliability. This
`Saito Distortion Measure will correspondingly be
`low-since applying the inverse LPC filter from the
`may be achieved by storing each set of autocorrelations
`65 coefficients output by autocorrelator 4 in a buffer 4a,
`previous frame "filters out" most of the noise power.
`and employing an averager 4b to produce a weighted
`Typically, the Itakura-Saito Distortion Measure be-
`tween adjacent frames of a noisy signal containing inter-
`sum of the current autocorrelation coefficients R; and
`mittent speech is higher during periods of speech than
`those from previous frames stored in and supplied from
`
`25
`
`T=M+K-d
`
`Page 7 of 10
`
`
`
`Ra;A;
`M=Ao+2IRo,
`
`10
`
`5,276,765
`
`7
`buffer 4a. The averaged autocorrelation coefficients
`Ra; thus derived are supplied to weighting and adding
`means 5,6 which receives also the autocorrelation vec(cid:173)
`tor A; of stored noise-period inverse filter coefficients
`L; from an autocorrelator 14 via buffer 15, and forms
`from Ra; and A; a measure M preferably defined as:
`
`8
`which is "true" when voiced speech is detected, and
`this signal, together with the threshold measure derived
`from thresholder 26 (which will generally be "true"
`when unvoiced speech is present) are supplied to the
`5 inputs of a NOR gate 28 to generate a signal which is
`"false" when speech is present and "true" when noise is
`present. This signal is supplied to buffer 15 (or to in(cid:173)
`verse filter analyser 3) so that inverse filter coefficients
`L; are only updated during noise periods.
`Threshold adapter 29 is also connected to receive the
`This measure is then thresholded by thesholder 7
`non-speech signal control output of control signal gen-
`erator circuit 20. The output of the threshold adapter 29
`against a threshold level, and the logical result provides
`is supplied to thresholder 7. The threshold adapter op-
`an indication of the presence or absence of speech at
`output 8.
`erates to increment or decrement the threshold in steps
`In order that the inverse filter coefficients L; corre- 15 which are a proportion of the instant threshold value,
`spond to a fair estimate of the inverse of the noise spec-
`until the threshold approximates the noise power level
`trur_n, it is des!rable to update these coefficients du~ng
`(which may conveniently be derived from, for example,
`pe~ods of noise (and, ?f course, not to update dunng
`weighting and adding circuits 22, 23). When the input
`penods of speech). It 1~,. however, _Preferable th~t t~e
`signal is very low, it may be desirable that the threshold
`speech/non-speech dec1S1on on which the updatm¥ is 20 is automatically set to a fixed, low, level since at the low
`based doe~ not depend UJ?On t?e result of the _updatmg,
`signal levels the effect of signal quantisation produced
`by ADC 2 can produce unreliable results.
`or els~ a smgle _wron~l~ identified frame of signal ~ay
`~.esult m the v?,1ce activity detect?r su?s:quently go!ng
`There may be further provided "hangover" generat-
`out of lock
`and wrongly ide~tifymg_ followmg
`ing means 30, which operates to measure the duration of
`fram~s. Preferably: the~efo~e, there is I?rovided a con- 25 indications of speech after thresholder 7 and, when the
`trol signal generatmg c1rcu1t 20, effectively a separate
`re
`of
`h h
`· d'
`t d .,
`'od ·
`be
`~s
`e~ m ica e
`,or a pen ~
`voice activity detector, which forms an independent
`P sence
`speec
`control signal indicating the presence or absence of
`excess _of a predeterm~~ed t1me :?ns~nt, the o~tput 1s
`speech to control inverse filter analyser 3 (or buffer 15)
`h~ld ~igh for a sh~rt hangover penod. In this wa~,
`so that the inverse filter autocorrelation coefficients A; 30 chppmg of the mid~Je of lo"'.-level spe~ch bursts 1s
`used to form the measure M are only updated during
`avoided, an_d ap~ropnate selection of the time constant
`"noise" periods. The control signal generator circuit 20
`prevents_ tnggenng_ of th~ hangover gen~rat?r 30 by
`includes LPC analyser 21 (which again may be part of
`short spikes _of n01se which are fal~ely md1cated as
`a speech coder and, specifically, may be performed by
`speech. It 'Yill of course be apprec1at~ that ~I the
`analyser 3), which produces a set of LPC coefficients 35 above funct1o~s _may be ex~uted by a smgle sw!8~ly
`M;corresponding to the input signal and an autocorrela-
`p~ogrammed d_1gital process!ng means such as a Digital
`tor 21a (which may be performed by autocorrelator 3a)
`Sign~! Processmg (D~~) chip, as part ~fan LPC c<;>dec
`which derives the autocorrelation coefficients B;. of M;.
`thus implemented (this 1s the preferred 1mplementat10n),
`If analyser 21 is performed by analyser 3, then M;=L;
`or as a suitably_ pro_grammed ~icrocomputer or. mi-
`and B;=A;. These autocorrelation coefficients are then 40 crocontroller chip with an associated memory device.
`supplied to weighting and adding means 22, 23 (equiva-
`. Conveniently, as d~ribed above, the voice detec-
`lent to 5, 6) which receive also the autocorrelation vec-
`tion apparatus may be implemented as part of an LPC
`tor R;of the input signal from autocorrelator 4. A mea-
`codec. Alter:natively, where autocorrelati~n coeffici-
`sure of the spectral similarity between the input speech
`ents of the signal or related measures (partial correla-
`frame and the preceding speech frame is thus calcu- 45 tion, or "parcor", coefficients) are transmitted to a dis-
`lated; this may be the Itakura-Saito distortion measure
`tant station the voice detection may take place distantly
`between R;ofthe present frame and B;ofthe preceding
`from the codec.
`frame, as disclosed above, or it may instead be derived
`I claim:
`1. Voice activity detection apparatus comprising:
`by calculating the Itakura-Saito distortion measure for
`R; and B; of the present frame, and subtracting (in sub- 50
`(i) means for receiving an electrical input signal in
`tractor 25) the corresponding measure for the previous
`which the presence or absence of signals represent-
`frame stored in buffer 24, to generate a spectral differ-
`ing speech is to be detected;
`ence signal (in either case, the measure is preferably
`(ii) means responsive to said means for receiving for
`energy-normalised by dividing by Ro). The buffer 24 is
`periodically adaptively generating an electrical
`then, of course, updated. This spectral difference signal, 55
`signal representing an estimated noise signal com-
`when thresholded by a thresholder 26 is, as discussed
`ponent of the input signal by producing the auto-
`above, an indicator of the presence or absence of
`correlation coefficients A; of the impulse response
`speech. We have found, however, that although this
`of a FIR filter having a response approximating the
`measure is excellent for distinguishing noise from un-
`inverse of the short term spectrum of the noise
`voiced speech (a task which prior art systems are gener- 60
`signal component;
`(iii) means responsive to said means for receiving for
`ally incapable ot) it is in general rather less able to dis-
`tinguish noise from voiced speech. Accordingly, there
`periodically forming from the input signal and the
`is preferably further provided within circuit 20 a voiced
`estimated noise representing signal an electrical
`speech detection circuit comprising a pitch analyser 27
`signal representing a measure M of the spectral
`(which in practice may operate as part of a speech 65
`similarity between a portion of the input signal and
`coder, and in particular may measure the long term
`the said estimated noise signal component, said
`predictor lag value produced in a multipulse LPC
`measure forming means comprises means for pro-
`coder). The pitch analyser 27 produces a logic signal
`ducing electrical signals representing the autocor-
`
`Page 8 of 10
`
`
`
`1c
`
`5,276,765
`
`M=Ro,4o+2IR;A;.
`
`6. Apparatus according to claim 1 or 4, in which
`
`9
`10
`relation coefficients R; of the input signal, and
`tween a portion of the input signal and earlier portions
`means connected to receive R; and A; signals, and
`of the input signal.
`14. Apparatus according to claim 13 in which the
`to calculate the measure M therefrom; and
`similarity measure generating means of said second
`(iv) electrical means responsive to said means for
`forming for comparing the electrical signals repre- 5 voice activity detection means comprises means for
`providing, from LPC filter data and autocorrelation
`senting said measure with a threshold value repre-
`senting signal to produce an electrical output indi-
`data relating to a present portion of the input signal, a
`catin~ th~ prese~ce or absence of speech in the
`present distortion measure; means for providing an
`electncal mput si~al.
`.
`. .
`equivalent past frame distortion measure corresponding
`2: Apparatus accordmg ~o claim 1, further c~mpi:ismg 10 to a preceding portion of the input signal, and means for
`a!1 mput_ a~ranged t? receive .a second el~ctncal mp~t
`generating a signal indicating the degree of similarity
`signal, s_1milar!y subJect to n~1se, from which speech 1s
`therebetween as an indicator of speech
`resence or
`absent: m which the ~~neratmg means compnse LPC
`absence.