`[11] Patent Number:
`[19]
`United States Patent
`
`Bossemeyer, Jr.
`[45] Date of Patent:
`Jan. 4, 2000
`
`USOO6012027A
`
`[54] CRITERIA FOR USABLE REPETITIONS 0F
`AN UTTERANCE DURING SPEECH
`REFERENCE ENROLLMENT
`
`[75]
`
`.
`Inventor‘ ESE]: Vllllles‘ey Bossemeyer’ Jr" St‘
`’
`'
`.
`,
`,
`[73] Ass1gnee: Amerltech Corporatlon, Hoffman
`Estates, 111.
`
`[21] Appl. No.: 03/932,073
`22
`F1 d:
`S
`. 17 1997
`1 e
`’
`ep
`Related US. Application Data
`
`l
`
`[
`
`[63]
`
`Continuation—in—part of application No. 08/863,462, May
`27’ 1997'
`Int. Cl.7 ........................................................ G10L 5/06
`[51]
`[52] U S C]
`704/243. 704/248' 704/253
`[58] Field 0f Searchiiiiiiiiiiii
`’
`704/243 238
`""""""7'04/239248 253 251’ 231’
`’
`’
`’
`’
`
`56
`
`[
`
`l
`
`C't d
`R f
`l e
`e erences
`US. PATENT DOCUMENTS
`
`313;: 32:2;et al' """"""""""""""" 38:33:
`$3323;
`4,618,984 10/1986 Das etal ..... 704/244
`4,694,493
`9/1987 Sakoe.
`4,773,093
`9/1988 Higgins et a1.
`.
`4,797,929
`1/1989 Gerson et a1.
`4,912,766
`3/1990 Forse ....................................... 704/225
`
`.
`
`350
`
`........................ 704/251
`
`11/1990 Dautrich et al.
`4,972,485
`6/1993 Bahl et al- ~
`5,222,146
`11/1993 McNair.
`5,265,191
`12/1993 Green .
`5,274,695
`3/1994 Picone et a1.
`.
`5,293,452
`9/1995 Ittycheriah et al..
`5,452,397
`2/1996 Jakatdar
`.................................. 704/251
`5,495,553
`2/1997 Chou et al.
`.
`5,606,644
`.
`4/1997 Chow et a1.
`5,617,486
`9/1997 Vysotsky ................................. 704/243
`5,664,058
`............. 235/472.03
`5,698,834 12/1997 Worthington et a1.
`
`2/1998 Gammel .................... 379/8803
`5,717,738
`5,774,841
`6/1998 Salazar et al.
`.......................... 704/225
`Primary Examiner—David R. Hudspeth
`Assistant Examiner—Donald L. Storm
`Attorney, Agent, or Firm—Bruce E. Stuckman; Dale B.
`Halling
`[57]
`
`ABSTRACT
`
`.
`A speech reference enrollment method 1nvolves the follow-
`ing steps: (a) requesting a user speak a vocabulary word; (b)
`detecting a first utterance (354); (c) requesting the user
`speak the vocabulary word; (d) detecting a second utterance
`(358); (e) determining a first similarity between the first
`utterance and the second utterance (362); (f) when the first
`similarity is less than a predetermined similarity, requesting
`the user speak the vocabulary word; (g) detecting a third
`utterance (366); (h) determining a second similarity between
`the fir“ utterance and the third utterance (370); and (1) When
`the second similarity is greater than or equal to the prede-
`termined similarity, creating a reference (364).
`
`20 Claims, 15 Drawing Sheets
`
`352
`
`
`
`354
` 356
`358
`
`
`
`
`
`
`
`DETERMINE FIRST SIMILARITY BETWEEN
`FEATURES -1 AND FEATURES -2
`
`
`
`IS
`FIRST
`<
`SIMILARITY
`PREDETERMINED
`SIMILARY
`a.
`
`YES
`
`
`
`THIRD UTTERANCE
`AND FEATURES -3
`
`NO
`
`364
`
`362
`
`FORM
`
`REFERENCE
`
`
`
`
`
`DETERMINE SECOND SIMILARITY BETWEEN
`FEATURES -1 AND FEATURES -3
`
`IS
`SECOND
`
`368
`
`N0
`
`SIMI|;ARITY
`PREDE‘FERMINED
`SIMILARY
`a
`
`
`YES
`
`370
`
`DETERMINE THIRD SIMILARITY BETWEEN /
`FEATURES -2 AND FEATURES -3
`
`372
`
`
`
`376
`IS
`THIRD
`
`SIMILARITY
`FORM
`2
`REFERENCE
`PREDETERMINED
`
`
`
`SIMILARY
`'7
`374
`
`
`
`NO
`
`START ENROLLMENT OVER
`
`378
`
`
`
`Page
`
`10f22
`
`GOOGLE EXHIBIT 1004
`
`Page 1 of 22
`
`GOOGLE EXHIBIT 1004
`
`
`
`US. Patent
`
`Jan. 4, 2000
`
`Sheet 1 0f 15
`
`6,012,027
`
`ZO_._.<O_...=mm_>
`
`ZO_w_om_D
`
`vw
`
`ZO_w_Om_D
`
`02_._.Imu_m_>>
`
`DZ<
`
`OZ_Z_m=>_OO
`
`N.mE
`
`or
`
`ON
`
`wv
`
`XOOmmooo
`
`m0k<mmzm0
`
`02.229:
`
`mOH<m<m§OO
`
`mek<wm
`
`m0...o<m._.xm
`
`Page 2 of 22
`
`Page 2 of 22
`
`
`
`
`
`
`
`
`
`US. Patent
`
`Jan. 4,2000
`
`Sheet 2 0f 15
`
`6,012,027
`
`START
`
`J
`
`4O
`
`GENERATE A CODE BOOK
`
`RECEIVE TEST UTTERANCES
`
`COMPARE TEST UTTERANCES
`
`AND TRAINING UTTERNACES
`
`WEIGHTING DECISIONS
`
`COMBINE WEIGHTED
`
`DECISIONS
`
`VERIFICATION DECISION
`
`42
`
`44
`
`46
`
`48
`
`5O
`
`52
`
`54
`
`Fig. 2
`
`Page 3 of 22
`
`Page 3 of 22
`
`
`
`US. Patent
`
`Jan. 4,2000
`
`Sheet 3 0f 15
`
`6,012,027
`
`CODE BOOK GENERATION
`
` SEGMENT INTO VOICED
`
` 8
`
`AND UNVOICED SOUNDS
`
`EXTRACT FEATURES (Ci)
`(e.g. CEPSTRUMS)
`FROM VOICED SOUNDS
`
`
`
`7
`
`STORE TRAINING UTTERANCE
`
`UJ (Ci)
`
`Fig. 3
`
`Page 4 of 22
`
`Page 4 of 22
`
`
`
`US. Patent
`
`Jan. 4,2000
`
`Sheet 4 0f 15
`
`6,012,027
`
`VERIFICATION DECISION
`
`START
`
`100
`
`INPUT UTTERANCES
`
`DETERMINE IF MALE OR FEMALE
`
`SEGMENT INTO VOICED
`AND UNVOICED SOUNDS
`
`102
`
`104
`
`106
`
`EXTRACT FEATURES FROM
`VOICED SOUNDS
`
`TO FORM TEST UTTERANCES
`
`U. (Ci), U2 (Ci), .
`
`.
`
`. ux (Ci)
`
`108
`
`CALCULATE WEIGHTED EUCLIDEAN
`DISTANCE USING MALE VARIANCE
`
`VECTOR (FEMALE VARIANCE)
`
`FORM DECISIONS BASED ON WED
`
`WEIGHT DECISIONS CHI-
`SQUARE DETECTOR
`
`SUM WEIGHTED DECISION
`
`VERIFICATION DECISION
`
`110
`
`112
`
`114
`
`16
`
`18
`
`Fig. 4
`
`Page 5 of 22
`
`Page 5 of 22
`
`
`
`US. Patent
`
`Jan. 4,2000
`
`Sheet 5 0f 15
`
`6,012,027
`
`ig.5
`
`LL]
`
`2>C
`
`E
`LU
`co
`
`COMPUTER
`
`LL
`D._|
`
`SD
`
`Page 6 of 22
`
`Page 6 of 22
`
`
`
`US. Patent
`
`Jan. 4, 2000
`
`Sheet 6 0f 15
`
`6,012,027
`
`170
`
`DIAL SERVICE NUMBER
`
`
`
`
`
`SPEAKING DIGITS
`FIRST UTTERANCE
`
`RECOGNIZE DIGITS USING
`SPEAKER INDEPENDENT
`VOICE RECOGNITION
`
`172
`
`174
`
`176
`
`
`
`
`
`
`
`ALLOW
`ACCESS TO
`DIAL—UP
`
` VERIFY
`PREVIOUS
`USER?
`
`178
`
`REQUEST PIN
`
`DENY
`ACCESS
`
`186
`
`Fig. 6
`
`Page 7 of 22
`
`Page 7 of 22
`
`
`
`US. Patent
`
`Jan. 4,2000
`
`Sheet 7 0f 15
`
`6,012,027
`
`200
`
`START
`
`SPEAKING DIGITS
`
`(FIRST UTTERANCE)
`
`
`PREVIOUS
`USER?
`
`
`
`YES
`
`210
`
`N0
`
`REQUEST
`PIN
`
`DENY
`ACCESS
`
`
`
`
`212
`
`Page 8 of 22
`
`DENY
`ACCESS.
`
`YES
`
`206
`
`216
`
`REQUEST
`PIN
`
`SPEAKING DIGITS
`(SECOND UTTERANCE)
`
`
`
`SIMILARITY
`
`OF FIRST AND
`
`SECOND >
`THRESHOLD?
`
`
`224
`
`STORE REFERENCE
`UTTERANCE
`
`ALLOW ACCESS
`
`208
`
`Fig. 7
`
`Page 8 of 22
`
`
`
`US. Patent
`
`Jan. 4, 2000
`
`Sheet 8 0f 15
`
`6,012,027
`
`E2m>wmgm8:90va
`
`:8me
`
`n=\zm
`
`252:9ch
`
`oom
`
`m
`
`
`
`mm,,1/,
`
`vwm.\_
`
`_
`
`mommom
`
`\I/“vow
`
`
`
`IIIVNm/h
`
`_
`
`mwQOo
`
`Em/xx
`
`x
`
`\
`
`x
`
`\
`
`\mum
`
`,,xmME
`
`Page 9 of 22
`
`Page 9 of 22
`
`
`
`
`US. Patent
`
`Jan. 4,2000
`
`Sheet 9 0f 15
`
`6,012,027
`
`350
`
`START
`
`RECEIVE FIRST UTTERANCE
`
`EXTRACT FEATURES - 1
`
`RECEIVE SECOND UTTERANCE
`
`EXTRACT FEATURES - 2
`
`352
`
`354
`
`356
`
`358
`
`DETERMINE FIRST SIMILARITY BETWEEN
`FEATURES -1 AND FEATURES -2
`
`360
`
`
`IS
`
`FIRST
`
`
`SIMILARITY
`<
`
`
`
`
`
`
`
`PREDETERMINED
`
`SIMILARY
`
`-
`YES
`
`
`
`
`362
`
`NO
`
`364
`
`FORM
`REFERENCE
`
`DETERMINE SECOND SIMILARITY BETWEEN
`FEATURES -1 AND FEATURES -3
`
`THIRD UTTERANCE
`AND FEATURES -3
`
`366
`
`
`
`
`368
`
`
`2
`
`
` SECOND
`SIMILARITY
`
`PREDETERMINED
`SIMILARY
`.
`
`
`YES
`
`370
`
`
`
`
`
`0 NO
`
`Page 10 of 22
`
`Page 10 of 22
`
`
`
`US. Patent
`
`Jan. 4, 2000
`
`Sheet 10 0f 15
`
`6,012,027
`
`
`
`DETERMINE THIRD SIMILARITY BETWEEN /
`
`FEATURES -2 AND FEATURES -3
`
`372
`
`
`
`376
`
`
`IS
`THIRD
`
`
`SIMILARITY
`>
`
`PREDE‘FERMINED
`SIMILARY
`
`
`? 374
`
`NO
`
`FORM
`
`REFERENCE
`
`
`
`START ENROLLMENT OVER
`
`378
`
`
`
`Fig. 93
`
`Page 11 of 22
`
`Page 11 of 22
`
`
`
`US. Patent
`
`Jan. 4,2000
`
`Sheet 11 0f 15
`
`6,012,027
`
`400
`
` 402
`
`DETERMINE DURATION
`
`OF UTTERANCE
`
`
`
`
`IS
`
`DURATION
`<
`
`YES
`
`406
`
`MINIMUM
`
`?
`
`
`
`
`
`DISREGARD
`
`UTTERANCE
`
`
`
`
`IS
`
`
`DURATION
`>
`
`MAXIMUM
`
`?
`
` 410
`
`START ENROLLMENT OVER
`
`Fig. 10
`
`Page 12 of 22
`
`Page 12 of 22
`
`
`
`US. Patent
`
`Jan. 4,2000
`
`Sheet 12 0f 15
`
`6,012,027
`
`420
`
`RECEIVE UTTERANCE
`
`DETERMINE SIN RATIO
`
`426
`
`
` IS
`5’“ EAT'O
`
`THRESHOLD
`?
`
`
`NO
`
`4
`
`22
`
`4
`
`24
`
`428
`
`YES
`
`PROCESS
`UTTERANCE
`
`REQUEST ANOTHER UTTERANCE
`
`430
`
`Fig. 11
`
`Page 13 of 22
`
`Page 13 of 22
`
`
`
`US. Patent
`
`Jan. 4, 2000
`
`Sheet13 0f15
`
`6,012,027
`
`mNmi
`
`NEE
`
`SoszE.\.
`
`
`
`F2300
`
`DmQO>
`
`IOmwmm
`
`ww2<mm
`
`DAOIwmmIP
`
`tiuw§fi
`
`vmv
`
`ME:
`
`MEFDZM
`
`on
`
`msEEfim/4_~
`
`Nov
`
`2.0mm
`
`mozémt:
`
`ooEmn.
`
`om“
`
`MDDFZQE<
`
`Page140f22
`
`Page 14 of 22
`
`
`
`
`
`
`US. Patent
`
`Jan. 4, 2000
`
`Sheet 14 0f 15
`
`6,012,027
`
`mDDHZQSZ
`
`._m>m._
`
`
`
`mum—>52
`
`”.0
`
`mm:n_5_<w
`
`Page 15 of 22
`
`Page 15 of 22
`
`
`
`US. Patent
`
`Jan. 4, 2000
`
`Sheet 15 0f 15
`
`6,012,027
`
`3m
`
`m0k<m<m200
`
`mg
`
`vwm
`
`mew%
`
`2ME
`
`mw<mm02_
`
`225
`
`o;
`
`m._m_<._.w3_,n_<
`
`Z_<0
`
`mm_u_._n__>_<
`
`ww<wmom5
`
`Z_<O
`
`mum
`
`Page 16 of 22
`
`Page 16 of 22
`
`
`
`
`
`
`
`6,012,027
`
`1
`CRITERIA FOR USABLE REPETITIONS OF
`AN UTTERANCE DURING SPEECH
`REFERENCE ENROLLMENT
`
`This application is a continuation in part of the patent
`application having Ser. No. 08/863,462, filed May 27, 1997,
`entitled “Method of Accessing a Dial-up Service” and filed
`to the same assignee as the present application.
`
`FIELD OF THE INVENTION
`
`The present invention is related to the field of speech
`recognition systems and more particularly to a speech ref-
`erence enrollment method.
`
`BACKGROUND OF THE INVENTION
`
`Both speech recognition and speaker verification appli-
`cation often use an enrollment process to obtain reference
`speech patterns for later use. Speech recognition systems
`that use an enrollment process are generally speaker depen-
`dent systems. Both speech recognition systems using an
`enrollment process and speaker verification systems will be
`referred herein as speech reference systems. The perfor-
`mance of speech reference systems is limited by the quality
`of the reference patterns obtained in the enrollment process.
`Prior art enrollment processes ask the user to speak the
`vocabulary word being enrolled and use the extracted fea-
`tures as the reference pattern for the vocabulary word. These
`systems suffer from unexpected background noise occurring
`while the user is uttering the vocabulary word during the
`enrollment process. This unexpected background noise is
`then incorporated into the reference pattern. Since the unex-
`pected background noise does not occur every time the user
`utters the vocabulary word, it degrades the ability of the
`speech reference system’s ability to match the reference
`pattern with a subsequent utterance.
`Thus there exists a need for an enrollment process for
`speech reference systems that does not incorporate unex-
`pected background noise in the reference patterns.
`
`SUMMARY OF THE INVENTION
`
`A speech reference enrollment method that overcomes
`these and other problems involves the following steps: (a)
`requesting a user speak a vocabulary word; (b) detecting a
`first utterance; (c) requesting the user speak the vocabulary
`word; (d) detecting a second utterance; (e) determining a
`first similarity between the first utterance and the second
`utterance; (f) when the first similarity is less than a prede-
`termined similarity, requesting the user speak the vocabulary
`word; (g) detecting a third utterance; (h) determining a
`second similarity between the first utterance and the third
`utterance; and (i) when the second similarity is greater than
`or equal to the predetermined similarity, creating a refer-
`ence.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of an embodiment of a speaker
`verification system;
`FIG. 2 is a flow chart of an embodiment of the steps used
`to form a speaker verification decision;
`FIG. 3 is a flow chart of an embodiment of the steps used
`to form a code book for a speaker verification decision;
`FIG. 4 is a flow chart of an embodiment of the steps used
`to form a speaker verification decision;
`FIG. 5 is a schematic diagram of a dial-up service that
`incorporates a speaker verification method;
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`FIG. 6 is a flow chart of an embodiment of the steps used
`in a dial-up service;
`FIG. 7 is a flow chart of an embodiment of the steps used
`in a dial-up service;
`FIG. 8 is a block diagram of a speech reference system
`using a speech reference enrollment method according to the
`invention in an intelligent network phone system;
`FIGS. 9a & b are flow charts of an embodiment of the
`
`steps used in the speech reference enrollment method;
`FIG. 10 is a flow chart of an embodiment of the steps used
`in an utterance duration check;
`FIG. 11 is a flow chart of an embodiment of the steps used
`in a signal to noise ratio check;
`FIG. 12 is a graph of the amplitude of an utterance versus
`time;
`FIG. 13 is a graph of the number of voiced speech frames
`versus time for an utterance;
`FIG. 14 is an amplitude histogram of an utterance; and
`FIG. 15 is a block diagram of an automatic gain control
`circuit.
`
`DETAILED DESCRIPTION OF THE DRAWINGS
`
`Aspeech reference enrollment method as described herein
`can be used for both speaker verification methods and
`speech recognition methods. Several
`improvements in
`speaker verification methods that can be used in conjunction
`with the speech enrollment method are first described. Next
`a dial-up service that takes advantage of the enrollment
`method is described. The speech enrollment method is then
`described in detail.
`
`FIG. 1 is a block diagram of an embodiment of a speaker
`verification system 10.
`It
`is important
`to note that
`the
`speaker verification system can be physically implemented
`in a number of ways. For instance,
`the system can be
`implemented as software in a general purpose computer
`connected to a microphone; or the system can be imple-
`mented as firmware in a general purpose microprocessor
`connected to memory and a microphone; or the system can
`be implemented using a Digital Signal Processor (DSP), a
`controller, a memory, and a microphone controlled by the
`appropriate software. Note that since the process can be
`performed using software in a computer, then a computer
`readable storage medium containing computer readable
`instructions can be used to implement the speaker verifica-
`tion method. These various system architectures are appar-
`ent to those skilled in the art and the particular system
`architecture selected will depend on the application.
`A microphone 12 receives an input speech and converts
`the sound waves to an electrical signal. A feature extractor
`14 analyzes the electrical signal and extracts key features of
`the speech. For instance, the feature extractor first digitizes
`the electrical signal. A cepstrum of the digitized signal is
`then performed to determine the cepstrum coefficients. In
`another embodiment, a linear predictive analysis is used to
`find the linear predictive coding (LPC) coefficients. Other
`feature extraction techniques are also possible.
`A switch 16 is shown attached to the feature extractor 14.
`
`This switch 16 represents that a different path is used in the
`training phase than in the verification phase. In the training
`phase the cepstrum coefficients are analyzed by a code book
`generator 18. The output of the code book generator 18 is
`stored in the code book 20. In one embodiment, the code
`book generator 18 compares samples of the same utterance
`from the same speaker to form a generalized representation
`of the utterance for that person. This generalized represen-
`
`Page 17 of 22
`
`Page 17 of 22
`
`
`
`6,012,027
`
`3
`tation is a training utterance in the code book. The training
`utterance represents the generalized cepstrum coefficients of
`a user speaking the number “one” as an example. A training
`utterance could also be a part of speech, a phoneme, or a
`number like “twenty one” or any other segment of speech.
`In addition to the registered users’ samples, utterances are
`taken from a group of non-users. These utterances are used
`to form a composite that represents an impostor code having
`a plurality of impostor references.
`In one embodiment, the code book generator 18 segre-
`gates the speakers (users and non-users) into male and
`female groups. The male enrolled references (male group)
`are aggregated to determining a male variance vector. The
`female enrolled references (female group) are aggregated to
`determine a female variance vector. These gender specific
`variance vectors will be used when calculating a weighted
`Euclidean distance (measure of closeness) in the verification
`phase.
`In the verification phase the switch 16 connects the
`feature extractor 14 to the comparator 22. The comparator
`22 performs a mathematical analysis of the closeness
`between a test utterance from a speaker with an enrolled
`reference stored in the code book 20 and between the test
`
`utterance and an impostor reference distribution. In one
`embodiment, a test utterance such as a spoken “one” is
`compared with the “one” enrolled reference for the speaker
`and the “one” impostor reference distribution. The compara-
`tor 22 determines a measure of closeness between the “one”
`enrolled reference, the “one” test utterance and the “one”
`impostor reference distribution. When the test utterance is
`closer to the enrolled reference than the impostor reference
`distribution,
`the speaker is verified as the true speaker.
`Otherwise the speaker is determined to be an impostor. In
`one embodiment, the measure of closeness is a modified
`weighted Euclidean distance. The modification in one
`embodiment involves using a generalized variance vector
`instead of an individual variance vector for each of the
`
`registered users. In another embodiment, a male variance
`vector is used for male speakers and a female variance
`vector is used for a female speaker.
`A decision weighting and combining system 24 uses the
`measure of closeness to determine if the test utterance is
`
`closest to the enrolled reference or the impostor reference
`distribution. When the test utterance is closer to the enrolled
`
`reference than the impostor reference distribution, a verified
`decision is made. When the test utterance is not closer to the
`
`enrolled reference than the impostor reference distribution,
`an unverified decision is made. These are preliminary deci-
`sions. Usually,
`the speaker is required to speak several
`utterances (e.g., “one”, “three”, “five”, “twenty one”). A
`decision is made for each of these test utterances. Each of the
`
`plurality of decisions is weighted and combined to form the
`verification decision.
`
`The decisions are weighted because not all utterances
`provide equal reliability. For instance, “one” could provide
`a much more reliable decision than “eight”. As a result, a
`more accurate verification decision can be formed by first
`weighting the decisions based on the underlying utterance.
`Two weighting methods can be used. One weighting method
`uses a historical approach. Sample utterances are compared
`to the enrolled references to determine a probability of false
`alarm PFA (speaker is not
`impostor but
`the decision is
`impostor) and a probability of miss PM (speaker is impostor
`but
`the decision is true speaker). The PFA and PM are
`probability of errors. These probability of errors are used to
`weight each decision. In one embodiment the weighting
`factors (weight) are described by the equation below:
`
`a; = log
`
`a; = log
`
`
`l — P ,-
`M Decision is Verified (True Speaker)
`PFAi
`
`PMi
`1 — PFAi
`
`.
`.
`.
`.
`Dec1sion is Not Verified (Imposter)
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`When the sum of the weighted decisions is greater than
`zero, then the verification decision is a true speaker. Other-
`wise the verification decision is an impostor.
`The other method of weighting the decisions is based on
`an immediate evaluation of the quality of the decision. In
`one embodiment, this is calculated by using a Chi-Squared
`detector. The decisions are then weighted on the confidence
`determined by the Chi-Squared detector.
`In another
`embodiment, a large sample approximation is used. Thus if
`the test statistics are t, find b such that c2(b)=t. Then a
`decision is an impostor if it exceeds the 1—a quantile of the
`c2 distribution.
`One weighting scheme is shown below:
`
`1.5, if b>caccept
`1.0, if 1—aébécaccep,
`—1.0, if crejmébé 1—a
`—1.25, if b<crejm
`When the sum of the weighted decisions is greater than
`zero, then the verification decision is a true speaker. When
`the sum of the weighted decision is less than or equal to zero,
`the decision is an impostor.
`In another embodiment, the feature extractor 14 segments
`the speech signal into voiced sounds and unvoiced sounds.
`Voiced sounds generally include vowels, while most other
`sounds are unvoiced. The unvoiced sounds are discarded
`
`before the cepstrum coefficients are calculated in both the
`training phase and the verification phase.
`These techniques of weighting the decisions, using gender
`dependent cepstrums and only using voiced sounds can be
`combined or used separately in a speaker verification sys-
`tem.
`
`FIG. 2 is a flow chart of an embodiment of the steps used
`to form a speaker verification decision. The process starts, at
`step 40, by generating a code book at step 42. The code book
`has a plurality of enrolled references for each of the plurality
`of speakers (registered users, plurality of people) and a
`plurality of impostor references. The enrolled references in
`one embodiment are the cepstrum coefficients for a particu-
`lar user speaking a particular utterance (e.g., “one). The
`enrolled references are generated by a user speaking the
`utterances. The cepstrum coefficients of each of the utter-
`ances are determined to from the enrolled references. In one
`
`embodiment a speaker is asked to repeat the utterance and a
`generalization of the two utterances is saved as the enrolled
`reference. In another embodiment both utterances are saved
`as enrolled reference.
`
`In one embodiment, a data base of male speakers is used
`to determine a male variance vector and a data base of
`
`female speakers is used to determine a female variance
`vector. In another embodiment, the data bases of male and
`female speakers are used to form a male impostor code book
`and a female impostor code book. The gender specific
`variance vectors are stored in the code book. At step 44, a
`plurality of test utterances (input set of utterances) from a
`speaker are received.
`In one embodiment
`the cepstrum
`coefficients of the test utterances are calculated. Each of the
`
`plurality of test utterances are compared to the plurality of
`enrolled references for the speaker at step 46. Based on the
`comparison, a plurality of decision are formed, one for each
`
`Page 18 of 22
`
`Page 18 of 22
`
`
`
`6,012,027
`
`5
`of the plurality of enrolled references. In one embodiment,
`the comparison is determined by a Euclidean weighted
`distance between the test utterance and the enrolled refer-
`
`ence and between the test utterance and an impostor refer-
`ence distribution.. In another embodiment,
`the Euclidean
`weighted distance is calculated with the male variance
`vector if the speaker is a male or the female variance vector
`if the speaker is a female. Each of the plurality of decisions
`are weighted to form a plurality of weighted decisions at step
`48. The weighting can be based on historical error rates for
`the utterance or based on a confidence level (confidence
`measure) of the decision for the utterance. The plurality of
`weighted decisions are combined at step 50. In one embodi-
`ment the step of combining involves summing the weighted
`decisions. Averification decision is then made based on the
`
`combined weighted decisions at step 52, ending the process
`at step 54. In one embodiment if the sum is greater than zero,
`the verification decision is the speaker is a true speaker,
`otherwise the speaker is an impostor.
`FIG. 3 is a flow chart of an embodiment of the steps used
`to form a code book for a speaker verification decision. The
`process starts, at step 70, by receiving an input utterance at
`step 72. In one embodiment, the input utterances are then
`segmented into a voiced sounds and an unvoiced sounds at
`step 74. The cepstrum coefficients are then calculated using
`the voiced sounds at step 76. The coefficients are stored as
`a enrolled reference for the speaker at step 78. The process
`then returns to step 72 for the next input utterance, until all
`the enrolled references have been stored in the code book.
`
`FIG. 4 is a flow chart of an embodiment of the steps used
`to form a speaker verification decision. The process starts, at
`step 100, by receiving input utterances at step 102. Next, it
`is determined if the speaker is male or female at step 104. In
`a speaker verification application, the speaker purports to be
`someone in particular. If the person purports to be someone
`that is a male, then the speaker is assumed to be male even
`if the speaker is a female. The input utterances are then
`segmented into a voiced sounds and an unvoiced sounds at
`step 106. Features (e.g., cepstrum coefficients) are extracted
`from the voiced sounds to form the test utterances, at step
`108. At step 110, the weighted Euclidean distance (WED) is
`calculated using a generalized male variance vector if the
`purported speaker is a male. When the purported speaker is
`a female, the female variance vector is used. The WED is
`calculated between the test utterance and the enrolled ref-
`
`erence for the speaker and the test utterance and the male (or
`female if appropriate) impostor reference distribution. A
`decision is formed for each test utterance based on the WED
`
`at step 112. The decisions are then weighted based on a
`confidence level (measure of confidence) determined using
`a Chi-squared detector at step 114. The weighted decisions
`are summed at step 116. A verification decision is made
`based on the sum of the weighted decisions at step 118.
`Using the speaker verification decisions discussed above
`results in an improved speaker verification system, that is
`more reliable than present techniques.
`A dial-up service that uses a speaker verification method
`as described above is shown in FIG. 5. The dial-up service
`is shown as a banking service. Auser dials a service number
`on their telephone 150. The public switched telephone
`network (PSTN) 152 then connects the user’s phone 150
`with a dial-up service computer 154 at a bank 156. The
`dial-up service need not be located within a bank. The
`service will be explained in conjunction with the flow chart
`shown in FIG. 6. The process starts, at step 170, by dialing
`a service number (communication service address, number)
`at step 172. The user (requester) is then prompted by the
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`
`computer 154 to speak a plurality of digits (access code,
`plurality of numbers, access number) to form a first utter-
`ance (first digitized utterance) at step 174. The digits are
`recognized using speaker independent voice recognition at
`step 176. When the user has used the dial-up service
`previously, verifying the user based on the first utterance at
`step 178. When the user is verified as a true speaker at step
`178, allowing access to the dial-up service at step 180. When
`the user cannot be verified, requesting the user input a
`personal identification number (PIN) at step 182. The PIN
`can be entered by the user either by speaking the PIN or by
`entering the PIN on a keypad. At step 184 it is determined
`if the PIN is valid. When the PIN is not valid, the user is
`denied access at step 186. When the PIN is valid the user is
`allowed access to the service at step 180. Using the above
`method the dial-up service uses a speaker verification sys-
`tem as a PIN option, but does not deny access to the user if
`it cannot verify the user.
`FIG. 7 is a flow chart of another embodiment of the steps
`used in a dial-up service. The process starts, step 200, by the
`user speaking an access code to form a plurality of utter-
`ances at step 202. At step 204 it is determined if the user has
`previously accessed the service. When the user has previ-
`ously used the service,
`the speaker verification system
`attempts to verify the user (identity) at step 206. When the
`speaker verification system can verify the user, the user is
`allowed access to the system at step 208. When the system
`cannot verify the user, a PIN is requested at step 210. Note
`the user can either speak the PIN or enter the PIN on a
`keypad. At step 212 it is determined if the PIN is valid.
`When the PIN is not valid the user is denied access at step
`214. When the PIN is valid, the user is allowed access at step
`208.
`
`When the user has not previously accessed the commu-
`nication service at step 204, the user is requested to enter a
`PIN at step 216. At step 218 it is determined if the PIN is
`valid at step 218. When the PIN is not valid, denying access
`to the service at step 220. When the PIN is valid the user is
`asked to speak the access code a second time to form a
`second utterance (plurality of second utterances, second
`digitized utterance) at step 222. The similarity between the
`first utterance (step 202) and the second utterance is com-
`pared to a threshold at step 224. In one embodiment the
`similarity is calculated using a weighted Euclidean distance.
`When the similarity is less than or equal to the threshold, the
`user is asked to speak the access code again at step 222. In
`this case the second and third utterances would be compared
`for the required similarity. In practice, the user would not be
`required to repeat the access code at step 222 more than once
`or twice and the system would then allow the user access.
`When the similarity is greater than the threshold, storing a
`combination of the two utterances as at step 226. In another
`embodiment both utterances are stored as enrolled refer-
`
`ences. Next access to the service is allowed at step 208. The
`enrolled reference is used to verify the user the next time
`they access the service. Note that the speaker verification
`part of the access to the dial-up service in one embodiment
`uses all the techniques discussed for a verification process.
`In another embodiment the verification process only uses
`one of the speaker verification techniques. Finally,
`in
`another embodiment the access number has a predetermined
`digit that is selected from a first set of digits (predefined set
`of digits) if the user is a male. When the user is a female, the
`predetermined digit is selected from a second set of digits.
`This allows the system to determine if the user is suppose to
`be a male or a female. Based on this information, the male
`variance vector or female variance vector is used in the
`
`speaker verification process.
`
`Page 19 of 22
`
`Page 19 of 22
`
`
`
`6,012,027
`
`7
`FIG. 8 is a block diagram of a speech reference system
`300 using a speech reference enrollment method according
`to the invention in an intelligent network phone system 302.
`The speech reference system 300 can perform speech rec-
`ognition or speaker verification. The speech reference sys-
`tem 300 is implemented in a service node or intelligent
`peripheral (SN/IP). When the speech reference system 300
`is implemented in a service node, it is directly connected to
`a telephone central office—service switching point (CO/
`SSP) 304—308. The central office—service switching points
`304—308 are connected to a plurality of telephones 310—320.
`When the speech reference system 300 is implemented in an
`intelligent peripheral, it is connected to a service control
`point (SCP) 322. In this scheme a call from one of the
`plurality of telephones 310—320 invoking a special feature,
`such as speech recognition, requires processing by the
`service control point 322. Calls requiring special processing
`are detected at CO/SSP 304—308. This triggers the CO/SSP
`304—308 to interrupt call processing while the CO/SSP
`304—308 transmits a query to the SCP 300, requesting
`information to recognize a word spoken by user. The query
`is carried over a signal system 7 (SS7) link 324 and routed
`to the appropriate SCP 322 by a signal transfer point (STP)
`326. The SCP 322 sends a request for the intelligent periph-
`eral 300 to perform speech recognition. The speech refer-
`ence system 300 can be implemented using a computer
`capable of reading and executing computer readable instruc-
`tions stored on a computer readable storage medium 328.
`The instructions on the storage medium 328 instruct the
`computer how to perform the enrollment method according
`to the invention.
`
`FIGS. 9a & b are flow charts of the speech reference
`enrollment method. This method can be used with any
`speech reference system, including those used as part of an
`intelligent
`telephone network as shown in FIG. 8. The
`enrollment process starts, step 350, by receiving a first
`utterance of a vocabulary word from a user at step 352. Next,
`a plurality of features are extracted from the first utterance
`at step 354. In one embodiment, the plurality of features are
`the cepstrum coefficients of the utterance. At step 356, a
`second utterance is received. In one embodiment the first
`
`utterance and the second utterance are received in response
`to a request that the user speak the vocabulary word. Next,
`the plurality of features are extracted from the second
`utterance at step 358. Note that
`the same features are
`extracted for both utterances. At step 360, a first similarity
`is determined between the plurality of features from the first
`utterance and the plurality of features from the second
`utterance. In one embodiment, the similarity is determined
`using a hidden Markov model Veterbi scoring system. Then
`it is determined if the first similarity is less than a prede-
`termined similarity at step 362. When the first similarity is
`not less than the predetermined similarity, then a reference
`pattern (reference utterance) of the vocabulary is formed at
`step 364. The reference pattern, in one embodiment, is an
`averaging of the features from the first and second utterance.
`In another embodiment, the reference pattern consists of
`storing the feature from both the first utterance and the
`second utterance, with a pointer from both to the vocabulary
`word.
`
`When the first simi