`Ittycheriah et al.
`
`[54] APPARATUS AND METHODS FOR SPEECH
`RECOGNITION INCLUDING INDIVIDUAL
`OR SPEAKER CLASS DEPENDENT
`DECODING HISTORY CACHES FOR FAST
`
`WORD ACCEPTANCE ()R REJECTION
`
`[75] Inventors: Abraham Poovakunnel Ittycheriah;
`lsjtepgmne CHerman Maes’ both of
`an my’ on“
`
`US005937383A
`[11] Patent Number:
`[45] Date 0f Patent:
`
`5,937,383
`Aug. 10, 1999
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,783,803 11/1988 Baker et al. .......................... .. 704/241
`
`4,866,778
`9/1989 Baker . . . . . . . . . . . .
`4,972,485 11/1990 Dautrich et a1. ..
`glggms et al
`5,677,991 10/1997 Hsu et a1. ................. .. 704/255
`5,754,978
`5/1998 Perez-Mendez et a1
`.... .. 704/255
`
`,
`
`,
`
`rong . . . . . . . . . . .
`
`
`
`~~~~ ~~ . . . . ..
`
`. . . . .. 704/242
`.... .. 704/242
`
`_
`
`_
`
`_
`
`_
`
`5,758,319
`
`5/1998 Knittle .............................. .. 704/251
`
`0r 0ra 10n,
`
`[73] AsslgneeZ glternatlgnal gusmeljslggchmes
`p
`mon
`[21] Appl. No.: 08/869,025
`[22]
`Filed;
`Jun, 4, 1997
`
`,
`
`.
`
`.
`
`[60]
`
`Related US. Application Data
`Provisional application No. 60/011,058, Feb. 2, 1996.
`6
`
`............................................
`
`_
`
`_
`
`_
`
`_
`
`8/1998 Yegnanarayanan et a1. ......... .. 704/255
`5,794,196
`Primary Examzner—R1chemond Dorv1l
`[57]
`ABSTRACT
`A method and an apparatus are provided for performing
`speech recognition on speech segments frequently input by
`a user. The method and the apparatus include use of keyword
`scoring in connection With a speech recognition vocabulary,
`a temporary score, and a predetermined margin to determine
`an appropriate Output as being representative of the input
`
`.
`
`.
`
`. .............. ..
`
`.
`
`,
`
`S eech Se ment'
`
`[58] Field of Search ................................... .. 704/255, 275,
`704/270, 239, 240, 241, 242, 236
`
`p
`
`g
`
`25 Claims, 1 Drawing Sheet
`
`INPUT
`SPEECH
`
`ACOUSTIC
`FRONT-END
`
`IL
`
`Atlllg?sglN-r
`UNIT
`
`SCORE
`
`/
`
`I8
`
`20 @
`
`USER
`|D
`UN IT
`
`CACHE
`DATABASE
`
`YES
`
`DECODED
`KEYWORD
`
`NO '6
`/ DECODED KEYWORD
`AND SCORES
`CLASSICAL
`SPEECH
`ECOGNITION
`ENGINE
`
`BASEFORM
`DECODED
`SCRIPT
`
`Page 1 of 6
`
`
`
`U.S. Patent
`
`Aug. 10, 1999
`
`5,937,383
`
`\\
`
`NN
`
`
`
`omo;>m_v_omoouma
`
`mmmoowQ24.
`
`0.
`
`._<o_mm<d
`
`zummmm
`
`zoEzooumm
`
`mz_ozm
`
`Become
`
`omo;>m«_mm»
`
`mE<~_<._z8
`
`mmoumI
`
`zm_onm_m<m_
`
`Booumo
`
`Eaum
`
`_.2“.
`
`
`
`m:u<u
`
`m_m<m_E<o
`
`mmoum
`
`_mmmE>
`
`»zmzzu:<
`
`:2:
`
`u_.5:8<
`
`ozmézofi
`
`5n_z_
`
`_._umm_n_m
`
`Page 2 of 6
`
`Page 2 of 6
`
`
`
`
`
`
`
`
`1
`APPARATUS AND METHODS FOR SPEECH
`RECOGNITION INCLUDING INDIVIDUAL
`OR SPEAKER CLASS DEPENDENT
`DECODING HISTORY CACHES FOR FAST
`WORD ACCEPTANCE OR REJECTION
`
`This application is based on provisional patent applica
`tion Ser. No. 60/011,058, ?led Feb. 2, 1996.
`
`BACKGROUND OF THE INVENTION
`
`10
`
`The present invention relates to speech recognition sys
`tems and, more particularly, to apparatus and methods for
`performing fast Word acceptance or rejection using decoding
`history caches.
`Speech recognition is an emerging technology. More and
`more often it is replacing classical data entry or order taking,
`Which typically require ?lling out of forms, typing or
`interacting With human operators. Typically an initial step in
`a computeriZed speech recognition system involves the
`computation of a set of acoustic features (feature vector)
`from sampled speech. The sampled speech may be provided
`by a user of the system via an audio-to-electrical transducer,
`such as a microphone, and converted from an analog rep
`resentation to a digital representation before sampling.
`Typically, a classical acoustic front-end (processor) is
`employed to compute the acoustic features from the sampled
`speech. The acoustic features are then submitted to a speech
`recognition engine Where the utterances are recogniZed
`thereby generating a decoded or recogniZed script Which is
`representative of the sampled input speech.
`Classical speech recognition systems typically compare
`the likelihood of all possible Word hypotheses or sequences
`of Word hypotheses and select the most probable hypotheses
`as the recogniZed script based on acoustic and language
`modeling scores. This process is referred to as a detailed
`match search. When a comparison of all possible hypotheses
`is impractical, Which is often the case, the set of possible
`hypotheses compared is limited by a process knoWn as the
`fast match search Which is performed to rapidly limit the set
`of possible hypotheses by eliminating, after a quick scoring,
`hypotheses falling too far behind the top ranking hypoth
`eses.
`Unfortunately, for high volume speech recognition
`applications, for eXample, a corporate name voice dialer, the
`amount of hypotheses to consider for knoWn detailed match
`and fast match searches is still prohibitively large.
`
`SUMMARY OF THE INVENTION
`
`It is an object of the present invention to provide appa
`ratus and methods for rapidly limiting the amount of hypoth
`eses to search based on the history of previous use of a
`speech recognition system by a user.
`It is another object of the present invention to de?ne a
`cache database Which stores keywords (vocabulary such as
`names and/or commands) frequently used by a user, as Well
`as score information attributed to the keyWords.
`In one aspect of the present invention, a method for
`performing speech recognition on speech segments fre
`quently input by a user comprises the steps of: inputting at
`least one keyWord spoken by the user; decoding the at least
`one keyWord by scoring the at least one keyWord against a
`speech recognition vocabulary to generate a decoded key
`Word and at least one score for the decoded keyWord; storing
`the decoded keyWord and the at least one score; inputting a
`speech segment spoken by the user; comparing the input
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`5,937,383
`
`2
`speech segment to the decoded keyWord in order to generate
`a temporary score; and comparing the temporary score
`against the at least one stored score and if the temporary
`score is one of Within a predetermined margin of, equivalent
`to, and larger than the at least one stored score, then the
`decoded keyWord is output as being representative of the
`input speech segment, else the input speech segment is
`scored against the speech recognition vocabulary to generate
`a second decoded keyWord and at least one score for the
`second decoded keyWord. The method also preferably
`includes the step of storing the second decoded keyWord and
`the at least one score associated thereWith.
`Advantageously, in this manner, keyWords and scores
`associated thereWith are stored in a decoding history cache
`in order to reduce the amount of hypotheses to consider in
`the decoding process. Further, the term “keyWord” is gen
`erally de?ned to include such Words and phrases as names,
`commands, and sentences Which include both functional and
`non-functional Words.
`These and other objects, features and advantages of the
`present invention Will become apparent from the folloWing
`detailed description of illustrative embodiments thereof,
`Which is to be read in connection With the accompanying
`draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram illustrating one embodiment of
`a speech recognition system employing a cache database
`according to the present invention.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`In a speech recognition application such as voice dialing
`(an operation Wherein a name of a party is input, recogniZed,
`matched With the telephone number associated With that
`party and then automatically dialed), each name utterance
`input by the system users must typically be compared With
`baseforms for all the Words of the vocabulary. HoWever, the
`present invention recogniZes that, most of the time, a user
`interacts regularly With a limited amount of people.
`Therefore, due in part to such a fact, the present invention
`provides a unique manner of building and implementing a
`cache database for each user Within a speech recognition
`environment. Such a cache implemented in a voice dialing
`application preferably contains the names of the latest
`persons queried by a particular user. This concept may be
`eXtended to include other forms of vocabulary in the cache
`such as, for eXample, commands. In the case of a voice
`controlled navigator, such a unique cache database may be
`built and implemented to store recent command history as
`Well as to use the conteXt to load the related command for
`execution by the navigation system. Given the description of
`the invention provided herein, it is to be appreciated that one
`of ordinary skill in the art Will be able to contemplate
`numerous other applications for employing the principles of
`the present invention.
`Generally, the present invention provides for storing data
`in a cache and utiliZing such data in order to quickly accept
`or reject keyWords, i.e., names, commands or sentences,
`Without impacting the overall recognition process and,
`therefore, Without adversely affecting recognition perfor
`mance. Such a cache database formed according to the
`invention contains the Word history of a keyWord (name,
`command or sentence) as Well as the likelihood (score)
`obtained When the Word Was added to the cache. Further,
`Whenever an input speech segment is uttered, the keyWords
`
`Page 3 of 6
`
`
`
`3
`of the cache are checked ?rst before the segment is subjected
`to a classical speech recognition engine. If it is determined,
`as Will be explained, that a keyword in the cache presents the
`highest likelihood of being the closest match to the speech
`segment input by the user, then that keyWord is output as the
`recogniZed script (name, command or sentence). In other
`Words, you Will try to align the baseforms of the Words
`stored in the cache and get a score for all these comparisons.
`The scores are compared With What is stored in the cache, to
`decide if one hypothesis is acceptable or not. Preferably, it
`is required that the score be Within a margin of the stored
`score and that it is among the top ranking scores among all
`the Words in the cache. If no Word in the cache is determined
`to be the closest match, then the input speech segment is
`subjected to the classical speech recognition engine.
`Subsequently, the decoded Word(s) resulting from the rec
`ognition preformed by the engine is then preferably added to
`the cache along With associated scores. In this manner, not
`only is the overall speech recognition system not impacted,
`but performance is substantially enhanced due to the imple
`mentation of such cache features.
`HoWever, it is to be understood that the likelihoods
`compared to make the acceptance/rejection determination
`are only comparable if associated With the same speaker.
`That is, the likelihood associated With the keyWord input to
`the system by a speaker must be compared to likelihoods of
`Words previously input to the cache by the same speaker.
`Thus, such a condition implies that the speakers identity be
`knoWn. As Will be explained, such identi?cation may pref
`erably be achieved by text-independent speaker identi?ca
`tion or by speaker-independent speaker classi?cation, as Will
`be explained; hoWever, any other conventional method for
`identifying the speaker may be employed, for instance, by
`keying in a passWord or by a similar non-speech based
`identi?cation process.
`Referring noW to FIG. 1, preferred apparatus 10 for
`implementing the invention is shoWn. Speci?cally, an acous
`tic front-end 12 is operatively coupled to a Viterbi alignment
`unit 14, as Well as to a classical speech recognition engine
`16. It is to be understood that the acoustic front-end 12 is a
`processor Which extracts acoustic features from a sample of
`input speech provided thereto. The acoustic front-end 12
`may be a conventional acoustic front-end and may generate
`conventional acoustic feature vectors such as, for example,
`mel cepstral vectors. The speci?c feature extraction process
`employed is not critical to the invention. Next, a cache
`database 18 is provided Which is responsive to a user
`identi?cation unit 20, Which Will be explained, and Which is
`operatively coupled to a score comparator 22, the Viterbi
`alignment unit 14 and the classical speech recognition
`engine 16. Given the above description of component
`interconnection, an explanation of the operation of such
`preferred apparatus Will folloW beloW.
`The functions performed by the Viterbi alignment unit 14
`are similar to those performed by the classical speech
`recognition engine 16. That is, the acoustic features
`extracted in acoustic front-end 12 are subjected to a proba
`bilistic pattern matching technique using Hidden Markov
`models (HMMs) in both the alignment unit 14 and the
`engine 16. Akey difference, as Will be explained, is that the
`Viterbi alignment unit 14 is only concerned With the data
`base of Words stored in the cache database 18, While the
`engine 16 is concerned With a larger scale language database
`(vocabulary) associated thereWith. It is to be understood that
`Hidden Markov modeling provides different scores for dif
`ferent hypotheses. Examples of such scores are rank (i.e., the
`ranking of a decoded Word), fast match scores (i.e., the
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`5,937,383
`
`4
`likelihoods of a simpli?ed scheme), detailed match scores
`(i.e., the likelihoods of a detailed scheme), language mod
`eling scores (i.e., probability according to a language model)
`and/or any combination of these scores.
`Speci?cally, When a user uses the system, his or her
`identify is determined by user ID unit 20. This may be
`accomplished by one of several methods. For instance, the
`user may enter a personal identi?cation number (PIN) or a
`passWord at a keypad, keyboard or any other device knoWn
`in the art for accepting such identifying indicia. Still further,
`the identity of the speaker may be determined using auto
`matic speaker identi?cation methods, for example, as dis
`closed in US. Ser. No. 08/788,471. HoWever, it is to be
`understood that the method for identifying the speaker
`employed by unit 20 is not critical to the invention. The
`important function served by the user ID unit 20 is to permit
`the cache 18 to knoW Which portion of its database corre
`sponds to a particular speaker.
`It is to be understood that initially the cache database 18
`does not have any keyWords or corresponding scores stored
`therein, that is, until a particular user inputs certain names or
`commands. When the user ?rst inputs a name or a command,
`the acoustic front-end 12 generates a set of acoustic features
`Which are then provided to the classical speech recognition
`engine 16. The Word is decoded in accordance With the
`Hidden Markov models and the decoded Word, along With
`the different scores (discussed above) associated With the
`Word, are stored in the cache database 18 preferably parti
`tioned for that particular user. It is to be understood that
`rather than a large cache database partitioned for each
`system user, individual cache databases may be formed for
`each user.
`Next, When the same user enters another speech segment,
`the folloWing operation is performed. First, each decoded
`keyWord in the cache database 18 corresponding to the same
`user is considered as a hypothesis With respect to the neWly
`entered speech segment. This evaluation is done by the
`Viterbi alignment unit 14. The resulting scores from such
`comparison are then presented to the score comparator 22.
`Also, the scores Which Were previously stored in the cache
`database 18 pertaining to the particular user are presented to
`the score comparator 22 Wherein the scores from unit 14 are
`compared to the stored scores. If the neW scores from unit
`14 are Within a margin of, the same as, or larger than the
`stored scores corresponding to one of the decoded keyWords
`in the cache, then that keyWord is accepted and is output
`from the score comparator 22 as the decoded input speech
`segment. HoWever, if the neW scores are not Within a margin
`or, or are less than any of the stored scores, then the decoded
`keyWords in the cache are rejected and the input speech
`segment is subjected to the complete vocabulary of the
`classical speech recognition engine 16 Whereby a decoded
`script is generated.
`After the input speech segment is decoded by the engine
`16, the decoded Word and associated scores are provided to
`the cache database 18 Where they are stored as a neW
`keyWord and related scores for that particular user. In this
`Way, the cache stores the latest and most frequently used
`keyWords that a user may input to the system. In the voice
`dialing phone system application, this Will preferably
`include the latest and most frequently called names. In the
`voice controlled navigation application, this Will preferably
`include the latest and most frequently input commands. It is
`to be appreciated that the output of the comparator 22
`(decoded keyWord) and the output of the engine 16 (decoded
`script) may be connected to the actual phone system or
`navigation system such that the phone or navigation systems
`
`Page 4 of 6
`
`
`
`5,937,383
`
`5
`are responsive to the decoded keyword or script and thus
`perform their respective automatic functions (i.e., dialing or
`navigating). As mentioned, one skilled in the art Will con
`template other uses for the novel principles disclosed herein.
`It is possible to eXtend the concept of a cache database for
`storing keywords and scores to unknoWn speakers Who have
`not enrolled in the speech recognition system and, therefore,
`have no prior history. This may be accomplished by vector
`quantization (VQ) clustering of the speakers as is disclosed
`in Us. Ser. No. 08/787,031. Also, a neW speaker may be
`associated With a class of speakers With similar character
`istics to himself or herself Wherein the acceptance or rejec
`tion decision is based on past history and an average
`previous likelihood associated With other speakers Within
`that class. Such a speaker classi?cation approach may be
`done completely speaker independently, in Which case, the
`cache database may preferably be built common to all users
`in the class.
`It is to be appreciated that the components of the embodi
`ments described herein may be implemented in hardWare,
`softWare or a combination thereof. Preferably, the preferred
`embodiment is implemented on an appropriately pro
`grammed general purpose digital computer.
`Although illustrative embodiments of the present inven
`tion have been described herein With reference to the accom
`panying draWings, it is to be understood that the invention
`is not limited to those precise embodiments, and that various
`other changes and modi?cations may be affected therein by
`one skilled in the art Without departing from the scope or
`spirit of the invention.
`What is claimed is:
`1. Amethod for performing speech recognition on speech
`segments frequently input by a user, the method comprising
`the steps of:
`(a) inputting at least one keyWord spoken by the user;
`(b) decoding the at least one keyWord by scoring the at
`least one keyWord against a speech recognition vocabu
`lary to generate a decoded keyWord and at least one
`score for the decoded keyWord;
`(c) storing the decoded keyWord and the at least one score;
`(d) inputting a speech segment spoken by the user;
`(e) comparing the input speech segment to the decoded
`keyWord in order to generate a temporary score; and
`(f) comparing the temporary score against the at least one
`stored score and if the temporary score is one of Within
`a predetermined margin of, equivalent to, and larger
`than the at least one stored score, then the decoded
`keyWord is output as being representative of the input
`speech segment, else the input speech segment is
`scored against the speech recognition vocabulary to
`generate a second decoded keyWord and at least one
`score for the second decoded keyWord.
`2. The method of claim 1, further comprising the step of
`storing the second decoded keyWord and the at least one
`score associated thereWith.
`3. The method of claim 1, further comprising the step of
`storing the decoded keyWord and scores associated there
`With in accordance With a predetermined identity of the user.
`4. The method of claim 3, further comprising the step of
`identifying the user via text-independent speaker identi?ca
`tion.
`5. The method of claim 3, further comprising the step of
`identifying the user via speaker-independent speaker clas
`si?cation.
`
`1O
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`6
`6. The method of claim 1, Wherein the at least one
`keyWord is a name.
`7. The method of claim 6, Wherein said method is utiliZed
`in a name-based voice dialing phone system.
`8. The method of claim 1, Wherein the at least one
`keyWord is a command.
`9. The method of claim 8, Wherein said method is utiliZed
`in a command-based voice controlled system.
`10. The method of claim 1, Wherein the at least one
`keyWord is from a large vocabulary associated With a speech
`recognition system.
`11. Apparatus for performing speech recognition on
`speech segments frequently input by a user, the apparatus
`comprising:
`means for inputting at least one keyWord spoken by the
`user;
`means for decoding the at least one keyWord by scoring
`the at least one keyWord against a speech recognition
`vocabulary to generate a decoded keyWord and at least
`one score for the decoded keyWord;
`means for storing the decoded keyWord and the at least
`one score;
`means for inputting a speech segment spoken by the user;
`means for comparing the input speech segment to the
`decoded keyWord in order to generate a temporary
`score; and
`means for comparing the temporary score against the at
`least one stored score and if the temporary score is one
`of Within a predetermined margin of, equivalent to, and
`larger than the at least one stored score, then the
`decoded keyWord is output as being representative of
`the input speech segment, else the input speech seg
`ment is scored against the speech recognition vocabu
`lary to generate a second decoded keyWord and at least
`one score for the second decoded keyWord.
`12. The apparatus of claim 11, further comprising means
`for storing the second keyWord and the at least one score
`associated thereWith.
`13. The apparatus of claim 11, further comprising means
`for storing the decoded keyWord and scores associated
`thereWith in accordance With a predetermined identity of the
`user.
`14. The apparatus of claim 13, further comprising means
`for identifying the user via text-independent speaker iden
`ti?cation.
`15. The method of claim 13, further comprising means for
`identifying the user via speaker-independent speaker clas
`si?cation.
`16. The apparatus of claim 11, Wherein the at least one
`keyWord is a name.
`17. The apparatus of claim 16, Wherein said apparatus is
`utiliZed in a name-based voice dialing phone system.
`18. The apparatus of claim 11, Wherein the at least one
`keyWord is a command.
`19. The apparatus of claim 18, Wherein said apparatus is
`utiliZed in a command-based voice controlled system.
`20. The apparatus of claim 11, Wherein the at least one
`keyWord is from a large vocabulary associated With a speech
`recognition system.
`21. A system for recogniZing keyWords frequently input
`by a speaker, the system comprising:
`a speech recognition engine for decoding at least one
`keyWord uttered by the speaker by scoring the at least
`
`Page 5 of 6
`
`
`
`5,937,383
`
`7
`one keyword against a speech recognition vocabulary
`to generate a decoded keyword and at least one score
`for the decoded keyWord;
`a cache database for storing the decoded keyWord and the
`at least one score associated thereWith in accordance
`With a predetermined identity of the speaker;
`means for performing a Viterbi alignment process on an
`input speech segment uttered by the speaker Wherein
`the input speech segment is compared to the decoded
`keyWord to generate a temporary score; and
`a comparator for comparing the temporary score against
`the at least one stored score and if the temporary score
`is one of Within a predetermined margin of, equivalent
`to, and larger than the at least one stored score, then the
`decoded keyWord is output as being representative of
`the input speech segment, else the input speech seg
`ment is scored against the speech recognition vocabu
`
`15
`
`8
`lary to generate a second decoded keyWord and at least
`one score for the second decoded keyWord.
`22. The system of claim 21, Wherein the second keyWord
`and the at least one score associated thereWith are stored in
`the cache database.
`23. The system of claim 21, Wherein the identity of the
`speaker is determined via text-independent speaker identi
`?cation.
`24. The system of claim 21, Wherein the identity of the
`speaker is determined via speaker-independent speaker clas
`si?cation.
`25. The system of claim 21, Wherein the at least one
`keyWord is one of a name, a command, and at least one Word
`from a large vocabulary associated With a speech recogni
`tion system.
`
`Page 6 of 6
`
`