throbber
Ulllted States Patent [19]
`Ittycheriah et al.
`
`[54] APPARATUS AND METHODS FOR SPEECH
`RECOGNITION INCLUDING INDIVIDUAL
`OR SPEAKER CLASS DEPENDENT
`DECODING HISTORY CACHES FOR FAST
`
`WORD ACCEPTANCE ()R REJECTION
`
`[75] Inventors: Abraham Poovakunnel Ittycheriah;
`lsjtepgmne CHerman Maes’ both of
`an my’ on“
`
`US005937383A
`[11] Patent Number:
`[45] Date 0f Patent:
`
`5,937,383
`Aug. 10, 1999
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,783,803 11/1988 Baker et al. .......................... .. 704/241
`
`4,866,778
`9/1989 Baker . . . . . . . . . . . .
`4,972,485 11/1990 Dautrich et a1. ..
`glggms et al
`5,677,991 10/1997 Hsu et a1. ................. .. 704/255
`5,754,978
`5/1998 Perez-Mendez et a1
`.... .. 704/255
`
`,
`
`,
`
`rong . . . . . . . . . . .
`
`
`
`~~~~ ~~ . . . . ..
`
`. . . . .. 704/242
`.... .. 704/242
`
`_
`
`_
`
`_
`
`_
`
`5,758,319
`
`5/1998 Knittle .............................. .. 704/251
`
`0r 0ra 10n,
`
`[73] AsslgneeZ glternatlgnal gusmeljslggchmes
`p
`mon
`[21] Appl. No.: 08/869,025
`[22]
`Filed;
`Jun, 4, 1997
`
`,
`
`.
`
`.
`
`[60]
`
`Related US. Application Data
`Provisional application No. 60/011,058, Feb. 2, 1996.
`6
`
`............................................
`
`_
`
`_
`
`_
`
`_
`
`8/1998 Yegnanarayanan et a1. ......... .. 704/255
`5,794,196
`Primary Examzner—R1chemond Dorv1l
`[57]
`ABSTRACT
`A method and an apparatus are provided for performing
`speech recognition on speech segments frequently input by
`a user. The method and the apparatus include use of keyword
`scoring in connection With a speech recognition vocabulary,
`a temporary score, and a predetermined margin to determine
`an appropriate Output as being representative of the input
`
`.
`
`.
`
`. .............. ..
`
`.
`
`,
`
`S eech Se ment'
`
`[58] Field of Search ................................... .. 704/255, 275,
`704/270, 239, 240, 241, 242, 236
`
`p
`
`g
`
`25 Claims, 1 Drawing Sheet
`
`INPUT
`SPEECH
`
`ACOUSTIC
`FRONT-END
`
`IL
`
`Atlllg?sglN-r
`UNIT
`
`SCORE
`
`/
`
`I8
`
`20 @
`
`USER
`|D
`UN IT
`
`CACHE
`DATABASE
`
`YES
`
`DECODED
`KEYWORD
`
`NO '6
`/ DECODED KEYWORD
`AND SCORES
`CLASSICAL
`SPEECH
`ECOGNITION
`ENGINE
`
`BASEFORM
`DECODED
`SCRIPT
`
`Page 1 of 6
`
`

`

`
`
`GYOMASYG300940
`
`ZZ
`
`YOLVEVdWOD
`
`HaadS
`
`NOILINSOOSa
`
`SNISNS
`
`EE
`
`S3YOOSGNV
`WIISSVTD
`
`GYOMA3ySHA3NOIS}|asvavivd
`aaao23d3HOV?
`dado93d |“Sls
`
`
`3Y0SaaaLIAMLSNODVINN!
`
`
`INaWNOION3-LNOYSHO3adS
`
`U.S. Patent
`
`Aug. 10, 1999
`
`5,937,383
`
`y
`
`+7|Z|
`
`WuOsSSVa
`
`4idld9S
`
`Page 2 of 6
`
`Page 2 of 6
`
`
`
`
`
`

`

`1
`APPARATUS AND METHODS FOR SPEECH
`RECOGNITION INCLUDING INDIVIDUAL
`OR SPEAKER CLASS DEPENDENT
`DECODING HISTORY CACHES FOR FAST
`WORD ACCEPTANCE OR REJECTION
`
`This application is based on provisional patent applica
`tion Ser. No. 60/011,058, ?led Feb. 2, 1996.
`
`BACKGROUND OF THE INVENTION
`
`10
`
`The present invention relates to speech recognition sys
`tems and, more particularly, to apparatus and methods for
`performing fast Word acceptance or rejection using decoding
`history caches.
`Speech recognition is an emerging technology. More and
`more often it is replacing classical data entry or order taking,
`Which typically require ?lling out of forms, typing or
`interacting With human operators. Typically an initial step in
`a computeriZed speech recognition system involves the
`computation of a set of acoustic features (feature vector)
`from sampled speech. The sampled speech may be provided
`by a user of the system via an audio-to-electrical transducer,
`such as a microphone, and converted from an analog rep
`resentation to a digital representation before sampling.
`Typically, a classical acoustic front-end (processor) is
`employed to compute the acoustic features from the sampled
`speech. The acoustic features are then submitted to a speech
`recognition engine Where the utterances are recogniZed
`thereby generating a decoded or recogniZed script Which is
`representative of the sampled input speech.
`Classical speech recognition systems typically compare
`the likelihood of all possible Word hypotheses or sequences
`of Word hypotheses and select the most probable hypotheses
`as the recogniZed script based on acoustic and language
`modeling scores. This process is referred to as a detailed
`match search. When a comparison of all possible hypotheses
`is impractical, Which is often the case, the set of possible
`hypotheses compared is limited by a process knoWn as the
`fast match search Which is performed to rapidly limit the set
`of possible hypotheses by eliminating, after a quick scoring,
`hypotheses falling too far behind the top ranking hypoth
`eses.
`Unfortunately, for high volume speech recognition
`applications, for eXample, a corporate name voice dialer, the
`amount of hypotheses to consider for knoWn detailed match
`and fast match searches is still prohibitively large.
`
`SUMMARY OF THE INVENTION
`
`It is an object of the present invention to provide appa
`ratus and methods for rapidly limiting the amount of hypoth
`eses to search based on the history of previous use of a
`speech recognition system by a user.
`It is another object of the present invention to de?ne a
`cache database Which stores keywords (vocabulary such as
`names and/or commands) frequently used by a user, as Well
`as score information attributed to the keyWords.
`In one aspect of the present invention, a method for
`performing speech recognition on speech segments fre
`quently input by a user comprises the steps of: inputting at
`least one keyWord spoken by the user; decoding the at least
`one keyWord by scoring the at least one keyWord against a
`speech recognition vocabulary to generate a decoded key
`Word and at least one score for the decoded keyWord; storing
`the decoded keyWord and the at least one score; inputting a
`speech segment spoken by the user; comparing the input
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`5,937,383
`
`2
`speech segment to the decoded keyWord in order to generate
`a temporary score; and comparing the temporary score
`against the at least one stored score and if the temporary
`score is one of Within a predetermined margin of, equivalent
`to, and larger than the at least one stored score, then the
`decoded keyWord is output as being representative of the
`input speech segment, else the input speech segment is
`scored against the speech recognition vocabulary to generate
`a second decoded keyWord and at least one score for the
`second decoded keyWord. The method also preferably
`includes the step of storing the second decoded keyWord and
`the at least one score associated thereWith.
`Advantageously, in this manner, keyWords and scores
`associated thereWith are stored in a decoding history cache
`in order to reduce the amount of hypotheses to consider in
`the decoding process. Further, the term “keyWord” is gen
`erally de?ned to include such Words and phrases as names,
`commands, and sentences Which include both functional and
`non-functional Words.
`These and other objects, features and advantages of the
`present invention Will become apparent from the folloWing
`detailed description of illustrative embodiments thereof,
`Which is to be read in connection With the accompanying
`draWings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram illustrating one embodiment of
`a speech recognition system employing a cache database
`according to the present invention.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`
`In a speech recognition application such as voice dialing
`(an operation Wherein a name of a party is input, recogniZed,
`matched With the telephone number associated With that
`party and then automatically dialed), each name utterance
`input by the system users must typically be compared With
`baseforms for all the Words of the vocabulary. HoWever, the
`present invention recogniZes that, most of the time, a user
`interacts regularly With a limited amount of people.
`Therefore, due in part to such a fact, the present invention
`provides a unique manner of building and implementing a
`cache database for each user Within a speech recognition
`environment. Such a cache implemented in a voice dialing
`application preferably contains the names of the latest
`persons queried by a particular user. This concept may be
`eXtended to include other forms of vocabulary in the cache
`such as, for eXample, commands. In the case of a voice
`controlled navigator, such a unique cache database may be
`built and implemented to store recent command history as
`Well as to use the conteXt to load the related command for
`execution by the navigation system. Given the description of
`the invention provided herein, it is to be appreciated that one
`of ordinary skill in the art Will be able to contemplate
`numerous other applications for employing the principles of
`the present invention.
`Generally, the present invention provides for storing data
`in a cache and utiliZing such data in order to quickly accept
`or reject keyWords, i.e., names, commands or sentences,
`Without impacting the overall recognition process and,
`therefore, Without adversely affecting recognition perfor
`mance. Such a cache database formed according to the
`invention contains the Word history of a keyWord (name,
`command or sentence) as Well as the likelihood (score)
`obtained When the Word Was added to the cache. Further,
`Whenever an input speech segment is uttered, the keyWords
`
`Page 3 of 6
`
`

`

`3
`of the cache are checked ?rst before the segment is subjected
`to a classical speech recognition engine. If it is determined,
`as Will be explained, that a keyword in the cache presents the
`highest likelihood of being the closest match to the speech
`segment input by the user, then that keyWord is output as the
`recogniZed script (name, command or sentence). In other
`Words, you Will try to align the baseforms of the Words
`stored in the cache and get a score for all these comparisons.
`The scores are compared With What is stored in the cache, to
`decide if one hypothesis is acceptable or not. Preferably, it
`is required that the score be Within a margin of the stored
`score and that it is among the top ranking scores among all
`the Words in the cache. If no Word in the cache is determined
`to be the closest match, then the input speech segment is
`subjected to the classical speech recognition engine.
`Subsequently, the decoded Word(s) resulting from the rec
`ognition preformed by the engine is then preferably added to
`the cache along With associated scores. In this manner, not
`only is the overall speech recognition system not impacted,
`but performance is substantially enhanced due to the imple
`mentation of such cache features.
`HoWever, it is to be understood that the likelihoods
`compared to make the acceptance/rejection determination
`are only comparable if associated With the same speaker.
`That is, the likelihood associated With the keyWord input to
`the system by a speaker must be compared to likelihoods of
`Words previously input to the cache by the same speaker.
`Thus, such a condition implies that the speakers identity be
`knoWn. As Will be explained, such identi?cation may pref
`erably be achieved by text-independent speaker identi?ca
`tion or by speaker-independent speaker classi?cation, as Will
`be explained; hoWever, any other conventional method for
`identifying the speaker may be employed, for instance, by
`keying in a passWord or by a similar non-speech based
`identi?cation process.
`Referring noW to FIG. 1, preferred apparatus 10 for
`implementing the invention is shoWn. Speci?cally, an acous
`tic front-end 12 is operatively coupled to a Viterbi alignment
`unit 14, as Well as to a classical speech recognition engine
`16. It is to be understood that the acoustic front-end 12 is a
`processor Which extracts acoustic features from a sample of
`input speech provided thereto. The acoustic front-end 12
`may be a conventional acoustic front-end and may generate
`conventional acoustic feature vectors such as, for example,
`mel cepstral vectors. The speci?c feature extraction process
`employed is not critical to the invention. Next, a cache
`database 18 is provided Which is responsive to a user
`identi?cation unit 20, Which Will be explained, and Which is
`operatively coupled to a score comparator 22, the Viterbi
`alignment unit 14 and the classical speech recognition
`engine 16. Given the above description of component
`interconnection, an explanation of the operation of such
`preferred apparatus Will folloW beloW.
`The functions performed by the Viterbi alignment unit 14
`are similar to those performed by the classical speech
`recognition engine 16. That is, the acoustic features
`extracted in acoustic front-end 12 are subjected to a proba
`bilistic pattern matching technique using Hidden Markov
`models (HMMs) in both the alignment unit 14 and the
`engine 16. Akey difference, as Will be explained, is that the
`Viterbi alignment unit 14 is only concerned With the data
`base of Words stored in the cache database 18, While the
`engine 16 is concerned With a larger scale language database
`(vocabulary) associated thereWith. It is to be understood that
`Hidden Markov modeling provides different scores for dif
`ferent hypotheses. Examples of such scores are rank (i.e., the
`ranking of a decoded Word), fast match scores (i.e., the
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`5,937,383
`
`4
`likelihoods of a simpli?ed scheme), detailed match scores
`(i.e., the likelihoods of a detailed scheme), language mod
`eling scores (i.e., probability according to a language model)
`and/or any combination of these scores.
`Speci?cally, When a user uses the system, his or her
`identify is determined by user ID unit 20. This may be
`accomplished by one of several methods. For instance, the
`user may enter a personal identi?cation number (PIN) or a
`passWord at a keypad, keyboard or any other device knoWn
`in the art for accepting such identifying indicia. Still further,
`the identity of the speaker may be determined using auto
`matic speaker identi?cation methods, for example, as dis
`closed in US. Ser. No. 08/788,471. HoWever, it is to be
`understood that the method for identifying the speaker
`employed by unit 20 is not critical to the invention. The
`important function served by the user ID unit 20 is to permit
`the cache 18 to knoW Which portion of its database corre
`sponds to a particular speaker.
`It is to be understood that initially the cache database 18
`does not have any keyWords or corresponding scores stored
`therein, that is, until a particular user inputs certain names or
`commands. When the user ?rst inputs a name or a command,
`the acoustic front-end 12 generates a set of acoustic features
`Which are then provided to the classical speech recognition
`engine 16. The Word is decoded in accordance With the
`Hidden Markov models and the decoded Word, along With
`the different scores (discussed above) associated With the
`Word, are stored in the cache database 18 preferably parti
`tioned for that particular user. It is to be understood that
`rather than a large cache database partitioned for each
`system user, individual cache databases may be formed for
`each user.
`Next, When the same user enters another speech segment,
`the folloWing operation is performed. First, each decoded
`keyWord in the cache database 18 corresponding to the same
`user is considered as a hypothesis With respect to the neWly
`entered speech segment. This evaluation is done by the
`Viterbi alignment unit 14. The resulting scores from such
`comparison are then presented to the score comparator 22.
`Also, the scores Which Were previously stored in the cache
`database 18 pertaining to the particular user are presented to
`the score comparator 22 Wherein the scores from unit 14 are
`compared to the stored scores. If the neW scores from unit
`14 are Within a margin of, the same as, or larger than the
`stored scores corresponding to one of the decoded keyWords
`in the cache, then that keyWord is accepted and is output
`from the score comparator 22 as the decoded input speech
`segment. HoWever, if the neW scores are not Within a margin
`or, or are less than any of the stored scores, then the decoded
`keyWords in the cache are rejected and the input speech
`segment is subjected to the complete vocabulary of the
`classical speech recognition engine 16 Whereby a decoded
`script is generated.
`After the input speech segment is decoded by the engine
`16, the decoded Word and associated scores are provided to
`the cache database 18 Where they are stored as a neW
`keyWord and related scores for that particular user. In this
`Way, the cache stores the latest and most frequently used
`keyWords that a user may input to the system. In the voice
`dialing phone system application, this Will preferably
`include the latest and most frequently called names. In the
`voice controlled navigation application, this Will preferably
`include the latest and most frequently input commands. It is
`to be appreciated that the output of the comparator 22
`(decoded keyWord) and the output of the engine 16 (decoded
`script) may be connected to the actual phone system or
`navigation system such that the phone or navigation systems
`
`Page 4 of 6
`
`

`

`5,937,383
`
`5
`are responsive to the decoded keyword or script and thus
`perform their respective automatic functions (i.e., dialing or
`navigating). As mentioned, one skilled in the art Will con
`template other uses for the novel principles disclosed herein.
`It is possible to eXtend the concept of a cache database for
`storing keywords and scores to unknoWn speakers Who have
`not enrolled in the speech recognition system and, therefore,
`have no prior history. This may be accomplished by vector
`quantization (VQ) clustering of the speakers as is disclosed
`in Us. Ser. No. 08/787,031. Also, a neW speaker may be
`associated With a class of speakers With similar character
`istics to himself or herself Wherein the acceptance or rejec
`tion decision is based on past history and an average
`previous likelihood associated With other speakers Within
`that class. Such a speaker classi?cation approach may be
`done completely speaker independently, in Which case, the
`cache database may preferably be built common to all users
`in the class.
`It is to be appreciated that the components of the embodi
`ments described herein may be implemented in hardWare,
`softWare or a combination thereof. Preferably, the preferred
`embodiment is implemented on an appropriately pro
`grammed general purpose digital computer.
`Although illustrative embodiments of the present inven
`tion have been described herein With reference to the accom
`panying draWings, it is to be understood that the invention
`is not limited to those precise embodiments, and that various
`other changes and modi?cations may be affected therein by
`one skilled in the art Without departing from the scope or
`spirit of the invention.
`What is claimed is:
`1. Amethod for performing speech recognition on speech
`segments frequently input by a user, the method comprising
`the steps of:
`(a) inputting at least one keyWord spoken by the user;
`(b) decoding the at least one keyWord by scoring the at
`least one keyWord against a speech recognition vocabu
`lary to generate a decoded keyWord and at least one
`score for the decoded keyWord;
`(c) storing the decoded keyWord and the at least one score;
`(d) inputting a speech segment spoken by the user;
`(e) comparing the input speech segment to the decoded
`keyWord in order to generate a temporary score; and
`(f) comparing the temporary score against the at least one
`stored score and if the temporary score is one of Within
`a predetermined margin of, equivalent to, and larger
`than the at least one stored score, then the decoded
`keyWord is output as being representative of the input
`speech segment, else the input speech segment is
`scored against the speech recognition vocabulary to
`generate a second decoded keyWord and at least one
`score for the second decoded keyWord.
`2. The method of claim 1, further comprising the step of
`storing the second decoded keyWord and the at least one
`score associated thereWith.
`3. The method of claim 1, further comprising the step of
`storing the decoded keyWord and scores associated there
`With in accordance With a predetermined identity of the user.
`4. The method of claim 3, further comprising the step of
`identifying the user via text-independent speaker identi?ca
`tion.
`5. The method of claim 3, further comprising the step of
`identifying the user via speaker-independent speaker clas
`si?cation.
`
`1O
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`6
`6. The method of claim 1, Wherein the at least one
`keyWord is a name.
`7. The method of claim 6, Wherein said method is utiliZed
`in a name-based voice dialing phone system.
`8. The method of claim 1, Wherein the at least one
`keyWord is a command.
`9. The method of claim 8, Wherein said method is utiliZed
`in a command-based voice controlled system.
`10. The method of claim 1, Wherein the at least one
`keyWord is from a large vocabulary associated With a speech
`recognition system.
`11. Apparatus for performing speech recognition on
`speech segments frequently input by a user, the apparatus
`comprising:
`means for inputting at least one keyWord spoken by the
`user;
`means for decoding the at least one keyWord by scoring
`the at least one keyWord against a speech recognition
`vocabulary to generate a decoded keyWord and at least
`one score for the decoded keyWord;
`means for storing the decoded keyWord and the at least
`one score;
`means for inputting a speech segment spoken by the user;
`means for comparing the input speech segment to the
`decoded keyWord in order to generate a temporary
`score; and
`means for comparing the temporary score against the at
`least one stored score and if the temporary score is one
`of Within a predetermined margin of, equivalent to, and
`larger than the at least one stored score, then the
`decoded keyWord is output as being representative of
`the input speech segment, else the input speech seg
`ment is scored against the speech recognition vocabu
`lary to generate a second decoded keyWord and at least
`one score for the second decoded keyWord.
`12. The apparatus of claim 11, further comprising means
`for storing the second keyWord and the at least one score
`associated thereWith.
`13. The apparatus of claim 11, further comprising means
`for storing the decoded keyWord and scores associated
`thereWith in accordance With a predetermined identity of the
`user.
`14. The apparatus of claim 13, further comprising means
`for identifying the user via text-independent speaker iden
`ti?cation.
`15. The method of claim 13, further comprising means for
`identifying the user via speaker-independent speaker clas
`si?cation.
`16. The apparatus of claim 11, Wherein the at least one
`keyWord is a name.
`17. The apparatus of claim 16, Wherein said apparatus is
`utiliZed in a name-based voice dialing phone system.
`18. The apparatus of claim 11, Wherein the at least one
`keyWord is a command.
`19. The apparatus of claim 18, Wherein said apparatus is
`utiliZed in a command-based voice controlled system.
`20. The apparatus of claim 11, Wherein the at least one
`keyWord is from a large vocabulary associated With a speech
`recognition system.
`21. A system for recogniZing keyWords frequently input
`by a speaker, the system comprising:
`a speech recognition engine for decoding at least one
`keyWord uttered by the speaker by scoring the at least
`
`Page 5 of 6
`
`

`

`5,937,383
`
`7
`one keyword against a speech recognition vocabulary
`to generate a decoded keyword and at least one score
`for the decoded keyWord;
`a cache database for storing the decoded keyWord and the
`at least one score associated thereWith in accordance
`With a predetermined identity of the speaker;
`means for performing a Viterbi alignment process on an
`input speech segment uttered by the speaker Wherein
`the input speech segment is compared to the decoded
`keyWord to generate a temporary score; and
`a comparator for comparing the temporary score against
`the at least one stored score and if the temporary score
`is one of Within a predetermined margin of, equivalent
`to, and larger than the at least one stored score, then the
`decoded keyWord is output as being representative of
`the input speech segment, else the input speech seg
`ment is scored against the speech recognition vocabu
`
`15
`
`8
`lary to generate a second decoded keyWord and at least
`one score for the second decoded keyWord.
`22. The system of claim 21, Wherein the second keyWord
`and the at least one score associated thereWith are stored in
`the cache database.
`23. The system of claim 21, Wherein the identity of the
`speaker is determined via text-independent speaker identi
`?cation.
`24. The system of claim 21, Wherein the identity of the
`speaker is determined via speaker-independent speaker clas
`si?cation.
`25. The system of claim 21, Wherein the at least one
`keyWord is one of a name, a command, and at least one Word
`from a large vocabulary associated With a speech recogni
`tion system.
`
`Page 6 of 6
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket