`
`5,983,186
`(11] Patent Number:
`[19]
`
`Miyazawaetal.
`[45] Date of Patent:
`*Nov. 9, 1999
`
`US005983186A
`
`[75]
`
`[54] VOICE-ACTIVATED INTERACTIVE SPEECH
`RECOGNITION DEVICE AND METHOD
`Inventors: Yasunaga Miyazawa; Mitsuhiro
`Inazumi; Hiroshi Hasegawa; Isao
`Edatsune; Osamu Urano, all of Suwa,
`Japan
`
`[*] Notice:
`
`[73] Assignee: seiko Epson Corporation, Tokyo,
`apan
`This patent issued on a continued pros-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`patent
`term provisions of 35 U.S.C.
`154(a)(2).
`
`[21] Appl. No.: 08/700,181
`
`:
`
`Filed:
`Aug. 20, 1996
`[22]
`sas .
`Foreign Application Priority Data
`[30]
`Aug. 21,1995
`[JP]
`Japan wee eeeeneeneeeeee 7-212248
`[SL]
`Tint, CUS ccccscsssssssnsneesesnesesnstsenetsevee G10L 5/00
`[52] U.S. Ch. ceeeeeeeneee 704/275; 704/251; 704/233
`[58] Field of Search .................... 179/15, 251; 704/212,
`704/275, 246, 233, 255; 381/57
`.
`References Cited
`U.S. PATENT DOCUMENTS
`
`[56]
`
`Techniques for implementing adaptable voice activation
`operations for interactive speech recognition devices and
`instruments. Specifically, such speech recognition devices
`and instruments include an input sound signal power or
`volume detector in communication with a central CPU for
`bringing the CPU outof an initial sleep state upon detection
`of perceived voice exceeding a predetermined threshold
`volume level and is continuously
`perceived for at least a
`certain period of time. If both theseconditions are satisfied,
`the CPU is transitioned into an active mode so that the
`erceived voice can be analyzed against a set of registered
`key svurds to determine ifa “poweron” vomerand ooaimiler
`instruction has been received. If so, the CPU maintains an
`active state in normal speech recognition processing ensues
`until a “power off’ command is received. However, if the
`perceived and analyzed voice can not be recognized, it is
`deemed to be background noise and the minimum threshold
`is selectively updated to accommodate the volumelevel of
`the perceived but unrecognized voice. Other aspects include
`tailoring the volume level of the synthesized voice response
`2,338,551
`1/1944 Stanko woeccccccesseesecesseeeeceneeneees 381/57
`according to the perceived volumelevel as detected by the
`4,052,568 10/1977 JankOwSKi «s.r
`tsesnsee 179/15
`
`input sound signal power detector, as well as modifying
`aeees
`Titoot eens et al.
`. ees
`
`audible response volume in accordance with updated vol
`5562483 10/190) Wenonan7ovats
`.. 704/275 Ue threshold
`5,577,164
`11/1996 Kaneko
`levels.
`
`9/1997 Foster......
`.. 704/246
`5,668,929
`12/1997 Cline wceccccccccesecsseseeeteseeeeees 704/246
`5,704,009
`
`
`5,794,198
`we 704/255
`8/1998 Yegnanarayananet al.
`5,799,279
`8/1998 Gould et al. oe eeeeeeseceseeeees 704/275
`FOREIGN PATENT DOCUMENTS
`62-253093
`11/1987
`Japan .
`6-4097
`1/1994
`Japan .
`6-119476
`4/1994
`Japan .
`
`sanaryPxaninerDawidRavespeth
`Attorney, Agent, or Firm—Michael T. Gabrik
`[57]
`ABSTRACT
`
`9 Claims, 4 Drawing Sheets
`
`
`
`SOUND SIGNAL EQUAL
`TO OR GREATER THAN
`
`THRESHOLD PRESENT
`
`
`Ls?
`PUTS CPU TO SLEEP.
`
`
`
`
`
`
`
`?
`NO
`
` OUTPUTS
`
`
`
`
`
`wean
`
`siz
`
`
`
`SETS FLAG TO
`ACTIVE MODE.
`———J
`SPEECH COMPREHENSION
`INTERACTION CONTROL
`
`
`
` {S INPUT
`SOUND SIGNAL
`FINISHED FOR SINGLE,
`INTERACTION?
`
`
`
`
`913
`
`YES
`OUTPUTS
`RESPONSE
`t
`rosl4
`SETS FLAG TO
`SLEEP MODE.
`s15
`SETS DEVICE IN
`SLEEP MODE.
`
`
`
`Page 1 of 14
`
`GOOGLE EXHIBIT 1003
`
`Page 1 of 14
`
`GOOGLE EXHIBIT 1003
`
`
`
`U.S. Patent
`
`Nov.9, 1999
`
`Sheet 1 of 4
`
`5,983,186
`
`9
`
`INPUT SOUND
`SIGNAL POWER
`DETECTOR
`
`| ||| | ||| |
`
`1
`
`J
`
`a!
`|
`
`4
`
`SPEECH REFERENCE
`TEMPLATES
`MEMORY
`
`SOUND SIGNAL
`INPUT UNIT
`
`PHRASE
`DETECTOR
`
`SPEECH
`COMPREHENSION]
`INTERACTION
`CONTROLLER
`
`MEMORY |
`
`!I
`
`|| |
`
`~
`
`6
`
`RESPONSE DATA
`
`|
`—ee I
`
`SPEECH
`OUTPUT UNIT
`
`6
`
`FIG.-1
`
`Page 2 of 14
`
`Page 2 of 14
`
`
`
`U.S. Patent
`
`Nov.9, 1999
`
`Sheet 2 of 4
`
`5,983,186
`
`"ASU"
`(TOMORROW)
`
`"TENK?
`(WEATHER)
`
`
`
`VOICE SIGNALclwwal\[mo—
`
`wt "2 _y, TIME AXIS
`
`FIG.-2A
`
`PHRASE 1
`
`1.0
`
`(WEATHER)
`FIG.-2B
`
`a9
`——
`
`ae
`
`PHRASE2 1.0
`"ASU"
`
`(TOMORROW)fiTIME ——>
`FIG.-2C + =
`
`S
`
`PHRASES 1.0
`NANAI’
`
`(WHAT TIME)—_7»___
`FIG.-2D
`bt
`
`‘6
`
`TIME ——
`
`PHRASE 4
`”OHAYOO”
`(6000 MORNING)
`
`FIG.-2E
`
`1.0
`
`
`
`TIME ——>
`
`Page 3 of 14
`
`Page 3 of 14
`
`€
`
`
`U.S. Patent
`
`Nov. 9, 1999
`
`Sheet 3 of 4
`
`5,983,186
`
`TIME ——>
`
`I |
`
`||||||||
`
`|
`|
`
`|||
`
`|
`
`INPUT SOUND
`SIGNAL WAVEFORM
`
`FIG.-3A
`
`TIME SEQUENCE OF
`POWER SIGNAL OF
`INPUT SOUND SIGNAL
`
`thi
`0
`
`‘0
`May
`—* +—10ms
`—| | 20 some ms
`FIG.-3B
`
`TIME ——>
`
`tO t!
`
`I
`A?
`\
`b+
`|
`|
`
`|
`|
`Al
`|
`—<$£§£_|_———————
`|
`
`|
`|
`
`|
`
`}—
`|Io
`
`eth?
`
`|
`°
`FIG.-5B to
`
`TIME——>
`
`Page 4 of 14
`
`Page 4 of 14
`
`
`
`U.S. Patent
`
`Nov.9, 1999
`
`Sheet 4 of 4
`
`5,983,186
`
`FIG.-4
`
`PUTS CPU TO SLEEP.
`YES—82
`
`Ss
`
`IS INPUT
`SOUND SIGNAL EQUAL
`TO OR GREATER THAN
`THRESHOLD PRESENT
`9
`
`INTERACTION?
`
`IS THERE
`SLEEP MODE
`REQUEST?
`
`OUTPUTS
`RESPONSE
`
`SETS FLAG TO
`
`SPEECH COMPREHENSION
`INTERACTION CONTROL
`
`s10
`
`(S_ (NPUT
`SOUND SIGNAL
`FINISHED FOR SINGLE
`
`OUTPUTS
`
`SHS FAG TO
`
`SLEEP MODE.
`
`SETS DEVICE IN
`SLEEP MODE.
`
`Page 5 of 14
`
`Page 5 of 14
`
`
`
`5,983,186
`
`1
`VOICE-ACTIVATED INTERACTIVE SPEECH
`RECOGNITION DEVICE AND METHOD
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`
`This Application is related to copending application Ser.
`No. 08/700,175 filed on the same date of the present
`application, Attorney’s Docket number P2504a,entitled “A
`Cartridge Based Interactive Voice Recognition Method and
`Apparatus”, copending application Ser. No. 08/669,874,
`filed on the same date of the present application, Attorney’s
`Docket number P2505a, entitled “A Speech Recognition
`Device and Processing Method”, all commonly assigned
`with the present invention to the Seiko Epson Corporation of
`Tokyo, Japan. This application is also related to the follow-
`ing copending applications: application Ser. No. 08/078,027,
`filed Jun. 18, 1993, entitled “Speech Recognition System”;
`application Ser. No. 08/102,859,filed Aug. 6, 1993, entitled
`“Speech Recognition Apparatus”; application Ser. No.
`08/485 ,134,filed Jun. 7, 1995, entitled “Speech Recognition
`Apparatus Using Neural Network and Learning Method
`Therefore”; and application Ser. No. 08/536,550,filed Sep.
`29, 1996, entitled “Interactive Voice Recognition Method
`and Apparatus Using Affirmative/Negative Content Dis-
`crimination”; again all commonly assigned with the present
`invention to the Seiko Epson Corporation of Tokyo, Japan.
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The invention generally relates to interactive speech rec-
`ognition instruments which recognize speech and produce
`an audible response or specified action based on developed
`recognition results, andis particularly concerned with voice-
`based activation of such instruments.
`
`2. Description of the Related Art
`Speech recognition devices can be generally classified
`into two types. Thefirst type is the specific-speaker speech
`recognition device that can only recognize the speech of a
`specific speaker, and the second general type is the non-
`specific speaker speech recognition device that can recog-
`nize the speech of non-specific speakers.
`In the case of a specific speaker speech recognition device
`a specific speaker first registers his or her speech signal
`patterns as reference templates by entering recognizable
`words one at a time according to an interactive specified
`interactive procedure. After registration, when the speaker
`issues one of the registered words. speech recognition is
`performed by comparing the feature pattern of the entered
`word to the registered speech templates. One example of this
`kind of interactive speech recognition device is a speech
`recognition toy. The child whousesthe toy pre-registers, for
`example, about 10 phrases such as “Good moming,” “Good
`night” and “Good day,”, as multiple speech instructions. In
`practice, when the speaker says “Good morning,” his speech
`signal is compared to the speech signal of the registered
`“Good morning.” If there is a match between the two speech
`signals, a electrical signal corresponding to the speech
`instruction is generated, which then makesthe toy perform
`a specified action.
`this type of specific
`As the name implies, of course,
`speaker speech recognition devices can recognize only the
`speech of a specific speaker or speech possessing a similar
`pattern. Furthermore, since the phrases to be recognized
`must be registered one at a time as part of device
`initialization, using such a device is quite cumbersome and
`complex.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`By contrast, a non-specific speaker speech recognition
`device creates standard speech feature patterns of the rec-
`ognition target phrases described above, using “canned”
`speech examplars spoken by a large number(e.g., around
`200) of speakers. Phrases spoken by a non-specific speaker/
`user and then compared to these pre-registered recognizable
`phrases for recognition.
`However, such speech recognition devices usually
`become ready to perform recognition operations and
`responses only when an external switch is turned on or
`external power is delivered to the device is turned on,
`regardless of whether the device uses specific or non-
`specific speaker recognition. But, in some types of speech
`recognition devices,
`it would be more convenient if the
`device were in a standby state waiting for speech inputatall
`times, and performed recognition operations by sensing
`speech input, without the need for the user to turn on the
`switch every time.
`for
`Take a stuffed toy utilizing speech recognition,
`example. If the toy can be kept in a speech input standby
`state, i,e., a so-called sleep mode, and can instantly respond
`when the child calls out its name, it can respond quickly
`without the need for plugging the device in or pressing, a
`button,
`thereby greatly enhancing its appeal as a user-
`friendly device especially for younger children where apply-
`ing external power may raise safety concerns. In addition to
`toys, the same can besaid of all electronic instruments that
`utilize speech recognition.
`Someissues must be resolved when keeping the device in
`a sleep mode and having it perform recognition operations
`by sensing speech input, as explained above. These include,
`for example, power consumption and the ability of the
`device to differentiate between phrases to be recognized and
`noise, and to act only in response to phrases to be recog-
`nized. In particular, since most toys run on batteries, mini-
`mizing battery drain is a major issue. Additionally, product
`prices must also be kept low to maintain commercial appeal
`for such devices , so using expensive, conventional activa-
`tion circuitry is undesirable. So, heretofore, there have been
`a large numberof technicalrestrictions on commercializing
`interactive speech devices which also feature voice activa-
`tion.
`
`OBJECTS OF THE INVENTION
`
`It is therefore, an object of the present invention,to enable
`the device to remain in a sleep mode and perform recogni-
`tion operations only when a recognizable speech input is
`detected, to minimize power consumption during the sleep
`mode, to enable the speech to be recognized at high accuracy
`even if noise is present in the usage environment, and to
`enhance commercial appeal of the device by retaining low
`cost over conventional designs.
`SUMMARYOF THE INVENTION
`
`In accordance with this and related objects, a voice
`activated interactive speech mechanism according to the
`present invention includes: 1) a sound signal input unit for
`receiving ambient signals projected to the mechanism; 2) a
`sound signal analyzer in communication with the sound
`signal input unit for analyzing sounds perceived by the
`sound signal input unit and generating voice feature param-
`eters responsive to these analyzed sounds; 3) a phrase
`detector in communication with the sound signal analyzer
`for receiving generated voice feature parameters for the
`perceived sounds, comparing the received data against a set
`of speech reference templates in an effort to find a match,
`
`Page 6 of 14
`
`Page 6 of 14
`
`
`
`5,983,186
`
`3
`and generating phrase detection data should a recognizable
`phrase be found; 4) a speech recognition interactive con-
`troller in communication with the phrase detector which
`receives generated phrase detection data, understands the
`meaning of the input speech, determines appropriate
`response content, and performs various controls on the
`mechanism based on the interpreted speech; 5) a speech
`synthesizer in communication with the interaction controller
`for generating synthesized speech output based on the
`determined response; 6) a speech output unit in communi-
`cation with the interaction controller and the speech synthe-
`sizer for broadcasting the generated synthesized speech; and
`7) an input sound signal power detector in communication
`with at least the sound signal input unit and the interaction
`controller for detecting the volume, magnitude or amplitude
`of input sound signals based on sound signal waveforms
`perceived by the sound signal input unit or capture device.
`Preferably, this power detector includes processing circuitry
`for forcing the mechanism to selectively enter or terminate
`a low-power sleep mode. Moreover, preferably, during this
`sleep mode, either the interaction controller or the input
`sound signal powerdetector itself determines whether input
`sound signals, as detected by the input sound signal power
`detector, are at least at a predetermined threshold volume
`level above the backgroundnoise. If so, a determination is
`then made whether or not threshold filtered input sound
`signal correspondsto a recognizable phrase, and,if so, shifts
`the device from the sleep mode into the active mode.
`Also, a hardware or software timer can be used to
`determine if a given perceived sound meets or exceeds the
`predetermined threshold powerlevel for a specified duration
`of time. If a given perceived sound signal that is higher in
`level than this threshold is continuously present for at least
`this specified duration of time and if the input sound signal
`is determined not to be a recognizable phrase, the input
`sound signal is determined to be background noise present
`in the environment. This aids proper voice activation in a
`noisy ambient environment. Moreover, the threshold power
`level may be updatedin real time to account for detection of
`this background noise.
`Furthermore,
`the sound signal power level detector,
`according to the present
`invention, may be used as an
`ambient noise feedback device to enable the speech recog-
`nition mechanism to take into account perceived noiselevels
`in formulating the volume of response message and other
`audible functions. In so doing, the mechanism mayset an
`initial threshold for eliminating noise, and perform power
`detection for a specified duration of time using this threshold
`as the reference. Specifically, 1) if an input sound signal that
`is higher in level than the current threshold is perceived for
`at least a specified duration oftime, and, 2) if the input sound
`signal is determined not to be a recognizable phrase, the
`input sound signalis judged by the mechanism to be ambient
`background noise. At the same time the threshold is updated
`to a value greater than the perceived backgroundnoiselevel.
`Also, the sound signal power level detector according to
`the present invention may be used by the mechanism to
`generate an audible response having a volumelevel corre-
`sponding to the perceived powerlevels of the input sound
`signal.
`Voice activated interactive speech processing according to
`the present invention includes: 1) sound signal capture for
`receiving ambient sound signals projected to a receiving
`device; 2) sound signal analysis for analyzing these sounds
`and generating voice feature parameter responsive thereto;
`3) phrase detection for comparing generated feature param-
`eters data for the perceived sounds against a set of speech
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`reference templates in an effort to find a match, and issuing
`phrase detection data should a recognizable phrase be found;
`4) overall speech recognition interactive control for receiv-
`ing generated phrase detection data, comprehending the
`meaning of the input speech, determining appropriate
`response content, and performing varied tasks responsive to
`the interpreted speech; 5) speech synthesis for generating
`synthesized speech output based on the determined
`response; 6) speech reproduction for broadcasting the gen-
`erated synthesized speech; and 7) input sound signal power
`detection for detecting the power, magnitude or amplitude of
`input sound signals based on perceived sound signal wave-
`forms. Preferably, such speech processing includesthe abil-
`ity to force a speech recognition device to selectively enter
`or terminate a low-powersleep mode. Moreover, preferably,
`during this sleep mode, a determination is made whether
`input sound signals, as detected during input sound signal
`capture, are at least at a predetermined threshold signal level
`above the backgroundnoise. If so, a determination is then
`made whetherornotthe threshold filtered input sound signal
`corresponds to a recognizable phrase, and, if so, shifts the
`device from the sleep mode into the active mode.
`Also, hardware or software timer processing can be
`incorporated in speech processing according to the present
`invention to determine if a given perceived sound meets or
`exceeds the predetermined threshold power level for a
`specified duration of time. If a given perceived sound signal
`that is higher in level than this threshold is continuously
`present for at least this specified duration of time, and, if the
`input sound signal is determined not to be a recognizable
`phrase, the input sound signal is judged to be steady noise
`present in the environment. This aids proper voice activation
`in a noisy ambient environment. Moreover, the threshold
`power level may be updated in real time to account for
`detection of this steady noise.
`Furthermore, according to the present invention, input
`signal power detection may be used for ambient noise
`feedback purposes to enable a speech recognition mecha-
`nism to take into account perceived noise levels in formu-
`lating the volume of response message and other audible
`functions. In so doing, may set an initial
`threshold for
`eliminating noise, and perform powerdetection for a speci-
`fied duration of time using this threshold as the reference.
`Specifically, for an input sound signal that is higher in level
`than the threshold which is continuously present for a
`specified duration of time,
`if the input sound signal
`is
`determined notto be a recognizable phrase, the input sound
`signal is judged by the mechanism to be steady noise present
`in the environment. Also,
`the threshold is concurrently
`updated to a value that is greater than the steady noise level.
`Also, sound signal powerlevel detection according to the
`present invention may incorporate generating an audible
`response having a volumelevel corresponding to the per-
`ceived powerlevels of the input sound signal.
`As explained hereinabove, when a speech recognition
`device according to the present invention is in the sleep
`mode based on a sleep mode request, it determines whether
`or not the volumelevel or powerlevel of a perceived input
`sound signal is at least matches a predetermined threshold
`volume level and also whetheror not the input sound signal
`constitutes a recognizable phrase. If both of these conditions
`are satisfied, the device shifts from the low-power sleep
`modeinto the active mode. Otherwise (i.e. a low-level sound
`or high-level noise situation), the sleep state is maintained.
`As a result, only phrases to be recognized are processed for
`recognition while reducing deleterious noise effects.
`Furthermore, when the device is in the sleep mode, only
`
`Page 7 of 14
`
`Page 7 of 14
`
`
`
`5,983,186
`
`5
`those portions of the device that consume small amounts of
`power, such as the sound signal input unit and the input
`sound signal power detector area, need be active, thereby
`keeping the total power consumption at a relatively low
`level (e.g. power consuming speech synthesis and reproduc-
`tion circuitry may be powered downat this time) in com-
`parison with conventional speech recognition devices.
`According to the present invention, if an input sound
`signal that is higher in level than the threshold is continu-
`ously present for a specified duration of time and is deter-
`mined not to be a recognizable phrase, it is judged to be
`steady noise present in the environment. In this way,rela-
`tively loud ambient noise sounds continuously presentfor an
`extended duration in the environment can be considered
`extraneous and accounted for, and thus the effects of steady
`noise present in the environment can be reduced. Thus,
`according to the present invention, voice activation opera-
`tions can be responsive to a changing noise environment, as
`would be in the case of carrying a speech recognition toy
`from a quiet bedroom into the cabin of a family vehicle.
`The noise level in the environmentis preferably judged
`based on the powersignal from the input sound signal power
`detector, and a response at a voice level that corresponds to
`the noise level is output. Therefore, the response can be
`output at a high voice level
`if the noise level
`in the
`environmentis high, making the responseeasier to hear even
`when some noise is present in the environment. Of course,
`when the ambient environment becomes quiet, speech pro-
`cessing according to the present invention permits attenua-
`tion of the threshold and increased device responsivenessto
`external sounds.
`
`Additionally, since the threshold is updated to a value that
`is greater than the steady noise level, and the noise level at
`a certain point in time is judged based on the magnitude of
`the threshold atthat point in time, the index ofthe noise level
`can be obtained based on the threshold, making it simple to
`determine the current noise level according to the present
`invention. Furthermore, even if the noise level changes, the
`response can be generated at a voice level that corresponds
`to the noise level, making it possible to output the response
`at a voice level that better suits the noise in the environment.
`
`the
`invention,
`Furthermore, according to the present
`response may be output at a voice level that corresponds to
`the power of the input sound signal. Therefore,
`if the
`speaker’s voice is loud, the response will also be loud; and
`if the speaker’s voice is soft, the response will also be soft,
`enabling an interactive conversations at a volume level
`appropriate for the situation.
`Other objects and attainments together with a fuller
`understanding of the invention will become apparent and
`appreciated by referring to the following description of
`specific, preferred embodiments and appending claims,
`taken in conjunction with the accompanying drawings:
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`In the drawings, wherein like reference symbols refer to
`like parts:
`FIG. 1 is a block diagram for explaining the first embodi-
`ment of the present invention;
`FIGS. 2A-2E diagramatically illustrate a sample input
`voice waveform and resultant partial word lattice for
`explaining phrase detection by the phrase detector and
`speech recognition by the speech comprehensioninteraction
`controller according to the present invention;
`FIGS. 3A and 3B diagramatically illustrate a simplified
`input sound signal waveform and its corresponding power
`signal;
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`FIG. 4 is a flow chart for explaining voice activation
`processing according to the first embodiment; and
`FIGS. 5A and 5B diagramatically illustrate a stylized
`input sound wave form in which both the threshold and
`respond output levels are set according to the noise level
`according to the present invention.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`
`The preferred embodimentof the invention are explained
`hereinbelow with reference to specific figures, where appro-
`priate the figures. Note that the invention has been applied
`to a toy in these embodiments, and more particularly to a
`stuffed toy dog, intended for small children. Furthermore,
`these embodiments will be explained in which teachings of
`the present invention are applied to a non-specific speaker
`speech recognition device that can recognize the speech of
`non-specific speakers.
`FIG. 1 is a configuration diagram ofthe first embodiment
`of the present invention. FIG. 1 schematically shows sound
`signal capture unit 1, sound signal analyzer 2, phrase detec-
`tor 3, speech reference templates memory 4, speech com-
`prehension interaction controller 5, response data memory6,
`speech synthesizer 7, speech output unit 8 and input sound
`signal power detector 9. Note that, of these configuration
`elements, sound signal analyzer 2, phrase detector 3, speech
`reference templates memory 4, speech comprehension inter-
`action controller 5, response data memory 6, and speech
`synthesizer 7 are contained inside the belly of the stuffed toy
`dog, (wherein analyzer 2, phrase detector 3, controller 5 and
`synthesizer 7 are constituent members of integrated CPU 10)
`and sound signal capture unit (here a microphone) 1 and
`speech output unit (here a speaker) 8 are installed in the ear
`and the mouth, respectively, of the stuffed toy, for example.
`The functions of these elements are explained in sequence
`below.
`
`Sound signals (including noise), such as a speaker’s
`voice, are input into the sound signal capture unit compris-
`ing a conventional microphone, an amplifier, a lowpass
`filter, an A/D converter, etc. which are notillustrated here for
`the sake of simplicity, since their structure does not particu-
`larly impact the teachings of the present invention. The
`sound signal
`input from the microphoneis first passed
`through the amplifier and the lowpassfilter and converted
`into an appropriate sound waveform. This waveform is
`converted into a digital signal (e.g., 12 KHz, 16 bits) by the
`A/D converter, which is then sent to sound signal analyzer
`2. Sound signal analyzer 2 uses a programmed CPU to
`analyze at short intervals the frequency of the waveform
`signal sent from sound signal capture unit 1, then extracts
`the multi-dimensional speech feature vector that expresses
`the frequency characteristics (LPC-CEPSTRUMcoefficient
`is normally used) thereof, and generates the time series
`(hereafter referred to as “feature vector array”) of this
`characteristic vector for subsequent matching and recogni-
`tion operations.
`The speech reference templates memory 4 preferably
`comprises a ROM devicethat stores (registers) voice vector
`reference templates of the recognition target phrases, created
`in advance using the speech issued for each word by a large
`numberof typical speakers chosen according to the contem-
`plated uses of the speech recognition device. Here, since a
`stuffed toy is used for the example, about 10 phrases used for
`greeting, such as “Good morning,” “Good night” and “Good
`day,” “tomorrow,” and “weather,” for example, will be used
`as the recognition targets. However,
`recognition target
`
`Page 8 of 14
`
`Page 8 of 14
`
`
`
`5,983,186
`
`7
`phrases are not limited to these particular phrases, and a
`wide variety of phrases can be registered, as will be apparent
`to those ordinarily skilled in the art. Furthermore,
`the
`number of phrases that can be registered certainly need not
`be limit to 10, and is dependent only in the size of addres-
`sable memory 4 utilized.
`Also, although not shown in FIG. 1, phrase detector 3
`comprises a general or special-purpose processor (CPU) and
`a ROMdevice storing the processing program, and deter-
`mines if and to what degree of certainty the input voice
`target phrases registered in reference templates memory 4
`may be present. Hidden Markov Model (HMM) or DP
`matching can be used by phrase detector 3 as is well known
`in the art word-spotting processing technology. However, in
`these embodiments, keyword-spotting processing technol-
`ogy using the dynamic recurrent neural network (DRNN)
`method is preferably used as disclosed in U.S. application
`Ser. No. 08/078,027, filed Jun. 18, 1993, entitled “Speech
`Recognition System”, commonly assigned with the present
`invention to Seiko-Epson Corporation of Tokyo, Japan,
`which is incorporated fully herein by reference. Also, this
`method is disclosed in the counterpart laid open Japanese
`applications H6-4097 and H6-119476. DRNNis used in
`order to perform voice recognition of virtually continuous
`speech by nonspecific speakers and to output word detection
`data as described herein.
`
`The following is a brief explanation of the specific
`processing performed by phrase detector 3 with reference to
`FIGS. 2A-2E. Phase detector 3 determines the confidence
`level at which a word registered in speech reference tem-
`plates memory 4 occurs at a specific location in the input
`voice. Now, suppose that the speaker inputs an example
`Japanese language phrase “asu no tenki wa...” meaning
`“Concerning tomorrow’s weather”. Assumethat in this case
`the stylized voice signal shown in FIG. 2A represents the
`audio waveform for this expression.
`In the expression “asu no tenki wa... ”, the contextual
`keywords or target phrases include “asu” (tomorrow) and
`“tenki” (weather). These are stored in the form of vector
`series reference templates in speech reference templates
`memory 4 as parts of the a predetermined word registry,
`which in this case, represents approximately 10 target dis-
`tinct words or phrases. If 10 phrases are registered, signals
`are output by the phrase detector 3 in order to detect phrases
`corresponding to these 10 phrases (designated phrase 1,
`phrase 2, phrase 3... up to phrase 10). From the information
`such as detected signal values, the phrase detector deter-
`mines the confidence level at which the corresponding
`words occur in the input voice.
`Motespecifically, if the word “tenki” (weather) occurs in
`the input voice as phrase 1, the detection subunit that is
`waiting for the signal “tenki” (weather) initiates an analog
`signal which rises at the portion “tenki” in the input voice,
`as shown in FIG. 2B. Similarly,
`if the word “asu”
`(tomorrow)occurs in the input voice as word 2, the detection
`subunit that is waiting for the signal “asu”rises at the portion
`“asu” in the input voice, as shown in FIG. 2C.
`In FIGS. 2B and 2C, the numerical values 0.9 and 0.8
`indicate respective confidence levels that the spoken voice
`contains the particular pre-registered keyword . Therelative
`level or magnitude ofthis level can fluctuate between ~0 and
`1.0, with 0 indicating a zero confidence match factor and 1.0
`representing a 100% confidence match factor. In the case of
`a high confidence level, such as 0.9 or 0.8, the registered or
`target phrase having a high confidence level can be consid-
`ered to be a recognition candidate relative to the input voice.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`Thus, the registered word “asu” occurs with a confidence
`level of 0.8 at position wl on the time axis. Similarly, the
`registered word or phrase “tenki” occurs with a confidence
`level of 0.9 at position w2 on the time axis.
`Also, the example of FIGS. 2A-2E showsthat, when the
`word “tenki’ (weather) is input, the signal that is waiting for
`phrase 3 (phrase 3 is assumed to be the registered word
`“nanji” (“Whattime. .. . ”) also rises at position w2 on the
`time axis with relative uncertain confidence level of approxi-
`mately 0.6. Thus, if two or more registered phrases exist as
`recognition candidates at the same time relative to an input
`voice signal, the recognition candidate word is determined
`by one of two methods: either by 1) selecting the potential
`recognition candidate having the highest degree of similarity
`to the input voice (using absolute confidence level
`comparisons) as the actually recognized keywordor phrase;
`or by 2) method of selecting one of the words as the
`recognized word by creating a predefined correlation table
`expressing contextual rules between words.In this case, the
`confidence level for “tenki” (weather) indicates that it has
`the highest degree of similarity to the input voice during
`time portion w2 on the time axis, even though “nanji’” can
`also be recognized as a potential recognition candidate.
`Therefore, “tenki” is selected as the actual recognition
`candidate for
`this example. Based on these confidence
`levels, interaction controller 5 performs the recognition of
`input voices.
`Collectively, the detection information, includingstarting
`and ending points on the time axis and the maximum
`magnitude of the detection signal indicating the confidence
`level, for each pre-registered word contained in non-specific
`speaker word registry within speech reference templates
`memory 4, is knownas a word lattice. In FIGS. 2B-2E, only
`a partial four dimensional lattice is shown for the sake of
`clarity, but a word lattice including detection information for
`every pre-registered