`[11] Patent Number:
`[19]
`United States Patent
`
`Miyazawa et al.
`[45] Date of Patent:
`*Nov. 9, 1999
`
`[54] VOICE-ACTIVATED INTERACTIVE SPEECH
`RECOGNITION DEVICE AND METHOD
`
`5,794,198
`5,799,279
`
`704/255
`8/1998 Yegnanarayanan et al.
`8/1998 Gould et al.
`............................ 704/275
`
`
`
`USOOS983186A
`
`[75]
`
`Inventors: Yasunaga Miyazawa; Mitsuhiro
`Inazumi; Hiroshi Hasegawa; Isao
`Edatsune; Osamu Urano, all of Suwa,
`Japan
`
`FOREIGN PATENT DOCUMENTS
`62—253093
`11/1987
`Japan .
`6—4097
`1/1994
`Japan .
`6—119476
`4/1994
`Japan .
`
`.
`.
`[73] Assignee.
`
`.
`.
`11211:: Epson Corporation, Tokyo,
`p
`
`Primary Examiner—David R. Hudspeth
`Assistant Examiner—Daniel Abebe
`Attorney, Agent, or Firm—Michael T. Gabrik
`
`[*] Notice:
`
`This patent issued on a continued pros-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`patent
`term provisions of 35 U.S,C,
`154(a)(2).
`
`[21] Appl. No.: 08/700,181
`
`[22]
`
`[30]
`
`Filed:
`
`Aug. 20, 1996
`.
`.
`.
`.
`.
`Foreign Application Priority Data
`
`Aug. 21, 1995
`
`[JP]
`
`Japan .................................... 7—212248
`
`6
`
`........................................................ G10L 5/00
`Int. Cl.
`[51]
`........................... 704/275; 704/251; 704/233
`[52] U..S. Cl.
`[58] Field Of Search --------------------- 179/15, 251; 704/212,
`704/275, 246, 233, 255; 381/57
`_
`References CltEd
`U.S. PATENT DOCUMENTS
`
`[56]
`
`1/1944 Stanko
`2,338,551
`................................... 381/57
`
`470527568 10/1977 JaPk9WSk1 """""" 179/15
`
`" 38:53::
`g’gg’ggi
`3133: 31%?“ et al’
`5,562,453
`10/1996 “2:111 """""""""""""""""""""" 704/275
`5:577:164
`11/1996 Kanefia ..................................
`704/275
`
`9/1997 Foster ......
`5,668,929
`704/246
`12/1997 Cline ....................................... 704/246
`5,704,009
`
`
`[57]
`
`ABSTRACT
`
`Techniques for implementing adaptable voice activation
`operations for interactive speech recognition devices and
`instruments. Specifically, such speech recognition devices
`and instruments include an input sound signal power or
`volume detector in communication with a central CPU for
`bringing the CPU out of an initial sleep state upon detection
`of perceived voice exceeding a predetermined threshold
`volume level and is continuously perceived for at least a
`.
`.
`.
`.
`.
`.
`certain period of time. If both these conditions are satisfied,
`the CPU is transitioned into an active mode so that the
`perceived voice can be analyzed against a set of registered
`key words to determine ifa “power on” command or similar
`instruction has been received. If SO, the CPU maintains an
`active state in normal speech recognition processing ensues
`until a “power off” command is received. However, if the
`perceived and analyzed voice can not be recognized, it is
`deemed to be background noise and the minimum threshold
`is selectively updated to accommodate the volume level of
`the perceived but unrecognized voice. Other aspects include
`tailoring the volume level of the synthesized voice response
`according to the perceived volume level as detected by the
`input sound signal power detector, as well as modifying
`audible response volume in accordance with updated vol-
`ume threshold levels.
`
`9 Claims, 4 Drawing Sheets
`
`
`IS INPUT
`
`
`SOUND SIGNAL EQUAL
`TO OR GREATER THAN
`
`THRESHOLD PRESENT
`
` r57
`PUTS CPU TO SLEEP,
`
`
`
`
`
`
`
`
` OUTPUTS
`
`
`
`IS IT
`A KEYWORD
`?
`YES
`SETS FLAG T0
`ACTIVE MODE.
`
`:—SPEECH COMPREHENS/ON
`INTERACT/ON CONTROL
` (5 INPUT
`
`SOUND SIGNAL
`HN/SHED FOR SINGLE
`INTERACTTON?
`
`
`
`
`
`
`
`RESPONSE
`
`' F574
`SETS FLAG TO
`SLEEP MODE.
`515
`5:73 DEWCE lN
`SLEEP MODE
`
`a
`
`
`
`
`56
`
`N0
`
`Page 1 of 14
`
`GOOGLE EXHIBIT 1003
`
`Page 1 of 14
`
`GOOGLE EXHIBIT 1003
`
`
`
`US. Patent
`
`N0V.9,1999
`
`Sheet 1 0f4
`
`5,983,186
`
`9
`
`SOUND SIGNAL
`INPUT UNIT
`
`7
`
`INPUT SOUND
`SIGNAL POWER
`DETECTOR
`
`CONTROLLER ____\___.____._______..____.._____..J
`
`5
`
`I
`
`\
`
`N O
`
`
`
`SPEECH
`COMPREHENS/ON
`
`INTERACTION
`
`r IIII III I II II I
`
`II III I II IIIIIII
`
`4
`
`SPEECH REFERENCE
`
`TEMPLATES
`
`MEMORY
`
`Page 2 of 14
`
`6
`
`SPEECH
`
`RESPONSE DA TA
`
`I
`SYNTHES/ZER
`______________ J
`
`.
`
`MEMORY
`
`SPEECH
`OUTPUT UNIT
`
`8
`
`FIG. -1
`
`Page 2 of 14
`
`
`
`US. Patent
`
`N0V.9,1999
`
`Sheet 2 0f4
`
`5,983,186
`
`”TENKI"
`”ASU”
`(WEATHER)
`(TOMORROW)
`H |<—~l
`
`VOICE SIGNAL WWW/WW
`
`”WEI
`
`I, w?
`
`2| TIME AXIS
`
`FIG. '2A
`
`PHRASE7
`
`7.
`
`(WEATHER)
`FIG.-2B
`
`WE
`
`f—‘le
`
`7.0
`PHRASE 2
`”ASU” ——"“"‘—
`(TOMORROW)A—
`0.8
`TIME—>
`FIG. '20
`l
`l
`
`3
`
`e
`
`7.0
`
`PHRASE 3
`"NAM/l"
`
`(WHAT TIME) v_=..%__
`FIG.-20
`SIX—j)
`
`-
`
`TIME——>
`
`PHRASE 4
`"OHA mo"
`(GOOD MORNING)
`
`FIG. ‘2E
`
`7.0
`
`
`
`TIME—->
`
`Page 3 of 14
`
`Page 3 of 14
`
`
`
`US. Patent
`
`N0V.9,1999
`
`Sheet 3 0f4
`
`5,983,186
`
`INPUT SOUND
`
`SIGNAL WA VEFORM
`
`FIG. -3A
`
`TIME SEQUENCE OF
`
`TIME —>
`
`POWER SIGNAL 0F
`
`#77
`
`0
`
`FIG-‘33
`
`INPUT SOUND SIGNAL 'A
`
`.
`
`,
`
`‘_
`
`70
`I H
`—,>I p— ms
`—>l H<—20 some ms
`t0 H
`
`T/ME——>
`
`I
`A2
`I
`k———————>|
`I
`I
`I
`I
`I
`I
`A7
`,
`p—H
`|
`|
`
`FIG.-5A
`
`II
`
`' ‘
`th2
`W L‘___
`
`FIG. -5B W
`
`Page 4 of 14
`
`Page 4 of 14
`
`
`
`US. Patent
`
`N0V.9,1999
`
`Sheet 4 0f4
`
`5,983,186
`
`START
`
`FIG. -4
`
`IS INPUT
`
`SOUND SIGNAL EQUAL
`TO OR GRE4TER THAN
`
`THRESHOLD PRESENT
`7
`
`YES
`
`32
`
`PUTS CPU T0 SLEEP.
`
`A KEYWORD
`
`WAKES UP CPU.
`
`3
`
`ANALYZES SOUND SIGNAL.
`
`DETEC TS PHRASE
`
`DEVICE IN
`
`SLEEP MODE
`?
`
`IS IT
`
`IS THERE
`
`SLEEP MODE
`
`REQUEST?
`
`OUTPUTS
`
`RESPONSE
`
`SETS FLAG TO
`
`OUTPUTS
`
`SPEECH COMPREHENS/ON
`INTERACTION CONTROL
`
`570
`
`I5 INPUT
`SOUND SIGNAL
`
`FIN/SHED FOR SINGLE
`INTERACTION?
`
`SETS FLAG 7.0
`SLEEP MODE
`
`SETS DEVICE IN
`
`SLEEP MODE.
`
`Page 5 of 14
`
`Page 5 of 14
`
`
`
`5,983,186
`
`1
`VOICE-ACTIVATED INTERACTIVE SPEECH
`RECOGNITION DEVICE AND METHOD
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`
`This Application is related to copending application Ser.
`No. 08/700,175 filed on the same date of the present
`application, Attorney’s Docket number P2504a, entitled “A
`Cartridge Based Interactive Voice Recognition Method and
`Apparatus”, copending application Ser. No. 08/669,874,
`filed on the same date of the present application, Attorney’s
`Docket number P2505a, entitled “A Speech Recognition
`Device and Processing Method”, all commonly assigned
`with the present invention to the Seiko Epson Corporation of
`Tokyo, Japan. This application is also related to the follow-
`ing copending applications: application Ser. No. 08/078,027,
`filed Jun. 18, 1993, entitled “Speech Recognition System”;
`application Ser. No. 08/102,859, filed Aug. 6, 1993, entitled
`“Speech Recognition Apparatus”; application Ser. No.
`08/485,134, filed Jun. 7, 1995, entitled “Speech Recognition
`Apparatus Using Neural Network and Learning Method
`Therefore”; and application Ser. No. 08/536,550, filed Sep.
`29, 1996, entitled “Interactive Voice Recognition Method
`and Apparatus Using Affirmative/Negative Content Dis-
`crimination”; again all commonly assigned with the present
`invention to the Seiko Epson Corporation of Tokyo, Japan.
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The invention generally relates to interactive speech rec-
`ognition instruments which recognize speech and produce
`an audible response or specified action based on developed
`recognition results, and is particularly concerned with voice-
`based activation of such instruments.
`
`2. Description of the Related Art
`Speech recognition devices can be generally classified
`into two types. The first type is the specific-speaker speech
`recognition device that can only recognize the speech of a
`specific speaker, and the second general type is the non-
`specific speaker speech recognition device that can recog-
`nize the speech of non-specific speakers.
`In the case of a specific speaker speech recognition device
`a specific speaker first registers his or her speech signal
`patterns as reference templates by entering recognizable
`words one at a time according to an interactive specified
`interactive procedure. After registration, when the speaker
`issues one of the registered words. speech recognition is
`performed by comparing the feature pattern of the entered
`word to the registered speech templates. One example of this
`kind of interactive speech recognition device is a speech
`recognition toy. The child who uses the toy pre-registers, for
`example, about 10 phrases such as “Good morning,” “Good
`night” and “Good day,”, as multiple speech instructions. In
`practice, when the speaker says “Good morning,” his speech
`signal is compared to the speech signal of the registered
`“Good morning.” If there is a match between the two speech
`signals, a electrical signal corresponding to the speech
`instruction is generated, which then makes the toy perform
`a specified action.
`this type of specific
`As the name implies, of course,
`speaker speech recognition devices can recognize only the
`speech of a specific speaker or speech possessing a similar
`pattern. Furthermore, since the phrases to be recognized
`must be registered one at a time as part of device
`initialization, using such a device is quite cumbersome and
`complex.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`By contrast, a non-specific speaker speech recognition
`device creates standard speech feature patterns of the rec-
`ognition target phrases described above, using “canned”
`speech examplars spoken by a large number (e.g., around
`200) of speakers. Phrases spoken by a non-specific speaker/
`user and then compared to these pre-registered recognizable
`phrases for recognition.
`However, such speech recognition devices usually
`become ready to perform recognition operations and
`responses only when an external switch is turned on or
`external power is delivered to the device is turned on,
`regardless of whether the device uses specific or non-
`specific speaker recognition. But, in some types of speech
`recognition devices,
`it would be more convenient if the
`device were in a standby state waiting for speech input at all
`times, and performed recognition operations by sensing
`speech input, without the need for the user to turn on the
`switch every time.
`for
`Take a stuffed toy utilizing speech recognition,
`example. If the toy can be kept in a speech input standby
`state, i,e., a so-called sleep mode, and can instantly respond
`when the child calls out its name, it can respond quickly
`without the need for plugging the device in or pressing, a
`button,
`thereby greatly enhancing its appeal as a user-
`friendly device especially for younger children where apply-
`ing external power may raise safety concerns. In addition to
`toys, the same can be said of all electronic instruments that
`utilize speech recognition.
`Some issues must be resolved when keeping the device in
`a sleep mode and having it perform recognition operations
`by sensing speech input, as explained above. These include,
`for example, power consumption and the ability of the
`device to differentiate between phrases to be recognized and
`noise, and to act only in response to phrases to be recog-
`nized. In particular, since most toys run on batteries, mini-
`mizing battery drain is a major issue. Additionally, product
`prices must also be kept low to maintain commercial appeal
`for such devices , so using expensive, conventional activa-
`tion circuitry is undesirable. So, heretofore, there have been
`a large number of technical restrictions on commercializing
`interactive speech devices which also feature voice activa-
`tion.
`
`OBJECTS OF THE INVENTION
`
`It is therefore, an object of the present invention, to enable
`the device to remain in a sleep mode and perform recogni-
`tion operations only when a recognizable speech input is
`detected, to minimize power consumption during the sleep
`mode, to enable the speech to be recognized at high accuracy
`even if noise is present in the usage environment, and to
`enhance commercial appeal of the device by retaining low
`cost over conventional designs.
`SUMMARY OF THE INVENTION
`
`In accordance with this and related objects, a voice
`activated interactive speech mechanism according to the
`present invention includes: 1) a sound signal input unit for
`receiving ambient signals projected to the mechanism; 2) a
`sound signal analyzer in communication with the sound
`signal input unit for analyzing sounds perceived by the
`sound signal input unit and generating voice feature param-
`eters responsive to these analyzed sounds; 3) a phrase
`detector in communication with the sound signal analyzer
`for receiving generated voice feature parameters for the
`perceived sounds, comparing the received data against a set
`of speech reference templates in an effort to find a match,
`
`Page 6 of 14
`
`Page 6 of 14
`
`
`
`5,983,186
`
`3
`and generating phrase detection data should a recognizable
`phrase be found; 4) a speech recognition interactive con-
`troller in communication with the phrase detector which
`receives generated phrase detection data, understands the
`meaning of the input speech, determines appropriate
`response content, and performs various controls on the
`mechanism based on the interpreted speech; 5) a speech
`synthesizer in communication with the interaction controller
`for generating synthesized speech output based on the
`determined response; 6) a speech output unit in communi-
`cation with the interaction controller and the speech synthe-
`sizer for broadcasting the generated synthesized speech; and
`7) an input sound signal power detector in communication
`with at least the sound signal input unit and the interaction
`controller for detecting the volume, magnitude or amplitude
`of input sound signals based on sound signal waveforms
`perceived by the sound signal input unit or capture device.
`Preferably, this power detector includes processing circuitry
`for forcing the mechanism to selectively enter or terminate
`a low-power sleep mode. Moreover, preferably, during this
`sleep mode, either the interaction controller or the input
`sound signal power detector itself determines whether input
`sound signals, as detected by the input sound signal power
`detector, are at least at a predetermined threshold volume
`level above the background noise. If so, a determination is
`then made whether or not threshold filtered input sound
`signal corresponds to a recognizable phrase, and, if so, shifts
`the device from the sleep mode into the active mode.
`Also, a hardware or software timer can be used to
`determine if a given perceived sound meets or exceeds the
`predetermined threshold power level for a specified duration
`of time. If a given perceived sound signal that is higher in
`level than this threshold is continuously present for at least
`this specified duration of time and if the input sound signal
`is determined not to be a recognizable phrase, the input
`sound signal is determined to be background noise present
`in the environment. This aids proper voice activation in a
`noisy ambient environment. Moreover, the threshold power
`level may be updated in real time to account for detection of
`this background noise.
`Furthermore,
`the sound signal power level detector,
`according to the present
`invention, may be used as an
`ambient noise feedback device to enable the speech recog-
`nition mechanism to take into account perceived noise levels
`in formulating the volume of response message and other
`audible functions. In so doing, the mechanism may set an
`initial threshold for eliminating noise, and perform power
`detection for a specified duration of time using this threshold
`as the reference. Specifically, 1) if an input sound signal that
`is higher in level than the current threshold is perceived for
`at least a specified duration of time, and, 2) if the input sound
`signal is determined not to be a recognizable phrase, the
`input sound signal is judged by the mechanism to be ambient
`background noise. At the same time the threshold is updated
`to a value greater than the perceived background noise level.
`Also, the sound signal power level detector according to
`the present invention may be used by the mechanism to
`generate an audible response having a volume level corre-
`sponding to the perceived power levels of the input sound
`signal.
`Voice activated interactive speech processing according to
`the present invention includes: 1) sound signal capture for
`receiving ambient sound signals projected to a receiving
`device; 2) sound signal analysis for analyzing these sounds
`and generating voice feature parameter responsive thereto;
`3) phrase detection for comparing generated feature param-
`eters data for the perceived sounds against a set of speech
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`reference templates in an effort to find a match, and issuing
`phrase detection data should a recognizable phrase be found;
`4) overall speech recognition interactive control for receiv-
`ing generated phrase detection data, comprehending the
`meaning of the input speech, determining appropriate
`response content, and performing varied tasks responsive to
`the interpreted speech; 5) speech synthesis for generating
`synthesized speech output based on the determined
`response; 6) speech reproduction for broadcasting the gen-
`erated synthesized speech; and 7) input sound signal power
`detection for detecting the power, magnitude or amplitude of
`input sound signals based on perceived sound signal wave-
`forms. Preferably, such speech processing includes the abil-
`ity to force a speech recognition device to selectively enter
`or terminate a low-power sleep mode. Moreover, preferably,
`during this sleep mode, a determination is made whether
`input sound signals, as detected during input sound signal
`capture, are at least at a predetermined threshold signal level
`above the background noise. If so, a determination is then
`made whether or not the threshold filtered input sound signal
`corresponds to a recognizable phrase, and, if so, shifts the
`device from the sleep mode into the active mode.
`Also, hardware or software timer processing can be
`incorporated in speech processing according to the present
`invention to determine if a given perceived sound meets or
`exceeds the predetermined threshold power level for a
`specified duration of time. If a given perceived sound signal
`that is higher in level than this threshold is continuously
`present for at least this specified duration of time, and, if the
`input sound signal is determined not to be a recognizable
`phrase, the input sound signal is judged to be steady noise
`present in the environment. This aids proper voice activation
`in a noisy ambient environment. Moreover, the threshold
`power level may be updated in real time to account for
`detection of this steady noise.
`Furthermore, according to the present invention, input
`signal power detection may be used for ambient noise
`feedback purposes to enable a speech recognition mecha-
`nism to take into account perceived noise levels in formu-
`lating the volume of response message and other audible
`functions. In so doing, may set an initial
`threshold for
`eliminating noise, and perform power detection for a speci-
`fied duration of time using this threshold as the reference.
`Specifically, for an input sound signal that is higher in level
`than the threshold which is continuously present for a
`specified duration of time,
`if the input sound signal
`is
`determined not to be a recognizable phrase, the input sound
`signal is judged by the mechanism to be steady noise present
`in the environment. Also,
`the threshold is concurrently
`updated to a value that is greater than the steady noise level.
`Also, sound signal power level detection according to the
`present invention may incorporate generating an audible
`response having a volume level corresponding to the per-
`ceived power levels of the input sound signal.
`As explained hereinabove, when a speech recognition
`device according to the present invention is in the sleep
`mode based on a sleep mode request, it determines whether
`or not the volume level or power level of a perceived input
`sound signal is at least matches a predetermined threshold
`volume level and also whether or not the input sound signal
`constitutes a recognizable phrase. If both of these conditions
`are satisfied, the device shifts from the low-power sleep
`mode into the active mode. Otherwise (i.e. a low-level sound
`or high-level noise situation), the sleep state is maintained.
`As a result, only phrases to be recognized are processed for
`recognition while reducing deleterious noise effects.
`Furthermore, when the device is in the sleep mode, only
`
`Page 7 of 14
`
`Page 7 of 14
`
`
`
`5,983,186
`
`5
`those portions of the device that consume small amounts of
`power, such as the sound signal input unit and the input
`sound signal power detector area, need be active, thereby
`keeping the total power consumption at a relatively low
`level (e.g. power consuming speech synthesis and reproduc-
`tion circuitry may be powered down at this time) in com-
`parison with conventional speech recognition devices.
`According to the present invention, if an input sound
`signal that is higher in level than the threshold is continu-
`ously present for a specified duration of time and is deter-
`mined not to be a recognizable phrase, it is judged to be
`steady noise present in the environment. In this way, rela-
`tively loud ambient noise sounds continuously present for an
`extended duration in the environment can be considered
`
`extraneous and accounted for, and thus the effects of steady
`noise present in the environment can be reduced. Thus,
`according to the present invention, voice activation opera-
`tions can be responsive to a changing noise environment, as
`would be in the case of carrying a speech recognition toy
`from a quiet bedroom into the cabin of a family vehicle.
`The noise level in the environment is preferably judged
`based on the power signal from the input sound signal power
`detector, and a response at a voice level that corresponds to
`the noise level is output. Therefore, the response can be
`output at a high voice level
`if the noise level
`in the
`environment is high, making the response easier to hear even
`when some noise is present in the environment. Of course,
`when the ambient environment becomes quiet, speech pro-
`cessing according to the present invention permits attenua-
`tion of the threshold and increased device responsiveness to
`external sounds.
`
`Additionally, since the threshold is updated to a value that
`is greater than the steady noise level, and the noise level at
`a certain point in time is judged based on the magnitude of
`the threshold at that point in time, the index of the noise level
`can be obtained based on the threshold, making it simple to
`determine the current noise level according to the present
`invention. Furthermore, even if the noise level changes, the
`response can be generated at a voice level that corresponds
`to the noise level, making it possible to output the response
`at a voice level that better suits the noise in the environment.
`
`the
`invention,
`Furthermore, according to the present
`response may be output at a voice level that corresponds to
`the power of the input sound signal. Therefore,
`if the
`speaker’s voice is loud, the response will also be loud; and
`if the speaker’s voice is soft, the response will also be soft,
`enabling an interactive conversations at a volume level
`appropriate for the situation.
`Other objects and attainments together with a fuller
`understanding of the invention will become apparent and
`appreciated by referring to the following description of
`specific, preferred embodiments and appending claims,
`taken in conjunction with the accompanying drawings:
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`In the drawings, wherein like reference symbols refer to
`like parts:
`FIG. 1 is a block diagram for explaining the first embodi-
`ment of the present invention;
`FIGS. 2A—2E diagramatically illustrate a sample input
`voice waveform and resultant partial word lattice for
`explaining phrase detection by the phrase detector and
`speech recognition by the speech comprehension interaction
`controller according to the present invention;
`FIGS. 3A and 3B diagramatically illustrate a simplified
`input sound signal waveform and its corresponding power
`signal;
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`FIG. 4 is a flow chart for explaining voice activation
`processing according to the first embodiment; and
`FIGS. 5A and 5B diagramatically illustrate a stylized
`input sound wave form in which both the threshold and
`respond output levels are set according to the noise level
`according to the present invention.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`
`The preferred embodiment of the invention are explained
`hereinbelow with reference to specific figures, where appro-
`priate the figures. Note that the invention has been applied
`to a toy in these embodiments, and more particularly to a
`stuffed toy dog, intended for small children. Furthermore,
`these embodiments will be explained in which teachings of
`the present invention are applied to a non-specific speaker
`speech recognition device that can recognize the speech of
`non-specific speakers.
`FIG. 1 is a configuration diagram of the first embodiment
`of the present invention. FIG. 1 schematically shows sound
`signal capture unit 1, sound signal analyzer 2, phrase detec-
`tor 3, speech reference templates memory 4, speech com-
`prehension interaction controller 5, response data memory 6,
`speech synthesizer 7, speech output unit 8 and input sound
`signal power detector 9. Note that, of these configuration
`elements, sound signal analyzer 2, phrase detector 3, speech
`reference templates memory 4, speech comprehension inter-
`action controller 5, response data memory 6, and speech
`synthesizer 7 are contained inside the belly of the stuffed toy
`dog, (wherein analyzer 2, phrase detector 3, controller 5 and
`synthesizer 7 are constituent members of integrated CPU 10)
`and sound signal capture unit (here a microphone) 1 and
`speech output unit (here a speaker) 8 are installed in the ear
`and the mouth, respectively, of the stuffed toy, for example.
`The functions of these elements are explained in sequence
`below.
`
`Sound signals (including noise), such as a speaker’s
`voice, are input into the sound signal capture unit compris-
`ing a conventional microphone, an amplifier, a lowpass
`filter, an A/D converter, etc. which are not illustrated here for
`the sake of simplicity, since their structure does not particu-
`larly impact the teachings of the present invention. The
`sound signal
`input from the microphone is first passed
`through the amplifier and the lowpass filter and converted
`into an appropriate sound waveform. This waveform is
`converted into a digital signal (e.g., 12 KHz, 16 bits) by the
`A/D converter, which is then sent to sound signal analyzer
`2. Sound signal analyzer 2 uses a programmed CPU to
`analyze at short intervals the frequency of the waveform
`signal sent from sound signal capture unit 1, then extracts
`the multi-dimensional speech feature vector that expresses
`the frequency characteristics (LPC-CEPSTRUM coefficient
`is normally used) thereof, and generates the time series
`(hereafter referred to as “feature vector array”) of this
`characteristic vector for subsequent matching and recogni-
`tion operations.
`The speech reference templates memory 4 preferably
`comprises a ROM device that stores (registers) voice vector
`reference templates of the recognition target phrases, created
`in advance using the speech issued for each word by a large
`number of typical speakers chosen according to the contem-
`plated uses of the speech recognition device. Here, since a
`stuffed toy is used for the example, about 10 phrases used for
`greeting, such as “Good morning,” “Good night” and “Good
`day,” “tomorrow,” and “weather,” for example, will be used
`as the recognition targets. However,
`recognition target
`
`Page 8 of 14
`
`Page 8 of 14
`
`
`
`5,983,186
`
`7
`phrases are not limited to these particular phrases, and a
`wide variety of phrases can be registered, as will be apparent
`to those ordinarily skilled in the art. Furthermore,
`the
`number of phrases that can be registered certainly need not
`be limit to 10, and is dependent only in the size of addres-
`sable memory 4 utilized.
`Also, although not shown in FIG. 1, phrase detector 3
`comprises a general or special-purpose processor (CPU) and
`a ROM device storing the processing program, and deter-
`mines if and to what degree of certainty the input voice
`target phrases registered in reference templates memory 4
`may be present. Hidden Markov Model (HMM) or DP
`matching can be used by phrase detector 3 as is well known
`in the art word-spotting processing technology. However, in
`these embodiments, keyword-spotting processing technol-
`ogy using the dynamic recurrent neural network (DRNN)
`method is preferably used as disclosed in US. application
`Ser. No. 08/078,027, filed Jun. 18, 1993, entitled “Speech
`Recognition System”, commonly assigned with the present
`invention to Seiko-Epson Corporation of Tokyo, Japan,
`which is incorporated fully herein by reference. Also, this
`method is disclosed in the counterpart laid open Japanese
`applications H6-4097 and H6-119476. DRNN is used in
`order to perform voice recognition of virtually continuous
`speech by nonspecific speakers and to output word detection
`data as described herein.
`
`The following is a brief explanation of the specific
`processing performed by phrase detector 3 with reference to
`FIGS. 2A—2E. Phase detector 3 determines the confidence
`
`level at which a word registered in speech reference tem-
`plates memory 4 occurs at a specific location in the input
`voice. Now, suppose that the speaker inputs an example
`Japanese language phrase “asu no tenki wa.
`.
`. ” meaning
`“Concerning tomorrow’s weather”. Assume that in this case
`the stylized voice signal shown in FIG. 2A represents the
`audio waveform for this expression.
`. ”, the contextual
`.
`In the expression “asu no tenki wa .
`keywords or target phrases include “asu” (tomorrow) and
`“tenki” (weather). These are stored in the form of vector
`series reference templates in speech reference templates
`memory 4 as parts of the a predetermined word registry,
`which in this case, represents approximately 10 target dis-
`tinct words or phrases. If 10 phrases are registered, signals
`are output by the phrase detector 3 in order to detect phrases
`corresponding to these 10 phrases (designated phrase 1,
`phrase 2, phrase 3 .
`.
`. up to phrase 10). From the information
`such as detected signal values, the phrase detector deter-
`mines the confidence level at which the corresponding
`words occur in the input voice.
`More specifically, if the word “tenki” (weather) occurs in
`the input voice as phrase 1, the detection subunit that is
`waiting for the signal “tenki” (weather) initiates an analog
`signal which rises at the portion “tenki” in the input voice,
`as shown in FIG. 2B. Similarly,
`if the word “asu”
`(tomorrow) occurs in the input voice as word 2, the detection
`subunit that is waiting for the signal “asu” rises at the portion
`“asu” in the input voice, as shown in FIG. 2C.
`In FIGS. 2B and 2C, the numerical values 0.9 and 0.8
`indicate respective confidence levels that the spoken voice
`contains the particular pre-registered keyword . The relative
`level or magnitude of this level can fluctuate between ~0 and
`1.0, with 0 indicating a zero confidence match factor and 1.0
`representing a 100% confidence match factor. In the case of
`a high confidence level, such as 0.9 or 0.8, the registered or
`target phrase having a high confidence level can be consid-
`ered to be a recognition candidate relative to the input voice.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`Thus, the registered word “asu” occurs with a confidence
`level of 0.8 at position wl on the time axis. Similarly, the
`registered word or phrase “tenki” occurs with a confidence
`level of 0.9 at position w2 on the time axis.
`Also, the example of FIGS. 2A—2E shows that, when the
`word “tenki” (weather) is input, the signal that is waiting for
`phrase 3 (phrase 3 is assumed to be the registered word
`“nanji” (“What time. .
`.
`. ”) also rises at position w2 on the
`time axis with relative uncertain confidence level of approxi-
`mately 0.6. Thus, if two or more registered phrases exist as
`recognition candidates at the same time relative to an input
`voice signal, the recognition candidate word is determined
`by one of two methods: either by 1) selecting the potential
`recognition candidate having the highest degree of similarity
`to the input voice (using absolute confidence level
`comparisons) as the actually recognized keyword or phrase;
`or by 2) method of selecting one of the words as the
`recognized word by creating a predefined correlation table
`expressing contextual rules between words. In this case, the
`confidence level for “tenki” (weather) indicates that it has
`the highest degree of similarity to the input voice during
`time portion w2 on the time axis, even though “nanji” can
`also be recognized as a potential recognition candidate.
`Therefore, “tenki” is selected as the actual recognition
`candidate for
`this example. Based on these confidence
`levels, interaction controller 5 performs the recognition o