throbber
5,983,186
`[11] Patent Number:
`[19]
`United States Patent
`
`Miyazawa et al.
`[45] Date of Patent:
`*Nov. 9, 1999
`
`[54] VOICE-ACTIVATED INTERACTIVE SPEECH
`RECOGNITION DEVICE AND METHOD
`
`5,794,198
`5,799,279
`
`704/255
`8/1998 Yegnanarayanan et al.
`8/1998 Gould et al.
`............................ 704/275
`
`
`
`USOOS983186A
`
`[75]
`
`Inventors: Yasunaga Miyazawa; Mitsuhiro
`Inazumi; Hiroshi Hasegawa; Isao
`Edatsune; Osamu Urano, all of Suwa,
`Japan
`
`FOREIGN PATENT DOCUMENTS
`62—253093
`11/1987
`Japan .
`6—4097
`1/1994
`Japan .
`6—119476
`4/1994
`Japan .
`
`.
`.
`[73] Assignee.
`
`.
`.
`11211:: Epson Corporation, Tokyo,
`p
`
`Primary Examiner—David R. Hudspeth
`Assistant Examiner—Daniel Abebe
`Attorney, Agent, or Firm—Michael T. Gabrik
`
`[*] Notice:
`
`This patent issued on a continued pros-
`ecution application filed under 37 CFR
`1.53(d), and is subject to the twenty year
`patent
`term provisions of 35 U.S,C,
`154(a)(2).
`
`[21] Appl. No.: 08/700,181
`
`[22]
`
`[30]
`
`Filed:
`
`Aug. 20, 1996
`.
`.
`.
`.
`.
`Foreign Application Priority Data
`
`Aug. 21, 1995
`
`[JP]
`
`Japan .................................... 7—212248
`
`6
`
`........................................................ G10L 5/00
`Int. Cl.
`[51]
`........................... 704/275; 704/251; 704/233
`[52] U..S. Cl.
`[58] Field Of Search --------------------- 179/15, 251; 704/212,
`704/275, 246, 233, 255; 381/57
`_
`References CltEd
`U.S. PATENT DOCUMENTS
`
`[56]
`
`1/1944 Stanko
`2,338,551
`................................... 381/57
`
`470527568 10/1977 JaPk9WSk1 """""" 179/15
`
`" 38:53::
`g’gg’ggi
`3133: 31%?“ et al’
`5,562,453
`10/1996 “2:111 """""""""""""""""""""" 704/275
`5:577:164
`11/1996 Kanefia ..................................
`704/275
`
`9/1997 Foster ......
`5,668,929
`704/246
`12/1997 Cline ....................................... 704/246
`5,704,009
`
`
`[57]
`
`ABSTRACT
`
`Techniques for implementing adaptable voice activation
`operations for interactive speech recognition devices and
`instruments. Specifically, such speech recognition devices
`and instruments include an input sound signal power or
`volume detector in communication with a central CPU for
`bringing the CPU out of an initial sleep state upon detection
`of perceived voice exceeding a predetermined threshold
`volume level and is continuously perceived for at least a
`.
`.
`.
`.
`.
`.
`certain period of time. If both these conditions are satisfied,
`the CPU is transitioned into an active mode so that the
`perceived voice can be analyzed against a set of registered
`key words to determine ifa “power on” command or similar
`instruction has been received. If SO, the CPU maintains an
`active state in normal speech recognition processing ensues
`until a “power off” command is received. However, if the
`perceived and analyzed voice can not be recognized, it is
`deemed to be background noise and the minimum threshold
`is selectively updated to accommodate the volume level of
`the perceived but unrecognized voice. Other aspects include
`tailoring the volume level of the synthesized voice response
`according to the perceived volume level as detected by the
`input sound signal power detector, as well as modifying
`audible response volume in accordance with updated vol-
`ume threshold levels.
`
`9 Claims, 4 Drawing Sheets
`
`
`IS INPUT
`
`
`SOUND SIGNAL EQUAL
`TO OR GREATER THAN
`
`THRESHOLD PRESENT
`
` r57
`PUTS CPU TO SLEEP,
`
`
`
`
`
`
`
`
` OUTPUTS
`
`
`
`IS IT
`A KEYWORD
`?
`YES
`SETS FLAG T0
`ACTIVE MODE.
`
`:—SPEECH COMPREHENS/ON
`INTERACT/ON CONTROL
` (5 INPUT
`
`SOUND SIGNAL
`HN/SHED FOR SINGLE
`INTERACTTON?
`
`
`
`
`
`
`
`RESPONSE
`
`' F574
`SETS FLAG TO
`SLEEP MODE.
`515
`5:73 DEWCE lN
`SLEEP MODE
`
`a
`
`
`
`
`56
`
`N0
`
`Page 1 of 14
`
`GOOGLE EXHIBIT 1003
`
`Page 1 of 14
`
`GOOGLE EXHIBIT 1003
`
`

`

`US. Patent
`
`N0V.9,1999
`
`Sheet 1 0f4
`
`5,983,186
`
`9
`
`SOUND SIGNAL
`INPUT UNIT
`
`7
`
`INPUT SOUND
`SIGNAL POWER
`DETECTOR
`
`CONTROLLER ____\___.____._______..____.._____..J
`
`5
`
`I
`
`\
`
`N O
`
`
`
`SPEECH
`COMPREHENS/ON
`
`INTERACTION
`
`r IIII III I II II I
`
`II III I II IIIIIII
`
`4
`
`SPEECH REFERENCE
`
`TEMPLATES
`
`MEMORY
`
`Page 2 of 14
`
`6
`
`SPEECH
`
`RESPONSE DA TA
`
`I
`SYNTHES/ZER
`______________ J
`
`.
`
`MEMORY
`
`SPEECH
`OUTPUT UNIT
`
`8
`
`FIG. -1
`
`Page 2 of 14
`
`

`

`US. Patent
`
`N0V.9,1999
`
`Sheet 2 0f4
`
`5,983,186
`
`”TENKI"
`”ASU”
`(WEATHER)
`(TOMORROW)
`H |<—~l
`
`VOICE SIGNAL WWW/WW
`
`”WEI
`
`I, w?
`
`2| TIME AXIS
`
`FIG. '2A
`
`PHRASE7
`
`7.
`
`(WEATHER)
`FIG.-2B
`
`WE
`
`f—‘le
`
`7.0
`PHRASE 2
`”ASU” ——"“"‘—
`(TOMORROW)A—
`0.8
`TIME—>
`FIG. '20
`l
`l
`
`3
`
`e
`
`7.0
`
`PHRASE 3
`"NAM/l"
`
`(WHAT TIME) v_=..%__
`FIG.-20
`SIX—j)
`
`-
`
`TIME——>
`
`PHRASE 4
`"OHA mo"
`(GOOD MORNING)
`
`FIG. ‘2E
`
`7.0
`
`
`
`TIME—->
`
`Page 3 of 14
`
`Page 3 of 14
`
`

`

`US. Patent
`
`N0V.9,1999
`
`Sheet 3 0f4
`
`5,983,186
`
`INPUT SOUND
`
`SIGNAL WA VEFORM
`
`FIG. -3A
`
`TIME SEQUENCE OF
`
`TIME —>
`
`POWER SIGNAL 0F
`
`#77
`
`0
`
`FIG-‘33
`
`INPUT SOUND SIGNAL 'A
`
`.
`
`,
`
`‘_
`
`70
`I H
`—,>I p— ms
`—>l H<—20 some ms
`t0 H
`
`T/ME——>
`
`I
`A2
`I
`k———————>|
`I
`I
`I
`I
`I
`I
`A7
`,
`p—H
`|
`|
`
`FIG.-5A
`
`II
`
`' ‘
`th2
`W L‘___
`
`FIG. -5B W
`
`Page 4 of 14
`
`Page 4 of 14
`
`

`

`US. Patent
`
`N0V.9,1999
`
`Sheet 4 0f4
`
`5,983,186
`
`START
`
`FIG. -4
`
`IS INPUT
`
`SOUND SIGNAL EQUAL
`TO OR GRE4TER THAN
`
`THRESHOLD PRESENT
`7
`
`YES
`
`32
`
`PUTS CPU T0 SLEEP.
`
`A KEYWORD
`
`WAKES UP CPU.
`
`3
`
`ANALYZES SOUND SIGNAL.
`
`DETEC TS PHRASE
`
`DEVICE IN
`
`SLEEP MODE
`?
`
`IS IT
`
`IS THERE
`
`SLEEP MODE
`
`REQUEST?
`
`OUTPUTS
`
`RESPONSE
`
`SETS FLAG TO
`
`OUTPUTS
`
`SPEECH COMPREHENS/ON
`INTERACTION CONTROL
`
`570
`
`I5 INPUT
`SOUND SIGNAL
`
`FIN/SHED FOR SINGLE
`INTERACTION?
`
`SETS FLAG 7.0
`SLEEP MODE
`
`SETS DEVICE IN
`
`SLEEP MODE.
`
`Page 5 of 14
`
`Page 5 of 14
`
`

`

`5,983,186
`
`1
`VOICE-ACTIVATED INTERACTIVE SPEECH
`RECOGNITION DEVICE AND METHOD
`
`CROSS REFERENCE TO RELATED
`APPLICATIONS
`
`This Application is related to copending application Ser.
`No. 08/700,175 filed on the same date of the present
`application, Attorney’s Docket number P2504a, entitled “A
`Cartridge Based Interactive Voice Recognition Method and
`Apparatus”, copending application Ser. No. 08/669,874,
`filed on the same date of the present application, Attorney’s
`Docket number P2505a, entitled “A Speech Recognition
`Device and Processing Method”, all commonly assigned
`with the present invention to the Seiko Epson Corporation of
`Tokyo, Japan. This application is also related to the follow-
`ing copending applications: application Ser. No. 08/078,027,
`filed Jun. 18, 1993, entitled “Speech Recognition System”;
`application Ser. No. 08/102,859, filed Aug. 6, 1993, entitled
`“Speech Recognition Apparatus”; application Ser. No.
`08/485,134, filed Jun. 7, 1995, entitled “Speech Recognition
`Apparatus Using Neural Network and Learning Method
`Therefore”; and application Ser. No. 08/536,550, filed Sep.
`29, 1996, entitled “Interactive Voice Recognition Method
`and Apparatus Using Affirmative/Negative Content Dis-
`crimination”; again all commonly assigned with the present
`invention to the Seiko Epson Corporation of Tokyo, Japan.
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The invention generally relates to interactive speech rec-
`ognition instruments which recognize speech and produce
`an audible response or specified action based on developed
`recognition results, and is particularly concerned with voice-
`based activation of such instruments.
`
`2. Description of the Related Art
`Speech recognition devices can be generally classified
`into two types. The first type is the specific-speaker speech
`recognition device that can only recognize the speech of a
`specific speaker, and the second general type is the non-
`specific speaker speech recognition device that can recog-
`nize the speech of non-specific speakers.
`In the case of a specific speaker speech recognition device
`a specific speaker first registers his or her speech signal
`patterns as reference templates by entering recognizable
`words one at a time according to an interactive specified
`interactive procedure. After registration, when the speaker
`issues one of the registered words. speech recognition is
`performed by comparing the feature pattern of the entered
`word to the registered speech templates. One example of this
`kind of interactive speech recognition device is a speech
`recognition toy. The child who uses the toy pre-registers, for
`example, about 10 phrases such as “Good morning,” “Good
`night” and “Good day,”, as multiple speech instructions. In
`practice, when the speaker says “Good morning,” his speech
`signal is compared to the speech signal of the registered
`“Good morning.” If there is a match between the two speech
`signals, a electrical signal corresponding to the speech
`instruction is generated, which then makes the toy perform
`a specified action.
`this type of specific
`As the name implies, of course,
`speaker speech recognition devices can recognize only the
`speech of a specific speaker or speech possessing a similar
`pattern. Furthermore, since the phrases to be recognized
`must be registered one at a time as part of device
`initialization, using such a device is quite cumbersome and
`complex.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`By contrast, a non-specific speaker speech recognition
`device creates standard speech feature patterns of the rec-
`ognition target phrases described above, using “canned”
`speech examplars spoken by a large number (e.g., around
`200) of speakers. Phrases spoken by a non-specific speaker/
`user and then compared to these pre-registered recognizable
`phrases for recognition.
`However, such speech recognition devices usually
`become ready to perform recognition operations and
`responses only when an external switch is turned on or
`external power is delivered to the device is turned on,
`regardless of whether the device uses specific or non-
`specific speaker recognition. But, in some types of speech
`recognition devices,
`it would be more convenient if the
`device were in a standby state waiting for speech input at all
`times, and performed recognition operations by sensing
`speech input, without the need for the user to turn on the
`switch every time.
`for
`Take a stuffed toy utilizing speech recognition,
`example. If the toy can be kept in a speech input standby
`state, i,e., a so-called sleep mode, and can instantly respond
`when the child calls out its name, it can respond quickly
`without the need for plugging the device in or pressing, a
`button,
`thereby greatly enhancing its appeal as a user-
`friendly device especially for younger children where apply-
`ing external power may raise safety concerns. In addition to
`toys, the same can be said of all electronic instruments that
`utilize speech recognition.
`Some issues must be resolved when keeping the device in
`a sleep mode and having it perform recognition operations
`by sensing speech input, as explained above. These include,
`for example, power consumption and the ability of the
`device to differentiate between phrases to be recognized and
`noise, and to act only in response to phrases to be recog-
`nized. In particular, since most toys run on batteries, mini-
`mizing battery drain is a major issue. Additionally, product
`prices must also be kept low to maintain commercial appeal
`for such devices , so using expensive, conventional activa-
`tion circuitry is undesirable. So, heretofore, there have been
`a large number of technical restrictions on commercializing
`interactive speech devices which also feature voice activa-
`tion.
`
`OBJECTS OF THE INVENTION
`
`It is therefore, an object of the present invention, to enable
`the device to remain in a sleep mode and perform recogni-
`tion operations only when a recognizable speech input is
`detected, to minimize power consumption during the sleep
`mode, to enable the speech to be recognized at high accuracy
`even if noise is present in the usage environment, and to
`enhance commercial appeal of the device by retaining low
`cost over conventional designs.
`SUMMARY OF THE INVENTION
`
`In accordance with this and related objects, a voice
`activated interactive speech mechanism according to the
`present invention includes: 1) a sound signal input unit for
`receiving ambient signals projected to the mechanism; 2) a
`sound signal analyzer in communication with the sound
`signal input unit for analyzing sounds perceived by the
`sound signal input unit and generating voice feature param-
`eters responsive to these analyzed sounds; 3) a phrase
`detector in communication with the sound signal analyzer
`for receiving generated voice feature parameters for the
`perceived sounds, comparing the received data against a set
`of speech reference templates in an effort to find a match,
`
`Page 6 of 14
`
`Page 6 of 14
`
`

`

`5,983,186
`
`3
`and generating phrase detection data should a recognizable
`phrase be found; 4) a speech recognition interactive con-
`troller in communication with the phrase detector which
`receives generated phrase detection data, understands the
`meaning of the input speech, determines appropriate
`response content, and performs various controls on the
`mechanism based on the interpreted speech; 5) a speech
`synthesizer in communication with the interaction controller
`for generating synthesized speech output based on the
`determined response; 6) a speech output unit in communi-
`cation with the interaction controller and the speech synthe-
`sizer for broadcasting the generated synthesized speech; and
`7) an input sound signal power detector in communication
`with at least the sound signal input unit and the interaction
`controller for detecting the volume, magnitude or amplitude
`of input sound signals based on sound signal waveforms
`perceived by the sound signal input unit or capture device.
`Preferably, this power detector includes processing circuitry
`for forcing the mechanism to selectively enter or terminate
`a low-power sleep mode. Moreover, preferably, during this
`sleep mode, either the interaction controller or the input
`sound signal power detector itself determines whether input
`sound signals, as detected by the input sound signal power
`detector, are at least at a predetermined threshold volume
`level above the background noise. If so, a determination is
`then made whether or not threshold filtered input sound
`signal corresponds to a recognizable phrase, and, if so, shifts
`the device from the sleep mode into the active mode.
`Also, a hardware or software timer can be used to
`determine if a given perceived sound meets or exceeds the
`predetermined threshold power level for a specified duration
`of time. If a given perceived sound signal that is higher in
`level than this threshold is continuously present for at least
`this specified duration of time and if the input sound signal
`is determined not to be a recognizable phrase, the input
`sound signal is determined to be background noise present
`in the environment. This aids proper voice activation in a
`noisy ambient environment. Moreover, the threshold power
`level may be updated in real time to account for detection of
`this background noise.
`Furthermore,
`the sound signal power level detector,
`according to the present
`invention, may be used as an
`ambient noise feedback device to enable the speech recog-
`nition mechanism to take into account perceived noise levels
`in formulating the volume of response message and other
`audible functions. In so doing, the mechanism may set an
`initial threshold for eliminating noise, and perform power
`detection for a specified duration of time using this threshold
`as the reference. Specifically, 1) if an input sound signal that
`is higher in level than the current threshold is perceived for
`at least a specified duration of time, and, 2) if the input sound
`signal is determined not to be a recognizable phrase, the
`input sound signal is judged by the mechanism to be ambient
`background noise. At the same time the threshold is updated
`to a value greater than the perceived background noise level.
`Also, the sound signal power level detector according to
`the present invention may be used by the mechanism to
`generate an audible response having a volume level corre-
`sponding to the perceived power levels of the input sound
`signal.
`Voice activated interactive speech processing according to
`the present invention includes: 1) sound signal capture for
`receiving ambient sound signals projected to a receiving
`device; 2) sound signal analysis for analyzing these sounds
`and generating voice feature parameter responsive thereto;
`3) phrase detection for comparing generated feature param-
`eters data for the perceived sounds against a set of speech
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`reference templates in an effort to find a match, and issuing
`phrase detection data should a recognizable phrase be found;
`4) overall speech recognition interactive control for receiv-
`ing generated phrase detection data, comprehending the
`meaning of the input speech, determining appropriate
`response content, and performing varied tasks responsive to
`the interpreted speech; 5) speech synthesis for generating
`synthesized speech output based on the determined
`response; 6) speech reproduction for broadcasting the gen-
`erated synthesized speech; and 7) input sound signal power
`detection for detecting the power, magnitude or amplitude of
`input sound signals based on perceived sound signal wave-
`forms. Preferably, such speech processing includes the abil-
`ity to force a speech recognition device to selectively enter
`or terminate a low-power sleep mode. Moreover, preferably,
`during this sleep mode, a determination is made whether
`input sound signals, as detected during input sound signal
`capture, are at least at a predetermined threshold signal level
`above the background noise. If so, a determination is then
`made whether or not the threshold filtered input sound signal
`corresponds to a recognizable phrase, and, if so, shifts the
`device from the sleep mode into the active mode.
`Also, hardware or software timer processing can be
`incorporated in speech processing according to the present
`invention to determine if a given perceived sound meets or
`exceeds the predetermined threshold power level for a
`specified duration of time. If a given perceived sound signal
`that is higher in level than this threshold is continuously
`present for at least this specified duration of time, and, if the
`input sound signal is determined not to be a recognizable
`phrase, the input sound signal is judged to be steady noise
`present in the environment. This aids proper voice activation
`in a noisy ambient environment. Moreover, the threshold
`power level may be updated in real time to account for
`detection of this steady noise.
`Furthermore, according to the present invention, input
`signal power detection may be used for ambient noise
`feedback purposes to enable a speech recognition mecha-
`nism to take into account perceived noise levels in formu-
`lating the volume of response message and other audible
`functions. In so doing, may set an initial
`threshold for
`eliminating noise, and perform power detection for a speci-
`fied duration of time using this threshold as the reference.
`Specifically, for an input sound signal that is higher in level
`than the threshold which is continuously present for a
`specified duration of time,
`if the input sound signal
`is
`determined not to be a recognizable phrase, the input sound
`signal is judged by the mechanism to be steady noise present
`in the environment. Also,
`the threshold is concurrently
`updated to a value that is greater than the steady noise level.
`Also, sound signal power level detection according to the
`present invention may incorporate generating an audible
`response having a volume level corresponding to the per-
`ceived power levels of the input sound signal.
`As explained hereinabove, when a speech recognition
`device according to the present invention is in the sleep
`mode based on a sleep mode request, it determines whether
`or not the volume level or power level of a perceived input
`sound signal is at least matches a predetermined threshold
`volume level and also whether or not the input sound signal
`constitutes a recognizable phrase. If both of these conditions
`are satisfied, the device shifts from the low-power sleep
`mode into the active mode. Otherwise (i.e. a low-level sound
`or high-level noise situation), the sleep state is maintained.
`As a result, only phrases to be recognized are processed for
`recognition while reducing deleterious noise effects.
`Furthermore, when the device is in the sleep mode, only
`
`Page 7 of 14
`
`Page 7 of 14
`
`

`

`5,983,186
`
`5
`those portions of the device that consume small amounts of
`power, such as the sound signal input unit and the input
`sound signal power detector area, need be active, thereby
`keeping the total power consumption at a relatively low
`level (e.g. power consuming speech synthesis and reproduc-
`tion circuitry may be powered down at this time) in com-
`parison with conventional speech recognition devices.
`According to the present invention, if an input sound
`signal that is higher in level than the threshold is continu-
`ously present for a specified duration of time and is deter-
`mined not to be a recognizable phrase, it is judged to be
`steady noise present in the environment. In this way, rela-
`tively loud ambient noise sounds continuously present for an
`extended duration in the environment can be considered
`
`extraneous and accounted for, and thus the effects of steady
`noise present in the environment can be reduced. Thus,
`according to the present invention, voice activation opera-
`tions can be responsive to a changing noise environment, as
`would be in the case of carrying a speech recognition toy
`from a quiet bedroom into the cabin of a family vehicle.
`The noise level in the environment is preferably judged
`based on the power signal from the input sound signal power
`detector, and a response at a voice level that corresponds to
`the noise level is output. Therefore, the response can be
`output at a high voice level
`if the noise level
`in the
`environment is high, making the response easier to hear even
`when some noise is present in the environment. Of course,
`when the ambient environment becomes quiet, speech pro-
`cessing according to the present invention permits attenua-
`tion of the threshold and increased device responsiveness to
`external sounds.
`
`Additionally, since the threshold is updated to a value that
`is greater than the steady noise level, and the noise level at
`a certain point in time is judged based on the magnitude of
`the threshold at that point in time, the index of the noise level
`can be obtained based on the threshold, making it simple to
`determine the current noise level according to the present
`invention. Furthermore, even if the noise level changes, the
`response can be generated at a voice level that corresponds
`to the noise level, making it possible to output the response
`at a voice level that better suits the noise in the environment.
`
`the
`invention,
`Furthermore, according to the present
`response may be output at a voice level that corresponds to
`the power of the input sound signal. Therefore,
`if the
`speaker’s voice is loud, the response will also be loud; and
`if the speaker’s voice is soft, the response will also be soft,
`enabling an interactive conversations at a volume level
`appropriate for the situation.
`Other objects and attainments together with a fuller
`understanding of the invention will become apparent and
`appreciated by referring to the following description of
`specific, preferred embodiments and appending claims,
`taken in conjunction with the accompanying drawings:
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`In the drawings, wherein like reference symbols refer to
`like parts:
`FIG. 1 is a block diagram for explaining the first embodi-
`ment of the present invention;
`FIGS. 2A—2E diagramatically illustrate a sample input
`voice waveform and resultant partial word lattice for
`explaining phrase detection by the phrase detector and
`speech recognition by the speech comprehension interaction
`controller according to the present invention;
`FIGS. 3A and 3B diagramatically illustrate a simplified
`input sound signal waveform and its corresponding power
`signal;
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`FIG. 4 is a flow chart for explaining voice activation
`processing according to the first embodiment; and
`FIGS. 5A and 5B diagramatically illustrate a stylized
`input sound wave form in which both the threshold and
`respond output levels are set according to the noise level
`according to the present invention.
`
`DESCRIPTION OF THE PREFERRED
`EMBODIMENTS
`
`The preferred embodiment of the invention are explained
`hereinbelow with reference to specific figures, where appro-
`priate the figures. Note that the invention has been applied
`to a toy in these embodiments, and more particularly to a
`stuffed toy dog, intended for small children. Furthermore,
`these embodiments will be explained in which teachings of
`the present invention are applied to a non-specific speaker
`speech recognition device that can recognize the speech of
`non-specific speakers.
`FIG. 1 is a configuration diagram of the first embodiment
`of the present invention. FIG. 1 schematically shows sound
`signal capture unit 1, sound signal analyzer 2, phrase detec-
`tor 3, speech reference templates memory 4, speech com-
`prehension interaction controller 5, response data memory 6,
`speech synthesizer 7, speech output unit 8 and input sound
`signal power detector 9. Note that, of these configuration
`elements, sound signal analyzer 2, phrase detector 3, speech
`reference templates memory 4, speech comprehension inter-
`action controller 5, response data memory 6, and speech
`synthesizer 7 are contained inside the belly of the stuffed toy
`dog, (wherein analyzer 2, phrase detector 3, controller 5 and
`synthesizer 7 are constituent members of integrated CPU 10)
`and sound signal capture unit (here a microphone) 1 and
`speech output unit (here a speaker) 8 are installed in the ear
`and the mouth, respectively, of the stuffed toy, for example.
`The functions of these elements are explained in sequence
`below.
`
`Sound signals (including noise), such as a speaker’s
`voice, are input into the sound signal capture unit compris-
`ing a conventional microphone, an amplifier, a lowpass
`filter, an A/D converter, etc. which are not illustrated here for
`the sake of simplicity, since their structure does not particu-
`larly impact the teachings of the present invention. The
`sound signal
`input from the microphone is first passed
`through the amplifier and the lowpass filter and converted
`into an appropriate sound waveform. This waveform is
`converted into a digital signal (e.g., 12 KHz, 16 bits) by the
`A/D converter, which is then sent to sound signal analyzer
`2. Sound signal analyzer 2 uses a programmed CPU to
`analyze at short intervals the frequency of the waveform
`signal sent from sound signal capture unit 1, then extracts
`the multi-dimensional speech feature vector that expresses
`the frequency characteristics (LPC-CEPSTRUM coefficient
`is normally used) thereof, and generates the time series
`(hereafter referred to as “feature vector array”) of this
`characteristic vector for subsequent matching and recogni-
`tion operations.
`The speech reference templates memory 4 preferably
`comprises a ROM device that stores (registers) voice vector
`reference templates of the recognition target phrases, created
`in advance using the speech issued for each word by a large
`number of typical speakers chosen according to the contem-
`plated uses of the speech recognition device. Here, since a
`stuffed toy is used for the example, about 10 phrases used for
`greeting, such as “Good morning,” “Good night” and “Good
`day,” “tomorrow,” and “weather,” for example, will be used
`as the recognition targets. However,
`recognition target
`
`Page 8 of 14
`
`Page 8 of 14
`
`

`

`5,983,186
`
`7
`phrases are not limited to these particular phrases, and a
`wide variety of phrases can be registered, as will be apparent
`to those ordinarily skilled in the art. Furthermore,
`the
`number of phrases that can be registered certainly need not
`be limit to 10, and is dependent only in the size of addres-
`sable memory 4 utilized.
`Also, although not shown in FIG. 1, phrase detector 3
`comprises a general or special-purpose processor (CPU) and
`a ROM device storing the processing program, and deter-
`mines if and to what degree of certainty the input voice
`target phrases registered in reference templates memory 4
`may be present. Hidden Markov Model (HMM) or DP
`matching can be used by phrase detector 3 as is well known
`in the art word-spotting processing technology. However, in
`these embodiments, keyword-spotting processing technol-
`ogy using the dynamic recurrent neural network (DRNN)
`method is preferably used as disclosed in US. application
`Ser. No. 08/078,027, filed Jun. 18, 1993, entitled “Speech
`Recognition System”, commonly assigned with the present
`invention to Seiko-Epson Corporation of Tokyo, Japan,
`which is incorporated fully herein by reference. Also, this
`method is disclosed in the counterpart laid open Japanese
`applications H6-4097 and H6-119476. DRNN is used in
`order to perform voice recognition of virtually continuous
`speech by nonspecific speakers and to output word detection
`data as described herein.
`
`The following is a brief explanation of the specific
`processing performed by phrase detector 3 with reference to
`FIGS. 2A—2E. Phase detector 3 determines the confidence
`
`level at which a word registered in speech reference tem-
`plates memory 4 occurs at a specific location in the input
`voice. Now, suppose that the speaker inputs an example
`Japanese language phrase “asu no tenki wa.
`.
`. ” meaning
`“Concerning tomorrow’s weather”. Assume that in this case
`the stylized voice signal shown in FIG. 2A represents the
`audio waveform for this expression.
`. ”, the contextual
`.
`In the expression “asu no tenki wa .
`keywords or target phrases include “asu” (tomorrow) and
`“tenki” (weather). These are stored in the form of vector
`series reference templates in speech reference templates
`memory 4 as parts of the a predetermined word registry,
`which in this case, represents approximately 10 target dis-
`tinct words or phrases. If 10 phrases are registered, signals
`are output by the phrase detector 3 in order to detect phrases
`corresponding to these 10 phrases (designated phrase 1,
`phrase 2, phrase 3 .
`.
`. up to phrase 10). From the information
`such as detected signal values, the phrase detector deter-
`mines the confidence level at which the corresponding
`words occur in the input voice.
`More specifically, if the word “tenki” (weather) occurs in
`the input voice as phrase 1, the detection subunit that is
`waiting for the signal “tenki” (weather) initiates an analog
`signal which rises at the portion “tenki” in the input voice,
`as shown in FIG. 2B. Similarly,
`if the word “asu”
`(tomorrow) occurs in the input voice as word 2, the detection
`subunit that is waiting for the signal “asu” rises at the portion
`“asu” in the input voice, as shown in FIG. 2C.
`In FIGS. 2B and 2C, the numerical values 0.9 and 0.8
`indicate respective confidence levels that the spoken voice
`contains the particular pre-registered keyword . The relative
`level or magnitude of this level can fluctuate between ~0 and
`1.0, with 0 indicating a zero confidence match factor and 1.0
`representing a 100% confidence match factor. In the case of
`a high confidence level, such as 0.9 or 0.8, the registered or
`target phrase having a high confidence level can be consid-
`ered to be a recognition candidate relative to the input voice.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`8
`Thus, the registered word “asu” occurs with a confidence
`level of 0.8 at position wl on the time axis. Similarly, the
`registered word or phrase “tenki” occurs with a confidence
`level of 0.9 at position w2 on the time axis.
`Also, the example of FIGS. 2A—2E shows that, when the
`word “tenki” (weather) is input, the signal that is waiting for
`phrase 3 (phrase 3 is assumed to be the registered word
`“nanji” (“What time. .
`.
`. ”) also rises at position w2 on the
`time axis with relative uncertain confidence level of approxi-
`mately 0.6. Thus, if two or more registered phrases exist as
`recognition candidates at the same time relative to an input
`voice signal, the recognition candidate word is determined
`by one of two methods: either by 1) selecting the potential
`recognition candidate having the highest degree of similarity
`to the input voice (using absolute confidence level
`comparisons) as the actually recognized keyword or phrase;
`or by 2) method of selecting one of the words as the
`recognized word by creating a predefined correlation table
`expressing contextual rules between words. In this case, the
`confidence level for “tenki” (weather) indicates that it has
`the highest degree of similarity to the input voice during
`time portion w2 on the time axis, even though “nanji” can
`also be recognized as a potential recognition candidate.
`Therefore, “tenki” is selected as the actual recognition
`candidate for
`this example. Based on these confidence
`levels, interaction controller 5 performs the recognition o

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket