throbber
978-1-4244-2354-5/09/$25.00 ©2009 IEEE
`
`3697
`
`ICASSP 2009
`
`A SINGLE-CHIP SPEECH DIALOGUE MODULE
`AND ITS EVALUATION ON A PERSONAL ROBOT, PAPERO-MINI
`
`Miki Sato, Toru Iwasawa, Akihiko Sugiyama, Toshihiro Nishizawa, Yosuke Takano
`
`NEC Co mmon Platform Software Research Laboratories
`Kawasaki 211-8666, JAPAN
`
`ABSTRACT
`
`Speech Dialogue Module
`
`This paper presents a single-chip speech dialogue module and its
`evaluation on a personal robot. This module is implemented on
`an application processor that was developed primarily for mobile
`phones to provide a compact size, low power-consumption, and low
`cost. It performs speech recognition with preprocessing functions
`such as direction-of-arrival (DOA) estimation, noise cancellation,
`beamforming with an array of microphones, and echo cancellation.
`Text-to-speech (TTS) conversion is also equipped with. Evaluation
`results obtained on a new personal robot, PaPeRo-mini, which is
`a scale-down version of PaPeRo, demonstrate an 85% correct rate
`in DOA estimation, and as much as 54% and 30% higher speech
`recognition rates in noisy environments and during robot utterances,
`respectively. These results are shown to be comparable to those ob-
`tained by PaPeRo.
`
`Index Terms— speech recognition, DOA estimation, noise can-
`cellation, microphone array, echo cancellation, speech dialogue mod-
`ule
`
`1. INTRODUCTION
`
`Speech dialogue systems have been receiving particular attentions
`as a user interface for a wide variety of interactive applications, such
`as robots and car navigation systems. These applications are gen-
`erally controlled by voice co mmands from a distance. A given
`co mmand is processed by a speech recognition system to gener-
`ate a corresponding operation. It is also necessary to transform text
`information into an audible form by using a text-to-speech (TTS)
`conversion system. However, it is still challenging to perform off-
`microphone speech recognition, where the microphone is placed at a
`distance from the talker [1]. The target signal is seriously interfered
`by other signals and the ambient noise in noisy environments. There-
`fore, noise robustness is essential to speech recognition systems in
`the real environment.
`To reduce undesirable influence by the ambient noise and the
`interference, signal-processing functions have been used for prepro-
`cessing the noisy speech. Among these functions are estimation of
`the direction of arrival (DOA) [2, 3], noise cancellation [4], beam-
`forming with a microphone array [5], and echo cancellation [6].
`DOA estimation identifies the direction of the voice co mmand so
`that the microphone directivity is steered towards the speech source.
`An adaptive noise canceller (ANC) and a microphone array (MA)
`reduce undesirable influence which cannot be sufficiently offset by
`the directional microphone. An acoustic echo canceller (AEC) sup-
`presses an echo that is a part of robot speech leaking in the micro-
`phone signal and contaminating the voice co mmand.
`In robot applications, these functions are generally implemented
`by software on a platform based on a personal computer (PC) [7].
`It is sometimes necessary to share computational power with other
`
`RT Component Translator
`
`DOA
`Estimation
`
`Noise
`Reduction
`
`Speech
`Recognition
`
`TTS
`Conversion
`
`ANC
`
`MA
`
`AEC
`
`Speech
`Recognition
`
`Speech
`Synthesis
`
`Multichannel Audio Input
`
`Loudspeaker
`
`Network I/F
`
`Microphones
`
`Other Applications
`
`Fig. 1. Block diagram of Speech Dialogue Module.
`
`applications on the same platform. Considering that a larger num-
`ber of complex applications are required on a robot, it is desirable
`to have a speech dialogue module on a separate platform so that the
`PC-based platform can be fully devoted to more complex and com-
`putationally intense applications on the robot. On such a separate
`module, the performance of the speech dialogue functions becomes
`more stable and guaranteed. In addition, a compact speech dialogue
`module helps promote human-robot interactions, with its portability,
`on more robots that otherwise would not have such an interface.
`This paper presents a compact speech dialogue module and its
`evaluation on a personal robot. This module offers dialogue func-
`tions similar to a personal robot PaPeRo [8] on an application pro-
`cessor that was developed primarily for mobile phones to provide a
`compact size, low power-consumption, and low cost. In the follow-
`ing section, functions of the speech dialogue module are described
`with the hardware for their implementation. Section 3 presents eval-
`uation results of a near-field DOA estimator, a noise-robust ANC
`with variable stepsizes, an adaptive beamformer for MA, and a noise-
`robust AEC in the real environment.
`
`2. SPEECH DIALOGUE MODULE
`
`2.1. Speech Dialogue Functions
`
`A block diagram of the speech dialogue module is illustrated in Fig.
`1. This module consists of speech recognition (word recognition),
`DOA estimation, noise reduction, and TTS conversion as speech di-
`alogue functions. Noise reduction has three subfunctions, namely,
`an adaptive noise canceller (ANC), a microphone-array (MA) beam-
`former, and an acoustic echo canceller (AEC). They operate sepa-
`rately and a desired output is manually selected. These functions
`
`Page 1 of 4
`
`SONOS EXHIBIT 1003
`
`

`

`3698
`
`Shared
`Memory
`
`For DOA Est.
`Input 1
`
`Input 4
`DOA
`
`For Noise
`Reduction
`
`Input 1
`
`Input 4
`Noise-
`Cancelled
`Signal
`
`DSP
`
`DOA Est.
`Core
`
`ANC
`Core
`
`MA
`Core
`
`AEC
`Core
`
`ARM1
`
`RT Component Translator
`
`Speech
`Recognition
`
`TTS
`Conversion
`
`DOA Est.
`Control
`
`Noise
`Reduction
`Control
`
`Input #20
`
`Input #3
`Input #2
`Input #1
`
`Output
`
`Multi-ring Buffer
`
`Internal
`Memory
`
`(cid:14984)(cid:14984)(cid:14985)(cid:15015)(cid:15012)
`
`Fig. 2. Speech Dialogue Module.
`
`Table 1. Specifications of the Speech Dialogue Module
`
`Fig. 3. Task distribution between ARM and DSP cores.
`
`TOTAL 348 MIPS
`
`AEC (142)
`
`ANC
`(78)
`
`MA
`(82)
`
`DOA
`
`(46)
`
`Fig. 4. Computational load for each function.
`
`Item
`CPU
`
`Memory
`
`Audio
`Interface
`
`Image
`Interface
`Other
`Interface
`Size
`
`Specification
`ARM9(192 MHz) (cid:152) 3
`DSP(SPXK6 192 MHz) (cid:152) 1
`128MB
`+128MB (w/ extended board)
`Flash 64MB
`microphone inputs 2ch (cid:152) 2
`+16ch (w/ audio board)
`speaker output 2ch
`camera input (cid:152) 2
`video output, LCD output
`USB, LAN, IrDA, GPIO
`CF Card (w/ extended board)
`55 mm (cid:152) 100 mm (cid:152) 32 mm
`(w/ audio and extended boards)
`
`work as RT (Robot Technology) components for RT middleware [9],
`and can be controlled by other network-connected applications.
`
`2.2. Hardware
`
`Figure 2 shows a picture of the speech dialogue module whose spec-
`ifications are illustrated in Table 1. An application processor, MP211
`[10], primarily designed for mobile phones, is employed for a suf-
`ficient processing power. It consists of one DSP and three ARM9
`cores and runs on a Linux operating system. For audio input inter-
`face, this module is equipped with synchronous microphone inputs
`on an extended board that are extensible to 16 channels, as well as
`2 types of 2-channel synchronous microphone inputs on the main
`board. In addition, the module also has some peripheral interfaces
`such as 2-channel loudspeaker outputs, 2-channel camera inputs, an
`LCD output, a USB and a LAN interfaces. It is possible to use a
`compact flash memory (CF) card on an extended board.
`
`2.3. Implementation of the functions
`
`The functions of the module were distributed to an ARM9 and a DSP
`cores running at 192 MHz. The task distribution between the ARM9
`and the DSP are depicted in Fig. 3. Speech recognition, TTS con-
`version, and RT component translator operates on the ARM9. DOA
`estimation and noise reduction are decomposed into core-functions
`and control blocks. The noise-reduction core consists of three sub-
`cores, namely, ANC core, MA core, and AEC core. The control
`blocks operate on the ARM core and the core-fuction blocks on the
`DSP core. The input signals are converted into a digital form at a rate
`of 11025 Hz and saved in multi-ring buffers on an internal memory
`
`Fig. 5. PaPeRo-mini (Left) and PaPeRo (Right).
`
`of the ARM. The computational load for each noise-reduction func-
`tion is compared in Fig. 4.
`
`3. EVALUATION
`
`3.1. Platform: PaPeRo-mini
`
`The speech dialogue module was installed in a PaPeRo-mini which
`is a scale-down version of PaPeRo, a partner robot based on a Win-
`dows PC. Figure 5 depicts PaPeRo and PaPeRo-mini whose speci-
`fications are compared in Table 2. PaPeRo-mini is an autonomous
`mobile robot with a size of 250 × 170 × 179 mm (HWD) and a
`weight of 2.5 Kg. Equispaced eight omnidirectional microphones
`are mounted around the neck and 2-channel loudspeakers are mounted
`in the bottom. It also has CCD cameras, ultrasonic sensors, infrared
`sensors, touch sensors, a pyroelectric sensor, and an LCD.
`
`3.2. DOA (direction of arrival) Estimation [3]
`
`Figure 6 (a) depicts the evaluation environment in a room with a
`background noise level of 40 dBA. One sentence spoken by 5 differ-
`ent males and females were presented 10 times at 75 dB from a loud-
`speaker at 1.0 m in elevation. PaPeRo-mini was placed 1.5 m away
`and turned with a step of 30 degrees to make 12 different DOAs. The
`microphone arrangement of PaPeRo-mini and PaPeRo are illustrated
`in Fig. 7.
`
`Page 2 of 4
`
`SONOS EXHIBIT 1003
`
`

`

`3699
`
`12 cm
`
`3
`
`13 cm
`
`12 cm
`
`(a)
`
`FRONT
`
`2
`
`14 cm
`
`14 cm
`
`1
`
`3
`
`21 cm
`
`(b)
`
`1
`
`2
`
`13 cm
`
`4
`
`Fig. 7. Microphone Arrangement (Top View, Distance in cm). (a)
`PaPeRo-mini, (b) PaPeRo
`
`Detection: Correct Ans.:
`
`0
`30
`60
`90
`120
`150
`180
`210
`240
`270
`300
`330
`Ave
`
`DOA [deg]
`
`PaPeRo
`
`0
`20
`60
`40
`80
`100
`Detection/Correct Answer Rates [%]
`
`Fig. 8. DOA Estimation Result.
`
`ison, the average speech-recognition rate of PaPeRo [4], is depicted
`It was evaluated on 1500 utterances by 30 different
`in Figure 9.
`males, females and children at a distance of 0.5 m and 1.5 m at a
`level of 70 dB. The noise level was set at 55-60 and 65-70 dB. The
`recognition rate with an ANC is improved by more than 40% and
`the maximum improvement reaches 54% with noise arriving from
`behind the robot. When there is no noise, the recognition rate of
`PaPeRo-mini is more than 15% lower than that of PaPeRo. It comes
`from the microphones. PaPeRo uses a directional microphone, while
`PaPeRo-mini uses an omnidirectional microphone. However, due to
`the ANC, the recognition rates in noisy environment are almost com-
`parable.
`
`3.4. MA (microphone array) [5]
`
`In the case of MA, the conditions for evaluation were same as those
`for the ANC except the noise directions. The noise source was
`placed in a direction of 30, 60, or 90 degrees. For the MA, four mi-
`crophones arranged linearly with 0.02 m spacings were mounted on
`the front-side of PaPeRo-mini. Figure 10 depicts the speech recogni-
`tion rate in comparison with an average rate by PaPeRo in the same
`condition as that for PaPeRo-mini. Due to the MA, the recognition
`rate is improved by more than 20% and the maximum improvement
`reaches 40% with noise arriving from the front of the robot. The
`recognition rate of PaPeRo-mini with the MA is comparable to that
`of PaPeRo.
`
`3.5. AEC (acoustic echo canceller) [6]
`
`Speech recognition was performed with echo-cancelled speech by
`AEC. The condition of evaluation was the same as that for the ANC
`
`Table 2. Specifications of PaPeRo-mini and PaPeRo
`
`CPU
`OS
`Audio
`Input
`Audio
`Output
`Image
`Input
`Image
`Output
`Other
`I/F
`Battery
`
`Size
`Weight
`
`PaPeRo
`PaPeRo-mini
`Pentium-M 1.6 GHz
`MP211
`Windows XP
`Linux
`Omnidirectional Mic x8 Omnidirectional Mic x7
`Directional Mic x1
`Stereo Loudspeakers
`Line Output x2
`Stereo CCD Camera
`
`Stereo Loudspeakers
`Line Output x2
`Stereo CCD Camera
`
`Composite Video
`LCD
`IrDA
`USB
`Li-ion 74Wh
`Operating Time 8h
`250x170x179 mm
`2.5 kg
`
`Composite Video
`RGB
`Remote Control
`USB
`Li-ion 60Wh
`Operating Time 2h
`385x248x245 mm
`5.0 kg
`
`3.5m
`
`3.5m
`
`Loudspeaker
`Height: 1.0m
`
`Desk
`
`0.5m
`
`Ceiling Height: 2.4m
`0o
`0.5m
`
`Desk
`
`30o
`
`60o
`
`PaPeRo
`-mini
`
`0.5m
`
`180o
`
`90o
`
`135o
`
`Bed
`
`Noise
`1m high
`
`Shelf
`
`TV
`
`1.5m
`
`Shelf
`
`TV
`
`Balcony
`
`3.5m
`
`1.2m
`
`0.5m
`
`Bed
`
`PaPeRo
`-mini
`
`Balcony
`
`3.5m
`
`Ceiling Height: 2.4m
`
`(a)
`
`Speech
`1m high
`(b)
`
`Fig. 6. Evaluation Environment. (a) DOA, (b) ANC/MA/AEC.
`
`Figure 8 depicts the evaluation result. Any DOA estimation re-
`sults other than those with insufficient power, correlation, or incon-
`sistent DOAs among different microphone pairs are considered as
`detection. The correct answer has a margin for an error of ±15 de-
`grees from the true DOA. For comparison, the evaluation result of
`PaPeRo is depicted in Figure 8. Parameters for height adjustment [3]
`were selected as d = 1.5 m and h = 1.0 m for a typical robot use at
`home. The correct answer rate is slightly more degraded than others
`for 60 degrees. However, average rate of correct answers reaches
`83% which is comparable to PaPeRo.
`
`3.3. ANC (adaptive noise canceller) [4]
`
`Speech recognition was performed with noise-cancelled speech by
`the ANC. The evaluation environment, prepared in the same room as
`that for DOA estimation, is depicted in Figure 6 (b). 450 utterances
`by 9 different males, females and children were presented at a dis-
`tance of 1.0 m, a height of 1.0 m, and a level of 70 dB. A loudspeaker
`presenting a commercial TV-program was placed 1.0 m away as the
`noise source in a direction of 90, 135, or 180 degrees at a level of 60-
`65 dB. A dictionary of 50 recognition and noise-rejection words [11]
`was used for speech recognition. For signal input, a front-side and a
`rear-side microphones among the eight around the neck of PaPeRo-
`mini were used as the primary and the reference microphones.
`Figure 9 demonstrates the speech recognition rate. For compar-
`
`Page 3 of 4
`
`SONOS EXHIBIT 1003
`
`

`

`3700
`
`100
`
`AEC on
`
`80
`
`60
`
`40
`
`20
`
`0
`
`AEC off
`
`PaPeRo-mini
`PaPeRo
`
`ON OFF
`Loudspeaker
`
`Recognition Rate [%]
`
`100
`
`ANC on
`
`80
`
`60
`
`40
`
`20
`
`0
`
`ANC off
`
`PaPeRo-mini
`PaPeRo
`
`90 135 180 Clean
`Direction of Noise Arrival [deg]
`
`Recognition Rate [%]
`
`Fig. 9. Speech Recognition Result (ANC).
`
`Fig. 11. Speech Recognition Result (AEC).
`
`MA on
`
`6. REFERENCES
`
`and Industrial Technology Development Organization).
`
`[1] H.-G. Hirsch and D. Pearce, “The Aurora Experimental Frame-
`work for the Performance Evaluation of Speech Recognition
`Systems Under Noisy Conditions, (cid:781)Proc. ISCA ITRW Asr
`2000, Sep. 2000.
`
`[2] Y. Kaneda, “Sound source localization; Robot audition system
`from the signal processing point of view, (cid:781)Proc. 22nd AI Chal-
`lenges, Vol.22, pp.1–8, Oct. 2005 (in Japanese).
`
`[3] M. Sato, A. Sugiyama, O. Hoshuyama, N. Yamashita, and
`Y. Fujita, “Near-Field Sound-Source Localization Based on a
`Signed Binary Code,(cid:781)IEICE Trans. Fundamentals, Vol.E88-A,
`No.8, pp.2078–2086, Aug. 2005.
`
`[4] M. Sato, A. Sugiyama, S. Ohnaka, “An adaptive noise canceller
`with low signal-distortion based on variable stepsize subfilters
`for human-robot communication, (cid:781)IEICE Trans. Fundamen-
`tals, Vol.E88-A, No.8, pp.2055–2061, Aug. 2005.
`
`[5] O. Hoshuyama, A. Sugiyama, A. Hirano, “A robust adaptive
`beamformer for microphone arrays with a blocking matrix us-
`ing constrained adaptive filters, (cid:781)IEEE Transactions on Signal
`Processing, Vol.47, pp.2677–2684, Oct. 1999.
`
`[6] A. Sugiyama, J. Berclaz, and M. Sato, “Noise-robust double-
`talk detection based on normalized cross correlation and a
`noise offset, (cid:781)Proc. ICASSP2005, pp.153–156, Mar. 2005.
`[7] M. Sato, A. Sugiyama, O. Hoshuyama, N. Yamashita,
`S. Ohnaka, and Y. Fujita, “The voice interface of Personal
`Robot, PaPeRo,” J. of Acoust. Soc. of Japan, Vol. 62, No. 3,
`pp.1–9, Mar. 2006 (in Japanese).
`
`[8] Y. Fujita, ”Personal Robot PaPeRo,” J. of Robotics and Mecha-
`tronics, Vol.14, No.1, pp.60-63, Jan. 2002.
`
`[9] N. Ando, T. Suehiro, K. Kitagaki, T. Kotoku, Y. Woo-
`Kuen(cid:14808)”RT-middleware: distributed component middleware
`for RT (robot technology),” Proc. IROS 2005, pp.3933-3938,
`Aug. 2005.
`[10] S. Torii et al., “A 600MIPS 120mW 70μA Leakage Triple-CPU
`Mobile Application Processor Chip,” Proc. of ISSCC2005, 7.5,
`Feb. 2005.
`
`[11] T. Iwasawa, ”Speech Recognition Interface for Personal Robot
`”PaPeRo,” Proc. 13th AI Challenges, Vol.13, pp.17-23, Jun.
`2001.
`
`100
`
`80
`
`60
`
`40
`
`20
`
`0
`
`Recognition Rate [%]
`
`MA off
`
`PaPeRo-mini
`PaPeRo
`
`30 60 90 Clean
`Direction of Noise Arrival [deg]
`
`Fig. 10. Speech Recognition Result (MA).
`
`and the MA except that there was no noise source. The music source
`was presented as an echo at 60-65 dB from a loudspeaker mounted
`on the bottom of PaPeRo-mini. A microphone same as the primary
`microphone for the ANC was used to capture the echo and the target
`speech. Figure 11 depicts the speech recognition rate in comparison
`with the PaPeRo data in the same environment. The echo level for
`PaPeRo was at 55-60 and 65-70 dB. Due to the AEC, the recognition
`rate is improved by 30% with the sound from the loudspeaker of the
`robot. The speech recognition rate by PaPeRo-mini was equivalent
`to that of PaPeRo.
`
`4. CONCLUSION
`
`A single-chip speech dialogue module and its evaluation on a per-
`sonal robot has been presented. This module has been implemented
`on a single-chip application processor to provide a compact size,
`low power-consumption, and portability. It has been equipped with
`direction of arrival (DOA) estimation, adaptive noise cancellation
`(ANC), a microphone array (MA) beamforming, and acoustic echo
`cancellation (AEC) for speech recognition in noisy environment.
`Evaluation results obtained on PaPeRo-mini in real environment have
`demonstrated an 85% correct rate in DOA estimation, and as much
`as 54% and 30% higher speech recognition rates in noisy environ-
`ments and during robot utterances, respectively.
`
`5. ACKNOWLEDGMENT
`
`This development was supported in part by a common platform de-
`velopment project for next-generation robots of NEDO (New Energy
`
`Page 4 of 4
`
`SONOS EXHIBIT 1003
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket