`
`I, Gordon MacPherson, am over twenty-one (21) years of age. I have never been
`convicted of a felony, and I am fully competent to make this declaration. I declare the following
`to be true to the best of my knowledge, information and belief:
`
`1.
`
`2.
`
`3.
`
`I am Director Board Governance & IP Operations of The Institute of Electrical and
`Electronics Engineers, Incorporated (“IEEE”).
`
`IEEE is a neutral third party in this dispute.
`
`I am not being compensated for this declaration and IEEE is only being reimbursed
`for the cost of the article I am certifying.
`
`4. Among my responsibilities as Director Board Governance & IP Operations, I act as a
`custodian of certain records for IEEE.
`
`5.
`
`I make this declaration based on my personal knowledge and information contained
`in the business records of IEEE.
`
`6. As part of its ordinary course of business, IEEE publishes and makes available
`technical articles and standards. These publications are made available for public
`download through the IEEE digital library, IEEE Xplore.
`
`7.
`
`It is the regular practice of IEEE to publish articles and other writings including
`article abstracts and make them available to the public through IEEE Xplore. IEEE
`maintains copies of publications in the ordinary course of its regularly conducted
`activities.
`
`8. The article below has been attached as Exhibit A to this declaration:
`
`A. Miki Sato et al.; “A single-chip speech dialogue module and its evaluation
`on a personal robot, PaPeRo-mini”, 2009 IEEE International Conference
`on Acoustics, Speech and Signal Processing, April 19 – 24, 2009.
`
`9.
`
`I obtained a copy of Exhibit A through IEEE Xplore, where it is maintained in the
`ordinary course of IEEE’s business. Exhibit A is a true and correct copy of the
`Exhibit, as it existed on or about December 29, 2021.
`
`10. The article and abstract from IEEE Xplore shows the date of publication. IEEE
`Xplore populates this information using the metadata associated with the publication.
`
`445 Hoes Lane Piscataway, NJ 08854
`
`DocuSign Envelope ID: 7FCDEB04-9D8A-4D7A-9401-7811BCC4CFCA
`
`Page 1 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`
`11. Miki Sato et al.; “A single-chip speech dialogue module and its evaluation on a
`personal robot, PaPeRo-mini” was published in the 2009 IEEE International
`Conference on Acoustics, Speech and Signal Processing. The 2009 IEEE
`International Conference on Acoustics, Speech and Signal Processing was held from
`April 19 – 24, 2009. Copies of the conference proceedings were made available no
`later than the last day of the conference. The article is currently available for public
`download from the IEEE digital library, IEEE Xplore.
`
`12. I hereby declare that all statements made herein of my own knowledge are true and
`that all statements made on information and belief are believed to be true, and further
`that these statements were made with the knowledge that willful false statements and
`the like are punishable by fine or imprisonment, or both, under 18 U.S.C. § 1001.
`
`I declare under penalty of perjury that the foregoing statements are true and correct.
`
`
`
`
`Executed on:
`
`
`
`
`
`
`
`
`
`
`DocuSign Envelope ID: 7FCDEB04-9D8A-4D7A-9401-7811BCC4CFCA
`
`1/6/2022
`
`Page 2 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`EXHIBIT A
`
`DocuSign Envelope ID: 7FCDEB04-9D8A-4D7A-9401-7811BCC4CFCA
`
`Page 3 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`IEEE.org
`
`IEEE Xplore
`
`IEEE-SA
`
`IEEE Spectrum
`
`More Sites
`
`Create
`Account
`
`Personal
`Sign In
`
`Cart
`
`
`
`
`
`Access provided by:
`Everything Demo User
`
`Sign Out
`
`Browse My Settings Help
`
`Access provided by:
`Everything Demo User
`
`Sign Out
`
`All
`
`
`
`Conferences > 2009 IEEE International Confe...
`
`
`
`ADVANCED SEARCH
`
` Back to Results
`
`A single-chip speech dialogue module and its evaluation
`on a personal robot, PaPeRo-mini
`Publisher: IEEE
`
`Cite This
`
`
` << Results
`
`Miki Sato ; Toru Iwasawa ; Akihiko Sugiyama ; Toshihiro Nishizawa ; Yosuke Takano All Authors
`
`
`Alerts
`
`Manage Content
`
`Alerts
`Add to Citation
`
`Alerts
`
`131
`Full
`Text Views
`
`3P
`
`aper
`Citations
`
` D
`
`ownl
`
`
`Abstract:This paper presents a single-chip speech dialogue module and its
`evaluation on a personal robot. This module is implemented on an application
`processor that was developed... View more
`
` Metadata
`Abstract:
`This paper presents a single-chip speech dialogue module and its evaluation on
`a personal robot. This module is implemented on an application processor that
`was developed primarily for mobile phones to provide a compact size, low
`power-consumption, and low cost. It performs speech recognition with
`preprocessing functions such as direction-of-arrival (DOA) estimation, noise
`cancellation, beamforming with an array of microphones, and echo cancellation.
`Text-to-speech (TTS) conversion is also equipped with. Evaluation results
`obtained on a new personal robot, PaPeRo-mini, which is a scale-down version
`of PaPeRo, demonstrate an 85% correct rate in DOA estimation, and as much
`as 54% and 30% higher speech recognition rates in noisy environments and
`
`Abstract
`
`Document
`Sections
`
`1.
`
`INTRODUCTION
`
`2. SPEECH
`DIALOGUE
`MODULE
`
`3. EVALUATION
`
`4. CONCLUSION
`
`Authors
`
`Figures
`
`References
`
`Citations
`
`
`
`Page 4 of 17
`
`SONOS EXHIBIT 1049
`
`More
`Like
`This
`Coherent signals direction-of-arrival
`estimation using a spherical microphone
`array: Frequency smoothing approach
`2009 IEEE Workshop on Applications of
`Signal Processing to Audio and Acoustics
`Published: 2009
`Co-prime Circular Microphone Arrays and
`Their Application to Direction of Arrival
`Estimation of Speech Sources
`ICASSP 2019 - 2019 IEEE International
`Conference on Acoustics, Speech and Signal
`Processing (ICASSP)
`Published: 2019
`Show
`More
`
`
`Keywords
`
`Metrics
`
`More Like This
`
`during robot utterances, respectively. These results are shown to be
`comparable to those obtained by PaPeRo.
`
`Published in: 2009 IEEE International Conference on Acoustics, Speech and
`Signal Processing
`
`Date of Conference: 19-24 April 2009
`
`Date Added to IEEE Xplore: 26 May
`2009
`
`INSPEC Accession Number:
`10701554
`
`DOI: 10.1109/ICASSP.2009.4960429
`
` ISBN Information:
`
` ISSN Information:
`
`Publisher: IEEE
`
`Conference Location: Taipei, Taiwan
`
` Contents
`
`SECTION 1.
`INTRODUCTION
`
`Speech dialogue systems have been receiving particular
`attentions as a user interface for a wide variety of interactive
`applications, such as robots and car navigation systems. These
`applications are generally controlled by voice co mmands from a
`distance. A given co mmand is processed by a speech recognition
`system to generate a corresponding operation. It is also necessary
`to transform text information into an audible form by using a
`text-to-speech (TTS) conversion system. However, it is still
`challenging to perform off-microphone speech recognition,
`where the microphone is placed at a distance from the talker [1].
`The target signal is seriously interfered by other signals and the
`ambient noise in noisy environments. Therefore, noise
`robustness is essential to speech recognition systems in the real
`environment.
`
`To reduce undesirable influence by the ambient noise and the
`interference, signal-processing functions have been used for
`preprocessing the noisy speech. Among these functions are
`estimation of the direction of arrival (DOA) [2], [3], noise
`cancellation [4], beam-forming with a microphone array [5],
`and echo cancellation [6]. DOA estimation identifies the
`direction of the voice co mmand so that the microphone
`directivity is steered towards the speech source. An adaptive
`noise canceller (ANC) and a microphone array (MA) reduce
`undesirable influence which cannot be sufficiently offset by the
`directional microphone. An acoustic echo canceller (AEC)
`suppresses an echo that is a part of robot speech leaking in the
`microphone signal and contaminating the voice co mmand.
`
`In robot applications, these functions are generally implemented
`by software on a platform based on a personal computer (PC) [7].
`It is sometimes necessary to share computational power with
`other applications on the same platform. Considering that a
`larger number of complex applications are required on a robot, it
`is desirable to have a speech dialogue module on a separate
`platform so that the PC-based platform can be fully devoted to
`
`Page 5 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`more complex and computationally intense applications on the
`robot. On such a separate module, the performance of the speech
`dialogue functions becomes more stable and guaranteed. In
`addition, a compact speech dialogue module helps promote
`human-robot interactions, with its portability, on more robots
`that otherwise would not have such an interface.
`
`This paper presents a compact speech dialogue module and its
`evaluation on a personal robot. This module offers dialogue
`functions similar to a personal robot PaPeRo [8] on an
`application processor that was developed primarily for mobile
`phones to provide a compact size, low power-consumption, and
`low cost. In the following section, functions of the speech
`dialogue module are described with the hardware for their
`implementation. Section 3 presents evaluation results of a near-
`field DOA estimator, a noise-robust ANC with variable stepsizes,
`an adaptive beamformer for MA, and a noise-robust AEC in the
`real environment.
`
`SECTION 2.
`SPEECH DIALOGUE MODULE
`
`2.1. Speech Dialogue Functions
`A block diagram of the speech dialogue module is illustrated in
`Fig. 1. This module consists of speech recognition (word
`recognition), DOA estimation, noise reduction, and TTS
`conversion as speech dialogue functions. Noise reduction has
`three subfunctions, namely, an adaptive noise canceller (ANC), a
`microphone-array (MA) beamformer, and an acoustic echo
`canceller (AEC). They operate separately and a desired output is
`manually selected. These functions work as RT (Robot
`Technology) components for RT middleware [9], and can be
`controlled by other network-connected applications.
`
`Fig. 1. Block diagram of Speech Dialogue Module.
`
`
`
`2.2. Hardware
`
`Figure 2 shows a picture of the speech dialogue module whose
`specifications are illustrated in Table 1. An application
`processor, MP211 [10], primarily designed for mobile phones, is
`
`Page 6 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`employed for a sufficient processing power. It consists of one
`DSP and three ARM9 cores and runs on a Linux operating
`system. For audio input interface, this module is equipped with
`synchronous microphone inputs on an extended board that are
`extensible to 16 channels, as well as 2 types of 2-channel
`synchronous microphone inputs on the main board. In addition,
`the module also has some peripheral interfaces such as 2-channel
`loudspeaker outputs, 2-channel camera inputs, an LCD output, a
`USB and a LAN interfaces. It is possible to use a compact flash
`memory (CF) card on an extended board.
`
`Fig. 2. Speech Dialogue Module.
`
`
`
`TABLE 1. Specifications of the Speech Dialogue Module
`
`Item
`
`CPU
`
`Memory
`
`Specification
`
`
`{\rm DSP}({\rm SPXK}6 192 {\rm MHz}) \times 1$
`
`128MB
`+128MB (w/ extended board)
`Flash 64MB
`
`Audio
`Interface
`
`microphone inputs
`speaker output 2ch
`
` (w/ audio board)
`
`Image
`Interface
`
`
`video output, LCD output
`
`Other
`Interface
`
`USB, LAN, IrDA, GPIO
`CF Card (w/ extended board)
`
`Size
`
`
`55
`(w/ audio and extended boards)
`
`2.3. Implementation of the functions
`
`The functions of the module were distributed to an ARM9 and a
`DSP cores running at 192 MHz. The task distribution between the
`ARM9 and the DSP are depicted in Fig. 3. Speech recognition,
`TTS conversion, and RT component translator operates on the
`ARM9. DOA estimation and noise reduction are decomposed into
`core-functions and control blocks. The noise-reduction core
`consists of three sub-cores, namely, ANC core, MA core, and AEC
`core. The control blocks operate on the ARM core and the core-
`fuction blocks on the DSP core. The input signals are converted
`into a digital form at a rate of 11025 Hz and saved in multi-ring
`buffers on an internal memory of the ARM. The computational
`load for each noise-reduction function is compared in Fig. 4.
`
`Page 7 of 17
`
`SONOS EXHIBIT 1049
`
`ARM9(192MHz) × 3
`2ch × 2 + 16ch
`camerainput × 2
`mm × 100mm × 32mm
`
`
`Fig. 3. Task distribution between ARM and DSP cores.
`
`
`
`
`
`Fig. 4. Computational load for each function.
`
`SECTION 3.
`EVALUATION
`
`3.1. Platform: PaPeRo-mini
`
`The speech dialogue module was installed in a PaPeRo-mini
`which is a scale-down version of PaPeRo, a partner robot based
`on a Windows PC. Figure 5 depicts PaPeRo and PaPeRo-mini
`whose specifications are compared in Table 2. PaPeRo-mini is an
`autonomous mobile robot with a size of
`(HWD) and a weight of 2.5 Kg. Equispaced eight omnidirectional
`microphones are mounted around the neck and 2-channel
`loudspeakers are mounted in the bottom. It also has CCD
`cameras, ultrasonic sensors, infrared sensors, touch sensors, a
`pyroelectric sensor, and an LCD.
`
`Page 8 of 17
`
`SONOS EXHIBIT 1049
`
`250 × 170 × 179mm
`
`
`Fig. 5. PaPeRo-mini (Left) and PaPeRo (Right).
`
`TABLE 2. Specifications of PaPeRo-mini and PaPeRo
`
`
`
`PaPeRo-mini
`
`PaPeRo
`
`MP211
`
`Linux
`
`Omnidirectional Mic x8
`
`Pentium-M 1.6 GHz
`
`Windows XP
`
`Omnidirectional Mic x7
`Directional Mic x1
`
`Stereo Loudspeakers
`Line Output x2
`
`Stereo Loudspeakers
`Line Output x2
`
`Stereo CCD Camera
`
`Stereo CCD Camera
`
`Composite Video
`LCD
`
`IrDA
`USB
`
`Li-ion 74Wh
`Operating Time 8h
`
`Composite Video
`RGB
`
`Remote Control
`USB
`
`Li-ion 60Wh
`Operating Time 2h
`
`2.5 kg
`
`5.0 kg
`
`
`
`
`
`CPU
`
`OS
`
`Audio
`Input
`
`Audio
`Output
`
`Image
`Input
`
`Image
`Output
`
`Other
`I/F
`
`Battery
`
`Size
`Weight
`
`3.2. DOA (direction of arrival) Estimation [3]
`
`Figure 6 (a) depicts the evaluation environment in a room with a
`background noise level of 40 dBA. One sentence spoken by 5
`different males and females were presented 10 times at 75 dB
`from a loudspeaker at 1.0 m in elevation. PaPeRo-mini was
`placed 1.5 m away and turned with a step of 30 degrees to make
`12 different DOAs. The microphone arrangement of PaPeRo-mini
`and PaPeRo are illustrated in Fig. 7.
`
`Page 9 of 17
`
`SONOS EXHIBIT 1049
`
`250 × 170 × 179mm
`385 × 248 × 245mm
`
`
`Fig. 6. Evaluation Environment. (a) DOA, (b) ANC/MA/AEC.
`
`
`
`
`
`Fig. 7. Microphone Arrangement (Top View, Distance in cm). (a)
`PaPeRo-mini, (b) PaPeRo
`
`Figure 8 depicts the evaluation result. Any DOA estimation
`results other than those with insufficient power, correlation, or
`inconsistent DOAs among different microphone pairs are
`considered as detection. The correct answer has a margin for an
`error of
` degrees from the true DOA. For comparison, the
`evaluation result of PaPeRo is depicted in Figure 8. Parameters
`for height adjustment [3] were selected as
` and
` for a typical robot use at home. The correct answer
`rate is slightly more degraded than others for 60 degrees.
`However, average rate of correct answers reaches 83% which is
`comparable to PaPeRo.
`
`Fig. 8. DOA Estimation Result.
`
`
`
`3.3. ANC (adaptive noise canceller) [4]
`
`Speech recognition was performed with noise-cancelled speech
`by the ANC. The evaluation environment, prepared in the same
`room as that for DOA estimation, is depicted in Figure 6 (b). 450
`utterances by 9 different males, females and children were
`
`Page 10 of 17
`
`SONOS EXHIBIT 1049
`
`±15
`d = 1.5m
`h = 1.0m
`
`
`presented at a distance of 1.0 m, a height of 1.0 m, and a level of
`70 dB. A loudspeaker presenting a commercial TV-program was
`placed 1.0 m away as the noise source in a direction of 90, 135, or
`180 degrees at a level of 60-65 dB. A dictionary of 50 recognition
`and noise-rejection words [11] was used for speech recognition.
`For signal input, a front-side and a rear-side microphones among
`the eight around the neck of PaPeRo-mini were used as the
`primary and the reference microphones.
`
`Figure 9 demonstrates the speech recognition rate. For
`comparison, the average speech-recognition rate of PaPeRo [4],
`is depicted in Figure 9. It was evaluated on 1500 utterances by
`30 different males, females and children at a distance of 0.5 m
`and 1.5 m at a level of 70 dB. The noise level was set at 55-60 and
`65-70 dB. The recognition rate with an ANC is improved by more
`than 40% and the maximum improvement reaches 54% with
`noise arriving from behind the robot. When there is no noise, the
`recognition rate of PaPeRo-mini is more than 15% lower than
`that of PaPeRo. It comes from the microphones. PaPeRo uses a
`directional microphone, while PaPeRo-mini uses an
`omnidirectional microphone. However, due to the ANC, the
`recognition rates in noisy environment are almost comparable.
`
`Fig. 9. Speech Recognition Result (ANC).
`
`
`
`3.4. MA (microphone array) [5]
`
`In the case of MA, the conditions for evaluation were same as
`those for the ANC except the noise directions. The noise source
`was placed in a direction of 30, 60, or 90 degrees. For the MA,
`four microphones arranged linearly with 0.02 m spacings were
`mounted on the front-side of PaPeRo-mini. Figure 10 depicts the
`speech recognition rate in comparison with an average rate by
`PaPeRo in the same condition as that for PaPeRo-mini. Due to
`the MA, the recognition rate is improved by more than 20% and
`the maximum improvement reaches 40% with noise arriving
`from the front of the robot. The recognition rate of PaPeRo-mini
`with the MA is comparable to that of PaPeRo.
`
`Page 11 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`Fig. 10. Speech Recognition Result (MA).
`
`
`
`3.5. AEC (acoustic echo canceller) [6]
`
`Speech recognition was performed with echo-cancelled speech by
`AEC. The condition of evaluation was the same as that for the
`ANC and the MA except that there was no noise source. The
`music source was presented as an echo at 60-65 dB from a
`loudspeaker mounted on the bottom of PaPeRo-mini. A
`microphone same as the primary microphone for the ANC was
`used to capture the echo and the target speech. Figure 11 depicts
`the speech recognition rate in comparison with the PaPeRo data
`in the same environment. The echo level for PaPeRo was at 55-60
`and 65-70 dB. Due to the AEC, the recognition rate is improved
`by 30% with the sound from the loudspeaker of the robot. The
`speech recognition rate by PaPeRo-mini was equivalent to that of
`PaPeRo.
`
`Fig. 11. Speech Recognition Result (AEC).
`
`
`
`SECTION 4.
`CONCLUSION
`
`A single-chip speech dialogue module and its evaluation on a
`personal robot has been presented. This module has been
`implemented on a single-chip application processor to provide a
`compact size, low power-consumption, and portability. It has
`been equipped with direction of arrival (DOA) estimation,
`adaptive noise cancellation (ANC), a microphone array (MA)
`beamforming, and acoustic echo cancellation (AEC) for speech
`
`Page 12 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`recognition in noisy environment. Evaluation results obtained on
`PaPeRo-mini in real environment have demonstrated an 85%
`correct rate in DOA estimation, and as much as 54% and 30%
`higher speech recognition rates in noisy environments and
`during robot utterances, respectively.
`
`ACKNOWLEDGMENT
`
`This development was supported in part by a common platform
`development project for next-generation robots of NEDO (New
`Energy and Industrial Technology Development Organization).
`
`Authors
`
`Figures
`
`References
`
`Citations
`
`Keywords
`
`Metrics
`
`
`
`
`
`
`
`
`
`
`
`
`
`IEEE Personal Account
`
`Purchase Details
`
`Profile Information
`
`Need Help?
`
`CHANGE USERNAME/PASSWORD
`
`PAYMENT OPTIONS
`
`COMMUNICATIONS PREFERENCES
`
`US & CANADA: +1 800 678 4333
`
`VIEW PURCHASED DOCUMENTS
`
`PROFESSION AND EDUCATION
`
`WORLDWIDE: +1 732 981 0060
`
`TECHNICAL INTERESTS
`
`CONTACT & SUPPORT
`
`Follow
`
`
`
`About IEEE Xplore | Contact Us | Help | Accessibility | Terms of Use | Nondiscrimination Policy | IEEE Ethics Reporting | Sitemap | Privacy & Opting Out of Cookies
`A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
`
`© Copyright 2021 IEEE - All rights reserved.
`
`IEEE Account
`
`Purchase Details
`
`Profile Information
`
`Need Help?
`
`» Change Username/Password
`» Update Address
`
`» Payment Options
`» Order History
`» View Purchased Documents
`
`» Communications Preferences
`» Profession and Education
`» Technical Interests
`
`» US & Canada: +1 800 678 4333
`» Worldwide: +1 732 981 0060
`» Contact & Support
`
`About IEEE Xplore Contact Us
`
`|
`
`
`
`|
`
`Help
`
`
`
`|
`
`Accessibility
`
`
`
`|
`
`Terms of Use
`
`
`
`|
`
`Nondiscrimination Policy
`
`
`
`|
`
`Sitemap
`
`
`
`|
`
`Privacy & Opting Out of Cookies
`
`A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
`© Copyright 2021 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
`
`Page 13 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`978-1-4244-2354-5/09/$25.00 ©2009 IEEE
`
`3697
`
`ICASSP 2009
`
`Authorized licensed use limited to: Everything Demo User. Downloaded on December 29,2021 at 15:00:21 UTC from IEEE Xplore. Restrictions apply.
`
`A SINGLE-CHIP SPEECH DIALOGUE MODULE
`AND ITS EVALUATION ON A PERSONAL ROBOT, PAPERO-MINI
`
`Miki Sato, Toru Iwasawa, Akihiko Sugiyama, Toshihiro Nishizawa, Yosuke Takano
`
`NEC Co mmon Platform Software Research Laboratories
`Kawasaki 211-8666, JAPAN
`
`ABSTRACT
`
`Speech Dialogue Module
`
`This paper presents a single-chip speech dialogue module and its
`evaluation on a personal robot. This module is implemented on
`an application processor that was developed primarily for mobile
`phones to provide a compact size, low power-consumption, and low
`cost. It performs speech recognition with preprocessing functions
`such as direction-of-arrival (DOA) estimation, noise cancellation,
`beamforming with an array of microphones, and echo cancellation.
`Text-to-speech (TTS) conversion is also equipped with. Evaluation
`results obtained on a new personal robot, PaPeRo-mini, which is
`a scale-down version of PaPeRo, demonstrate an 85% correct rate
`in DOA estimation, and as much as 54% and 30% higher speech
`recognition rates in noisy environments and during robot utterances,
`respectively. These results are shown to be comparable to those ob-
`tained by PaPeRo.
`
`Index Terms— speech recognition, DOA estimation, noise can-
`cellation, microphone array, echo cancellation, speech dialogue mod-
`ule
`
`1. INTRODUCTION
`
`Speech dialogue systems have been receiving particular attentions
`as a user interface for a wide variety of interactive applications, such
`as robots and car navigation systems. These applications are gen-
`erally controlled by voice co mmands from a distance. A given
`co mmand is processed by a speech recognition system to gener-
`ate a corresponding operation. It is also necessary to transform text
`information into an audible form by using a text-to-speech (TTS)
`conversion system. However, it is still challenging to perform off-
`microphone speech recognition, where the microphone is placed at a
`distance from the talker [1]. The target signal is seriously interfered
`by other signals and the ambient noise in noisy environments. There-
`fore, noise robustness is essential to speech recognition systems in
`the real environment.
`To reduce undesirable influence by the ambient noise and the
`interference, signal-processing functions have been used for prepro-
`cessing the noisy speech. Among these functions are estimation of
`the direction of arrival (DOA) [2, 3], noise cancellation [4], beam-
`forming with a microphone array [5], and echo cancellation [6].
`DOA estimation identifies the direction of the voice co mmand so
`that the microphone directivity is steered towards the speech source.
`An adaptive noise canceller (ANC) and a microphone array (MA)
`reduce undesirable influence which cannot be sufficiently offset by
`the directional microphone. An acoustic echo canceller (AEC) sup-
`presses an echo that is a part of robot speech leaking in the micro-
`phone signal and contaminating the voice co mmand.
`In robot applications, these functions are generally implemented
`by software on a platform based on a personal computer (PC) [7].
`It is sometimes necessary to share computational power with other
`
`RT Component Translator
`
`DOA
`Estimation
`
`Noise
`Reduction
`
`Speech
`Recognition
`
`TTS
`Conversion
`
`ANC
`
`MA
`
`AEC
`
`Speech
`Recognition
`
`Speech
`Synthesis
`
`Multichannel Audio Input
`
`Loudspeaker
`
`Network I/F
`
`Microphones
`
`Other Applications
`
`Fig. 1. Block diagram of Speech Dialogue Module.
`
`applications on the same platform. Considering that a larger num-
`ber of complex applications are required on a robot, it is desirable
`to have a speech dialogue module on a separate platform so that the
`PC-based platform can be fully devoted to more complex and com-
`putationally intense applications on the robot. On such a separate
`module, the performance of the speech dialogue functions becomes
`more stable and guaranteed. In addition, a compact speech dialogue
`module helps promote human-robot interactions, with its portability,
`on more robots that otherwise would not have such an interface.
`This paper presents a compact speech dialogue module and its
`evaluation on a personal robot. This module offers dialogue func-
`tions similar to a personal robot PaPeRo [8] on an application pro-
`cessor that was developed primarily for mobile phones to provide a
`compact size, low power-consumption, and low cost. In the follow-
`ing section, functions of the speech dialogue module are described
`with the hardware for their implementation. Section 3 presents eval-
`uation results of a near-field DOA estimator, a noise-robust ANC
`with variable stepsizes, an adaptive beamformer for MA, and a noise-
`robust AEC in the real environment.
`
`2. SPEECH DIALOGUE MODULE
`
`2.1. Speech Dialogue Functions
`
`A block diagram of the speech dialogue module is illustrated in Fig.
`1. This module consists of speech recognition (word recognition),
`DOA estimation, noise reduction, and TTS conversion as speech di-
`alogue functions. Noise reduction has three subfunctions, namely,
`an adaptive noise canceller (ANC), a microphone-array (MA) beam-
`former, and an acoustic echo canceller (AEC). They operate sepa-
`rately and a desired output is manually selected. These functions
`
`Page 14 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`Authorized licensed use limited to: Everything Demo User. Downloaded on December 29,2021 at 15:00:21 UTC from IEEE Xplore. Restrictions apply.
`
`3698
`
`Shared
`Memory
`
`For DOA Est.
`Input 1
`
`Input 4
`DOA
`
`For Noise
`Reduction
`
`Input 1
`
`Input 4
`Noise-
`Cancelled
`Signal
`
`DSP
`
`DOA Est.
`Core
`
`ANC
`Core
`
`MA
`Core
`
`AEC
`Core
`
`ARM1
`
`RT Component Translator
`
`Speech
`Recognition
`
`TTS
`Conversion
`
`DOA Est.
`Control
`
`Noise
`Reduction
`Control
`
`Input #20
`
`Input #3
`Input #2
`Input #1
`
`Output
`
`Multi-ring Buffer
`
`Internal
`Memory
`
`(cid:14984)(cid:14984)(cid:14985)(cid:15015)(cid:15012)
`
`Fig. 2. Speech Dialogue Module.
`
`Table 1. Specifications of the Speech Dialogue Module
`
`Fig. 3. Task distribution between ARM and DSP cores.
`
`TOTAL 348 MIPS
`
`AEC (142)
`
`ANC
`(78)
`
`MA
`(82)
`
`DOA
`
`(46)
`
`Fig. 4. Computational load for each function.
`
`Item
`CPU
`
`Memory
`
`Audio
`Interface
`
`Image
`Interface
`Other
`Interface
`Size
`
`Specification
`ARM9(192 MHz) (cid:152) 3
`DSP(SPXK6 192 MHz) (cid:152) 1
`128MB
`+128MB (w/ extended board)
`Flash 64MB
`microphone inputs 2ch (cid:152) 2
`+16ch (w/ audio board)
`speaker output 2ch
`camera input (cid:152) 2
`video output, LCD output
`USB, LAN, IrDA, GPIO
`CF Card (w/ extended board)
`55 mm (cid:152) 100 mm (cid:152) 32 mm
`(w/ audio and extended boards)
`
`work as RT (Robot Technology) components for RT middleware [9],
`and can be controlled by other network-connected applications.
`
`2.2. Hardware
`
`Figure 2 shows a picture of the speech dialogue module whose spec-
`ifications are illustrated in Table 1. An application processor, MP211
`[10], primarily designed for mobile phones, is employed for a suf-
`ficient processing power. It consists of one DSP and three ARM9
`cores and runs on a Linux operating system. For audio input inter-
`face, this module is equipped with synchronous microphone inputs
`on an extended board that are extensible to 16 channels, as well as
`2 types of 2-channel synchronous microphone inputs on the main
`board. In addition, the module also has some peripheral interfaces
`such as 2-channel loudspeaker outputs, 2-channel camera inputs, an
`LCD output, a USB and a LAN interfaces. It is possible to use a
`compact flash memory (CF) card on an extended board.
`
`2.3. Implementation of the functions
`
`The functions of the module were distributed to an ARM9 and a DSP
`cores running at 192 MHz. The task distribution between the ARM9
`and the DSP are depicted in Fig. 3. Speech recognition, TTS con-
`version, and RT component translator operates on the ARM9. DOA
`estimation and noise reduction are decomposed into core-functions
`and control blocks. The noise-reduction core consists of three sub-
`cores, namely, ANC core, MA core, and AEC core. The control
`blocks operate on the ARM core and the core-fuction blocks on the
`DSP core. The input signals are converted into a digital form at a rate
`of 11025 Hz and saved in multi-ring buffers on an internal memory
`
`Fig. 5. PaPeRo-mini (Left) and PaPeRo (Right).
`
`of the ARM. The computational load for each noise-reduction func-
`tion is compared in Fig. 4.
`
`3. EVALUATION
`
`3.1. Platform: PaPeRo-mini
`
`The speech dialogue module was installed in a PaPeRo-mini which
`is a scale-down version of PaPeRo, a partner robot based on a Win-
`dows PC. Figure 5 depicts PaPeRo and PaPeRo-mini whose speci-
`fications are compared in Table 2. PaPeRo-mini is an autonomous
`mobile robot with a size of 250 × 170 × 179 mm (HWD) and a
`weight of 2.5 Kg. Equispaced eight omnidirectional microphones
`are mounted around the neck and 2-channel loudspeakers are mounted
`in the bottom. It also has CCD cameras, ultrasonic sensors, infrared
`sensors, touch sensors, a pyroelectric sensor, and an LCD.
`
`3.2. DOA (direction of arrival) Estimation [3]
`
`Figure 6 (a) depicts the evaluation environment in a room with a
`background noise level of 40 dBA. One sentence spoken by 5 differ-
`ent males and females were presented 10 times at 75 dB from a loud-
`speaker at 1.0 m in elevation. PaPeRo-mini was placed 1.5 m away
`and turned with a step of 30 degrees to make 12 different DOAs. The
`microphone arrangement of PaPeRo-mini and PaPeRo are illustrated
`in Fig. 7.
`
`Page 15 of 17
`
`SONOS EXHIBIT 1049
`
`
`
`Authorized licensed use limited to: Everything Demo User. Downloaded on December 29,2021 at 15:00:21 UTC from IEEE Xplore. Restrictions apply.
`
`3699
`
`12 cm
`
`3
`
`13 cm
`
`12 cm
`
`(a)
`
`FRONT
`
`2
`
`14 cm
`
`14 cm
`
`1
`
`3
`
`21 cm
`
`(b)
`
`1
`
`2
`
`13 cm
`
`4
`
`Fig. 7. Microphone Arrangement (Top View, Distance in cm). (a)
`PaPeRo-mini, (b) PaPeRo
`
`Detection: Correct Ans.:
`
`0
`30
`60
`90
`120
`150
`180
`210
`240
`270
`300
`330
`Ave
`
`DOA [deg]
`
`PaPeRo
`
`0
`20
`60
`40
`80
`100
`Detection/Correct Answer Rates [%]
`
`Fig. 8. DOA Estimation Result.
`
`ison, the average speech-recognition rate of PaPeRo [4], is depicted
`It was evaluated on 1500 utterances by 30 different
`in Figure 9.
`males, females and children at a distance of 0.5 m and 1.5 m at a
`level of 70 dB. The noise level was set at 55-60 and 65-70 dB. The
`recognition rate with an ANC is improved by more than 40% and
`the maximum improvement reaches 54% with noise arriving from
`behind the robot. When there is no noise, the recognition rate of
`PaPeRo-mini is more than 15% lower than that of PaPeRo. It comes
`from the microphones. PaPeRo uses a directional microphone, while
`PaPeRo-mini uses an omnidirectional microphone. However, due to
`the ANC, the recognition rates in noisy environment are almost com-
`parable.
`
`3.4. MA (microphone array) [5]
`
`In the case of MA, the conditions for evaluation were same as those
`for the ANC except the noise directions. The noise source was
`placed in a direction of 30, 60, or 90 degrees. For the MA, four mi-
`crophones arranged linearly with 0.02 m spacings were mounted on
`the front-side of PaPeRo-mini. Figure 10 depicts the speech recogni-
`tion rate in comparison with an average rate by PaPeRo in the same
`condition as that for PaPeRo-mini. Due to the MA, the recognition
`rate is improved by more than 20% and the maximum improvement
`reaches 40% with noise arriving from the front of the robot. The
`recognition rate of PaPeRo-mini with the MA is comparable to that
`of PaPeRo.
`
`3.5. AEC (acoustic echo canceller) [6]
`
`Speech recognition was performed with echo-cancelled speech by
`AEC. The condition of evaluation was the same as that for the ANC
`
`Table 2. Specifications of PaPeRo-mini and PaPeRo
`
`CPU
`OS
`Audio
`Input
`Audio
`Output
`Image
`Input
`Image
`Output
`Other
`I/F
`Battery
`
`Size
`Weight
`
`PaPeRo
`PaPeRo-mini
`Pentium-M 1.6 GHz
`MP211
`Windows XP
`Linux
`Omnidirectional Mic x8 Omnidirectional Mic x7
`Directional Mic x1
`Stereo Loudspeakers
`Line Output x2
`Stereo CCD Camera
`
`Stereo Loudspeakers
`Line Output x2
`Stereo CCD Camera
`
`Composite Video
`LCD
`IrDA
`USB
`Li-ion 74Wh
`Operating Time 8h