`
`VOLUME 15
`
`NUMBER 4
`
`ITASD8
`
`(ISSN 1558-7916)
`
`PAPERS
`Speech Analysis
`A Soft Voice Activity Detection Using GARCH Filter and Variance Gamma Distribution ...... . . . . . R. Tahmasbi and S. Rezaei
`Single and Multiple F0 Contour Estimation Through Parametric Spectrogram Modeling of Speech in Noisy Environments ...
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Le Roux, H. Kameoka, N. Ono, A. de Cheveigné, and S. Sagayama
`Speech Coding
`Memory-Based Vector Quantization of LSF Parameters by a Power Series Approximation ...... . . . . T. Eriksson and F. Nordén
`Rate Allocation for Noncollaborative Multiuser Speech Communication Systems Based on Bargaining Theory ............. ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. J. Borgström, M. van der Schaar, and A. Alwan
`Wideband Speech Coding Advances in VMR-WB Standard .................... .... . . . . . . . . . . . . . . . . . . . . . M. Jelínek and R. Salami
`Speech Enhancement
`A Spectral Conversion Approach to Single-Channel Speech Enhancement .... ......................... ......................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Mouchtaris, J. Van der Spiegel, P. Mueller, and P. Tsakalides
`Noisy Speech Enhancement Using Harmonic-Noise Model and Codebook-Based Post-Processing .. ......................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E. Zavarehei, S. Vaseghi, and Q. Yan
`Speech Adaptation/Normalization
`Environmental Independent ASR Model Adaptation/Compensation by Bayesian Parametric Representation ................ ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X. Wang and D. O’Shaughnessy
`Speech Synthesis and Generation
`Simulation of Losses Due to Turbulence in the Time-Varying Vocal System .. ..... . . . . . P. Birkholz, D. Jackèl, and B. J. Kröger
`Variable-Length Unit Selection in TTS Using Structural Syntactic Cost ... . . C.-H. Wu, C.-C. Hsia, J.-F. Chen, and J.-F. Wang
`Speech Data Mining and Document Retrieval
`Audio Signal Feature Extraction and Classification Using Local Discriminant Bases ................. ......................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Umapathy, S. Krishnan, and R. K. Rao
`Content-Based Audio Processing
`Melody Transcription From Music Audio: Approaches and Evaluation ........ ......................... ......................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. E. Poliner, D. P. W. Ellis, A. F. Ehmann, E. Gómez, S. Streich, and B. Ong
`Melody Extraction and Musical Onset Detection via Probabilistic Models of Framewise STFT Peak Data ................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H. Thornburg, R. J. Leistikow, and J. Berger
`Audio Coding
`Low Bit-Rate Object Coding of Musical Audio Using Bayesian Harmonic Models ....... . . . . . . E. Vincent and M. D. Plumbley
`
`1129
`
`1135
`
`1146
`
`1156
`1167
`
`1180
`
`1194
`
`1204
`
`1218
`1227
`
`1236
`
`1247
`
`1257
`
`1273
`
`(Contents Continued on Back Cover)
`
`IPR PETITION
`US RE48,371
`Sonos Ex. 1011
`
`
`
`(Contents Continued from Front Cover)
`
`Audio Analysis and Synthesis
`Joint Detection and Tracking of Time-Varying Harmonic Components: A Flexible Bayesian Approach ...................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Dubois and M. Davy
`Audio for Multimedia
`Robust Data Hiding in Audio Using Allpass Filters .... ............. . . . . . . . . . . . . . . . H. M. A. Malik, R. Ansari, and A. A. Khokhar
`Echo Cancellation
`System Identification in the Short-Time Fourier Transform Domain With Crossband Filtering ...... . . . . Y. Avargel and I. Cohen
`An Improvement of the Two-Path Algorithm Transfer Logic for Acoustic Echo Cancellation ........ ......................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Lindström, C. Schüldt, and I. Claesson
`Loudspeaker and Microphone Array Signal Processing
`Direction of Arrival Estimation Using the Parameterized Spatial Correlation Matrix .................. ......................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Dmochowski, J. Benesty, and S. Affes
`Multichannel Bin-Wise Robust Frequency-Domain Adaptive Filtering and Its Application to Adaptive Beamforming ...... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Herbordt, H. Buchner, S. Nakamura, and W. Kellermann
`Large Vocabulary Continuous Recognition/Search
`Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous
`Speech Recognition ........... .................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Hori, C. Hori, Y. Minami, and A. Nakamura
`Robust Speech Recognition
`A Study of Variable-Parameter Gaussian Mixture Hidden Markov Modeling for Noisy Speech Recognition ................ ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X. Cui and Y. Gong
`General Topics in Speech Recognition
`Template-Based Continuous Speech Recognition ....... ......................... ......................... ......................... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. De Wachter, M. Matton, K. Demuynck, P. Wambacq, R. Cools, and D. Van Compernolle
`Exploiting Temporal Correlation of Speech for Error Robust and Bandwidth Flexible Distributed Speech Recognition ..... ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Z.-H. Tan, P. Dalsgaard, and B. Lindberg
`A Framework for Secure Speech Recognition ........... ................ . . . . . . . . . . . . . . . . . . . . . . . . . . P. Smaragdis and M. Shashanka
`Acoustic Modeling for Automatic Speech Recognition
`Automatic Model Complexity Control Using Marginalized Discriminative Growth Functions ...... . . . . . . . X. Liu and M. Gales
`Trajectory Clustering for Solving the Trajectory Folding Problem in Automatic Speech Recognition ........................ ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Y. Han, J. de Veth, and L. Boves
`Speaker Characterization and Recognition
`Joint Factor Analysis Versus Eigenchannels in Speaker Recognition .. . . P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel
`Speaker and Session Variability in GMM-Based Speaker Verification .. . . P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel
`Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation ............. ....
`. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W.-H. Tsai, S.-S. Cheng, and H.-M. Wang
`Source Separation and Signal Enhancement
`Separation of Singing Voice From Music Accompaniment for Monaural Recordings ........... . . . . . . . . . . . Y. Li and D. L. Wang
`Signal Processing for Music
`Parameterized Finite Difference Schemes for Plates: Stability, the Reduction of Directional Dispersion and Frequency
`Warping ........................ .................... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Bilbao, L. Savioja, and J. O. Smith
`
`1283
`
`1296
`
`1305
`
`1320
`
`1327
`
`1340
`
`1352
`
`1366
`
`1377
`
`1391
`1404
`
`1414
`
`1425
`
`1435
`1448
`
`1461
`
`1475
`
`1488
`
`CORRESPONDENCE
`On the Ramsey Class of Interleavers for Robust Speech Recognition in Burst-Like Packet Loss ...........................
`.. ........ ......... ......... ........ ......... ......... ........ ...... A. M. Gómez, A. M. Peinado, V. Sánchez, and A. J. Rubio
`
`1496
`
`EDICS—Editors’s Information Classification Scheme ...................................... ......... ........ ......... ......... .
`Information for Authors ....................................................... ........ ......... ......... ........ ......... ......... .
`
`1500
`1502
`
`ANNOUNCEMENTS
`Call for Papers—Special Issue on New Approaches to Statistical Speech Processing .................... ......... ......... .
`Call for Papers—2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics ................... .
`Call for Papers—IEEE TRANSACTIONS ON MULTIMEDIA Special Issue on Multimedia Applications in Mobile/Wireless
`Context ...................................................................... ........ ......... ......... ........ ......... ......... .
`
`1504
`1505
`
`1506
`
`
`
`IEEE SIGNAL PROCESSING SOCIETY
`The Signal Processing Society is an organization, within the framework of the IEEE, of members with principal professional interest in the technology of transmission, recording, repro-
`duction, processing, and measurement of speech and other signals by digital electronic, electrical, acoustic, mechanical, and optical means, the components and systems to accomplish
`these and related aims, and the environmental, psychological, and physiological factors concerned therewith. All members of the IEEE are eligible for membership in the Society and
`will receive this TRANSACTIONS upon payment of the annual Society membership fee of $27.00 plus an annual subscription fee of $34.00. For information on joining, write to the IEEE
`at the address below. Member copies of Transactions/Journals are for personal use only.
`
`Publications Board Chair
`K. J. R. LIU, VP-Publications
`Univ. Maryland
`College Park, MD 20742
`
`Signal Processing Letters
`A. B. GERSHMAN, Editor-in-Chief
`Darmstadt Univ. Technol.
`D-64283 Darmstadt, Germany
`
`ALEX ACERO
`Microsoft Res.
`Redmond, WA 98052-6399
`
`ABEER ALWAN
`Dept. of Elect. Eng.
`UCLA
`Los Angeles, CA90095
`
`BILL BYRNE
`Eng. Dept.
`Cambridge, CB21PZ, U.K.
`
`ISRAEL COHEN
`Technion–Israel Inst. of Technol.
`Technion City, Haifa 32000, Israel
`
`YARIV EPHRAIM
`George Mason Univ.
`Dept. of ECE
`Fairfax, VA 22030-444
`
`DILEK HAKKANI-TÜR
`Intl. Comput. Sci. Inst. (ICSI)
`Berkely, CA 94704
`
`MARY HARPER
`Purdue Univ.
`Sch. of Elect. & Comput. Eng.
`West Lafayete, IN 47907-1285
`
`Trans. on Signal Processing
`A.-J. VAN DER VEEN, Editor-in-Chief
`Delft Univ. Technol.
`2628 CD Delft, The Netherlands
`
`SOCIETY PUBLICATIONS
`Trans. on Image Processing
`C. A. BOUMAN, Editor-in-Chief
`Purdue Univ.
`W. Lafayette, IN 47906
`
`Trans. on Information Forensics
`and Security
`P. MOULIN, Editor-in-Chief
`Univ. Illinois
`Urbana, IL 61801
`
`SP Magazine
`S.-F. CHANG, Editor-in-Chief
`Columbia Univ.
`New York, NY 10027
`
`TRANSACTIONS ASSOCIATE EDITORS
`MARK HASEGAWA-JOHNSON
`SYLVAIN MARCHAND
`Univ. of Illinois
`Univ. of Bordeaux
`Elect. & Comput. Eng.
`351, cours de la Liberation
`Beckman Inst.
`F-33405, Talence Cedex, France
`Urbana, IL 61801
`
`TIMOTHY J. HAZEN
`MIT Comput. Sci. and
`Artificial Intelligence Lab.
`Cambridge, MA 02139
`
`HONG–GOO KANG
`Yonsei Univ. Seoul
`South Korea, 120–749
`
`SIMON KING
`Ctr. for Speech Technol. Res.
`Univ. of Edinburgh
`Edinburgh, EH8 9LW, U.K.
`
`SEN KUO
`Dept. of Elect. Eng.
`Northern Illinois Univ.
`Dekalb, IL 60115
`
`SHOJI MAKINO
`NTT Communication Res. Labs.
`Kyoto, 619-0237, Japan
`
`RAINER MARTIN
`Ruhr-Univ. Bochum
`Inst. of Communication Acoustics
`Bochum, Germany 44780
`
`HELEN MENG
`Chinese Univ. of Hong Kong
`Shatin, New Territory
`Hong Kong, SAR, China
`
`MAURIZIO OMOLOGO
`ITC-IRST
`38050, Povo-Trento, Italy
`
`RUDOLF RABENSTEN
`Telecommunications Inst. 1
`Univ. of Erlangen-Nuremberg
`Cauerstrasse 7
`D-91058 Erlangen, Germany
`
`SUSANTO RAHARDJA
`Div. of Info. Eng.
`21 Heng Mui Keng Terrace
`Singapore, 119613
`ersusanto@ntu.edu.sg
`
`Trans. on Audio, Speech,
`and Language Processing
`M. OSTENDORF, Editor-in-Chief
`Univ. Washington
`Seattle, WA 98195-2500
`
`Trans. on SP, IP, ASL, and IFS
`SPS Publication Office
`IEEE Signal Processing Society
`Piscataway, NJ 08854
`
`GAEL RICHARD
`37-39 rue Dareau, Bureau/Office DA 412
`75014 Paris, France
`GERHARD RIGOLL
`Munich Univ. of Technol.
`D-80290 Munich, Germany
`HIROSHI SAWADA
`NTT Communication Sci. Labs.
`NTT Corp.
`Kyoto, 619-0237, Japan
`MALCOLM SLANEY
`Yahoo! Res.
`Santa Clara, CA 95054
`ARUN C. SURENDRAN
`Comm. Collaboration and Signal
`Processing at Microsoft Res.
`One Microsoft Way
`Redmond, WA 98052
`GEORGE TZANETAKIS
`Dept. of Comput. Sci.
`Univ. of Victoria
`Victoria, BC V8W 3 P6, Canada
`VESA VALIMAKI
`TKK-Helsinki Univ. of Technol.
`Dept. of Elect. Comm. Eng.
`Lab. Acoustics Audio Signal Processing
`FI-02015 TKK, Espoo, Finland
`
`LEAH H. JAMIESON, President and CEO
`LEWIS M. TERMAN, President-Elect
`CELIA L. DESMOND, Secretary
`DAVID G. GREEN, Treasurer
`MICHAEL R. LIGHTNER, Past President
`MOSHE KAM, Vice President, Educational Activities
`RICHARD V. COX, Director, Division IX—Signals and Applications
`
`IEEE Officers
`JOHN B. BAILLIEUL, Vice President, Publication Services and Products
`PEDRO A. RAY, Vice President, Regional Activities
`GEORGE W. ARNOLD, President, IEEE Standards Association
`PETER W. STAECKER, Vice President, Technical Activities
`JOHN W. MEREDITH, President, IEEE-USA
`
`IEEE Executive Staff
`JEFFRY W. RAYNES, CAE, Executive Director & Chief Operating Officer
`MATTHEW LOEB, Corporate Strategy & Communications
`DONALD CURTIS, Human Resources
`RICHARD D. SCHWARTZ, Business Administration
`ANTHONY DURNIAK, Publications Activities
`CHRIS BRANTLEY,
`IEEE-USA
`JUDITH GORMAN, Standards Activities
`MARY WARD-CALLAN, Technical Activities
`CECELIA JANKOWSKI, Regional Activities
`SALLY A. WASELIK,
`Information Technology
`BARBARA COBURN STOLER, Educational Activities
`
`IEEE Periodicals
`Transactions/Journals Department
`Staff Director: FRAN ZAPPULLA
`Editorial Director: DAWN MELLEY Production Director: ROBERT SMREK
`Managing Editor: MARTIN J. MORAHAN Associate Editor: ANDREW SWARTZ
`
`IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING (ISSN 1558-7916) is published eight times a year in January, February, March, May, July, August, September,
`and November by the Institute of Electrical and Electronics Engineers, Inc. Responsibility for the contents rests upon the authors and not upon the IEEE, the Society/Council, or its
`members. IEEE Corporate Office: 3 Park Avenue, 17th Floor, New York, NY 10016-5997. IEEE Operations Center: 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331. NJ
`Telephone: +1 732 981 0060. Price/Publication Information: Individual copies: IEEE Members $20.00 (first copy only), nonmembers $90.00 per copy. (Note: Postage and handling
`charge not included.) Member and nonmember subscription prices available upon request. Available in microfiche and microfilm. Copyright and Reprint Permissions: Abstracting is
`permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For all other copying, reprint, or republication permission, write to Copyrights and Permissions
`Department, IEEE Publications Administration, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331. Copyright © 2007 by the Institute of Electrical and Electronics Engineers,
`Inc. All rights reserved. Periodicals Postage Paid at New York, NY and at additional mailing offices. Postmaster: Send address changes to IEEE TRANSACTIONS ON AUDIO, SPEECH, AND
`LANGUAGE PROCESSING, IEEE, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331. GST Registration No. 125634188. Printed in U.S.A.
`
`Digital Object Identifier 10.1109/TASL.2007.897012
`
`
`
`IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
`
`1327
`
`Direction of Arrival Estimation Using the
`Parameterized Spatial Correlation Matrix
`
`Jacek Dmochowski, Jacob Benesty, Senior Member, IEEE, and Sofiène Affes, Senior Member, IEEE
`
`Abstract—The estimation of the direction-of-arrival (DOA) of
`one or more acoustic sources is an area that has generated much
`interest in recent years, with applications like automatic video
`camera steering and multiparty stereophonic teleconferencing
`entering the market. DOA estimation algorithms are hindered by
`the effects of background noise and reverberation. Methods based
`on the time-differences-of-arrival (TDOA) are commonly used
`to determine the azimuth angle of arrival of an acoustic source.
`TDOA-based methods compute each relative delay using only two
`microphones, even though additional microphones are usually
`available. This paper deals with DOA estimation based on spatial
`spectral estimation, and establishes the parameterized spatial cor-
`relation matrix as the framework for this class of DOA estimators.
`This matrix jointly takes into account all pairs of microphones,
`and is at the heart of several broadband spatial spectral estima-
`tors, including steered-response power (SRP) algorithms. This
`paper reviews and evaluates these broadband spatial spectral esti-
`mators, comparing their performance to TDOA-based locators. In
`addition, an eigenanalysis of the parameterized spatial correlation
`matrix is performed and reveals that such analysis allows one to
`estimate the channel attenuation from factors such as uncalibrated
`microphones. This estimate generalizes the broadband minimum
`variance spatial spectral estimator to more general signal models.
`A DOA estimator based on the multichannel cross correlation
`coefficient (MCCC) is also proposed. The performance of all
`proposed algorithms is included in the evaluation. It is shown that
`adding extra microphones helps combat the effects of background
`noise and reverberation. Furthermore, the link between accurate
`spatial spectral estimation and corresponding DOA estimation
`is investigated. The application of the minimum variance and
`MCCC methods to the spatial spectral estimation problem leads
`to better resolution than that of the commonly used fixed-weighted
`SRP spectrum. However, this increased spatial spectral resolution
`does not always translate to more accurate DOA estimation.
`
`Index Terms—Circular arrays, delay-and-sum beamforming
`(DSB), direction-of-arrival (DOA) estimation, linear spatial predic-
`tion, microphone arrays, multichannel cross correlation coefficient
`(MCCC), spatial correlation matrix, time delay estimation.
`
`I. INTRODUCTION
`
`P ROPAGATING signals contain much information about
`
`the sources that emit them. Indeed, the location of a signal
`source is of much interest in many applications, and there exists
`a large and increasing need to locate and track sound sources.
`
`Manuscript received September 6, 2006; revised November 8, 2006. The as-
`sociate editor coordinating the review of this manuscript and approving it for
`publication was Dr. Hiroshi Sawada.
`The authors are with the Institut National de la Recherche Scientifique-
`Énergie, Matériaux, et Télécommunications (INRS-EMT), Université du
`Québec, Montréal, QC H5A 1K6, Canada (e-mail: dmochow@emt.inrs.ca).
`Digital Object Identifier 10.1109/TASL.2006.889795
`
`For example, a signal-enhancing beamformer [1], [2] must con-
`tinuously monitor the position of the desired signal source in
`order to provide the desired directivity and interference sup-
`pression. This paper is concerned with estimating the direc-
`tion-of-arrival (DOA) of acoustic sources in the presence of sig-
`nificant levels of both noise and reverberation.
`The two major classes of broadband DOA estimation
`techniques are those based on the time-differences-of-arrival
`(TDOA) and spatial spectral estimators. The latter terminology
`arises from the fact that spatial frequency corresponds to the
`wavenumber vector, whose direction is that of the propagating
`signal. Therefore, by looking for peaks in the spatial spectrum,
`one is determining the DOAs of the dominant signal sources.
`The TDOA approach is based on the relationship between
`DOA and relative delays across the array. The problem of es-
`timating these relative delays is termed “time delay estimation”
`[3]. The generalized cross-correlation (GCC) approach of [4],
`[5] is the most popular time delay estimation technique. Alter-
`native methods of estimating the TDOA include phase regres-
`sion [6] and linear prediction preprocessing [7]. The resulting
`relative delays are then mapped to the DOA by an appropriate
`inverse function that takes into account array geometry.
`Even though multiple-microphone arrays are commonplace
`in time delay estimation algorithms, there has not emerged a
`clearly preferred way of combining the various measurements
`from multiple microphones. Notice that in the TDOA approach,
`the time delays are estimated using only two microphones at a
`time, even though one usually has several more sensor outputs at
`one’s disposal. The averaging of measurements from indepen-
`dent pairs of microphones is not an optimal way of combining
`the measurements, as each computed time delay is derived from
`only two microphones, and thus often contains significant levels
`of corrupting noise and interference. It is thus well known that
`current TDOA-based DOA estimation algorithms are plagued
`by the effects of both noise and especially reverberation.
`To that end, Griebel and Brandstein [8] map all “realizable”
`combinations of microphone-pair delays to the corresponding
`source locations, and maximize simultaneously the sum (across
`various microphone pairs) of cross-correlations across all pos-
`sible locations. This approach is notable, as it jointly maximizes
`the results of the cross-correlations between the various micro-
`phone pairs.
`The spatial spectral estimation problem is well defined in the
`narrowband signal community. There are three major methods:
`the steered conventional beamformer approach (also termed
`the “Bartlett” estimate), the minimum variance estimator (also
`termed the “Capon” or maximum-likelihood estimator), and
`the linear spatial predictive spectral estimator. Reference [9]
`
`1558-7916/$25.00 © 2007 IEEE
`
`
`
`1328
`
`IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 4, MAY 2007
`
`provides an excellent overview of these approaches. These
`three approaches are unified in their use of the narrowband
`spatial correlation matrix, as outlined in the next section.
`The situation is more scattered in the broadband signal
`case. Various spectral estimators have been proposed, but there
`does not exist any common framework for organizing these
`approaches. The steered conventional beamformer approach
`applies to broadband signals. The delay-and-sum beamformer
`(DSB) is steered to all possible DOAs to determine the DOA
`which emits the most energy. An alternative formulation of
`this approach is termed the “steered-response power” (SRP)
`method, which exploits the fact that the DSB output power may
`be written as a sum of cross-correlations. The computational
`requirements of the SRP method are a hindrance to practical
`implementation [8]. A detailed treatment of steered-beam-
`former approaches to source localization is given in [10], and
`the statistical optimality of the approach is shown in [11]–[13].
`Krolik and Swingler develop a broadband minimum variance
`estimator based on the steered conventional beamformer [14],
`which may be viewed as an adaptive weighted SRP algorithm.
`There have also been approaches that generalize narrowband
`localization algorithms (i.e., MUSIC [15]) to broadband sig-
`nals through subband processing and subsequent combining
`(see [16], for example). A broadband linear spatial predictive
`approach to time delay estimation is outlined in [17] and [18].
`This approach, which is limited to linear array geometries,
`makes use of all the channels in a joint fashion via the time
`delay parameterized spatial correlation matrix.
`This paper attempts to unify broadband spatial spectral esti-
`mators into a single framework and compares their performance
`from a DOA estimation standpoint to TDOA-based algorithms.
`This unified framework is the azimuth parameterized spatial
`correlation matrix, which is at the heart of all broadband spa-
`tial spectral estimators.
`In addition, several new ideas are presented. First, due to
`the parametrization, well-known narrowband array processing
`notions [19] are applied to the DOA estimation problem, gen-
`eralizing these ideas to the broadband case. A DOA estimator
`based on the eigenanalysis of the parameterized spatial corre-
`lation matrix ensues. More importantly, it is shown that this
`eigenanalysis allows one to estimate the channel attenuation
`from factors such as uncalibrated microphones. The existing
`minimum variance approach to broadband spatial spectral esti-
`mation is reformulated in the context of a more general signal
`model which accounts for such attenuation factors. Further-
`more, the ideas of [17] and [18] are extended to more general
`array geometries (i.e., circular) via the azimuth parameterized
`spatial correlation matrix, resulting in a minimum entropy DOA
`estimator.
`Circular arrays (see [20]–[22], for example) offer some ad-
`vantages over their linear counterparts. A circular array provides
`spatial discrimination over the entire 360 azimuth range, which
`is particularly important for applications that require front-to-
`back signal enhancement, such as teleconferencing. Further-
`more, a circular array geometry allows for more compact de-
`signs. While the contents of this paper apply generally to planar
`array geometries, the circular geometry is used throughout the
`simulation portion.
`
`Fig. 1. Circular array geometry.
`
`Section II presents the signal propagation model in planar
`(i.e., circular) arrays and serves as the foundation for the re-
`mainder of the paper. Section III reviews the role of the tradi-
`tional, nonparameterized spatial correlation matrix in narrow-
`band DOA estimation, and shows how the parameterized ver-
`sion of the spatial correlation matrix allows for generalization
`to broadband signals. Section IV describes the existing and pro-
`posed broadband spatial spectral estimators in terms of the pa-
`rameterized spatial correlation matrix. Section V outlines the
`simulation model employed throughout this paper and evaluates
`the performance of all spatial spectral estimators and TDOA-
`based methods in both reverberation- and noise-limited envi-
`ronments. Concluding statements are given in Section VI.
`The spatial spectral estimation approach to DOA estimation
`has limitations in certain reverberant environments. If an inter-
`fering signal or reflection arrives at the array with a higher en-
`ergy than the direct-path signal, the DOA estimate will be false,
`even though the spatial spectral estimate is accurate. Such situ-
`ations arise when the source is oriented towards a reflective bar-
`rier and away from the array. This problem is beyond the scope
`of this paper and is not addressed herein. Rather, the focus of
`this paper is on the evaluation of spatial spectral estimators in
`noisy and reverberant environments and on their application to
`DOA estimation.
`
`II. SIGNAL MODEL
`
`elements in a 2-D geom-
`Assume a planar array of
`etry, shown in Fig. 1 (i.e., circular geometry), whose outputs
`,
`, where
`is the time index.
`are denoted by
`Denoting the azimuth angle of arrival by , propagation of the
`signal from a far-field source to microphone is modeled as:
`
`(1)
`
`, are the attenuation factors due to
`,
`where
`channel effects,
`is the propagation time, in samples, from the
`to microphone 0,
`is an additive noise
`unknown source
`signal at the th microphone, and
`, is the
`
`,
`
`
`
`DMOCHOWSKI et al.: DIRECTION OF ARRIVAL ESTIMATION USING THE PARAMETERIZED SPATIAL CORRELATION MATRIX
`
`1329
`
`relative delay between microphones 0 and . In matrix form, the
`array signal model becomes:
`
`...
`
`...
`
`...
`
`. . .
`
`...
`
`. . .
`. . .
`
`...
`
`...
`
`although presented in far-field planar context, easily generalize
`to the near-field spherical case by including the range and ele-
`vation in the forthcoming parametrization.
`
`III. PARAMETERIZED SPATIAL CORRELATION MATRIX
`In narrowband signal applications, a common space-time
`statistic is that of the spatial correlation matrix [19], which is
`given by
`
`(2)
`
`where
`
`(5)
`
`(6)
`
`The function
`relates the angle of arrival to the relative delays
`between microphone elements 0 and , and is derived for the case
`of an equispaced circular array in the following manner. When
`operating in the far-field, the time delay between microphone
`and the center of the array is given by [23]
`
`where the azimuth angle (relative to the selected angle refer-
`,
`ence) of the th microphone is denoted by
`,
`denotes the array radius, and is the speed of signal
`propagation. It easily follows that
`
`(3)
`
`(4)
`
`may
`It is also worth mentioning that the additive noise
`. In that
`be temporally correlated with the desired signal
`case, a reverberant environment is modeled. The anechoic en-
`vironment is modeled by making the additive noise temporally
`uncorrelated with the source signal. In either case, the additive
`noise may be spatially correlated across the sensors.
`It should also be stated that the signal model presented above
`makes use of the far-field assumption, in that the incoming wave
`is assumed to be planar, such that all sensors perceive the same
`DOA. An error is incurred if the signal source is actually lo-
`cated in the near-field; in that case, the relative delays are also
`a function of the range. In the most general case (i.e., a source
`takes three
`in the near-field of a 3-D geometry), the function
`parameters: the azimuth, range, and elevation. This paper fo-
`cuses on a specific subset of this general model: a source located
`in the far-field with only a slight elevation, such that a single
`parameter suffices. This is commonly the case in a teleconfer-
`encing environment. Nevertheless, the concepts of this paper,
`
`the superscript
`denotes conjugate transpose, as complex sig-
`de-
`nals are commonly used in narrowband applications, and
`notes the transpose of a matrix or vector. To steer these array
`outputs to a particular DOA, one applies a complex weight to
`each sensor output, whose phase performs the steering, and then
`sums the sensor outputs to form the output beam. Now, if the
`input signal is no longer narrowband, each frequency requires
`its own complex weight to appropriately phase-shift the signal
`at that frequency. In the context of broadband spatial spectral
`estimation, the spatial correlation matrix may be computed at
`each temporal frequency, and the resulting spatial spectrum is
`now a function of the temporal frequency. For broadband appli-
`cations, these narrowband estimates may be assimilated into a
`time-domain statistic, a procedure termed “focusing,” which is
`described in [24]. The resulting structure is termed a “focused
`covariance matrix.”
`In this paper, broadband spatial spectral estimation is
`addressed in another manner. Instead of implementing the
`steering delays in the complex weighting at each sensor, the
`delays are actually implemented as a time-delay in the spatial
`correlation matrix, which is now parameterized. Thus, each
`microphone output is appropriately delayed before computing
`this parameterized spatial correlation matrix:
`
`(7)
`
`and real signals are assumed from this point on. The delays are
`a function of the assumed azimuth DOA, which becomes the
`parameter. The parameterized spatial correlation matrix is for-
`mally written as shown by (8) and (9) at the bottom of the page.
`is not simply the array observation matrix, as is
`The matrix
`commonly used in narrowband beamforming models. Instead, it