`
`Sound
`Capture
`and
`Processing
`
`PRACTICAL
`APPROACHES
`
`IPR PETITION
`US RE48,371
`Sonos Ex. 1033
`
`
`
`Sound Capture and Processing
`
`
`
`-1k
`1B9~
`[[pt;
`,3,
`Sound Capture and Processin/009
`Practical Approaches
`
`I van J. Tashev
`Microsoft Research, USA
`
`ffiWILEY
`
`A John Wiley and SonR, Ltd., Publication
`
`
`
`To my family: the time to write
`this book was taken from them
`
`This edition first published 2009
`© 2009 John Wiley & Sons Ltd.,
`
`Registered office
`John Wiley & Sons Ltd, Tbe Atrium, Southern Gate, Chichester, West Sussex, P019 8SQ, United Kingdom
`
`rordctails of our global editorial offices, for customer services and for information about how to apply for permission to
`reuse lhe copyright material in this book please see our website at www.wiley.com.
`
`The right of the author to be identified as the author of this work has been asserted in accordance with the Copyright,
`Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmiued, in
`any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by
`the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher.
`
`Wiley also publishes its books in a variety of electronic formals. Sornecontentthat appears in print may not be available
`in electronic books.
`
`Designations used by companies to distinguish their producL~ are often claimed as trademarks. All brand names
`and product names used in this bnok are trade names, service marks, trademarks or registered trademarks of their
`respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication
`is designed to provide accur ate and authoritative information in regard to the subject matter covered. lt is sold on
`the understanding that the publisher is not engaged in rendering professional services. [f professional advice or other
`expert assistance is required, the services of a competent professional should be sought.
`
`MATLAB® is a trademark of The Math Works, Inc., and is used with permission. The MathWorh does not warrant
`the accuracy of the text or exercises in this book. This book's use or discussion of MATLAB® software or
`related products does not constitute endorsement or sponsorship by The Math Works of a particular pedagogical
`approach or particular use of MATLAB® software.
`
`Library of Congress Ca1a/oging-in-P11b/ication Data
`
`Tashev, [van J. (Ivan Jelev)
`Sound capture and processing: practical approaches/ Ivan J. Tashev.
`p. cm.
`Includes index.
`ISBN 978-0-470-31983-3 (cloth)
`1. Speech processing systems. 2. Sound-Recording and reproducing-Digital
`techniques. 3. Signal processing-Digital techniques. L Title.
`TK7882.S65T37 2009
`62I.382'8-<ic22
`
`2009011987
`
`A catalogue record for this book is available from the British Library.
`
`ISBN 978-0-470-31983-3 (H/B)
`
`'fypeset in l l/l3pt Times by Thomson Digital, Noida, India.
`Printed and bound in Great Britain by CPI Antony Rowe, Chippenham, Wiltshire.
`
`
`
`Contents
`
`About the Author
`Foreword
`Preface
`Acknowledgements
`
`1 Introduction
`1.1 The Need for, and Consumers of, Sound Capture and
`Audio Processing Algorithms
`1.2 Typical Sound Capture System
`1.3 The Goal of this Book and its Target Audience
`1.4 Prerequisites
`1.5 Book Structure
`1.6 Exercises
`
`2 Basics
`2.1 Noise: Definition, Modeling, Properties
`2.1.1 Statistical Properties
`2.1.2 Spectral Properties
`2.1.3 Temporal Properties
`2.1.4 Spatial Characteristics
`2.2 Sigrial: Definition, Modeling, Properties
`2.2.1 Statistical Properties
`2.2.2 Spectral Properties
`2.2.3 Temporal Properties
`2.2.4 Spatial Characteristics
`2.3 Classification: Suppression, Cancellation, Enhancement
`2.3.1 Noise Suppression
`2.3.2 Noise Cancellation
`2.3.3 Active Noise Cancellation
`2.3.4 De-reverberation
`2.3.5 Speech Enhancement
`2.3.6 Acoustic Echo Reduction
`2.4 Sampling and Quantization
`2.4.1 Sampling Process and Sampling Theorem
`
`xv
`xvu
`XIX
`xxi
`
`1
`
`1
`2
`3
`4
`4
`5 ·
`
`7
`7
`7
`9
`11
`11
`12
`13
`16
`17
`18
`19
`19
`20
`20
`21
`21
`21
`23
`23
`
`
`
`viii
`
`Contents
`
`Contents
`
`2.4.2 Quantization
`2.4.3 Signal Reconstruction
`2.4.4 Errors During Real Discretization
`2.4.4.1 Discretization with a Non-ideal Sampling Function
`2.4.4.2 Sampling with Averaging
`2.4.4.3 Sampling Signals with Finite Duration
`2.5 Audio Processing in the Frequency Domain
`2.5.1 Processing in the Frequency Domain
`2.5.2 Properties of the Frequency Domain Representation
`2.5.3 Discrete Fourier Transformation
`2.5.4 Short-time Transformation, and Weighting
`2.5.5 Overlap-add Process
`2.5.6 Spectrogram: Time-Frequency Representation of the Signal
`2.5.7 Other Methods for Transformation to the Frequency Domain
`2.5. 7 .1 Lapped Transformations
`2.5.7.2 Cepstral Analysis
`2.6 Bandwidth Limiting
`2.7 Signal-to-Noise-Ratio: Definition and Measurement
`2.8 Subjective Quality Measurement
`2.9 Other Methods for Quality and Enhancement Measurement
`2.10 Summary
`Bibliography
`
`3 Sound and Sound Capturing Devices
`3.1
`Sound and Sound Propagation
`3.1.1 Sound as a Longitudinal Mechanical Wave
`3.1.2 Frequency of the Sound Wave
`3.1.3 Speed of Sound
`3.1.4 Wavelength
`3.1.5 Sound Wave Parameters
`3.1.5.1
`Intensity
`3.1.5.2 Sound Pressure Level
`3.1.5.3 Power
`3.1.5.4 Sound Attenuation
`3.1.6 Huygens' Principle, Diffraction, and Reflection
`3.1.7 Doppler Effect
`3.1.8 Weighting Curves and Measuring Sound Pressure Levels
`3.2 Microphones
`3.2.1 Definition
`3.2.2 Microphone Classification by Conversion 'Type
`3.3 Omnidirectional and Pressure Gradient Microphones
`3.3.1 Pressure Microphone
`3.3.2 Pressure-gradient Microphone
`3.4 Parameter Definitions
`3.4.1 Microphone Sensitivity
`3.4.2 Microphone Noise and Output SNR
`
`25
`27
`29
`29
`30
`31
`32
`32
`33
`35
`36
`37
`40
`42
`42
`43
`45
`48
`49
`50
`52
`53
`
`55
`55
`55
`56
`58
`60
`61
`61
`61
`62
`63
`63
`65
`66
`68
`68
`69
`70
`70
`71
`73
`73
`74
`
`3.4.3 Directivity Pattern
`3.4.4 Frequency Response
`3.4.5 Directivity Index
`3.4.6 Ambient Noise Suppression
`3.4.7 Additional Electrical Parameters
`3.4.8 Manufacturing Tolerances
`3.5 First-order Directional Microphones
`3.6 Noise-canceling Microphones and the Proximity Effect
`3. 7 Measurement of Microphone Parameters
`3.7.1 Sensitivity
`3.7 .2 Directivity Pattern
`3.7.3 Self Noise
`3.8 Microphone Models
`3.9 Summary
`Bibliography
`
`4 Single-channel Noise Reduction
`4.1 Noise Suppression as a Signal Estimation Problem
`4.2 Suppression Rules
`4.2.1 Noise Suppression as Gain-based Processing
`4.2.2 Definition of A-Priori and A-Posteriori SNRs
`4.2.3 Wiener Suppression Rule
`4.2.4 Artifacts and Distortions
`4.2.5 Spectral Subtraction Rule
`4.2.6 Maximum-likelihood Suppression Rule
`4.2.7 Ephraim and Malah Short-term MMSE Suppression Rule
`4.2.8 Ephraim and Malah Short-term Log-MMSE Suppression Rule
`4.2.9 More Efficient Solutions
`4.2.10 Exploring Other Probability Distributions of the Speech Signal
`4.2.11 Probability-based Suppression Rules
`4.2.12 Comparison of the Suppression Rules
`4.3 Uncertain Presence of the Speech Signal
`4.3.1 Voice Activity Detectors
`4.3.1.1 ROC Curves
`4.3.1.2 Simple VAD with Dual-time-constant Integrator
`4.3.1.3 Statistical-model-based VAD with Likelihood Ratio Test
`4.3.1.4 VAD with Floating Threshold and Hangover
`Scheme with State Machine
`4.3.2. Modified Suppression Rule
`4.3.3. Presence Probability Estimators
`4.4 Estimation of the Signal and Noise Parameters
`4.4.1 Noise Models: Updating and Statistical Parameters
`4.4.2 A-Priori SNR Estimation
`4.5 Architecture of a Noise Suppressor
`4.6 Optimizing the Entire System
`4.7 Specialized Noise-reduction Systems
`
`..
`
`ix
`
`74
`75
`75
`77
`77
`78
`82
`84
`87
`87
`87
`90
`92
`92
`93
`
`95
`96
`96
`96
`97
`98
`99
`100
`100
`102
`103
`103
`105
`108
`lll
`ll5
`ll5
`ll6
`ll8
`122
`
`123
`124
`126
`126
`126
`127
`130
`137
`139
`
`
`
`X
`
`Contents
`
`Contents
`
`4.7.1 Adaptive Noise Cancellation
`4.7.2 Psychoacoustic Noise Suppression
`4.7.2.1 Human Hearing Organ
`4.7.2.2 Loudness
`4.7.2.3 Masking Effects
`4.7.2.4 Perceptually Balanced Noise Suppressors
`4.7.3 Suppression of Predictable Components
`4.7.4 Noise Suppression Based on Speech Modeling
`4.8 Practical Tips and Tricks for Noise Suppression
`4.8.1 Model Initialization and Tracking
`4.8.2 Averaging in the Frequency Domain
`4.8.3 Limiting
`4.8.4 Minimal Gain
`4.8.5 Overflow and Underflow
`4.8.6 Dealing with High Signal-to-Noise Ratios
`4.8.7 Fast Real-time Implementation
`4.9 Summary
`Bibliography
`
`5 Sound Capture with Microphone Arrays
`5.1 Definitions and Types of Microphone Array
`5.1.1 Transducer Arrays and their Applications
`5.1.2 Specifics of Array Processing for Audio Applications
`5.1.3 Types of Microphone Arrays
`5.1.3.1 Linear Microphone Arrays
`5.1.3.2 Circular Microphone Arrays
`5.1.3.3 Planar Microphone Arrays
`5.1.3.4 Volumetric (3D) Microphone Arrays
`5.1.3.5 Specialized Microphone Arrays
`5.2 The Sound Capture Model and Beamforming
`5.2.1 Coordinate System
`5.2.2 Sound Propagation and Capture
`5.2.2.1 Near-field Model
`5.2.2.2 Far-field Model
`5.2.3 Spatial Aliasing and Ambiguity
`5.2.4 Spatial Correlation of the Microphone Signals
`5.2.5 Delay-and-Sum Beamformer
`5.2.6 Generalized Filter-and-Sum Beamformer
`5.3 Terminology and Parameter Definitions
`5.3.1 Terminology
`5.3.2 Directivity Pattern and Directivity Index
`5.3.3 Beam Width
`5.3.4 Array Gain
`5.3.5 Uncorrelated Noise Gain
`5.3.6 Ambient Noise Gain
`5.3.7 Total Noise Gain
`
`139
`142
`142
`143
`144
`149
`150
`157
`158
`158
`159
`159
`159
`160
`160
`161
`161
`162
`
`165
`165
`165
`169
`171
`171
`172
`173
`173
`174
`174
`174
`176
`176
`177
`178
`181
`182
`187
`188
`188
`190
`192
`193
`194
`194
`195
`
`IDOA Space Definition
`5.3.8
`5.3.9 Beamformer Design Goal and Constraints
`5.4 Time-invariant Beamformers
`5.4.1 MVDR Beamformer
`5.4.2 More Realistic Design - Adding the Microphone Self Noise
`5.4.3 Other Criteria for Optimality
`5.4.4 Beam Pattern Synthesis
`5.4.4.1 Beam Pattern Synthesis with the Cosine Function
`5.4.4.2 Beam Pattern Synthesis with
`Dolph-Chebyshev Polynomials
`5.4.4.3 Practical Use of Beam Pattern Synthesis
`5.4.5 Beam Width Optimization
`5.4.6 Beamformer with Direct Optimization
`5.5 Channel Mismatch and Handling
`5.5.1 Reasons for Channel Mismatch
`5.5.2 How Manufacturing Tolerances Affect the Beamformer
`5.5.3 Calibration and Self-calibration Algorithms
`5.5.3.1 Classification of Calibration Algorithms
`5.5.3.2 Gain Self-calibration Algorithms
`5.5.3.3 Phase Self-calibration Algorithm
`5.5.3.4 Self-calibration Algorithms - Practical Use
`5.5.4 Designs Robust to Manufacturing Tolerances
`5.5.4.1 Tolerances as Uncorrelated Noise
`5.5.4.2 Cost Functions and Optimization Goals
`5.5.4.3 MVDR Beamformer Robust to
`Manufacturing Tolerances
`5.5.4.4 Beamformer with Direct Optimization
`Robust to Manufacturing Tolerances
`5.5.4.5 Balanced Design for Handling the
`Manufacturing Tolerances
`5.6 Adaptive Beamformers
`5.6.1 MVDR and MPDR Adaptive Beamformers
`5.6.2 LMS Adaptive Beamformers
`5.6.2.1 Widrow Beamformer
`5.6.2.2 Frost Beamformer
`5.6.3 Generalized Side-lobe Canceller
`5.6.3.1 Griffiths-Jim Beamformer
`5.6.3 .2 Robust Generalized Side-lobe Canceller
`5.6.4 Adaptive Algorithms for Microphone Arrays - Summary
`5.7 Microphone-array Post-processors
`5.7.1 Multimicrophone MMSE Estimator
`5.7.2 Post-processor Based on Power Densities Estimation
`5.7.3 Post-processor Based on Noise-field Coherence
`5.7.4 Spatial Suppression and Filtering in the IDOA Space
`5.7.4.1 Spatial Noise Suppression
`5.7.4.2 Spatial Filtering
`
`xi
`
`195
`197
`198
`198
`201
`202
`203
`203
`
`205
`207
`207
`210
`213
`213
`215
`218
`218
`219
`222
`222
`223
`223
`224
`
`225
`
`225
`
`230
`231
`231
`231
`232
`232
`233
`233
`235
`236
`236
`237
`238
`240
`241
`242
`244
`
`
`
`xii
`
`Contents
`
`Contents
`
`5.7.4.3 Spatial Filter in Side-lobe Canceller Scheme
`5.7.4.4 Combination with LMS Adaptive Filter
`5.8 Specific Algorithms for Small Microphone Arrays
`5.8.1 Linear Beamforming Using the Directivity of the Microphones
`5.8.2 Spatial Suppressor Using Microphone Directivity
`5.8.2.1 Time-invariant Linear Beamformers
`5.8.2.2 Feature Extraction and Statistical Models
`5.8.2.3 Probability Estimation and Features Fusion
`5.8.2.4 Estimation of Optimal Time-invariant Parameters
`5.9 Summary
`Bibliography
`
`6 Sound Source Localization and Tracking with Microphone Arrays
`6.1 Sound Source Localization
`6.1.1 Goal of Sound Source Localization
`6.1.2 Major Scenarios
`6.1.3 Performance Limitations
`6.1.4 How Humans and Animals Localize Sounds
`6.1.5 Anatomy of a Sound Source Localizer
`6.1.6 Evaluation of Sound Source Localizers
`6.2 Sound Source Localization from a Single Frame
`6.2.1 Methods Based on Time Delay Estimation
`6.2.1.1 Time Delay Estimation for One Pair of Microphones
`6.2.1.2 Combining the Pairs
`6.2.2 Methods Based on Steered-response Power
`6.2.2.1 Conventional Steered-response Power Algorithms
`6.2.2.2 Weighted Steered-response Power Algorithm
`6.2.2.3 Maximum-likelihood Algorithm
`6.2.2.4 MUSIC Algorithm
`6.2.2.5 Combining the Bins
`6.2.2.6 Comparison of the Steered-response Power Algorithms
`6.2.2.7 Particle Filters
`6.3 Post-processing Algorithms
`6.3.1 Purpose
`6.3.2 Simple Clustering
`6.3.2.1 Grouping the Measurements
`6.3.2.2 Determining the Number of Cluster Candidates
`6.3.2.3 Averaging the Measurements in Each Cluster Candidate
`6.3.2.4 Reduction of the Potential Sound Sources
`6.3.3 Localization and Tracking of Multiple Sound Sources
`6.3.3.1 k-Means Clustering
`6.3.3.2 Fuzzy C-means Clustering
`6.3.3.3 Tracking the Dynamics
`6.4 Practical Approaches and Tips
`6.4.1 Increasing the Resolution of Time-delay Estimates
`6.4.2 Practical Alternatives for Finding the Peaks
`
`247
`248
`250
`251
`254
`255
`256
`258
`258
`260
`261
`
`263
`263
`263
`264
`266
`266
`270
`271
`272
`272
`272
`278
`280
`281
`281
`282
`282
`284
`285
`286
`291
`291
`294
`294
`294
`295
`296
`296
`297
`298
`299
`300
`300
`301
`
`6.4.3 Peak Selection and Weighting
`6.4.4 Assigning Confidence Levels and Precision
`6.5 Summary
`Bibliography
`
`7 Acoustic Echo-reduction Systems
`7.1 General Principles and Terminology
`7.1.1 Problem Description
`7.1.2 Acoustic Echo Cancellation
`7.1.3 Acoustic Echo Suppression
`7.1.4 Evaluation Parameters
`7.2 LMS Solution for Acoustic Echo Cancellation
`7.3 NLMS and RLS Algorithms
`7.4 Double-talk Detectors
`7.4.1 Principle and Evaluation
`7.4.2 Geigel Algorithm
`7.4.3 Cross-correlation Algorithms
`7.4.4 Coherence Algorithms
`7.5 Non-linear Acoustic Echo Cancellation
`7.5.1 Non-linear Distortions
`7.5.2 Non-linear ABC with Adaptive Volterra Filters
`7.5.3 Non-linear ABC Using Orthogonalized Power Filters
`7.5.4 Non-linear ABC in the Frequency Domain
`7.6 Acoustic Echo Suppression
`7.6.1 Estimation of the Residual Energy
`7.6.2 Suppressing the Echo Residual
`7.7 Multichannel Acoustic Echo Reduction
`7.7.1 The Non-uniquenes Problem
`7.7.2 Tracking the Changes
`7.7.3 Decorrelation of the Channels
`7.7.4 Multichannel Acoustic Echo Suppression
`7.7.5 Reducing the Degrees of Freedom
`7.8 Practical Aspects of the Acoustic Echo-reduction Systems
`7.8.1 Shadow Filters
`7.8.2 Center Clipper
`7.8.3 Feedback Prevention
`7.8.4 Tracking the Clock Drifts
`7.8.5 Putting Them All Together
`7.9 Summary
`Bibliography
`
`8 De-reverberation
`8.1 Reverberation and Modeling
`8.1.1 Reverberation Effect
`8.1.2 How Reverberation Affects Humans
`8.1.3 Reverberation and Speech Recognition
`
`.
`
`xiii
`
`301
`302
`303
`304
`
`307
`307
`307
`309
`311
`312
`313
`315
`316
`316
`317
`317
`319
`320
`320
`321
`322
`323
`323
`323
`325
`327
`327
`329
`329
`330
`331
`334
`334
`334
`335
`335
`336
`337
`338
`
`341
`341
`341
`345
`347
`
`
`
`
`
`Foreword
`
`Just a couple of decades ago we would think of "sound capture and
`processing" as the problems of designing microphones for converting
`sounds from the real world into electrical signals, as well as amplifying,
`editing, recording, and transmitting such signals, mostly using analog
`hardware technologies. That's because our intended applications were
`mostly analog telephony, broadcasting, and voice and music recording.
`We have come a long way: small digital audio players have replaced bulky
`portable cassette tape players, and people make voice calls mostly via
`digital mobile phones and voice communication software in their com(cid:173)
`puters. Thanks to the evolution of digital signal processing technologies, we now focus mostly
`on processing sounds not as analog electrical signals, but rather as digital files or data streams in
`a computer or digital device. We can do a lot more with digital sound processing, such as
`transcribe speech into text, identify persons speaking, recognize music from humming, remove·
`noises much more efficiently, add special effects, and so much more. Thus, today we think of
`sound capture as the problem of digitally processing the signals captured by microphones so as
`to improve their quality for best performance in digital communications, broadcasting,
`recording, recognition, classification, and other applications.
`This book by Ivan Tashev provides a comprehensive yet concise overview of the funda(cid:173)
`mental problems and core signal processing algorithms for digital sound capture, including
`ambient noise reduction, acoustic echo cancellation, and reduction of reverberation. After
`introducing the necessary basic aspects of digital audio signal processing, the book presents
`basic physical properties of sound and propagation of sound waves, as well as a review of
`microphone technologies, providing the reader with a strong understanding of key aspects of
`digitized sounds. The book dLscusses the fundamental problems of noise reduction, which are
`usually solved via techniques based on statistical models of the signals of interest (typically
`voice) and of interfering signals. An important discussion of properties of the human auditory
`system is also presented; auditory models can play a very important role in algorithms for
`enhancing audio signals in communication and recording/playback applications, where the
`final destination is the human ear.
`Microphone arrays have become increasingly important in the past decade or so. Thanks to
`the rapid evolution and reduction in cost of analog and digital electronics in recent years, it is
`inexpensive to capture sound through several channels, using an array of microphones. That
`opens new opportunities for improving sound capture, such as detecting the direction of
`incoming sounds and applying spatial filtering techniques. The book includes two excellent
`
`
`
`xviii
`
`Foreword
`
`chapters whose coverage goes from the basics of microphone array configurations and delay(cid:173)
`and-sum beamforming, to modem sophisticated algorithms for high-performance multichan(cid:173)
`nel signal enhancement.
`Acoustic echoes and reverberation are the two most important kinds of signal degradations in
`many sound capture scenarios. If you 're a professional singer, you probably don't mind holding
`a microphone or wearing a headset with a microphone close to your mouth, but most of us prefer
`microphones to be invisible, far away from our mouths. That means microphone will capture
`not only our own voices, but also reverberation components because of sound reflections from
`nearby walls, as well as echoes of signals that are being played back from loudspeakers.
`Removing such undesirable artifacts presents significant technical challenges, which are well
`addressed in the final two chapters, which present modern algorithms for tackling them.
`A key quality of this book is that it presents not only fundamental theoretical analyses,
`models, and algorithms, but it also considers many practical aspects that are very important for
`the design of real-world engineering solutions to sound capture problems. Thus, this book
`should be of great appeal to both students and engineers.
`I have had the pleasure of working with Ivan on research and development of sound capture
`systems and algorithms. His enthusiasm, deep engineering and mathematical knowledge, and
`pragmatic approaches were all contagious. His work has had significant practical impact, for
`example the introduction of multichannel sound capture and processing modules in the
`Microsoft Windows operating system. I have learned a considerable amount about sound
`capturing and processing from my interactions with Ivan, and I am sure you will, as well, by
`reading this book. Enjoy!
`
`Henrique Malvar
`Managing Director
`Microsoft Research
`Redmond Laboratory
`
`Preface
`
`Capturing and processing sounds is critical in mobile and handheld devices, communication
`systems, and computers using automatic speech recognition. Devices and technologies for
`proper conversion of sounds to electric signals and removing unwanted parts, such as noise and
`reverberation, have been used since the first telephones. They evolved, becoming more and
`more complex. In many cases the existing algorithms exceed the abilities of typical processors
`in these devices and computers to provide real-time processing of the captured signal.
`This book will discuss the basic principles for building an audio processing stack, sound
`capturing devices, single-channel speech-enhancement algorithms, and microphone arrays for
`sound capture and sound source localization. Further, algorithms will be described for acoustic
`echo cancellation and de-reverberation - building blocks of a sound capture and processing
`stack for telecommunication and speech recognition. Wherever possible the various algorithms
`are discussed in the order of their development and publication. In all cases the aim is to try to
`give the larger picture - where the technology came from, what worked and what had to be ·
`adapted for the needs of audio processing. This gives a better perspective for further
`development of new audio signal processing algorithms.
`Even the best equations and signal processing algorithms are not worth anything before
`being implemented and verified by processing of real data. That is why, in this book, stress is
`placed on experimenting with recorded sounds and implementation of the algorithms. In
`practice, frequently a simpler model with fewer parameters to estimate works better than a more
`precise but more complex model with a larger number of parameters. With the latter one has
`either to sacrifice estimation precision or to increase the estimation time. This balance of
`simplicity, precision, and reaction time is critical for real-time systems, where on top of
`everything we have to watch out for parameters such as latency, consumed memory, and CPU
`time.
`Most of the algorithms and approaches described in this book are based on statistical models.
`In mathematics, a single example cannot prove but can disprove a theorem. In statistical signal
`processing, a single example is ... just a sample. What matters is careful evaluation of the
`algorithms with a good corpus of speech or audio signals, distributed in their signal-to-noise
`ratios, type of noise, and other parameters - as close as possible to the real problem we are trying
`to solve.
`The solution of practically any signal processing problem can be improved by tuning the
`parameters of the algorithm, provided we have a proper criterion for optimality. There are
`always adaptation time constants, thresholds, which cannot be estimated and their values
`have to be adjusted experimentally. The mathematical models and solutions we use are usually
`
`
`
`xx
`
`Preface
`
`optimal in one or another way. If they reflect properly the nature of the process they model, then
`we have a good solution and the results are satisfactory. In all cases it is important to remember
`that we do not want a "minimum mean-square error solution," or a "maximum-likelihood
`solution," or even a "log minimum mean-square error solution." We do not want to improve
`the signal-to-noise ratio. What we want is for listeners to perceive the sound quality of the
`processed signal as better - improved - compared to the input signal. From this perspective,
`the final judge of how good is an algorithm is the human ear, so use it to verify the solution.
`Hearing is an important sense for humans and animals. In many places in this book are provided
`examples of how humans and animals hear and localize sounds - this explains better some
`signal processing approaches and brings biology-inspired designs for sound capture and
`processing systems.
`In many cases the signal processing chain consists of several algorithms for sound capture
`and speech enhancement. The practice shows us that a sequence of separately optimized
`algorithms usually provides suboptimal results. Tuning and optimization of the designed sound
`capturing system end-to-end is a must if we want to achieve best results.
`For further information please visit http://www.wiley.com/go/tashev sound
`
`Ivan Tashev
`Redmond, WA
`USA
`
`Acknowledgements
`
`I want to thank the Book Program in MathWorks and especially Dee Savageau, Naomi
`Fernandes, and Meg Vulliez for the help and responsiveness. The MATLAB® scripts, part of
`this book, were tested with MATLAB® R2007a, provided as part of this program.
`I am grateful to my colleagues from Microsoft Research Alex Acero, Amitav Das, Li Deng,
`Dinei Florencio, Cormac Herley, Zicheng Liu, Mike Seltzer, and Cha Zhang. They read the
`chapters of this book and provided valuable feedback.
`And last, but not least, I want to express my great pleasure working with the nice and helpful
`people from John Wiley & Sons, Ltd. During the long process from proposal, through writing,
`copyediting, and finalizing the book with all the details, they were always professional,
`understanding, and ready to suggest the right solution. I was lucky enough to work with Tiina
`Ruonamaa, Sarah Hinton, Sarah Tilley, and Catlin Flint - thank you all for everything you did
`during the process of writing this book!
`
`
`
`4
`
`Single-channel Noise Reduction
`
`This chapter deals with noise reduction of a single channel. We assume that we have a
`mixture of a useful signal, usually human speech, and an unwanted signal - which we
`call noise. The goal of this type of processing is to provide an estimate of the useful
`signal - an enhanced signal with better properties and characteristics.
`The problem with a noisy speech signal is that a human listener can understand a
`lower percentage of the spoken words. In addition, this understanding requires more
`mental effort on the part of the listener. This means that the listener can quickly lose
`attention - an unwanted outcome during meetings over a noisy telephone line, for
`example. If the noisy signal is sent to a speech recognition engine, the noise reduces the
`recognition rate as it masks speech features important for the recognizer.
`With noise-reduction algorithms, as with most other signal processing algorithms,
`there are multiple trade-offs. One is between better reduction of the unwanted noise
`signal and introduction of undesired effects - additional signals and distortions in the
`wanted speech signal. From this perspective, while improvement in the signal-to-noise
`ratio (SNR) remains the main evaluation criterion of the efficiency of these algorithms,
`subjective listening tests or objective sound quality evaluations are also important. The
`perfect noise-reduction algorithm will make the main speaker's voice more under(cid:173)
`standable so that it seems to stand out, while preserving relevant background noise
`(train station, party sounds, and so on). Such an algorithm should not introduce
`noticeable dist011ions in either foreground (wanted speech) or background (unwanted
`noise) signals.
`Most single-channel algorithms are based on building statistical models of the
`speech and noise signals. In this chapter we wilJ look at the commonly used approaches
`for suppression of noise, the algorithms to distinguish between noise and voice (called
`"voice activity detectors"), and some adaptive noise-canceling algorithms. Exercises
`with implementation of some of these algorithms will be provided for better
`understanding of the processes inside the noise suppressors.
`
`Sound Capture and ProceJ·sing
`© 2009 John Wiley & Sons, Ltd
`
`Ivan J. Tashev
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`114
`
`Sound Capture and Processing
`
`Single-channel Noise Reduction
`
`115
`
`probabilistic rules. The maximum-likelihood suppression rule is definitely worst in
`this sense.
`From the LSD perspective, the front runners are MMSE and log-MMSE (which is
`optimal in the log-MMSE sense). Good results are shown by the entire group of
`efficient alternatives. Note that Wiener and probabilistic rules are worse from this
`perspective, which means that they do not deal well with low levels of noise and speech.
`The best average SNR improvement definitely has Wiener and probabilistic
`rnles, followed by the efficient alternatives and spectral subtraction. The maximum(cid:173)
`likelihood rule, as expected, has the lowest improvement in SNR. It is outperformed
`by the approximate Wiener suppression rule.
`The highest MOS score and the best sound is achieved by log-MMSE and MAP
`SAE, followed closely by the group of efficient alternatives. The maximum-likelihood
`suppression rule sounds worse owing to a substantial amount of noise.
`Figure 4.9 shows the relationship between the average improvement of SNR and the
`average MOS score - the last two columns in Table 4.2. It is clear that, to a certain
`degree, the noise suppression helps, and the signals with more suppressed noise achieve
`better perceptual sound quality. Enforcing the noise suppression further actually
`decreases the sound quality, regardless of the better SNR. This is good evidence that,
`when evaluating noi e-suppressing algorithms, improvement in the SNR should not
`be used as the only criterion, and even not as a main evaluation criterion. IBtimately the
`goal of this type of speech enhancement is to make the output signal sound better for the
`human listener. From this perspective, the MOS is a much better criterion. When
`targeting speech recognition, the best criterion is, of course, the highest recognition rate.
`
`certain errors. Thus it will be important how robust each one of these suppression rules
`is to those errors.
`
`EXERCISE
`
`Look at the MATLAB script SuppressionRule.m which returns the suppression rule
`values for the given vectors of a-priori and a-posteriori SNRs:
`
`Gain= SuppressionRule (gamma , xi , SuppressionType)
`
`The argument Supp ression Typ e is a numberfrom0to9 and determines which
`suppression rule is to be used. The script contains implementation of most of the
`suppression rules discussed so far. Finish the implementation of the rest of the
`suppression rules.
`Write a MATLAB script that computes the suppression rules as a function of the a(cid:173)
`priori and a-posteriori SNRs in the range of ±30 dB. Limit the gain values in the range
`from - 40d.B to +20d.B and plot the rules in three dimensions using the mesh
`function.
`
`4.3 Uncertain Presence of the Speech Signal
`
`All the suppression rules discussed above were derived under the assumption of the
`presence of both noise and speech signals. The speech signal, however, is not always
`presented in the short-term spectral representations. Even continuous speech has
`pauses with durations of 100- 200 ms-which, compared with the typical frame sizes of
`10-40 ms, means that there will be a substantial number of audio frames without a
`speech signal at all. Trying to estimate the speech signal in these frames leads to
`di tortions and musical noise .
`Classification of audio frames into "noise only" and "contains some speech" is in
`general a detection and estimation problem [10]. Stable and reliable work of the voice
`activity detector (VAD) is critical for achieving good noise-suppression results. Frame
`classification is used further to build statistical models of the noise and speech signals,
`so it leads to modification of the suppression rule as well.
`
`4.3.J Voice Activity Detectors
`
`Voice activity detectors are algorithms for detecting the presence of speech in a
`mixed signal consisting of speech plus noise. They can vary from a simple binary
`decision (yes/no) for the entire audio frame to precise estimators of the speech
`presence probability for each frequency bin. Most modern noise-suppression
`systems contain at least one VAD, in many cases two or more. The commonly
`used algorithms base their decision on the assumption of a quasi-stationary noise;
`
`4.4
`
`4.2
`
`QI 4.0
`
`..