`
`Signal Processing 89 (2009) 1038–1049
`
`Contents lists available at ScienceDirect
`
`Signal Processing
`
`journal homepage: www.elsevier.com/locate/sigpro
`
`Verified speaker localization utilizing voicing level in split-bands
`Afsaneh Asaei a,b,c, , Mohammad Javad Taghizadeh c, Marjan Bahrololum c,
`Mohammed Ghanbari d
`a IDIAP Research Institute, Martigny, Switzerland
`b Swiss Federal Institute of Technology at Lausanne (EPFL), Switzerland
`c Iran Telecommunication Research Center, Tehran, Iran
`d Department of Computing and Electronic Systems, University of Essex, Colchester, UK
`
`a r t i c l e i n f o
`
`a b s t r a c t
`
`Article history:
`Received 26 December 2007
`Received in revised form
`31 October 2008
`Accepted 4 December 2008
`Available online 24 December 2008
`
`Keywords:
`Microphone array
`Speaker verification
`Speaker localization
`Reverberation
`Beamforming
`Speech recognition
`
`This paper proposes a joint verification-localization structure based on split-band
`analysis of speech signal and the mixed voicing level. To address the problems in
`reverberant acoustic environments, a new fundamental frequency estimation algorithm
`is proposed based on high resolution spectral estimation. In the reconstruction of the
`distorted speech this information is utilized to reduce the side effect of acoustic noise on
`the voicing parts. A speaker verification system examines the features of
`the
`reconstructed speech in order
`to authorize the speaker before localization.
`This procedure prevents localization and beamforming for non-speech and specially
`the unwanted speakers in multi-speaker scenarios. The verification is implemented with
`the Gaussian Mixture Model and a new filtering scheme is proposed based on the
`voicing likelihood of each frequency band measured in the previous steps for efficient
`localization of the authorized speaker. The performance of the proposed VSL (verified
`speaker localization) front-end is evaluated in various reverberant and noisy environ-
`ments. The VSL is utilized in the development of distant-talking automatic speech
`recognition by microphone array where the system can lock on a specific source and
`hence the recognition quality improves noticeably.
`& 2008 Elsevier B.V. All rights reserved.
`
`1.
`
`Introduction
`
`For a hands-free speech interface, it is very important
`to capture distant talking speech with high quality. An
`ideal solution for this purpose is sound acquisition by
`microphone array. A microphone array can acquire the
`desired speech signals selectively by steering the beam
`pattern directivity of the array towards the desired
`speaker. This process is called beamforming and due to
`the directivity pattern steering, it can spatially filter out
`
` Corresponding author at:
`IDIAP Research Institute, Martigny,
`Switzerland. Tel.: +41 27 72177 73.
`E-mail addresses: afsaneh.asaei@idiap.ch (A. Asaei),
`taghizadehmj@itrc.ac.ir (M.J. Taghizadeh), bahrololum@itrc.ac.ir
`(M. Bahrololum), ghan@essex.ac.uk (M. Ghanbari).
`
`0165-1684/$ - see front matter & 2008 Elsevier B.V. All rights reserved.
`doi:10.1016/j.sigpro.2008.12.003
`
`noises from other directions regardless of the noise
`nature. The main obstacles to achieve reasonable perfor-
`mance in array based systems are the reverberation and
`the presence of ambient noise of acoustic environment.
`These parameters affect the accuracy of speaker localiza-
`tion and beamforming in capturing the desired spatial
`signal and suppressing the others. To tackle this problem,
`various methods have been proposed recently, but they all
`seem to give erroneous estimations in speaker direction
`finding under the presence of high noise and reverbera-
`tion. These conventional algorithms in multi-speaker
`environments not only have difficulty in localizing the
`multiple sound sources accurately, but they also fail to
`localize the target talker among the known multiple
`speaker positions. These localization techniques can
`be loosely classified into three general categories:
`
`IPR PETITION
`US RE48,371
`Sonos Ex. 1031
`
`
`
`ARTICLE IN PRESS
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`1039
`
`(i) those adopting high resolution spectral concepts,
`(ii)
`techniques based upon maximizing the steered
`response power (SRP) of a beamformer and (iii) ap-
`proaches employing time-difference of arrival (TDOA)
`information.
`The first class of these techniques, characterizes any
`localization scheme that is dependent upon applications
`of the spatio-spectral correlation matrix [1]. Interestingly,
`all of these methods are all designed for narrowband
`signals and are very sensitive to source and microphone
`modeling [2] implying complexities within the speaker
`localization process [3,4]. The second class of the afore-
`mentioned strategies is based on maximizing the output
`power of a steered beamformer or SRP. In this case, a
`beamformer is used to scan over a predefined spatial
`region by adjusting its steering delays [5]. A filtering
`process can also be employed to increase accuracy
`whereby filters are designed in such a way to boost the
`power of the desired signal even if they may increase
`distortion. This is the main distinction between the
`popular beamforming techniques in speech acquisition
`systems and that of localization [6,7]. This category has
`the most robustness in source localization in practical
`situations and is preferable in enabling reliable localiza-
`tion of speech signals with short frames [8]. The third
`category is realized in two phases. Firstly, it detects a set
`of TDOA of the wave-front between different microphone
`pairs mostly based on the generalized cross-correlation
`(GCC) function maximization [9]. In computing the cross-
`correlation function, to increase accuracy, some weighting
`schemes are also employed. The most important weight-
`ings are ML (Maximum Likelihood) and PHAT (phase
`transform) [10,11]. Second, geometrical constraints are
`used to infer
`the source position. Due to its low
`computational cost, this technique has attracted many
`interests. However, pair-wise techniques suffer consider-
`ably from multipath propagation [8]. Since the primary
`goal of microphone array based systems is practicality in
`the real environment, we have considered this subject for
`real applications. In the scenario which is the subject of
`this investigation, we have focused on SRP based localiza-
`tion.
`All the above mentioned attempts were aimed to
`improve the localization accuracy in the presence of
`acoustic noise and reverberation and could not achieve
`satisfactory results in the presence of spurious speech
`sources such as the voice of unwanted speakers. In this
`scenario, speaker verification is needed to authorize the
`speech. This stage of speaker verification by microphone
`array is addressed in [12], where a microphone array is
`utilized to capture the speech and provide input for
`automatic speech identification. A 2-D matched filter
`microphone array is proposed to improve the identifica-
`tion scores in a reverberant environment. In this algo-
`rithm, the identification is addressed after the array-based
`analysis of the received signal. Investigations by Giana-
`kopoulos et al. [13] are concentrated on the implementa-
`tion of the front-end signal pre-processing tasks such as
`filtering, acquisition and beamforming to improve speaker
`recognition. This procedure suffers from over computation
`of localization and beamforming in the multi-speaker
`
`scenarios. In [14] an adaptive near-field beamformer is
`implemented for hands-free speaker recognition. In [15]
`speech enhancement techniques are utilized to reduce the
`acoustic degradation of source signal and improve speaker
`verification in the noisy environments. In [16] a speaker
`identification algorithm based on the angle of arrival of
`the speech is proposed. Since the convergence rate is
`large, the new algorithm has practical limitations and
`participants are required to remain seated during the
`experiment. Hence, limited number of investigators has
`studied speaker recognition and although the effective-
`ness of beamforming is proven in robust hands-free
`speaker recognition [17], but verification always comes
`after the localization, beamforming and other computa-
`tional array processing algorithms.
`In this paper, the idea of verification prior to localiza-
`tion is proposed. It has been observed through extensive
`testing that the quality of the voiced parts is very
`important for verification. Therefore, we have enhanced
`these parts and used them for verification. For the verified
`speech, localization is performed and the enhanced signal
`is acquired through sub-array beamforming. The verifica-
`tion result is tested again after beamforming to ensure a
`high accuracy. We name this front-end block as verified
`speaker localization (VSL). The multi-channel speech
`enhancement based on localization and beamforming is
`only run for the desired voices and the whole system
`becomes robust to unwanted noises as well as other
`spontaneous sources of energy. The over computation of
`beamforming and post processing for unwanted speech
`signals is also prevented which reduces the computational
`complexity of the front-end task in multi-speaker scenar-
`ios considerably.
`Organization of the paper is as follows: The general
`architecture of the proposed VSL front-end is explained in
`Section 2. It includes a brief overview of VSL components,
`details of the split-band reconstruction, speaker verifica-
`tion and localization. Scenario of testing and the results
`achieved are described in Section 3. A VSL based far-field
`automatic speech recognition (ASR) is also introduced in
`this section and the effect of the VSL front-end on the
`performance of this system is evaluated. Finally, conclud-
`ing remarks are given in Section 4.
`
`2. General overview of the proposed VSL front-end
`
`The main elements of the proposed front-end signal
`pre-processing block are: acquisition, reconstruction of
`the voiced parts, verification, localization and beamform-
`ing. The order in which they interact with each other is
`shown in Fig. 1.
`The acquired speech is first analyzed in split-bands to
`measure the voicing level. For this purpose in the
`reverberant acoustic environments, a new fundamental
`frequency estimation algorithm is proposed based on the
`subspace approach in high resolution spectral estimation.
`A reconstruction stage for the degraded voiced bands is
`also proposed prior to the verification. The verification is
`implemented using Gaussian Mixture Model and a new
`SRP filtering scheme is proposed based on the voicing
`
`
`
`1040
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`ARTICLE IN PRESS
`
`Acquisition
`
`Reconstruction of
`the voiced parts
`
`Speaker
`Verification
`
`Speaker
`Localization
`
`Beamforming
`
`Fig. 1. A general architecture of the proposed VSL front-end.
`
`likelihood of each frequency band measured in the
`previous steps to effectively localize the authorized
`speaker.
`In the traditional methods, as discussed in the
`introductory part, whenever a source of energy is detected
`by the localization algorithm, the beamforming will then
`be applied to acquire the enhanced signal. These two
`processes are computational
`intensive in the far-field
`interfaces. In the proposed VSL front-end, a new localiza-
`tion algorithm improves the speaker localization accuracy
`as well as the robustness against the reverberation and
`noise, while the verification which is performed prior to
`localization prevents the over computation of localization
`and beamforming for unwanted sources (specially tran-
`sient or unauthorized speakers). Therefore, the whole
`system will have the capability to update the location
`information of any specific individual. On the other hand,
`since the localization is based on short speech frames, it is
`also capable of tracking a moving speaker. These two
`capabilities indicate that the system can lock on a speaker,
`while ignoring other speech sources. Since localization
`and beamforming are highly computational demanding
`[11] and achieving an enhanced speech for far-field
`applications needs heavy processing, this lock on char-
`acteristic improves the front-end task both in terms of
`computation and robustness in far-field applications such
`as teleconferencing, voice control and speech recognition
`where the presence of unwanted speech signals is highly
`probable.
`In the proposed VSL front-end, the received signal is
`first segmented based on detection of the non-speech
`activity for more than 2 s. Each segment is analyzed for
`voicing level measurement at speech sub-bands corre-
`sponding to the fundamental frequency harmonics. The
`voiced parts are then reconstructed at split bands
`regarding the harmonic bands of the speech spectrum
`and the signal is analyzed for authentication within a
`verification algorithm. For the verified speech, misdetec-
`tion of source localization due to reverberation and
`acoustic noise is reduced through the voicing level
`measurement. The beamforming algorithm uses this
`information to steer the beam pattern towards the
`direction of the speaker to acquire the source signal while
`suppressing the noise from other directions. Details of
`each component are discussed in the following sections.
`
`2.1. Microphone array signal model
`
`In this paper, we assume the sound wave propagation
`follows a linear wave equation [18]. Hence, the acoustic
`path between the sound sources and microphones can be
`modeled as a linear system [19]. This assumption is
`plausible in small-room microphone array environments
`and is usually employed in the array-processing techni-
`ques [20]. With these assumptions the produced signal by
`
`(1)
`
`the mth microphone at location dm can be expressed as
`xmðtÞ ¼ sðtÞ hsðdm; ds; tÞ þ vmðtÞ
`where hs(dm,ds,t) is the room impulse response from the
`speech source s(t) at location ds to microphone m. The
`operator * is convolution. nm is a white Gaussian and is
`assumed to be uncorrelated to s(t).
`The impulse response h, characterizes all the acoustic
`paths from the source to location dm, including the direct
`path. In general, hs varies with environmental changes,
`such as temperature, humidity,
`furniture and people
`inside the room. It is reasonable to assume these factors
`to remain fixed in the period of each experiment.
`Separating the direct path component from the rest of
`the acoustic paths, the following expression can be
`defined for hs(dm,ds,t):
`hsðdm; ds; tÞ ¼ a
`dðt tmÞ þ uðdm; ds; tÞ
`rm
`where rm is the distance between the source and the mth
`microphone, tm is the propagation delay equal to the ratio
`of rm to the speed of sound. The constant a depends on the
`medium and the system of units used. u(dm,ds,t) char-
`acterizes all the acoustic paths except the direct path.
`Substituting this equation into (1), the signal model at
`microphone m is given by
`xmðtÞ ¼ a
`sðt tmÞ þ sðtÞ uðdm; ds; tÞ þ vmðtÞ
`rm
`
`(2)
`
`(3)
`
`The first term is the direct path component which is
`important for localization, the second term is the model of
`reverberation and the third term is the uncorrelated noise.
`
`2.2. Split-band reconstruction
`
`A typical simulated room impulse response is illu-
`strated in Fig. 2. The largest peak corresponds to the direct
`path and the other peaks are due to the surrounding walls
`reverberation. Assuming the total system of microphone
`array and room as a linear system [21], the received signal
`at each microphone is the convolution of this impulse
`response with the original source signal. This effect
`impairs the received signal quality at the microphone
`array and reduces the periodicity of the voiced segments.
`Hence we have considered this side-effect and have
`enhanced these harmonic parts through reconstruction.
`The first step is the estimation of the fundamental
`frequency. However, due to the distortion of periodicity
`and harmonicity, conventional
`fundamental
`frequency
`extraction algorithms such as autocorrelation function
`(ACF), average magnitude difference function (AMDF),
`Cepstrum, simple inverse filtering tracking (SIFT) and
`harmonic product spectrum (HPS) give erroneous results.
`Since the estimation accuracy of the fundamental fre-
`quency in the presence of noise and reverberation is very
`important for the performance of the whole system, we
`
`
`
`ARTICLE IN PRESS
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`1041
`
`(4)
`
`El ¼
`
`Since the room can be modeled as a linear system, the
`frequency content of the received signal is similar to the
`original sound and it is only distorted in amplitude
`and phase. Therefore reverberation converts the global
`maximum of the spectrum to a local maximum with no
`frequency displacement.
`Through a large number of experiments we have
`verified the robustness of the algorithm to different
`reverberant noisy environments. The algorithm was also
`verified for robustness to sudden closure, such as in a
`vowel-to-nasal transition, where waveform periodicity is
`reduced but the fundamental frequency did not change.
`After estimation of the fundamental frequency, the
`algorithm is used to measure the voicing level in each
`frequency band. An accurate measure of voicing level was
`applied to multi-band excitation (MBE) coders [26]. The
`voicing decision was made by calculating the normalized
`error El between the original and the modeled speech
`spectrum in each frequency band of the fundamental
`P
`frequency harmonics:
`P
`jXðoÞ ^Xðo; o0Þj2
`blo¼al
`jXðoÞj2
`blo¼al
`where X(o) is the speech spectrum of the received signal
`at the reference microphone channel (#5), o0 is the
`fundamental frequency, al and bl are the first and last
`harmonics in the lth band, and ^Xðo; o0Þ is the estimated
`speech spectrum calculated in each frequency band as the
`spectral shape of a Hanning window with a constant
`amplitude.
`To determine the voicing decision, the normalized
`error, El, of the lth frequency band is compared with an
`adaptive threshold [27]. If the normalized error is less
`than a threshold, the corresponding frequency band
`belongs to the target voice and it is reconstructed in the
`split-bands based on the fundamental frequency harmo-
`nics.
`Since higher harmonics are more susceptible to
`reverberation and acoustic noise [28] decision on voicing
`for the frame was carried out on the majority of the lower
`half of the speech frequency band. For those intervals
`when all of the speakers are talking simultaneously, the
`speech frames loose their periodicity and these frames are
`not involved in the other phases of the VSL processing.
`The speech signal due to acoustic noise is distorted.
`The distortion can be reduced in voiced parts by precise
`extraction of the fundamental frequency and then using it
`to reconstruct the speech spectrum. The split-band mixed
`voicing decision calculated for each frequency band is
`utilized to synthesize the voiced speech spectrum. Each
`harmonic band has a shape similar to the spectral shape of
`the window used prior to the Fourier transform, whereas
`the non-voiced bands are random in nature. Therefore, a
`voiced harmonic band can be finely synthesized as a
`multiplication of the frequency response of a suitable
`window centered at the harmonic of fundamental fre-
`quency corresponding to that band with constant ampli-
`tude measured with respect to the original signal [29].
`Reconstruction of the harmonic bands is given by
`Eq. (5). This reconstruction is performed up to the highest
`
`0
`
`0.02
`
`0.04 0.06
`0.08
`Time (s)
`
`0.1
`
`0.12
`
`0.14
`
`Fig. 2. Room impulse response.
`
`0.7
`0.6
`0.5
`0.4
`0.3
`0.2
`0.1
`0
`-0.1
`-0.2
`
`Amplitude
`
`have extracted the fundamental frequency on the sub-
`space to benefit
`from the high resolution spectral
`estimation property of this technique.
`The subspace based spectral estimation is an accurate
`method for detecting the discrete frequencies of a signal
`and hence we used the multiple signal classification
`(MUSIC) [22,23] in our algorithm. The MUSIC algorithm
`detects complex sinusoids by performing eigendecompo-
`sition on the data vector covariance matrix of the received
`signal. Andrews et al. [24] have already proposed the pitch
`determination algorithm based on MUSIC. Here we have
`modified their approach for the reverberant signals. To
`find the fundamental
`frequency,
`the autocorrelation
`matrix of the speech signal is computed from its power
`spectrum via FT. Since the fundamental frequency of
`speech sources is less than 800 Hz [25], we have applied
`the MUSIC algorithm only to the lower
`frequency
`components of the speech spectrum. With an 800-point
`DFT of 20 ms of the speech signal at the sampling
`frequency of 16 kHz, the frequency components of a
`MUSIC spectrum will be at 20,40,y,800 Hz. The total
`number of these components is 40 and the eigenvalues
`are computed from the received signal autocorrelation
`matrix. The number of harmonics contained in the
`spectrum is an important parameter of
`the MUSIC
`algorithm. If it is set too large, the spectrum will be easily
`affected by the noise and if it is too small, the spectral
`estimation becomes inaccurate and the error will be
`increased. For our experiments, the set of dominant
`eigenvalues {lk} which span over the signal subspace are
`chosen so as to satisfy l1XlkXl1=8, where l1 is the
`eigenvalue of the first fundamental component. The FFT is
`applied to the logarithm of the MUSIC power spectrum
`and the peak location of the signal determines the
`estimated fundamental frequency. To reduce the compu-
`tational cost, we have estimated the fundamental fre-
`quency at the precision of 20 Hz. This was done by
`searching the pseudospectrum of the signal with 1 Hz
`precision at the vicinity of 80 Hz around the pre-estimated
`fundamental frequency. The corresponding frequency of
`the local maxima is detected as the fundamental
`frequency.
`
`
`
`1042
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`ARTICLE IN PRESS
`
`voiced band of the speech spectrum.
`^Xðo; o0Þ ¼ Ak;o0 WðoÞ 1pkpK; dakepopdbke
`(5)
`where ak ¼ (k 0.5)o0, bk ¼ (k+0.5)o0, d e stands for the
`nearest integer greater than or equal to, K is the number of
`harmonics in the 8 kHz speech frequency bandwidth,
`W(o) is the frequency response of the Hanning window
`centered at
`the kth harmonic of
`the fundamental
`frequency and Ak;o0 is the kth harmonic amplitude defined
`P
`as:
`P
`dbke
`o¼dakeXðoÞWðoÞ
`dbke
`o¼dakejWðoÞj2
`For concatenation of the reconstructed successive frames,
`we use linear interpolation to remove frequency mis-
`matches [30]. Fig. 3 displays a clean speech, noisy signal
`and the synthesized speech from its noisy origin by
`spectrogram. This figure shows how reconstruction
`
`Ak;o0 ¼
`
`(6)
`
`procedure reduces the acoustical noise and retrieves the
`harmonicity of voicing speech.
`
`2.3. Speaker verification
`
`Mixture models belong to a family of density model
`that comprises of a number of component functions,
`usually Gaussian. The distribution of feature vectors was
`extracted from a speaker’s speech modeled by a Gaussian
`mixture density. This is a method that has been proven to
`be one of the most successful approaches for text-
`independent speaker verification. Therefore we have
`implemented speaker modeling based on the Gaussian
`Mixture Models (GMM).
`In this algorithm Gaussian
`mixtures are used to model arbitrary densities of the
`speech signal [31–33].
`A block diagram of the implemented speaker verifica-
`tion system is shown in Fig. 4. There are two steps in the
`
`Fig. 3. Spectrogram of: (a) clean close-talk speech; (b) far field signal; and (c) synthesized speech.
`
`
`
`ARTICLE IN PRESS
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`1043
`
`Training phase
`
`Speech Signal
`
`Silence
`Removal
`
`Feature
`Extraction
`
`Noise
`ction
`Reduction
`
`Train a model
`based on GMM
`
`Trained
`model
`
`Verification phase
`
`Speech Signal (x)
`
`Silence
`Removal
`
`Feature
`Extraction
`
`Noise
`Reduction
`
`Comparison
`Λ (x)
`
`Accept
`/Reject
`
`Fig. 4. The block diagram of the two phases in a speaker verification system.
`
`Threshold ( )
`
`(8)
`
`of the Eq. (7) is calculated and the verification decision is
`made. Otherwise, the claimed speaker will be rejected.
`
`^C ¼ arg maxflogðpðxjCiÞg
`1pipN
`where p(x|Ci) is the likelihood of the input vector x for
`mixture model Ci of speaker i. N is the total number of all
`registered speakers. A block diagram of the proposed
`speaker verification model is shown in Fig. 5. The model is
`able to prevent misdetection between two speakers of
`similar sounds by finding the associated model of the
`imposter speaker and rejects it prior to the computation of
`the threshold function. This sub-system has been tested
`on-line and proven to be highly robust and accurate in
`practice [35].
`
`2.4. SRP sound localization
`
`process of speaker verification. In the first step, called the
`training phase, each registered speaker has to provide
`speech samples. After removal of silence intervals, the
`MFCC (Mel Frequency Cepstrum Coefficient) features are
`extracted for speech frames and the effect of channel
`distortion is reduced by Cepstral Mean Subtraction (CMS).
`To build a reference model
`for every speaker,
`the
`parameters of a GMM are calculated for each speaker by
`determining the mean vector and covariance matrix of the
`Gaussian densities and the mixture weights. In addition, a
`threshold is also set from the training samples; the
`threshold is important for the final rejection or acceptance
`of a user, and it is independent of the acoustic character-
`istics of the environment. The second phase is the actual
`verification, where the input speech is compared with the
`stored reference models and the recognition decision is
`made to accept or reject a speaker.
`It should be noted that to enhance speaker recognition,
`we have developed a pre-step gender recognition based
`on the maximum a posterior probability for a given
`observation sequence in the gender recognition phase. It is
`based on a GMM model and is separately trained for
`female and male speakers.
`
`2.3.1. Decision parameter
`The general approach proposed by Reynolds et al. [34]
`for speaker verification is to apply a likelihood ratio test to
`an input utterance to determine if the claimed speaker
`should be accepted or rejected. Given an utterance of
`speech signal x, a claimed speaker is identified with the
`corresponding model Cc and an anti-model Ct. Discard-
`ing the constant prior probabilities for claimant and
`imposter speakers, the likelihood ratio in the log domain
`becomes:
`LðvÞ ¼ log pðxjCcÞ log pðxjCcÞ
`(7)
`The term pi(x|Cc) is the likelihood of the utterance if it is
`from the claimed speaker and pðxjCcÞ is the likelihood of
`utterance if it is not from the claimed speaker. The
`likelihood ratio is compared to a threshold A and the test
`speaker is accepted if L(x)4A, otherwise it is rejected.
`To increase system accuracy, we first find the model ^C
`which has the maximum a posteriori probability for the
`speaker sequence using Eq. (8). In case this model is
`the model of the claimed speaker, the likelihood ratio
`
`Source location in spherical coordinates is represented
`by range r, azimuth y and elevation f. If the source range
`is larger than the array dimension within a specific
`threshold [36], its wave front is received in a planar form
`and the range accuracy becomes ambiguous. To alleviate
`the ambiguity, the source position could be specified
`within specific y and f and its directional vector is
`defined:
`
`(9)
`
`375
`
`264
`
`ðsÞ
`o
`~z
`
`cos f sin y
`cos f cos y
`sin y
`
`In this case the steering delay of microphone m relative to
`
`a reference microphone is calculated through
`Dm ¼ drm cos b
`c
`
`
`
` Fs
`
`(10)
`
`where drm is the distance between the reference micro-
`phone and microphone m, c is the speed of sound, b is the
`ratio of wavefront angle to the microphones intersection
`line and Fs is the sampling frequency. d e stands for the
`nearest integer greater than or equal to.
`In the algorithm based on the SRP, a microphone array
`beam pattern is steered towards the candidate positions,
`the so called beamforming. The output power of a
`beamformer is then computed and the source position is
`determined based on the beamformer maximum power
`
`
`
`1044
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`ARTICLE IN PRESS
`
`Speech
`Signal
`(X)
`
`Silence
`Removal
`
`Feature
`Extraction
`
`Noise
`Reduction
`
`Gender
`Recognition
`
`Trained
`Gaussian
`Mixture
`Model in the
`First Phase
`
`Max Likelihood
`
`Ψˆ
`
`Comparison
`
`Fig. 5. The proposed speaker verification model.
`
`No
`
`No
`
`No
`
`Reject
`
`Reject
`
`Reject
`
`Is verified
`gender,
`Claimed?
`
`Yes
`
`Is obtained
`model,
`Claimed?
`
`Yes
`
`Λ(X) <
`
`Yes
`
`Accept
`
`The side effect of reverberation is presented in Fig. 6,
`where speech frames with periodic structures are less
`influenced by reverberation and noise and must have
`higher weights than the other parts in the localization
`algorithm.
`in each
`An accurate measurement of voicing level
`frequency band was described in Section 2.2. As men-
`tioned, the voicing decision is made by calculating the
`normalized error El between the original and the modeled
`speech spectrum for each frequency band l. For voiced
`frames, El has a value close to zero; and values near to 1
`correspond to the noisy and non-periodic intervals.
`Therefore, the calculated error from Eq. (4) is used to
`measure the degree of periodicity for each frequency band
`and can be employed in a filtering scheme for SRP
`localization. The filter to be employed at each channel is
`
`Gl;mðoÞ ¼ 1 El
`jXmðoÞj ; o 2 ½al; bl
`
`(14)
`
`where El is calculated for the received signal at micro-
`phone m, Xm(o). This filter will emphasize the voiced
`frames. Furthermore, the influence of the signal amplitude
`will be omitted and only the phase information will
`be used in the localization process which leads to
`improvement in the robustness of this algorithm to both
`noise and reverberation.
`In practice, for small arrays it is sufficient to compute
`the fundamental frequency harmonics at the reference
`microphone and then the error for each frequency band is
`calculated by this pitch period. Therefore, if a channel
`signal for some reason is degraded, its influence will be
`reduced. The employed filter in beamforming algorithm
`and the output of the steered array is computed by
`Eq. (15) and its power is calculated for that particular
`point in space. We name the proposed method SRP-H, as it
`is based on beamforming and analyses of the fundamental
`frequency harmonics of the speech signal.
`1 El;m
`jXmðoÞj XmðoÞe joDm
`
`XL
`XM
`
`m¼1
`
`l¼1
`
`Y SRP-H
`D1...DM
`
`ðoÞ ¼
`
`(15)
`
`using the ML estimation. The output of a filter-and-sum
`beamformer in the frequency domain is defined:
`
`XM
`
`YðoÞ ¼
`
`GmðoÞXmðoÞe joDm
`
`(11)
`
`(12)
`
`m¼1
`where Xm(o) is the discrete Fourier transform of the
`received signal from microphone m, Gm(o) is the corre-
`sponding filter for microphone m, D1,y,DM are integer
`values representing the steering delays derived from
`Eq. (10) for candidate positions in the space. To avoid
`rounding up errors of Dm to integer values, prior to
`localization, the signal is upsampled to 96 kHz. The SRP is
`calculated by
`PD1...DM ¼ YðoÞY TðoÞ
`where Y(o)
`the
`is the horizontal output vector of
`beamformer and YT is its transpose. Although steering
`delays are continuous variables, Eq. (10) is computed for
`discrete locations in space.
`In calculating SRP, choice of a suitable filter has a
`considerable impact on the robustness of localization to
`both noise and reverberation. In a well-known SRP-PHAT
`algorithm [10], the filter used at each channel is given by
`
`GmðoÞ ¼
`
`1
`jXmðoÞj
`
`for m ¼ 1 . . . M
`
`(13)
`
`2.4.1. A new filtering scheme exploiting harmonic structures
`Since the received signal at microphone array is
`contaminated by multi-path and noise signals,
`the
`periodicity of
`the voiced segments is reduced. This
`phenomenon can be seen in Fig. 6, where T60 identifies
`the room reverberation time. It is measured as 10 times
`the logarithm of the normalized squared impulse re-
`sponse amplitude and is represented in the form of e.g.
`T60, implying the time needed for this value to decay from
`0 to 60 dB [37].
`
`
`
`ARTICLE IN PRESS
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`1045
`
`Fig. 6. Periodicity degradation due to noise and reverberation: upper graph: initial signal of word /a/; middle graph: signal affected by reverberation; and
`lower graph: signal affected by reverberation and noise.
`
`3. Test Scenario and the tentative results
`
`The VSL front-end performance was evaluated under
`different conditions of noise and reverberation. The
`impact of these acoustic parameters on a VSL based ASR
`is quantified in this section. Fig. 7 shows a VSL based ASR.
`The original speech has been chosen from TIMIT
`database and the far-field signal received at each micro-
`phone channel was simulated based on the theories
`explained in Section 2.1. The TIMIT speech data was
`recorded with a close-talking microphone of sampling
`frequency of 16 kHz. The New York City subset comprising
`of 13 females and 22 males were used. This database was
`divided into a training set and a testing set. The training
`set for each speaker comprised of ten sentences and the