throbber
ARTICLE IN PRESS
`
`Signal Processing 89 (2009) 1038–1049
`
`Contents lists available at ScienceDirect
`
`Signal Processing
`
`journal homepage: www.elsevier.com/locate/sigpro
`
`Verified speaker localization utilizing voicing level in split-bands
`Afsaneh Asaei a,b,c, , Mohammad Javad Taghizadeh c, Marjan Bahrololum c,
`Mohammed Ghanbari d
`a IDIAP Research Institute, Martigny, Switzerland
`b Swiss Federal Institute of Technology at Lausanne (EPFL), Switzerland
`c Iran Telecommunication Research Center, Tehran, Iran
`d Department of Computing and Electronic Systems, University of Essex, Colchester, UK
`
`a r t i c l e i n f o
`
`a b s t r a c t
`
`Article history:
`Received 26 December 2007
`Received in revised form
`31 October 2008
`Accepted 4 December 2008
`Available online 24 December 2008
`
`Keywords:
`Microphone array
`Speaker verification
`Speaker localization
`Reverberation
`Beamforming
`Speech recognition
`
`This paper proposes a joint verification-localization structure based on split-band
`analysis of speech signal and the mixed voicing level. To address the problems in
`reverberant acoustic environments, a new fundamental frequency estimation algorithm
`is proposed based on high resolution spectral estimation. In the reconstruction of the
`distorted speech this information is utilized to reduce the side effect of acoustic noise on
`the voicing parts. A speaker verification system examines the features of
`the
`reconstructed speech in order
`to authorize the speaker before localization.
`This procedure prevents localization and beamforming for non-speech and specially
`the unwanted speakers in multi-speaker scenarios. The verification is implemented with
`the Gaussian Mixture Model and a new filtering scheme is proposed based on the
`voicing likelihood of each frequency band measured in the previous steps for efficient
`localization of the authorized speaker. The performance of the proposed VSL (verified
`speaker localization) front-end is evaluated in various reverberant and noisy environ-
`ments. The VSL is utilized in the development of distant-talking automatic speech
`recognition by microphone array where the system can lock on a specific source and
`hence the recognition quality improves noticeably.
`& 2008 Elsevier B.V. All rights reserved.
`
`1.
`
`Introduction
`
`For a hands-free speech interface, it is very important
`to capture distant talking speech with high quality. An
`ideal solution for this purpose is sound acquisition by
`microphone array. A microphone array can acquire the
`desired speech signals selectively by steering the beam
`pattern directivity of the array towards the desired
`speaker. This process is called beamforming and due to
`the directivity pattern steering, it can spatially filter out
`
` Corresponding author at:
`IDIAP Research Institute, Martigny,
`Switzerland. Tel.: +41 27 72177 73.
`E-mail addresses: afsaneh.asaei@idiap.ch (A. Asaei),
`taghizadehmj@itrc.ac.ir (M.J. Taghizadeh), bahrololum@itrc.ac.ir
`(M. Bahrololum), ghan@essex.ac.uk (M. Ghanbari).
`
`0165-1684/$ - see front matter & 2008 Elsevier B.V. All rights reserved.
`doi:10.1016/j.sigpro.2008.12.003
`
`noises from other directions regardless of the noise
`nature. The main obstacles to achieve reasonable perfor-
`mance in array based systems are the reverberation and
`the presence of ambient noise of acoustic environment.
`These parameters affect the accuracy of speaker localiza-
`tion and beamforming in capturing the desired spatial
`signal and suppressing the others. To tackle this problem,
`various methods have been proposed recently, but they all
`seem to give erroneous estimations in speaker direction
`finding under the presence of high noise and reverbera-
`tion. These conventional algorithms in multi-speaker
`environments not only have difficulty in localizing the
`multiple sound sources accurately, but they also fail to
`localize the target talker among the known multiple
`speaker positions. These localization techniques can
`be loosely classified into three general categories:
`
`IPR PETITION
`US RE48,371
`Sonos Ex. 1031
`
`

`

`ARTICLE IN PRESS
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`1039
`
`(i) those adopting high resolution spectral concepts,
`(ii)
`techniques based upon maximizing the steered
`response power (SRP) of a beamformer and (iii) ap-
`proaches employing time-difference of arrival (TDOA)
`information.
`The first class of these techniques, characterizes any
`localization scheme that is dependent upon applications
`of the spatio-spectral correlation matrix [1]. Interestingly,
`all of these methods are all designed for narrowband
`signals and are very sensitive to source and microphone
`modeling [2] implying complexities within the speaker
`localization process [3,4]. The second class of the afore-
`mentioned strategies is based on maximizing the output
`power of a steered beamformer or SRP. In this case, a
`beamformer is used to scan over a predefined spatial
`region by adjusting its steering delays [5]. A filtering
`process can also be employed to increase accuracy
`whereby filters are designed in such a way to boost the
`power of the desired signal even if they may increase
`distortion. This is the main distinction between the
`popular beamforming techniques in speech acquisition
`systems and that of localization [6,7]. This category has
`the most robustness in source localization in practical
`situations and is preferable in enabling reliable localiza-
`tion of speech signals with short frames [8]. The third
`category is realized in two phases. Firstly, it detects a set
`of TDOA of the wave-front between different microphone
`pairs mostly based on the generalized cross-correlation
`(GCC) function maximization [9]. In computing the cross-
`correlation function, to increase accuracy, some weighting
`schemes are also employed. The most important weight-
`ings are ML (Maximum Likelihood) and PHAT (phase
`transform) [10,11]. Second, geometrical constraints are
`used to infer
`the source position. Due to its low
`computational cost, this technique has attracted many
`interests. However, pair-wise techniques suffer consider-
`ably from multipath propagation [8]. Since the primary
`goal of microphone array based systems is practicality in
`the real environment, we have considered this subject for
`real applications. In the scenario which is the subject of
`this investigation, we have focused on SRP based localiza-
`tion.
`All the above mentioned attempts were aimed to
`improve the localization accuracy in the presence of
`acoustic noise and reverberation and could not achieve
`satisfactory results in the presence of spurious speech
`sources such as the voice of unwanted speakers. In this
`scenario, speaker verification is needed to authorize the
`speech. This stage of speaker verification by microphone
`array is addressed in [12], where a microphone array is
`utilized to capture the speech and provide input for
`automatic speech identification. A 2-D matched filter
`microphone array is proposed to improve the identifica-
`tion scores in a reverberant environment. In this algo-
`rithm, the identification is addressed after the array-based
`analysis of the received signal. Investigations by Giana-
`kopoulos et al. [13] are concentrated on the implementa-
`tion of the front-end signal pre-processing tasks such as
`filtering, acquisition and beamforming to improve speaker
`recognition. This procedure suffers from over computation
`of localization and beamforming in the multi-speaker
`
`scenarios. In [14] an adaptive near-field beamformer is
`implemented for hands-free speaker recognition. In [15]
`speech enhancement techniques are utilized to reduce the
`acoustic degradation of source signal and improve speaker
`verification in the noisy environments. In [16] a speaker
`identification algorithm based on the angle of arrival of
`the speech is proposed. Since the convergence rate is
`large, the new algorithm has practical limitations and
`participants are required to remain seated during the
`experiment. Hence, limited number of investigators has
`studied speaker recognition and although the effective-
`ness of beamforming is proven in robust hands-free
`speaker recognition [17], but verification always comes
`after the localization, beamforming and other computa-
`tional array processing algorithms.
`In this paper, the idea of verification prior to localiza-
`tion is proposed. It has been observed through extensive
`testing that the quality of the voiced parts is very
`important for verification. Therefore, we have enhanced
`these parts and used them for verification. For the verified
`speech, localization is performed and the enhanced signal
`is acquired through sub-array beamforming. The verifica-
`tion result is tested again after beamforming to ensure a
`high accuracy. We name this front-end block as verified
`speaker localization (VSL). The multi-channel speech
`enhancement based on localization and beamforming is
`only run for the desired voices and the whole system
`becomes robust to unwanted noises as well as other
`spontaneous sources of energy. The over computation of
`beamforming and post processing for unwanted speech
`signals is also prevented which reduces the computational
`complexity of the front-end task in multi-speaker scenar-
`ios considerably.
`Organization of the paper is as follows: The general
`architecture of the proposed VSL front-end is explained in
`Section 2. It includes a brief overview of VSL components,
`details of the split-band reconstruction, speaker verifica-
`tion and localization. Scenario of testing and the results
`achieved are described in Section 3. A VSL based far-field
`automatic speech recognition (ASR) is also introduced in
`this section and the effect of the VSL front-end on the
`performance of this system is evaluated. Finally, conclud-
`ing remarks are given in Section 4.
`
`2. General overview of the proposed VSL front-end
`
`The main elements of the proposed front-end signal
`pre-processing block are: acquisition, reconstruction of
`the voiced parts, verification, localization and beamform-
`ing. The order in which they interact with each other is
`shown in Fig. 1.
`The acquired speech is first analyzed in split-bands to
`measure the voicing level. For this purpose in the
`reverberant acoustic environments, a new fundamental
`frequency estimation algorithm is proposed based on the
`subspace approach in high resolution spectral estimation.
`A reconstruction stage for the degraded voiced bands is
`also proposed prior to the verification. The verification is
`implemented using Gaussian Mixture Model and a new
`SRP filtering scheme is proposed based on the voicing
`
`

`

`1040
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`ARTICLE IN PRESS
`
`Acquisition
`
`Reconstruction of
`the voiced parts
`
`Speaker
`Verification
`
`Speaker
`Localization
`
`Beamforming
`
`Fig. 1. A general architecture of the proposed VSL front-end.
`
`likelihood of each frequency band measured in the
`previous steps to effectively localize the authorized
`speaker.
`In the traditional methods, as discussed in the
`introductory part, whenever a source of energy is detected
`by the localization algorithm, the beamforming will then
`be applied to acquire the enhanced signal. These two
`processes are computational
`intensive in the far-field
`interfaces. In the proposed VSL front-end, a new localiza-
`tion algorithm improves the speaker localization accuracy
`as well as the robustness against the reverberation and
`noise, while the verification which is performed prior to
`localization prevents the over computation of localization
`and beamforming for unwanted sources (specially tran-
`sient or unauthorized speakers). Therefore, the whole
`system will have the capability to update the location
`information of any specific individual. On the other hand,
`since the localization is based on short speech frames, it is
`also capable of tracking a moving speaker. These two
`capabilities indicate that the system can lock on a speaker,
`while ignoring other speech sources. Since localization
`and beamforming are highly computational demanding
`[11] and achieving an enhanced speech for far-field
`applications needs heavy processing, this lock on char-
`acteristic improves the front-end task both in terms of
`computation and robustness in far-field applications such
`as teleconferencing, voice control and speech recognition
`where the presence of unwanted speech signals is highly
`probable.
`In the proposed VSL front-end, the received signal is
`first segmented based on detection of the non-speech
`activity for more than 2 s. Each segment is analyzed for
`voicing level measurement at speech sub-bands corre-
`sponding to the fundamental frequency harmonics. The
`voiced parts are then reconstructed at split bands
`regarding the harmonic bands of the speech spectrum
`and the signal is analyzed for authentication within a
`verification algorithm. For the verified speech, misdetec-
`tion of source localization due to reverberation and
`acoustic noise is reduced through the voicing level
`measurement. The beamforming algorithm uses this
`information to steer the beam pattern towards the
`direction of the speaker to acquire the source signal while
`suppressing the noise from other directions. Details of
`each component are discussed in the following sections.
`
`2.1. Microphone array signal model
`
`In this paper, we assume the sound wave propagation
`follows a linear wave equation [18]. Hence, the acoustic
`path between the sound sources and microphones can be
`modeled as a linear system [19]. This assumption is
`plausible in small-room microphone array environments
`and is usually employed in the array-processing techni-
`ques [20]. With these assumptions the produced signal by
`
`(1)
`
`the mth microphone at location dm can be expressed as
`xmðtÞ ¼ sðtÞ hsðdm; ds; tÞ þ vmðtÞ
`where hs(dm,ds,t) is the room impulse response from the
`speech source s(t) at location ds to microphone m. The
`operator * is convolution. nm is a white Gaussian and is
`assumed to be uncorrelated to s(t).
`The impulse response h, characterizes all the acoustic
`paths from the source to location dm, including the direct
`path. In general, hs varies with environmental changes,
`such as temperature, humidity,
`furniture and people
`inside the room. It is reasonable to assume these factors
`to remain fixed in the period of each experiment.
`Separating the direct path component from the rest of
`the acoustic paths, the following expression can be
`defined for hs(dm,ds,t):
`hsðdm; ds; tÞ ¼ a
`dðt tmÞ þ uðdm; ds; tÞ
`rm
`where rm is the distance between the source and the mth
`microphone, tm is the propagation delay equal to the ratio
`of rm to the speed of sound. The constant a depends on the
`medium and the system of units used. u(dm,ds,t) char-
`acterizes all the acoustic paths except the direct path.
`Substituting this equation into (1), the signal model at
`microphone m is given by
`xmðtÞ ¼ a
`sðt tmÞ þ sðtÞ uðdm; ds; tÞ þ vmðtÞ
`rm
`
`(2)
`
`(3)
`
`The first term is the direct path component which is
`important for localization, the second term is the model of
`reverberation and the third term is the uncorrelated noise.
`
`2.2. Split-band reconstruction
`
`A typical simulated room impulse response is illu-
`strated in Fig. 2. The largest peak corresponds to the direct
`path and the other peaks are due to the surrounding walls
`reverberation. Assuming the total system of microphone
`array and room as a linear system [21], the received signal
`at each microphone is the convolution of this impulse
`response with the original source signal. This effect
`impairs the received signal quality at the microphone
`array and reduces the periodicity of the voiced segments.
`Hence we have considered this side-effect and have
`enhanced these harmonic parts through reconstruction.
`The first step is the estimation of the fundamental
`frequency. However, due to the distortion of periodicity
`and harmonicity, conventional
`fundamental
`frequency
`extraction algorithms such as autocorrelation function
`(ACF), average magnitude difference function (AMDF),
`Cepstrum, simple inverse filtering tracking (SIFT) and
`harmonic product spectrum (HPS) give erroneous results.
`Since the estimation accuracy of the fundamental fre-
`quency in the presence of noise and reverberation is very
`important for the performance of the whole system, we
`
`

`

`ARTICLE IN PRESS
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`1041
`
`(4)
`
`El ¼
`
`Since the room can be modeled as a linear system, the
`frequency content of the received signal is similar to the
`original sound and it is only distorted in amplitude
`and phase. Therefore reverberation converts the global
`maximum of the spectrum to a local maximum with no
`frequency displacement.
`Through a large number of experiments we have
`verified the robustness of the algorithm to different
`reverberant noisy environments. The algorithm was also
`verified for robustness to sudden closure, such as in a
`vowel-to-nasal transition, where waveform periodicity is
`reduced but the fundamental frequency did not change.
`After estimation of the fundamental frequency, the
`algorithm is used to measure the voicing level in each
`frequency band. An accurate measure of voicing level was
`applied to multi-band excitation (MBE) coders [26]. The
`voicing decision was made by calculating the normalized
`error El between the original and the modeled speech
`spectrum in each frequency band of the fundamental
`P
`frequency harmonics:
`P
`jXðoÞ ^Xðo; o0Þj2
`blo¼al
`jXðoÞj2
`blo¼al
`where X(o) is the speech spectrum of the received signal
`at the reference microphone channel (#5), o0 is the
`fundamental frequency, al and bl are the first and last
`harmonics in the lth band, and ^Xðo; o0Þ is the estimated
`speech spectrum calculated in each frequency band as the
`spectral shape of a Hanning window with a constant
`amplitude.
`To determine the voicing decision, the normalized
`error, El, of the lth frequency band is compared with an
`adaptive threshold [27]. If the normalized error is less
`than a threshold, the corresponding frequency band
`belongs to the target voice and it is reconstructed in the
`split-bands based on the fundamental frequency harmo-
`nics.
`Since higher harmonics are more susceptible to
`reverberation and acoustic noise [28] decision on voicing
`for the frame was carried out on the majority of the lower
`half of the speech frequency band. For those intervals
`when all of the speakers are talking simultaneously, the
`speech frames loose their periodicity and these frames are
`not involved in the other phases of the VSL processing.
`The speech signal due to acoustic noise is distorted.
`The distortion can be reduced in voiced parts by precise
`extraction of the fundamental frequency and then using it
`to reconstruct the speech spectrum. The split-band mixed
`voicing decision calculated for each frequency band is
`utilized to synthesize the voiced speech spectrum. Each
`harmonic band has a shape similar to the spectral shape of
`the window used prior to the Fourier transform, whereas
`the non-voiced bands are random in nature. Therefore, a
`voiced harmonic band can be finely synthesized as a
`multiplication of the frequency response of a suitable
`window centered at the harmonic of fundamental fre-
`quency corresponding to that band with constant ampli-
`tude measured with respect to the original signal [29].
`Reconstruction of the harmonic bands is given by
`Eq. (5). This reconstruction is performed up to the highest
`
`0
`
`0.02
`
`0.04 0.06
`0.08
`Time (s)
`
`0.1
`
`0.12
`
`0.14
`
`Fig. 2. Room impulse response.
`
`0.7
`0.6
`0.5
`0.4
`0.3
`0.2
`0.1
`0
`-0.1
`-0.2
`
`Amplitude
`
`have extracted the fundamental frequency on the sub-
`space to benefit
`from the high resolution spectral
`estimation property of this technique.
`The subspace based spectral estimation is an accurate
`method for detecting the discrete frequencies of a signal
`and hence we used the multiple signal classification
`(MUSIC) [22,23] in our algorithm. The MUSIC algorithm
`detects complex sinusoids by performing eigendecompo-
`sition on the data vector covariance matrix of the received
`signal. Andrews et al. [24] have already proposed the pitch
`determination algorithm based on MUSIC. Here we have
`modified their approach for the reverberant signals. To
`find the fundamental
`frequency,
`the autocorrelation
`matrix of the speech signal is computed from its power
`spectrum via FT. Since the fundamental frequency of
`speech sources is less than 800 Hz [25], we have applied
`the MUSIC algorithm only to the lower
`frequency
`components of the speech spectrum. With an 800-point
`DFT of 20 ms of the speech signal at the sampling
`frequency of 16 kHz, the frequency components of a
`MUSIC spectrum will be at 20,40,y,800 Hz. The total
`number of these components is 40 and the eigenvalues
`are computed from the received signal autocorrelation
`matrix. The number of harmonics contained in the
`spectrum is an important parameter of
`the MUSIC
`algorithm. If it is set too large, the spectrum will be easily
`affected by the noise and if it is too small, the spectral
`estimation becomes inaccurate and the error will be
`increased. For our experiments, the set of dominant
`eigenvalues {lk} which span over the signal subspace are
`chosen so as to satisfy l1XlkXl1=8, where l1 is the
`eigenvalue of the first fundamental component. The FFT is
`applied to the logarithm of the MUSIC power spectrum
`and the peak location of the signal determines the
`estimated fundamental frequency. To reduce the compu-
`tational cost, we have estimated the fundamental fre-
`quency at the precision of 20 Hz. This was done by
`searching the pseudospectrum of the signal with 1 Hz
`precision at the vicinity of 80 Hz around the pre-estimated
`fundamental frequency. The corresponding frequency of
`the local maxima is detected as the fundamental
`frequency.
`
`

`

`1042
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`ARTICLE IN PRESS
`
`voiced band of the speech spectrum.
`^Xðo; o0Þ ¼ Ak;o0 WðoÞ 1pkpK; dakepopdbke
`(5)
`where ak ¼ (k0.5)o0, bk ¼ (k+0.5)o0, d e stands for the
`nearest integer greater than or equal to, K is the number of
`harmonics in the 8 kHz speech frequency bandwidth,
`W(o) is the frequency response of the Hanning window
`centered at
`the kth harmonic of
`the fundamental
`frequency and Ak;o0 is the kth harmonic amplitude defined
`P
`as:
`P
`dbke
`o¼dakeXðoÞWðoÞ
`dbke
`o¼dakejWðoÞj2
`For concatenation of the reconstructed successive frames,
`we use linear interpolation to remove frequency mis-
`matches [30]. Fig. 3 displays a clean speech, noisy signal
`and the synthesized speech from its noisy origin by
`spectrogram. This figure shows how reconstruction
`
`Ak;o0 ¼
`
`(6)
`
`procedure reduces the acoustical noise and retrieves the
`harmonicity of voicing speech.
`
`2.3. Speaker verification
`
`Mixture models belong to a family of density model
`that comprises of a number of component functions,
`usually Gaussian. The distribution of feature vectors was
`extracted from a speaker’s speech modeled by a Gaussian
`mixture density. This is a method that has been proven to
`be one of the most successful approaches for text-
`independent speaker verification. Therefore we have
`implemented speaker modeling based on the Gaussian
`Mixture Models (GMM).
`In this algorithm Gaussian
`mixtures are used to model arbitrary densities of the
`speech signal [31–33].
`A block diagram of the implemented speaker verifica-
`tion system is shown in Fig. 4. There are two steps in the
`
`Fig. 3. Spectrogram of: (a) clean close-talk speech; (b) far field signal; and (c) synthesized speech.
`
`

`

`ARTICLE IN PRESS
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`1043
`
`Training phase
`
`Speech Signal
`
`Silence
`Removal
`
`Feature
`Extraction
`
`Noise
`ction
`Reduction
`
`Train a model
`based on GMM
`
`Trained
`model
`
`Verification phase
`
`Speech Signal (x)
`
`Silence
`Removal
`
`Feature
`Extraction
`
`Noise
`Reduction
`
`Comparison
`Λ (x)
`
`Accept
`/Reject
`
`Fig. 4. The block diagram of the two phases in a speaker verification system.
`
`Threshold ( )
`
`(8)
`
`of the Eq. (7) is calculated and the verification decision is
`made. Otherwise, the claimed speaker will be rejected.
`
`^C ¼ arg maxflogðpðxjCiÞg
`1pipN
`where p(x|Ci) is the likelihood of the input vector x for
`mixture model Ci of speaker i. N is the total number of all
`registered speakers. A block diagram of the proposed
`speaker verification model is shown in Fig. 5. The model is
`able to prevent misdetection between two speakers of
`similar sounds by finding the associated model of the
`imposter speaker and rejects it prior to the computation of
`the threshold function. This sub-system has been tested
`on-line and proven to be highly robust and accurate in
`practice [35].
`
`2.4. SRP sound localization
`
`process of speaker verification. In the first step, called the
`training phase, each registered speaker has to provide
`speech samples. After removal of silence intervals, the
`MFCC (Mel Frequency Cepstrum Coefficient) features are
`extracted for speech frames and the effect of channel
`distortion is reduced by Cepstral Mean Subtraction (CMS).
`To build a reference model
`for every speaker,
`the
`parameters of a GMM are calculated for each speaker by
`determining the mean vector and covariance matrix of the
`Gaussian densities and the mixture weights. In addition, a
`threshold is also set from the training samples; the
`threshold is important for the final rejection or acceptance
`of a user, and it is independent of the acoustic character-
`istics of the environment. The second phase is the actual
`verification, where the input speech is compared with the
`stored reference models and the recognition decision is
`made to accept or reject a speaker.
`It should be noted that to enhance speaker recognition,
`we have developed a pre-step gender recognition based
`on the maximum a posterior probability for a given
`observation sequence in the gender recognition phase. It is
`based on a GMM model and is separately trained for
`female and male speakers.
`
`2.3.1. Decision parameter
`The general approach proposed by Reynolds et al. [34]
`for speaker verification is to apply a likelihood ratio test to
`an input utterance to determine if the claimed speaker
`should be accepted or rejected. Given an utterance of
`speech signal x, a claimed speaker is identified with the
`corresponding model Cc and an anti-model Ct. Discard-
`ing the constant prior probabilities for claimant and
`imposter speakers, the likelihood ratio in the log domain
`becomes:
`LðvÞ ¼ log pðxjCcÞ log pðxjCcÞ
`(7)
`The term pi(x|Cc) is the likelihood of the utterance if it is
`from the claimed speaker and pðxjCcÞ is the likelihood of
`utterance if it is not from the claimed speaker. The
`likelihood ratio is compared to a threshold A and the test
`speaker is accepted if L(x)4A, otherwise it is rejected.
`To increase system accuracy, we first find the model ^C
`which has the maximum a posteriori probability for the
`speaker sequence using Eq. (8). In case this model is
`the model of the claimed speaker, the likelihood ratio
`
`Source location in spherical coordinates is represented
`by range r, azimuth y and elevation f. If the source range
`is larger than the array dimension within a specific
`threshold [36], its wave front is received in a planar form
`and the range accuracy becomes ambiguous. To alleviate
`the ambiguity, the source position could be specified
`within specific y and f and its directional vector is
`defined:
`
`(9)
`
`375
`
`264
`
`ðsÞ
`o 
`~z
`
`cos f sin y
`cos f cos y
`sin y
`
`In this case the steering delay of microphone m relative to
`
`a reference microphone is calculated through
`Dm ¼ drm cos b
`c
`
`
`
` Fs
`
`(10)
`
`where drm is the distance between the reference micro-
`phone and microphone m, c is the speed of sound, b is the
`ratio of wavefront angle to the microphones intersection
`line and Fs is the sampling frequency. d e stands for the
`nearest integer greater than or equal to.
`In the algorithm based on the SRP, a microphone array
`beam pattern is steered towards the candidate positions,
`the so called beamforming. The output power of a
`beamformer is then computed and the source position is
`determined based on the beamformer maximum power
`
`

`

`1044
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`ARTICLE IN PRESS
`
`Speech
`Signal
`(X)
`
`Silence
`Removal
`
`Feature
`Extraction
`
`Noise
`Reduction
`
`Gender
`Recognition
`
`Trained
`Gaussian
`Mixture
`Model in the
`First Phase
`
`Max Likelihood
`
`Ψˆ
`
`Comparison
`
`Fig. 5. The proposed speaker verification model.
`
`No
`
`No
`
`No
`
`Reject
`
`Reject
`
`Reject
`
`Is verified
`gender,
`Claimed?
`
`Yes
`
`Is obtained
`model,
`Claimed?
`
`Yes
`
`Λ(X) <
`
`Yes
`
`Accept
`
`The side effect of reverberation is presented in Fig. 6,
`where speech frames with periodic structures are less
`influenced by reverberation and noise and must have
`higher weights than the other parts in the localization
`algorithm.
`in each
`An accurate measurement of voicing level
`frequency band was described in Section 2.2. As men-
`tioned, the voicing decision is made by calculating the
`normalized error El between the original and the modeled
`speech spectrum for each frequency band l. For voiced
`frames, El has a value close to zero; and values near to 1
`correspond to the noisy and non-periodic intervals.
`Therefore, the calculated error from Eq. (4) is used to
`measure the degree of periodicity for each frequency band
`and can be employed in a filtering scheme for SRP
`localization. The filter to be employed at each channel is
`
`Gl;mðoÞ ¼ 1 El
`jXmðoÞj ; o 2 ½al; blŠ
`
`(14)
`
`where El is calculated for the received signal at micro-
`phone m, Xm(o). This filter will emphasize the voiced
`frames. Furthermore, the influence of the signal amplitude
`will be omitted and only the phase information will
`be used in the localization process which leads to
`improvement in the robustness of this algorithm to both
`noise and reverberation.
`In practice, for small arrays it is sufficient to compute
`the fundamental frequency harmonics at the reference
`microphone and then the error for each frequency band is
`calculated by this pitch period. Therefore, if a channel
`signal for some reason is degraded, its influence will be
`reduced. The employed filter in beamforming algorithm
`and the output of the steered array is computed by
`Eq. (15) and its power is calculated for that particular
`point in space. We name the proposed method SRP-H, as it
`is based on beamforming and analyses of the fundamental
`frequency harmonics of the speech signal.
`1 El;m
`jXmðoÞj XmðoÞejoDm
`
`XL
`XM
`
`m¼1
`
`l¼1
`
`Y SRP-H
`D1...DM
`
`ðoÞ ¼
`
`(15)
`
`using the ML estimation. The output of a filter-and-sum
`beamformer in the frequency domain is defined:
`
`XM
`
`YðoÞ ¼
`
`GmðoÞXmðoÞejoDm
`
`(11)
`
`(12)
`
`m¼1
`where Xm(o) is the discrete Fourier transform of the
`received signal from microphone m, Gm(o) is the corre-
`sponding filter for microphone m, D1,y,DM are integer
`values representing the steering delays derived from
`Eq. (10) for candidate positions in the space. To avoid
`rounding up errors of Dm to integer values, prior to
`localization, the signal is upsampled to 96 kHz. The SRP is
`calculated by
`PD1...DM ¼ YðoÞY TðoÞ
`where Y(o)
`the
`is the horizontal output vector of
`beamformer and YT is its transpose. Although steering
`delays are continuous variables, Eq. (10) is computed for
`discrete locations in space.
`In calculating SRP, choice of a suitable filter has a
`considerable impact on the robustness of localization to
`both noise and reverberation. In a well-known SRP-PHAT
`algorithm [10], the filter used at each channel is given by
`
`GmðoÞ ¼
`
`1
`jXmðoÞj
`
`for m ¼ 1 . . . M
`
`(13)
`
`2.4.1. A new filtering scheme exploiting harmonic structures
`Since the received signal at microphone array is
`contaminated by multi-path and noise signals,
`the
`periodicity of
`the voiced segments is reduced. This
`phenomenon can be seen in Fig. 6, where T60 identifies
`the room reverberation time. It is measured as 10 times
`the logarithm of the normalized squared impulse re-
`sponse amplitude and is represented in the form of e.g.
`T60, implying the time needed for this value to decay from
`0 to 60 dB [37].
`
`

`

`ARTICLE IN PRESS
`
`A. Asaei et al. / Signal Processing 89 (2009) 1038–1049
`
`1045
`
`Fig. 6. Periodicity degradation due to noise and reverberation: upper graph: initial signal of word /a/; middle graph: signal affected by reverberation; and
`lower graph: signal affected by reverberation and noise.
`
`3. Test Scenario and the tentative results
`
`The VSL front-end performance was evaluated under
`different conditions of noise and reverberation. The
`impact of these acoustic parameters on a VSL based ASR
`is quantified in this section. Fig. 7 shows a VSL based ASR.
`The original speech has been chosen from TIMIT
`database and the far-field signal received at each micro-
`phone channel was simulated based on the theories
`explained in Section 2.1. The TIMIT speech data was
`recorded with a close-talking microphone of sampling
`frequency of 16 kHz. The New York City subset comprising
`of 13 females and 22 males were used. This database was
`divided into a training set and a testing set. The training
`set for each speaker comprised of ten sentences and the

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket