`iml1
`
`II~ 11111
`3 3029 04671 7492
`
`\
`
`2ND EDITION
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 1
`
`
`
`Speech Synthesis and
`Recognition
`
`i
`
`:Jt I
`
`.1
`
`i
`
`Second Edition
`
`");j • JI t:r,,i;
`, 1 v , 'i vrf
`,.
`
`'·
`
`) I
`
`John Holmes and Wend~ Elobties·:
`
`ttJW J,:o~ :~
`ih\\l<.\
`• .,_ •'
`·{rj.-,(' ":iif [
`
`J,
`~,, i,,
`
`lt
`
`••
`
`I H I.
`1· t I • 1 'L'l
`, •
`
`,.
`
`. ~· ,
`
`10 . 1 ~rrft ~ n:i.; 11 i 11"'.',· iEru
`• n r,
`,. :.;ti tu .11,-om~
`fri .. r1•L1 1,..,, , YI
`l' 1·,)
`·; _, -; "'1 ·,~ZJOl,. . :·,_ Hif!noin i
`.. dr1g
`_.,, r~;.11"•.,
`
`lil
`
`, ! . '- .>
`
`,,
`
`-1
`
`•,~:u.1$ 1f' ,L j,1
`
`)
`
`:1,·,
`.
`
`J
`
`,· 1. J'
`
`~ i
`
`,._., ., .,-.,; • ·,.n1-;
`,;. ·: ·, .. i .iJ
`. :
`
`' .. :,
`
`:.t.
`
`·.,;;.,
`
`'1·
`•.:..,
`
`1·-~•1•1-.c h •·
`'•
`I
`_,,;
`'dH,
`
`·''t......i
`...
`,
`lJ
`, • 4V,I\J
`·1.:11::rJ;.url1Jq
`• ,:1 .,:·. ·1,
`• l ·_,J t ·: at.I!<~
`• ."-YOTt-.1
`
`.,
`
`•
`I\.
`
`1 -~"r ,. •· ,:1;·,-,;)..,•'.,
`_l,\fh.Jl'.
`
`......
`
`. I
`
`,
`
`'• ~, ··;°V 1_o.,.q"'>t,11c,:!•·•.··:\•t
`"S.f
`•.
`'
`. '
`t
`
`·:1:_,•l
`
`.. ~1i''.,H
`, • •
`
`•
`
`... 1. I c.?~'Til'lH
`.>.t'>ri'l11-(1 1L'fl'J\;c.
`[!; 1 1
`:.·~i'.~,,.·· Ti;·;iri{;~.~rodd:-: .
`.-:1•· Lui.;
`H-_.u
`i.::i31. ·· (.j~J
`i-l 1:,.1~':1-f~!-~· ·) ·".~(.,l
`:••;) l.i::..r-li
`. if ~u,, ,·Jl ~~l\.lOH .I "nr·~i,·~ gr.ii!£~...l'l i'11 ·'>-'l 1/·
`. f
`
`1
`
`, I•
`
`....
`
`r!',
`
`I
`
`r,r
`
`,. ')
`,.,.,
`; .5
`·. ~t
`t ~, 1
`?}
`·:,
`
`-~n
`
`),,
`
`r,
`
`,•,·.:..,
`
`•
`•
`~ 1111dec1 \1C,,
`London and New York
`
`h
`
`I
`
`• •
`
`&_i ,
`
`...
`
`I
`
`I
`
`,._.;
`
`I
`
`f
`
`\
`
`)
`r
`I
`I
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 2
`
`
`
`First edition by the late Dr J.N. Holmes published 1988 by Van Nostrand
`Reinhold
`Second edition published 2001 by Taylor & Francis
`11 New Fetter Lane, London EC4P 4EE
`
`Simultaneously published in the USA and Canada
`by Taylor & Francis
`29 West 35th Street, New York, NY 10001
`
`Taylor & Francis is an imprint of the Taylor & Francis Group
`
`@ 2001 Wendy J. Holmes
`
`Publisher's Note
`This book has been prepared from camera-ready copy provided by the authors.
`Printed and bound in Great Britain by Biddies Ltd, Guildford and King's Lynn
`
`All rights reserved. No part of this book may be reprinted or reproduced or
`utilised in any form or by any electronic, mechanical, or other means, now
`known or hereafter invented, including photocopying and recording, or in any
`information storage or retrieval system, without permission in writing from the
`publishers.
`
`Every effort has been made to ensure that the advice and information in this
`book is true and accurate at the time of going to press. However, neither the
`publisher nor the authors can accept any legal responsibility or liability for any
`errors or omissions that may be made. In the case of drug administration, any
`medical procedure or the use of technical equipment mentioned within this
`book, you are strongly advised to consult the manufacturer's guidelines.
`
`British library Cataloguing in Publication Data
`A catalogue record for this book is available from the British Library
`
`library of Congress Cataloging in Publication Data
`
`Holmes, J.N.
`Speech synthesis and recognition/John Holmes and Wendy Holmes.--2nd ed
`p.cm.
`Includes bibliographical references and index.
`ISBN 0-7484-0856-8 (he.) -- ISBN 0-7484-0857-6 (pbk.)
`1. Speech processing systems. I. Holmes, Wendy (Wendy J.) II. Title.
`
`TK77882.S65 H64 2002
`006.4'54--dc21
`
`ISBN 0-7484-0856-8 (hbk)
`ISBN 0-7484-0857-6 (pbk)
`
`2001044279
`
`I
`I
`
`I
`
`I
`
`I
`I
`I
`I
`
`I
`I
`I
`I
`
`I
`
`I
`I
`
`I
`
`I
`I
`
`I
`I
`I
`I
`
`I
`
`I
`
`I
`
`I
`I
`I
`I
`I
`I
`I
`
`I
`
`I
`I
`I
`I
`
`I
`I
`I
`I
`I
`
`I
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 3
`
`
`
`CONTENTS
`
`Preface to the First Edition
`Preface to the Second Edition
`List of Abbreviations
`
`1 Human Speech Communication
`
`1.1 Value of speech for human-machine communication
`1.2
`Ideas and language
`1.3 Relationship between written and spoken language
`1.4 Phonetics and phonology
`1.5 The acoustic signal
`1.6 Phonemes, phones and allophones
`1. 7 Vowels, consonants and syllables
`1.8 Phonemes and spelling
`1.9 Prosodic features
`1.10 Language, accent and dialect
`1.11 Supplementing the acoustic signal
`1.12 The complexity of speech processing
`Chapter 1 summary
`Chapter 1 exercises
`
`2 Mechanisms and Models of Human Speech Production
`
`Introduction
`2.1
`Sound sources
`2.2
`2.3 The resonant system
`2.4
`Interaction of laryngeal and vocal tract functions
`2.5 Radiation
`2.6 Waveforms and spectrograms
`2. 7
`Speech production models
`2. 7.1 Excitation models
`2.7.2 Vocal tract models
`Chapter 2 summary
`Chapter 2 exercises
`
`3 Mechanisms and Models of the Human Auditory System
`
`Introduction
`3 .1
`3 .2 Physiology of the outer and middle ears
`3.3
`Structure of the cochlea
`
`Xlll
`
`xv ..
`xvn
`
`1
`
`1
`1
`1
`2
`2
`3
`4
`6
`6
`7
`8
`9
`10
`10
`
`11
`
`11
`12
`15
`19
`21
`21
`25
`26
`27
`31
`32
`
`33
`
`33
`33
`34
`
`-
`
`I I'
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 4
`
`
`
`vi
`
`Contents
`
`3. 4 Neural response
`3 .5 Psychophysical measurements
`3.6 Analysis of simple and complex signals
`3.7 Models of the auditory system
`3. 7 .1 Mechanical filtering
`3.7.2 Models of neural transduction
`3.7.3 Higher-level neural processing
`Chapter 3 summary
`Chapter 3 exercises
`
`4 Digital Coding of Speech
`
`4.4
`
`Introduction
`4.1
`4.2 Simple waveform coders
`4.2.1 Pulse code modulation
`4.2.2 Deltamodulation
`4.3 Analysis/synthesis systems (vocoders)
`4 .3 .1 Channel vocoders
`4.3.2 Sinusoidal coders
`4.3.3 LPC vocoders
`4.3.4 Formant vocoders
`4.3.5 Efficient parameter coding
`4.3.6 Vocoders based on segmental/phonetic structure
`Intermediate systems
`4.4.1 Sub-band coding
`4.4.2 Linear prediction with simple coding of the residual
`4.4.3 Adaptive predictive coding
`4.4.4 Multipulse LPC
`4.4.5 Code-excited linear prediction
`4.5 Evaluating speech coding algorithms
`4.5.1 Subjective speech intelligibility measures
`4.5.2 Subjective speech quality measures
`4.5.3 Objective speech quality measures
`4.6 Choosing a coder
`Chapter 4 summary
`Chapter 4 exercises
`
`5 Message Synthesis from Stored Human Speech Components
`
`5. I
`Introduction
`5.2 Concatenation of whole words
`5.2. l Simple waveform concatenation
`5.2.2 Concatenation of vocoded words
`5.2.3 Limitations of concatenating word-size units
`
`36
`38
`41
`42
`42
`43
`43
`46
`46
`
`47
`
`47
`48
`48
`50
`52
`53
`53
`54
`56
`57
`58
`58
`59
`60
`60
`62
`62
`63
`64
`64
`64
`65
`66
`66
`
`67
`
`67
`67
`67
`70
`71
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 5
`
`
`
`Contents
`
`5.3 Concatenation of sub-word units: general principles
`5.3.1 Choice of sub-word unit
`5 .3 .2 Recording and selecting data for the units
`5 .3 .3 Varying durations of concatenative units
`5.4 Synthesis by concatenating vocoded sub-word units
`5.5 Synthesis by concatenating waveform segments
`5 .5 .1 Pitch modification
`5.5.2 Timing modification
`5.5.3 Performance of waveform concatenation
`5.6 Variants of concatenative waveform synthesis
`5. 7 Hardware requirements
`Chapter 5 summary
`Chapter 5 exercises
`
`6 Phonetic synthesis by rule
`
`Introduction
`6.1
`6.2 Acoustic-phonetic rules
`6.3 Rules for formant synthesizers
`6.4 Table-driven phonetic rules
`6.4.1 Simple transition calculation
`6.4.2 Overlapping transitions
`6.4.3 Using the tables to generate utterances
`6.5 Optimizing phonetic rules
`6.5. I Automatic adjustment of phonetic rules
`6.5.2 Rules for different speaker types
`6.5.3 Incorporating intensity rules
`6.6 Current capabilities of phonetic synthesis by rule
`Chapter 6 summary
`Chapter 6 exercises
`
`7
`
`Speech Synthesis from Textual or Conceptual Input
`
`Introduction
`7 .1
`7 .2 Emulating the human speaking process
`7.3 Converting from text to speech
`7 .3 .1 TIS system architecture
`7.3 .2 Overview of tasks required for TIS conversion
`7.4 Text analysis
`7.4.1 Text pre-processing
`7.4.2 Morphological analysis
`7.4.3 Phonetic transcription
`7.4 .4 Syntactic analysis and prosodic phrasing
`7.4.5 Assignment of lexical stress and pattern of word accents
`
`vii
`
`71
`71
`72
`73
`74
`74
`75
`77
`77
`78
`79
`80
`80
`
`81
`
`81
`81
`82
`83
`84
`85
`86
`89
`89
`90
`91
`91
`92
`92
`
`93
`
`93
`93
`94
`94
`96
`97
`97
`99
`100
`101
`102
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 6
`
`
`
`viii
`
`7 .5 Prosody generation
`7 .5 .1 Timing pattern
`7.5.2 Fundamental frequency contour
`Implementation issues
`7.6
`7. 7 Current TIS synthesis capabilities
`7.8 Speech synthesis from concept
`Chapter 7 summary
`Chapter 7 exercises
`
`Contents
`
`102
`103
`104
`106
`107
`107
`108
`108
`
`8
`
`Introduction to automatic speech recognition: template matching
`
`109
`
`Introduction
`8.1
`8.2 General principles of pattern matching
`8.3 Distance metrics
`8.3.1 Filter-bank analysis
`8.3.2 Level normalization
`8.4 End-point detection for isolated words
`8.5 Allowing for timescale variations
`8.6 Dynamic programming for time alignment
`8.7 Refinements to isolated-word DP matching
`8. 8 Score pruning
`8.9 Allowing for end-point errors
`8.10 Dynamic programming for connected words
`8.11 Continuous speech recognition
`8.12 Syntactic constraints
`8.13 Training a whole-word recognizer
`Chapter 8 summary
`Chapter 8 exercises
`
`9
`
`Introduction to stochastic modelling
`
`109
`109
`110
`111
`112
`114
`115
`115
`117
`118
`121
`121
`124
`125
`125
`126
`126
`
`127
`
`127
`9 .1 Feature variability in pattern matching
`128
`9.2
`Introduction to hidden Markov models
`130
`9.3 Probability calculations in hidden Markov models
`133
`9.4 The Viterbi algorithm
`134
`9.5 Parameter estimation for hidden Markov models
`135
`9.5. l Forward and backward probabilities
`9.5.2 Parameter re-estimation with forward and backward probabilities 136
`9.5.3 Viterbi training
`139
`9.6 Vector quantization
`140
`9.7 Multi-variate continuous distributions
`141
`9.8 Use of normal distributions with HMMs
`142
`9.8.1 Probability calculations
`143
`9 .8.2 Estimating the parameters of a normal distribution
`144
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 7
`
`
`
`Contents
`
`9.8.3 Baum-Welch re-estimation
`9.8.4 Viterbi training
`9.9 Model initialization
`9 .10 Gaussian mixtures
`9 .10.1 Calculating emission probabilities
`9 .10.2 Baum-Welch re-estimation
`9 .10.3 Re-estimation using the most likely state sequence
`9 .10.4 Initialization of Gaussian mixture distributions
`9 .10.5 Tied mixture distributions
`9 .11 Extension of stochastic models to word sequences
`9 .12 Implementing probability calculations
`9.12.1 Using the Viterbi algorithm with probabilities in logarithmic
`form
`9 .12.2 Adding probabilities when they are in logarithmic form
`9.13 Relationship between DTW and a simple HMM
`9.14 State durational characteristics ofHMMs
`Chapter 9 summary
`Chapter 9 exercises
`
`ix
`
`144
`145
`146
`14 7
`14 7
`148
`149
`150
`151
`152
`153
`153
`
`154
`155
`156
`157
`15 8
`
`10 Introduction to front-end analysis for automatic speech recognition
`
`159
`
`10.1 Introduction
`10 .2 Pre-emphasis
`10.3 Frames and windowing
`10.4 Filter banks, Fourier analysis and the mel scale
`10.5 Cepstral analysis
`10.6 Analysis based on linear prediction
`10.7 Dynamic features
`10.8 Capturing the perceptually relevant information
`10.9 General feature transformations
`10.10 Variable-frame-rate analysis
`Chapter 10 summary
`Chapter 10 exercises
`
`159
`15 9
`159
`160
`161
`165
`166
`167
`167
`167
`168
`168
`
`""
`
`11 Practical techniques for improving speech recognition performance
`
`169
`
`11.1 Introduction
`11.2 Robustness to environment and channel effects
`11.2.1 Feature-based techniques
`11.2.2 Model-based techniques
`11.2.3 Dealing with unknown or unpredictable noise corruption
`11.3 Speaker-independent recognition
`11.3.1 Speaker normalization
`11.4 Model adaptation
`
`169
`169
`171
`171
`173
`174
`175
`17 6
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 8
`
`
`
`x
`
`11.4.1 Bayesian methods for training and adaptation of HMMs
`11.4.2 Adaptation methods based on linear transforms
`11.5 Discriminative training methods
`11.5.1 Maximum mutual information training
`11.5.2 Training criteria based on reducing recognition errors
`11.6 Robustness of recognizers to vocabulary variation
`Chapter 11 summary
`Chapter 11 exercises
`
`12 Automatic speech recognition for large vocabularies
`
`Contents
`
`176
`178
`179
`179
`180
`181
`181
`182
`
`183
`
`183
`12.1 Introduction
`183
`12.2 Historical perspective
`184
`12.3 Speech transcription and speech understanding
`185
`12.4 Speech transcription
`186
`12.5 Challenges posed by large vocabularies
`187
`12.6 Acoustic modelling
`188
`12.6.1 Context-dependent phone modelling
`188
`12.6.2 Training issues for context-dependent models
`190
`12.6.3 Parameter tying
`190
`12.6.4 Training procedure
`193
`12.6.5 Methods for clustering model parameters
`194
`12.6.6 Constructing phonetic decision trees
`195
`12.6.7 Extensions beyond triphone modelling
`196
`12.7 Language modelling
`197
`12.7.1 N-grams
`197
`12.7.2 Perplexity and evaluating language models
`198
`12.7.3 Data sparsity in language modelling
`199
`12.7.4 Discounting
`200
`12. 7.5 Backing off in language modelling
`200
`12.7.6 Interpolation of language models
`201
`12.7.7 Choice of more general distribution for smoothing
`202
`12.7.8 Improving on simple N-grams
`203
`12.8 Decoding
`203
`12.8.1 Efficient one-pass Viterbi decoding for large vocabularies
`204
`12.8.2 Multiple-pass Viterbi decoding
`205
`12.8.3 Depth-first decoding
`205
`12.9 Evaluating LVCSR performance
`205
`12.9.1 Measuring errors
`206
`12.9.2 Controlling word insertion errors
`206
`12.9.3 Performance evaluations
`209
`12.10 Speech understanding
`12.10.1 Measuring and evaluating speech understanding performance 210
`Chapter 12 summary
`211
`Chapter 12 exercises
`212
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 9
`
`
`
`Contents
`
`13 Neural networks for speech recognition
`
`13 .1 Introduction
`13.2 The human brain
`13 .3 Connectionist models
`13.4 Properties of ANNs
`13.5 ANNs for speech recognition
`13.5.1 Hybrid HMM/ANN methods
`Chapter 13 summary
`Chapter 13 exercises
`
`14 Recognition of speaker characteristics
`
`I ,t
`
`f
`
`xi
`
`213
`
`213
`213
`214
`215
`216
`217
`218
`218
`
`219
`
`219
`14.1 Characteristics of speakers
`219
`14.2 Verification versus identification
`220
`14.2.1 Assessing performance
`221
`14.2.2 Measures of verification performance
`224
`14.3 Speaker recognition
`224
`14.3.1 Text dependence
`14.3.2 Methods for text-dependent/text-prompted speaker recognition 224
`14.3.3 Methods for text-independent speaker recognition
`225
`14.3.4 Acoustic features for speaker recognition
`226
`14.3.5 Evaluations of speaker recognition performance
`227
`14.4 Language recognition
`228
`14.4.1 Techniques for language recognition
`228
`14.4.2 Acoustic features for language recognition
`229
`Chapter 14 summary
`230
`Chapter 14 exercises
`23 0
`
`15 Applications and performance of current technology
`
`15 .1 Introduction
`15 .2 Why use speech technology?
`15.3 Speech synthesis technology
`15 .4 Examples of speech synthesis applications
`15. 4 .1 Aids for the dis ab led
`15.4.2 Spoken warning signals, instructions and user feedback
`15.4.3 Education, toys and games
`15.4.4 Telecommunications
`15.5 Speech recognition technology
`15.5.1 Characterizing speech recognizers and recognition tasks
`15 .5 .2 Typical recognition performance for different tasks
`15.5.3 Achieving success with ASR in an application
`15.6 Examples of ASR applications
`
`231
`
`231
`231
`232
`233
`23 3
`233
`234
`234
`235
`235
`23 7
`238
`239
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 10
`
`
`
`xii
`
`Contents
`
`15.6.1 Command and control
`15.6.2 Education, toys and games
`15.6.3 Dictation
`15.6.4 Data entry and retrieval
`15 .6.5 Telecommunications
`15.7 Applications of speaker and language recognition
`15.8 The future of speech technology applications
`Chapter 15 summary
`Chapter 15 exercises
`
`16 Future research directions in speech synthesis and recognition
`
`16.1 Introduction
`16.2 Speech synthesis
`16.2.1 Speech sound generation
`16.2.2 Prosody generation and higher-level linguistic processing
`16.3 Automatic speech recognition
`16.3.1 Advantages of statistical pattern-matching methods
`16.3.2 Limitations of HMMs for speech recognition
`16.3.3 Developing improved recognition models
`16.4 Relationship between synthesis and recognition
`16.5 Automatic speech understanding
`Chapter 16 summary
`Chapter 16 exercises
`
`17 Further Reading
`
`17.1 Books
`17.2 Journals
`17 .3 Conferences and workshops
`1 7. 4 The Internet
`17.5 Reading for individual chapters
`
`References
`Solutions to Exercises
`Glossary
`Index
`
`239
`239
`240
`240
`241
`243
`243
`244
`244
`
`245
`
`245
`245
`246
`247
`248
`248
`249
`250
`252
`253
`254
`254
`
`255
`
`255
`256
`256
`257
`258
`
`265
`277
`283
`287
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 11
`
`
`
`CHAPTERS
`
`Introduction to Automatic Speech
`Recognition: Template Matching
`
`8.1 INTRODUCTION
`
`Much of the early work on automatic speech recognition (ASR), starting in the
`1950s, involved attempting
`to apply rules based either on acoustic/phonetic
`knowledge or in many cases on simple ad hoc measurements of properties of the
`speech signal for different types of speech sound. The intention was to decode the
`signal directly into a sequence of phoneme-like units. These early methods,
`extensively reviewed by Hyde ( 1972), achieved very little success. The poor results
`were mainly because co-articulation causes the acoustic properties of individual
`phones to vary very widely, and any rule-based hard decisions about phone identity
`will often be wrong if they use only local information. Once wrong decisions have
`been made at an early stage, it is extremely difficult to recover from the errors later.
`An alternative to rule-based methods is to use pattern-matching techniques.
`Primitive pattern-matching approaches were being investigated at around the same
`time as the early rule-based methods, but major improvements in speech recognizer
`performance did not occur until more general pattern-matching techniques were
`invented. This chapter describes typical methods that were developed for spoken
`word recognition during the 1970s. Although these methods were widely used in
`commercial speech recognizers in the 1970s and 1980s, they have now been largely
`superseded by more powerful methods ( to be described in later chapters), which
`can be understood as a generalization of the simpler pattern-matching techniques
`introduced here. A thorough understanding of the principles of the first successful
`pattern-matching methods is thus a valuable introduction to the later techniques.
`
`8.2 GENERAL PRINCIPLES OF PATTERN MATCHING
`
`When a person utters a word, as we saw in Chapter 1, the word can be considered
`as a sequence of phonemes ( the linguistic units) and the phonemes will be realized
`as phones. Because of inevitable co-articulation, the acoustic patterns associated
`with individual phones overlap in time, and therefore depend on the identities of
`their neighbours. Even for a word spoken ~ isolation, therefore, the acoustic
`pattern is related in a very complicated way to the word's linguistic structure.
`However, if the same person repeats the same isolated word on separate
`occasions, the pattern is likely to be generally similar, because the same phonetic
`relationships will apply. Of course, there will probably also be differences, arising
`from many causes. For example, the second occurrence might be spoken faster or
`more slowly; there may be differences in vocal effort; the pitch and its variation
`during the word could be different; one example may be spoken more precisely
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 12
`
`
`
`110
`
`Speech Synthesis and Recognition
`
`than the other, etc. It is obvious that the waveform of separate utterances of the
`same word may be very different. There are likely to be more similarities between
`spectrograms because (assuming that a short time-window is used, see Section 2.6),
`they better illustrate the vocal-tract resonances, which are closely related to the
`positions of the articulators. But even spectrograms will differ in detail due to the
`above types of difference, and timescale differences will be particularly obvious.
`A well-established approach to ASR is to store in the machine example
`acoustic patterns ( called templates) for all the words to be recognized, usually
`spoken by the person who will subsequently use the machine. Any incoming word
`can then be compared in tum with all words in the store, and the one that is most
`similar is assumed to be the correct one. In general none of the templates will match
`perfectly, so to be successful this technique must rely on the correct word being
`more similar to its own template than to any of the alternatives.
`It is obvious that in some sense the sound pattern of the correct word is likely
`to be a better match than a wrong word, because it is made by more similar
`articulatory movements. Exploiting this similarity is, however, critically dependent
`on how the word patterns are compared, i.e. on how the 'distance' between two
`word examples is calculated. For example,
`it would be useless
`to compare
`waveforms, because even very similar repetitions of a word will differ appreciably
`in waveform detail from moment to moment, largely due to the difficulty of
`repeating the intonation and timing exactly.
`It is implicit in the above comments that it must also be possible to identify
`the start and end points of words that are to be compared.
`
`8.3 DISTANCE METRICS
`
`In this section we will consider the problem of comparing the templates with the
`incoming speech when we know that corresponding points
`in time will be
`associated with similar articulatory events. In effect, we appear to be assuming that
`the words to be compared are spoken in isolation at exactly the same speed, and
`that their start and end points can be reliably determined.
`In practice these
`assumptions will very rarely be justified, and methods of dealing with the resultant
`problems will be discussed later in the chapter.
`In calculating a distance between two words it is usual to derive a short-term
`distance that is local to corresponding parts of the words, and to integrate this
`distance over the entire word duration. Parameters representing the acoustic signal
`must be derived over some span of time, during which the properties are assumed
`not to change much. In one such span of time the measurements can be stored as a
`set of numbers, or feature vector, which may be regarded as representing a point
`in multi-dimensional space. The properties of a whole word can then be described
`as a succession of feature vectors ( often referred to as frames), each representing a
`time slice of, say, 10-20 ms. The integral of the distance between the patterns then
`reduces to a sum of distances between corresponding pairs of feature vectors. To be
`useful, the distance must not be sensitive to small differences in intensity between
`otherwise similar words, and it should not give too much weight to differences in
`pitch. Those features of the acoustic signal that are determined by the phonetic
`properties should obviously be given more weight in the distance calculation.
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 13
`
`
`
`Introduction to Automatic Speech Recognition: Template Matching
`
`111
`
`8.3.l Filter-bank analysis
`
`The most obvious approach in choosing a distance metric which has some of the
`desirable properties is to use some representation of the short-term power spectrum.
`It has been explained in Chapter 2 how the short-term spectrum can represent the
`effects of moving formants, excitation spectrum, etc.
`Although in tone languages pitch needs to be taken into account, in Western
`languages there is normally only slight correlation between pitch variations and the
`phonetic content of a word. The likely idiosyncratic variations of pitch that will
`occur from occasion to occasion mean that, except for tone languages, it is
`normally safer to ignore pitch in whole-word pattern-matching recognizers. Even
`for tone languages it is probably desirable to analyse pitch variations separately
`from effects due to the vocal tract configuration. It is best, therefore, to make the
`bandwidth of the spectral resolution such that it will not resolve the harmonics of
`the fundamental of voiced speech. Because the excitation periodicity is evident in
`the amplitude variations of the output from a broad-band analysis, it is also
`necessary to apply some time-smoothing to remove it. Such time-smoothing will
`also remove most of the fluctuations
`that result from randomness in turbulent
`excitation.
`At higher frequencies the precise formant positions become less significant,
`and the resolving power of the ear ( critical bandwidth - see Chapter 3) is such that
`detailed spectral information is not available to human listeners at high frequencies.
`It is therefore permissible to make the spectral analysis less selective, such that the
`effective filter bandwidth is several times the typical harmonic spacing. The desired
`analysis can thus be provided by a set of bandpass filters whose bandwidths and
`
`2
`
`. ......
`
`.
`
`• • •
`
`•
`
`• • •
`
`• •
`
`• • ♦ ♦• ♦ •••
`
`♦ •••
`
`2
`
`0
`
`•
`
`••••••••••
`.........
`••••••••••
`. ........ .
`
`GHT
`
`Time
`
`0·5 s
`
`Figure 8.1 Spectrographic dis.plays of a 10-channel filter-bank analysis (with a non-linear
`frequency spacing of the channels), shown for one example of the word "three" and two
`examples of the word "eight". It can be seen that the examples of "eight" are generally similar,
`although the lower one has a shorter gap for the [t] and a longer burst.
`
`
`..... ·••· .................. .
`. . . . . . . . . . . . . . . . . . . . . .
`. .. . . ... . . . . .
`5 • ····•·•··················•·
`• • • • ·•••·················••·
`·••··•··············••·
`··•·················••·
`·••·················••·
`N5-rc-•~.~.-.-.-.-.-.-.~.~.~.---~.~~.~.7.7,
`I
`• • • ••••••••••
`••••• • •
`·············
`• ••••••
`>
`• • •••••••••••
`• ♦ •••••
`···••·.
`g 2 • • • • • • • • . . .
`. ••••••
`♦ ••••••••
`~
`••
`. ....
`r::r
`••••••••••
`~o....L-·-·~••~•~•~•'-Z..JIUL:E~•~•------=-~·~•-•-·-·--
`GHT
`E
`5 .••••••••••
`·•••·······
`·•••·······
`. . . . . . . .
`·••·····••·
`.••••••••••
`• • •• • • •
`•••••••••••
`• • •••••••••
`. • •
`• •••
`I
`E
`I
`0·0
`
`.:it!
`
`•••••
`
`••••••••
`
`♦ ••
`
`
`~ ............ .
`. .. ·••··
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 14
`
`
`
`112
`
`Speech Synthesis and Recognition
`
`spacings are roughly equal to those of critical bands and whose range of centre
`frequencies covers the frequencies most important for speech perception (say from
`300 Hz up to around 5 kHz). The total number of band-pass filters is therefore not
`likely to be more than about 20, and successful results have been achieved with as
`few as 10. When the necessary time-smoothing is included, the feature vector will
`represent the signal power in the filters averaged over the frame interval.
`The usual name for this type of speech analysis is filter-bank analysis.
`Whether it is provided by a bank of discrete filters, implemented in analogue or
`digital form, or is implemented by sampling the outputs from short-term Fourier
`transforms, is a matter of engineering convenience. Figure 8.1 displays word
`patterns from a typical I 0-channel filter-bank analyser for two examples of one
`word and one example of another. It can be seen from the frequency scales that the
`channels are closer together in the lower-frequency regions.
`A consequence of removing the effect of the fundamental frequency and of
`using filters at least as wide as critical bands is to reduce the amount of information
`needed to describe a word pattern to much less than is needed for the waveform.
`Thus storage and computation in the pattern-matching process are much reduced.
`
`8.3.2 Level normalization
`
`Mean speech level normally varies by a few dB over periods of a few seconds, and
`changes in spacing between the microphone and the speaker's mouth can also cause
`changes of several dB. As these changes will be of no phonetic significance, it is
`desirable to minimize their effects on the distance metric. Use of filter-bank power
`directly gives most weight to more intense regions of the spectrum, where a change
`of 2 or 3 dB will represent a very large absolute difference. On the other hand, a
`3 dB difference in one of the weaker formants might be of similar phonetic
`significance, but will cause a very small effect on the power. This difficulty can be
`avoided to a large extent by representing the power logarithmically, so that similar
`power ratios have the same effect on the distance calculation whether they occur in
`intense or weak spectral regions. Most of the phonetically unimportant variations
`discussed above will then have much less weight in the distance calculation than the
`differences in spectrum level that result from formant movements, etc.
`Although comparing levels logarithmically is advantageous, care must be
`exercised in very low-level sounds, such as weak fricatives or during stop(cid:173)
`consonant closures. At these times the logarithm of the level in a channel will
`depend more on the ambient background noise level than on the speech signal. If
`the speaker is in a very quiet environment the logarithmic level may suffer quite
`wide irrelevant variations as a result of breath noise or the rustle of clothing. One
`way of avoiding this difficulty is to add a small constant to the measured level
`before taking logarithms. The value of the constant would be chosen to dominate
`the greatest expected background noise level, but to be small compared with the
`level usually found during speech.
`Differences in vocal effort will mainly have the effect of adding a constant to
`all components of the log spectrum, rather than changing the shape of the spectrum
`cross-section. Such differences can be made to have no effect on the distance
`metric by subtracting the mean of the logarithm of the spectrum level of each frame
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 15
`
`
`
`
`
`Introduction to Automatic Speech Recognition: Template Matching
`
`113
`
`from all the separate spectrum components for the frame. In practice this amount of
`level compensation is undesirable because extreme level variations are of some
`phonetic significance. For e~ample, . a substa~tial part of the acoustic difference
`between [ f] and any vowel 1s the difference m level, which can be as much as
`
`. . . . . .
`····•··
`····••·
`••••••
`··•··
`~ •••
`♦ •••
`(.!)
`
`• • •
`
`• ••
`
`♦•• •
`
`•
`
`····· ·••
`- ........ .
`. ·••· ...•
`•••••• • •
`···••···•·
`····•··••·
`····•···•·
`····•·····
`w ••••••••••
`..........
`••••••••••
`····••·
`
`• • •••
`
`• ♦ ••••
`
`•
`
`•
`
`··•·······
`·••·······
`. . . . . . .
`·••·····••·
`• ••••••••
`• •• • • •
`. .
`·••··••· ••
`·••·······
`
`♦ ••
`
`E
`
`••••
`
`UJ
`••••••••••
`UJ •••••
`
`••••••••••
`••••••
`••••• •
`••••••
`···••·
`···••·
`• •••
`• • •• •
`• • • • • • • • ••
`••
`• •••••••
`····•···•·
`····•···•·
`···•····•·
`········•
`•••••••••
`•••••••••
`a::::~::::::
`•••••••••
`••••••••••
`. . . . . .
`··•···
`
`.• ..
`·'· • ••
`:•:•:
`.•.•.
`·•• ...... .
`•••••••••••
`·••:·······
`·••·····••·
`·••·······
`• ••••••
`• • • • • • •
`• • •••••••
`• •
`··•·······
`..
`··••·
`
`•
`
`E
`
`e i=
`
`• •••
`•••
`·······••·
`........
`•••••••••
`. ...... .
`
`GHT
`
`. .. ·•···
`. . . . . . . . .
`·······•··
`••••••••••
`........
`
`GHT
`
`Figure 8.2 Graphical representation of the distance between frames of the spectrogr~ms
`shown in Figure 8.1. The larger the blob the smaller the distance. It can be seen that there 1s a
`continuous path of fairly small distances between the bottom left and top right when the two
`examples of "eight" are compared, but not when "eight" is compared with "three".
`
`Amazon / Zentian Limited
`Exhibit 1018
`Page 16
`
`
`
`114
`
`Speech Synthesis and Recognition
`
`level differences of this
`if
`30 dB. Recognition accuracy might well suffer
`magnitude were ignored. A useful compromise
`is to compensate only partly for
`level variations, by subtracting some fraction (say in the range 0.7 to 0.9) of the
`mean logarithmic level from each spectral channel. There are also several other
`techniques for achieving a similar effect.
`A suitable distance metric for use with a filter bank is the sum of the squared
`differences between the logarithms of power levels in corresponding channels (i.e.
`the square of the Euclidean distance in the multi-dimensional space). A graphical
`representation of the Euclidean distance between frames for the words used in
`Figure 8.1 is shown in Figure 8