`
`UUs
`aT
`IEEE
`
`sh)
`JULY-DEC
`Ne
`
`Page 1 of 30
`
`GOOGLE EXHIBIT 1010
`
`
`
`aSee
`
`AUTOMATED
`Utate
`DETTE
`Sw LAR CAROAEL
`+ Fingerprint Identity Authentication
`+ Fingerprint Features
`EyHLA CAREETO
`
`Papers on:
`
`_
`+ Speaker Recognition
`+ EvaluationofIdentification ~
`and Verification Systems
`Sy ZENCLO SYA
`
`1915 Classic Paper:
`“The Pure Electron Discharge
`andIt’s Applicitnin
`Radio Telegraphy aud Telephony”
`by Irving Langmuir
`
`Scanning the Past: Ralbh Bown &
`The Golden Age ofPropagation Research
`
`EI
`
`THE PROCEEDINGS
`
`fi
`
`» THE INSTITUTE OF ELECTRICAL AND ELECTRONICS eenase Nex
`
`SEPTEMBER1997
`
`»
`)
`
`,
`:
`
`a
`\
`
`.
`\
`F
`
`’
`
`,
`
`y
`
`Page 2 of 30
`
`Page 2 of 30
`
`
`
`Iris Recognition: An Emerging Biometric Technology, R. P. Wildes
`1347 Prolog, W. Shen and R. Khanna
`1365—AnIdentity-Authentication System Using Fingerprints
`A. K. Jain, L. Hong, S. Pankanti, and R. Bolle
`1364 Prolog, W. Shen and R. Khanna
`1390_—_—~Fingerprint Features—Statistical Analysis and System Performance Estimates
`(Invited Paper), A. R. Roddy and J. D. Stosz
`:
`1389 Prolog, W. Shen and R. Khanna
`Face Recognition: Eigenface, Elastic Matching, and Neural Nets
`(Invited Paper), J. Zhang, Y. Yan, and M. Lades
`1422 Prolog, W. Shen and R. Khanna
`Speaker Recognition: A Tutorial (Invited Paper), J. P. Campbell, Jr.
`1436 Prolog, W. Shen and R. Khanna
`1464~~Evaluation of Automated Biometrics-Based Identification and Verification Systems
`W. Shen, M. Surette, and R. Khanna
`
`
`
`VEVIVIVEDEDEDEDEDDE)ssevemecnoerere1]0]001TTTTTUTITTTTTTTTTPTPTETETEETETTTT
`
`
`
`Prolog, W. Shen and R. Khanna
`1463
`Biometrics: Privacy’s Foe or Privacy’s Friend? J. D. Woodward
`1479
`Prolog, W. Shen, R. Khanna, and J. D. Woodward
`Comments on “The Pure Electron Discharge and Its Applications in Radio Telegraphy
`and Telephony” (Invited Paper), C. K. Birdsall
`1496—-The Pure Electron Discharge and Its Applications in Radio Telegraphy and Telephony
`(Classic Paper), I. Langmuir
`
`& IPIROCIEIEIDIINGS®TEEE
`
`
`
`published monthly by the Institute of Electrical and Electronics Engineers, Inc.
`September 1997 Vol. 85
`No. 9
`
`car
`
`SEP 23 1997
`
`en,
`GROUP AéGLE,)
`¥ah
`
`COPY
`RY oF GONGS=
`
`SPECIAL ISSUE ON
`AUTOMATED BIOMETRICS
`
`Edited by Weicheng Shen and Rajiv Khanna
`
`
`
`
`
`1343
`
`Scanning the Special Issue on Automated Biometrics, W. Shen and R. Khanna
`
`PAPERS
`
`1348
`
`1423
`
`1437
`
`1480
`
`1493
`
`BOOK REVIEWS
`
`1509
`
`The Balanced Scorecard: Translating Strategy into Action by R. S. Kaplan and
`D. P. Norton, Reviewed by R. C. Dorf and M. Rattanen
`
`SCANNING THE PAST
`
`1511
`1514.
`1516
`
`Ralph Bown and the Golden Age of Propagation Research, J. E. Brittain
`FUTURE SPECIAL ISSUES/SPECIAL SECTIONS OF THE PROCEEDINGS
`PROCEEDINGS CLASSIC PAPER REPRINT SCHEDULE
`
`Page 3 of 30
`
`Page 3 of 30
`
`
`
`PROCEEDINGS OF ‘THE IEEE
`1997 EDITORIAL BOARD
`Richard B. Fair, Editor
`James E. Brittain, Associate Editor, History
`
`Winser E. Alexander
`Roger Barr
`Albert Benveniste
`G. M. Borsuk
`Bimal K, Bose
`Lawrence Carin
`Giovanni De Micheli
`Per Enge
`E. K. Gannett
`Erol Gelenbe
`T. G, Giallorenzi
`J. D. Gibson
`Bijan Jabbari
`Dwight L. Jaggard
`Peter Kaiser
`
`Sung-Mo(Steve) Kang
`Murat Kunt
`Chen-Ching Liu
`Massimo Maresca
`K. W. Martin
`Yale N. Patt
`Theo Pavlidis
`P. B. Schneck
`Marwan Simaan
`L. M. Terman
`Fawwaz T. Ulaby
`Paul P. Wang
`H. R. Wittmann
`Francis T. S. Yu
`
`1997 IEEE PUBLICATIONS BOARD
`Friedolf Smits, Chair
`Tariq S. Durrani, Vice Chair
`
`IEEE STAFF
`Daniel J. Senese, Executive Director
`
`Frederick T. Andrews
`David Daut
`Kenneth Dawson
`Stephen L. Diamond
`Richard B. Fair
`Gregg Gibson
`Roger Hoyt
`W. Dexter Johnston, Jr.
`Marcel Keschner
`
`Deborah Flaherty Kizer
`Prasad Kodali
`Frank Lord
`William Middleton
`Robert T. Nash
`Charles Robinson
`Allan C. Schell
`Steven Unger
`George W. Zobrist
`
`STAFF EXECUTIVES
`Anthony J. Ferraro, Publications
`Richard D. Schwartz, Business
`Administration
`
`MANAGING DIRECTORS
`Donald Curtis, Human Resources
`Cecelia Jankowski, RegionalActivities
`Peter A Lewis, Educational
`Activities
`Andrew G. Salem, Standards Activities
`W. Thomas Suttle,
`Professional Activities
`John Witsken,
`Information Technology
`
`PUBLICATIONS DIRECTORS
`
`Kenneth Moore, [EEE Press
`Lewis Moore,
`Publications Administration
`Fran Zappulla,
`Staff Director, IEEE Periodicals
`
`PROCEEDINGS STAFF
`
`Jim Calder, Managing Editor
`Margery Scanlon,
`Editorial Coordinator
`Gail S. Ferenc, Transactions Manager
`Valerie Cammarata, Edttorial Manager
`Geraldine E. Krolin,
`Managing Editor, TRANSACTIONS/JOURNALS
`Tonya Ugoretz Buzby, Associate Editor
`
`Frank Caruthers, Jim Esch,
`Howard Falk, Richard A. O’Donnell.
`Kevin Self, George Likourezos
`Contributing Editors
`Stephen Goldberg, Cover Artist
`Susan Schneiderman, Richard C. Faust,
`Advertising Sales
`
`Manuscripis should be submitted in triplicate to the Editor at the IEEE Operations Center. A
`summary ofinsiruclions for preparation is found in the most recent January issue ofthis journal.
`Detailed instructions are contained in “Information for IEEE Transactions and Journals Authors,"
`available on request, After a manuscript has been accepted for publication, the author's organi-
`zation will be requested (o honor a charge of $110 per printed page (one-page minimum charge) to
`cover part of the publication cost. Responsibility for contents of papers rests upon the authors and
`not on the IEEE orits members.
`Copyright: [t is the policy of the [EEE to own the copyright to the technical contributionsit
`publishes on behalf of the interests of the [EEE.its authors, and their employers and to facilitale the
`appropriate reuse ofthis matcrial by others. To comply with the U.S. Copyright Law, authors are
`requested (o sign an IEEE copyright form before publication. This Form, a copy of which is found
`in the most recent January issue of the journal, returns to authors and their employers righis to
`reuse thei: material for their own purposes.
`
`
`
`
`
`HTTPDS2@etests[TTTPPEEee
`
`for the following month’s issue, Send new address, plus mailing label showing old address, to the
`PROCEEDINGS OF THE [EEE (ISSN 0018-9219; cudes IREPAD)is published monthly by the
`IEEE Operations Center. Member copies of the PROCEEDINGSare for personal use only.
`Institute of Electrical and Electronics Engineers, Inc.
`IEEE Corporate Office: 345 Eas! 47th
`Street, New York, NY J00|7-2394 USA. IEEE Operations Center: 445 Hoes Lane, P, O. Box
`1331, Piscataway, NJ 08855-1331 USA,
`NJ Telephone: 732-981-0060. Copyright and
`Advertising correspondence should be addressed to PROCEEDINGS Advertising Department, IEEE
`Operations Center, 445 Hoes Lane, Piscataway, NJ 08855-1331
`Reprint Permission:—Abstracting is permitted with credit to the source, Libraries are permitted
`to photocopy forprivale use of patrons, provided the per-copy fee indicated in the code at the
`bottomof the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive,
`Danvers, MA 01923.
`Forall other copying,
`reprint, or republication permission, write to
`Copyrights and Permissions Department. [EEE Publications Administration, 445 Hoes Lane, P. O.
`Box 1331, Piscataway. NJ 08855-1331. Copyright © 1997 by the Institute of Electrical and
`Electronics Engineers. Inc.. All rights reserved. Periodicals Postage Paid at New York, NY and at
`additional mailing offices.
`Postmaster: Send address changes to PROCEEDINGS OF THE IEEE,
`IEEE. 445 Hoes Lane, P. O. Box 1331. Piscataway. NJ 08855-1331. GST Registration No.
`125634188. Printed in U.S.A
`
`Annual Subscription: Member and nonmemberprices available on request. Single copies: IEEE
`members $10.00 (first copy only), nonmembers $20,00 per copy, (Note: Add $4.00 for poslage and
`handling charge by any order from $1.00 to $50.00, including prepaid orders.) Other: Available in
`microfiche and microfilm. Changeof address must be received bythe first of a monthto be effective
`
`fax: 732-562-5456, email: j.calder@ieee.arg.)
`
`CONTRIBUTIONS
`
`The PROCEEDINGS OF THE IEEE publishes comprehensive, in-depth
`review, tutorial, and survey papers written for technically knowledgeable
`readers who are not necessarily specialists in the subjects being treated.
`The papers are of long-range interest and broad significance. Applications
`and technological issues, as well as theory, are emphasized. The topics
`include all aspects of electrical and computer engineering and science.
`From time to time, papers on managerial, historical, enconomic, and
`ethical aspects of technology are published. Papers are authored by
`recognized authorities and reviewed by experts. They include extensive
`introductions written at a level suitable for the nonspecialist, with ample
`references for those who wish to probe further. Several issues a year are
`devoted to a single subject of special importance.
`IMPORTANT: Prospective authors, before preparing a full-length manu-
`script, should submit a proposal containing a description of the topic and
`its
`importance to PROCEEDINGS readers, a detailed outline of the proposed
`
`paperand its type of coverage, and a brief biography showing the author's
`qualifications for writing the paper (including reference to previously
`published material as well as information on the author's relation to the
`topic). If the proposal receives a favorable review, the author will be
`encouraged to prepare the paper, which after submittal will go through the
`normal review process. Guidelines for proposals are available from the
`address below or the PROCEEDINGS home page:
`http://www .ieee.org/pubs/transjour/proc.
`Technicalletters are no longer published in the PROCEEDINGS. Comments
`on and corrections to material published in this journal will be considered,
`however,
`
`Please send proposals to the Editor, PROCREDINGS OF THE IEEE, 445
`Hoes Lane, Piscataway, NJ 08855-1331 USA.(Telephone: 732-562-5478,
`
`COVER Thisspecial issue covers the subject of automated biometric systemsthat are used to verify individual identity using unique biometric mcasurements
`of the human body, Our coverillustrates the concept of one of the growing number of applications of these systems.
`
`Page 4 of 30
`
`Page 4 of 30
`
`
`
`Speaker Recognition: A Tutorial
`
`
`
`JOSEPH P. CAMPBELL, JR., SENIOR MEMBER,IEEE
`
`Invited Paper
`
`A tutorial on the design and development of automatic speaker-
`recognition systems is presented. Automatic speaker recognition
`is the use of a machine to recognize a person from a spoken
`phrase. These systems can operate in two modes:
`to identify
`a particular person or to verify a person’s claimed identity.
`Speech processing and the basic components ofautomatic speaker-
`recognition systems are shownand designtradeoffs are discussed.
`Then, a new automatic speaker-recognition system is given. This
`recognizer performs with 98.9% correct identification, Last,
`the
`performances of various systems are conipared.
`Keywords—Accesscontrol, authentication, biomedical measure-
`ments, biomedical signal processing, biomedical transducers, bio-
`metric, communication system security, computer network security,
`computer security, corpus, data bases,
`identification of persons,
`public safety, site security monitoring, speaker recognition, speech
`processing, verification.
`
`I.
`
`INTRODUCTION
`
`In keeping with this special issue on biometrics, the focus
`of this paper is on facilities and network access-control
`applications of speaker recognition. Speech processing is a
`diverse field with many applications. Fig.
`1 shows a few of
`these areas and how speaker recognition relates to the rest
`ofthe field; this paper focuses on the three boxed areas.
`Speaker recognition encompasses verification and iden-
`tification. Automatic speaker verification (ASV)is the use
`of a machine to verify a person’s claimed identity from
`his voice. The literature abounds with different terms for
`speaker verification, including voice verification, speaker
`authentication, voice authentication, talker authentication,
`and talker verification. In automatic speaker identification
`(ASI), there is no a priori identity claim, and the system
`decides who the person is, what group the person is a
`member of, or (in the open-set case) that the person is
`unknown. General overviews of speaker recognition have
`been given in [2], [12], [17], [37], [51], [52], and [59].
`Speakerverification is defined as deciding if a speakeris
`whom he claims to be. This is different than the speaker
`
`Manuscript received April 20, 1997; revised June 27, 1997.
`The author is with the National Security Agency, R22, Ft. Meade,
`MD 20755-6516 USA;
`and the Whiting School of Engineering,
`The Johns Hopkins University, Baltimore, MD 21218 USA (e-mail:
`j.campbell @ieee.org).
`Publisher Item Identifier $ 0018-92 19(97)06947-8.
`
`Speech Provessing
`
`Analysis/Synthesis
`
`Recognition
`
`Coding
`
`Speech
`Recognition
`
`Speaker
`Recognition
`
`Language
`Identification
`
`Speaker
`Identification
`
`Speaker
`Detection
`
`Speaker
`Verification
`
`Text
`Independent
`Unwitting
`Speakers
`Variable
`Quality
`Speech
`
`Text
`Independent
`Cooperative
`Speakers
`High
`Quality
`Speech
`
`Speech
`
`Text
`Independent
`Cooperative
`Speakers
`High
`Quality
`Speech
`
`Text
`Dependent
`Cooperative
`Speakers
`High
`Quality
`
`Fig. 1. Speech processing.
`
`™
`
`identification problem, which is deciding if a speaker is a
`specific person or is among a group of persons. In speaker
`verification, a person makes an identity claim (e.g., by
`entering an employee numberor presenting his smart card).
`In text-dependent recognition, the phrase is known to the
`system and can be fixed or prompted (visually or orally).
`The claimant speaks the phrase into a microphone. This
`signal is analyzed by a verification system that makes the
`binary decision to accept or reject the user’s identity claim
`or possibly to report
`insufficient confidence and request
`additional input before making the decision.
`A typical ASV setup is shown in Fig. 2. The claimant,
`who has previously enrolled in the system, presents an
`encrypted smart card containing his identification informa-
`tion. He then attempts to be authenticated by speaking a
`prompted phrase(s) into the microphone. There is generally
`a tradeoff between accuracy andtest-session duration. In
`addition to his voice, ambient room noise and delayed
`versions of his voice enter the microphone via reflective
`acoustic surfaces. Prior to a verification session, users must
`enroll in the system (typically under supervised conditions).
`During this enrollment, voice models are generated and
`stored (possibly on a smart card) for use in later verification
`
`PROCEEDINGS OF THE IEEE, VOL. 85, NO. 9, SEPTEMBER 1997
`
`1437
`
`U.S. Government work not protected by U.S. copyright.
`
`Page 5 of 30
`
`Page 5 of 30
`
`
`
`
`
`
`Pattern
`
`Feature
`|_Digital
`Filtering
`and A/D|Speech”|Extraction
`Matching|Scores
`
`
`
`
`
`NEARS Taconic
`
`Doe SS . | Surface Microphone
`
` © _ Microphone
`
`Claimed ID
`
`Fig. 3. Generic speaker-verification system.
`
`B. Problem Formulation
`
`Speech is a complicated signal produced as a result
`of several
`transformations occurring at several different
`levels: semantic, linguistic, articulatory, and acoustic. Dif-
`ferences in these transformations appear as differences in
`the acoustic properties of the speech signal. Speaker-related
`differences are a result of a combination of anatomical
`differences inherent in the vocal tract and the learned speak-
`ing habits of different individuals. In speaker recognition,
`all these differences can be used to discriminate between
`speakers.
`
`C. Generic Speaker Verification
`The general approach to ASV consists of five steps:
`digital speech data acquisition, feature extraction, pattern
`matching, making an accept/reject decision, and enrollment
`to generate speaker reference models. A block diagram
`of this procedure is shown in Fig. 3. Feature extraction
`maps eachinterval of speech to a multidimensional feature
`space. (A speech interval typically spans 10-30 msof the
`speech waveform andis referred to as a frame of speech.)
`This sequence of feature vectors 2; is then compared to
`speaker models by pattern matching. This results in a match
`score 2; for each vector or sequence of vectors. The match
`score measures the similarity of the computed input feature
`vectors to models of the claimed speaker or feature vector
`patterns for the claimed speaker. Last, a decision is made to
`either accept or reject the claimant according to the match
`score or sequence of match scores, which is a hypothesis-
`testing problem.
`For speakerrecognition, features that exhibit high speaker
`discrimination power, high interspeaker variability, and
`low intraspeaker variability are desired. Many forms of
`pattern matching and corresponding models are possible.
`Pattern-matching methods include dynamic time warping
`(DTW), the hidden Markov model (HMM),artificial neural
`networks, and vector quantization (VQ). Template models
`are used in DTW,statistical models are used in HMM,and
`codebook models are used in VQ.
`
`D. Overview
`
`The purpose of this introductory section is to present a
`general framework and motivation for speaker recognition,
`an overview of the entire paper, and a presentation of
`previous work in speaker recognition.
`Section II contains an overview of speech processing,
`including speech signal acquisition,
`the data base used
`in later experiments, speech production, linear prediction
`(LP),
`transformations,
`and the cepstrum. Section TI
`
`PROCEEDINGS OF THE IEEE, VOL. 85, NO. 9, SEPTEMBER 1997
`
`Smart Card
`
`(Ambient Noise
`
`Fig. 2. Typical speaker-verification setup.
`
`
`
`Table 1 Sources of Verification Error
`
`Misspoken or misread prompted phrases
`
`Extreme emotionalstates (e.g., stress or duress)
`
`Time varying(intra- or intersession) microphone placement
`
`Poor or inconsistent room acoustics(e.g., multipath and noise)
`
`Channel mismatch(e.g., using different microphones for
`enrollment and verification)
`
`Sickness(e.g., head colds can alter the vocal tract)
`
`Aging (the vocal tract can drift away from models with age)
`
`sessions. There is also generally a tradeoff between accu-
`tacy and the duration and number of enrollmentsessions.
`Many factors can contribute to verification and iden-
`tification errors. Table 1
`lists some of the human and
`environmental factors that contribute to these errors, a few
`of which are shown in Fig. 2. These factors generally are
`outside the scope of algorithms or are better corrected by
`means other than algorithms (e.g., better microphones).
`These factors are important, however, because no matter
`how good a speaker-recognition algorithm is, human er-
`ror (e.g., misreading or misspeaking) ultimately limits its
`performance.
`
`A. Motivation
`
`ASV andASI are probably the most natural and econom-
`ical methods for solving the problems of unauthorized use
`of computer and communications systems and multilevel
`access control. With the ubiquitous telephone network and
`microphones bundled with computers, the cost of a speaker-
`recognition system might only be for software.
`Biometric systems automatically recognize a person by
`using distinguishing traits (a narrow definition). Speaker
`recognition is a performance biometric, i.e., you perform
`a task to be recognized. Your voice, like other biometrics,
`cannot be forgotten or misplaced, unlike knowledge-based
`(e.g., password) or possession-based (e.g., key) access-
`control methods. Speaker-recognition systems can be made
`somewhatrobust against noise and channelvariations [33],
`[49], ordinary human changes (e.g.,
`time-of-day voice
`changes and minor head colds), and mimicry by humans
`and tape recorders [22].
`
`1438
`
`Page 6 of 30
`
`Page 6 of 30
`
`
`
`the divergence measure, and
`presents feature selection,
`the Bhattacharyya distance. This section is highlighted
`by the development of the divergence shape measure and
`the Bhattacharyya distance shape. Section IV introduces
`pattern matching and Section V presents classification,
`decision theory, and receiver operating characteristic
`(ROC) curves. Section VI describes a simple but effective
`speaker-recognition algorithm. Section VII demonstrates
`the performanceof various speaker-recognition algorithms,
`and Section VIII concludes by summarizing this paper.
`
`E. Previous Work
`
`There is considerable speaker-recognition activity in in-
`dustry, national laboratories, and universities. Among those
`who have researched and designed several generations of
`speaker-recognition systems are AT&T (andits derivatives);
`Bolt, Beranek, and Newman;
`the Dalle Molle Institute
`for Perceptual Artificial Intelligence (Switzerland); ITT;
`Massachusetts Institute of Technology Lincoln Labs; Na-
`tional Tsing Hua University (Taiwan); Nagoya Univer-
`sity (Japan); Nippon Telegraph and Telephone (Japan);
`Rensselaer Polytechnic Institute; Rutgers University; and
`Texas Instruments (TT). The majority of ASV research is
`directed at verification over telephone lines [36]. Sandia
`National Laboratories, the National Institute of Standards
`and Technology [35], and the National Security Agency [8]
`have conducted evaluations of speaker-recognition systems.
`Table 2 shows a sampling of the chronological advance-
`ment in speaker verification. The following terms are used
`to define the columns in Table 2: “source” refers to a cita-
`
`tion in the references, “org”is the company or school where
`the work was done, “features” are the signal measurements
`(e.g., cepstrum), “input” is the type of input speech (labora-
`tory, office quality, or telephone), “text” indicates whether
`a text-dependent or text-independent mode of operation
`is used, “method” is the heart of the pattern-matching
`process, “pop” is the population size of the test (number
`£6,
`99
`of people), and “error” is the equal error percentage for
`speaker-verification systems
`“v”
`or the recognition error
`percentage for speaker identification systems “2” given
`the specified duration of test speech in seconds. This
`data is presented to give a simplified general view of
`past speaker-recognition research. The references should
`be consulted for important distinctions that are not
`in-
`cluded,e.g., differences in enrollment, differences in cross-
`gender impostortrials, differences in normalizing “cohort”
`speakers [53], differences in partitioning the impostor and
`cohort sets, and differences in known versus unknown
`impostors [8].
`It should be noted that
`it
`is difficult
`to
`make meaningful comparisons between the text-dependent
`and the generally more difficult
`text-independent
`tasks.
`Text-independent approaches, such as Gish’s segmental
`Gaussian model
`[18] and Reynolds’ Gaussian Mixture
`Model[49], need to deal with unique problems(e.g., sounds
`or articulations present
`in the test material but not
`in
`training). It is also difficult to compare between the binary-
`choice verification task and the generally more difficult
`multiple-choice identification task [12], [39].
`
`CAMPBELL: SPEAKER RECOGNITION
`
`Page 7 of 30
`
`trend shows accuracy improvements over
`The general
`time with larger tests (enabled by larger data bases), thus
`increasing confidence in the performance measurements.
`For high-security applications,
`these speaker-recognition
`systems would need to be used in combination with other
`authenticators (e.g., smart card). The performanceof current
`speaker-recognition systems, however, makes them suitable
`for many practical applications. There are more than a
`dozen commercial ASV systems,
`including those from
`ITT, Lernout & Hauspie, T-NETIX, Veritel, and Voice
`Control Systems. Perhaps the largest scale deployment of
`any biometric to date is Sprint’s Voice FONCARD®, which
`uses TI’s voice verification engine.
`Speaker-verification applications include access control,
`telephone banking, and telephone credit cards. The ac-
`counting firm of Ernst and Young estimates that high-tech
`computer thieves in the United States steal $3-5 billion
`annually. Automatic speaker-recognition technology could
`substantially reduce this crime by reducing these fraudulent
`transactions.
`
`As automatic speaker-verification systems gain wide-
`spread use, it is imperative to understand the errors made
`by these systems. There are two types of errors: the false
`acceptance of an invalid user (FA or Type I) and the false
`rejection of a valid user (FR or Type II). It takes a pair
`of subjects to make a false acceptance error: an impostor
`and a target. Because of this hunter'and prey relationship,
`in this paper, the impostor is referred to as a wolf and the
`targét as a sheep. False acceptance errors are the ultimate
`concern of high-security speaker-verification applications;
`however, they can be traded off for false rejection errors.
`After reviewing the methods of speaker recognition,
`a simple speaker-recognition system will be presented.
`A data base of 186 people collected over a three-month
`period was used in closed-set
`speaker
`identification
`experiments. A speaker-recognition system using methods
`presented here is practical
`to implement
`in software on
`a modest personal computer. The example system uses
`features and measures
`for
`speaker
`recognition based
`upon speaker-discrimination criteria (the ultimate goal of
`any recognition system). Experimental results show that
`these new features and measures yield 1.1% closed-set
`speaker identification error on data bases of 44 and 43
`people. The features and measures use long-term statistics
`based upon
`an
`information-theoretic
`shape measure
`between line spectrum pair (LSP) frequency features. This
`new measure,
`the divergence shape, can be interpreted
`geometrically as the shape of an information-theoretic
`measure called divergence. The LSP’s were found to be
`very effective features in this divergence shape measure.
`The following section contains an overview of digital
`signal acquisition, speech production, speech signal pro-
`cessing, LP, and mel cepstra.
`
`II.
`
`SPEECH PROCESSING
`
`Speech processing extracts the desired information from
`a speech signal. To process a signal by a digital computer,
`
`1439
`
`Page 7 of 30
`
`
`
`Table 2.
`
`Selected Chronology of Speaker-Recognition Progress
`
`se [Oe[te[wet[ome[or
`
`
`
`Atal 1974 [1] AT&T|Cepstrum|Pattern Match Dependent i: 2%@0.5s
`
`Lab
`
`
`v: 2%@|s
`
`Error
`
`Lab|Independent a
`
`
`Markel and
`Davis 1979
`[34]
`
`STI
`
`LP
`
`Long Term
`Statistics
`
`17
`
`i: 2%@39s
`
`10
`Furui 1981
`
`
`AT&T|Normalized} Patten Match|Tele-|Dependent Vv: 0.2%@3s
`[16]
`Cepstrum
`phone
`
`
`
`21
`
`
`Schwartz, et al.|BBN Nonparametric|Tele-|IndependentLAR i: 2.5%@2s
`1982 [56]
`phone
`ab
`
`Li and Wrench
`1983 [31]
`
`TT
`
`I
`
`LP,
`Cepstrum
`
`Pattern Match
`
`(e
`
`Independent
`
`i: 21%@3s
`i: 4%@10s
`
`TI
`
`Lab Dependent|200|v: 0.8%@6s
`
`DTW
`Doddington
`Filter-bank
`1985 [12]
`
`
`
` Tele- 10 isolated
`
`
`100|i: S%@1.5s
`VQ (size 64)
`Soong,et al.
`AT&T
`LP
`phone
`digits
`Likelihood
`1985 [57]
`i: 1.5%@3.5s
`
`
`Ratio
`Distortion
`
`
`
`
`DTW
`Independent
`Likelihood
`
`Scoring
`
`
`
`Attili, et al. Cepstrum,|Projected Long
`
`
`Dependent 90|v: 1%@3s
`1988 [3]
`LP,
`Term Statistics
`Autocorr
`
`
`
`
`
`
`
`1TT
`Higgins,et al.
`LAR,
`DTW
`Office|Dependent|186|v: 1.7%@10s
`LP-
`Likelihood
`1991 [22]
`
`
`Cepstrum
`Scoring
`
`
`
`
`LP
`HMM
`10 isolated
`
`
`
`
`(AR mix)
`digits
`
`
`
`
`Reynolds 1995|MIT-LL Mel-
`
`
`HMM Office|Dependent|138] i: 0.8%@10s
`{48]; Reynolds
`Cepstrum
`(GMM)
`v: 0.12%@10s
`and Carlson
`1995 [49]
`
`Higgins and
`Wohlford 1986
`[23]
`
`ITT
`
`Cepstrum
`
`Tishby 1991
`[60
`
`AT&T
`
`v: 10%@2.5s
`v: 4.5%@10s
`
`v: 2.8%@1.5s
`v: 0.8%@3.5s
`
`
`
`
`
`Che and Lin
`Rutgers
`Cepstrum Office|Dependent|138|i: 0.56%@2.5sHMM
`1995 [9]
`i: 0.14%@10s
`v: 0.62%@2.5s
`
`AFIT
`Colombi,etal.
`HMM Office|Dependent|138] i: 0.22%@10s
`
`
`1996 [10]
`monophone
`v: 0.28%@ 10s
`
`Reynolds 1996|MIT-LL
`Mel- Tele-|Independent|416|v: 11%/16%@3sHMM
`
`
`Cepstrum,
`(GMM)
`phone
`Vv: 6%/8%@10s
`Mel-
`v: 3%/5%@3 0s
`dCepstrum
`matched/mis-
`matched handset
`
`[50}
`
`the signal must be representedin digital form so that it can
`be used by a digital computer.
`
`A. Speech Signal Acquisition
`Initially,
`the acoustic sound pressure wave is trans-
`formed into a digital signal suitable for voice process-
`ing. A microphone or telephone handset can be used to
`convert
`the acoustic wave into an analog signal. This
`analog signal is conditioned with antialiasingfiltering (and
`possibly additionalfiltering to compensate for any channel
`impairments). The antialiasing filter limits the bandwidth
`of the signal to approximately the Nyquist rate (half the
`
`1440
`
`sampling rate) before sampling. The conditioned analog
`signal
`is then sampled to form a digital signal by an
`analog-to-digital (A/D) converter. Today’s A/D converters
`for speech applications typically sample with 12-16 bits
`of resolution at 8000-20000 samples per second. Over-
`sampling is commonly used to allow a simpler analog
`antialiasing filter and to control the fidelity of the sampled
`signal precisely (e.g., sigma—delta converters).
`the analog
`In local speaker-verification applications,
`channel is simply the microphone,
`its cable, and analog
`signal conditioning. Thus, the resulting digital signal can
`be very high quality,
`lacking distortions produced by
`
`PROCEEDINGS OF THE IEEE, VOL. 85, NO. 9, SEPTEMBER 1997
`
`Page 8 of 30
`
`Page 8 of 30
`
`
`
`a= NASAL CAVITY
`Collected with a STU-III electret-microphone telephone handset
`HARD
`PALATE
`over 3 month period in a real-world office environment
`SOFT PALATE===
`4 enrollment sessions per subject with 24 phrases per session
`(VELUM )
`Hrarg BONE ——
`EPIGLOTTIS --—-
`
`Table 3. The YOHO Corpus
`
`“Combination lock” phrases(e.g., “twenty-six, eighty-one,fifty-
`seven’)
`
`138 subjects: 106 males, 32 females
`
`10 verification sessions per subject at approximately 3-day
`intervals with 4 phrasesper session
`Total of 1380 validated test sessions
`
`
`
`
`
`cRICOID
`CARTILAGE
`
`LUNG
`
`Fig. 4. Human vocal system. (Reprinted with permission from J.
`Flanagan, Speech Analysis and Perception, 2nd ed. New York and
`Berlin: Springer-Verlag, 1972, p. 10, Fig. 2.1. © Springer-Verlag.)
`
`a t
`
`e
`
`* oral cavity (forward of the velum and boundedby the
`lips, tongue, and palate);
`* nasal pharynx (above the velum, rear end of nasal
`cavity);
`
`* nasal cavity (above the palate and extending from the
`pharynx to the nostrils).
`An adult male vocaltract is approximately 17 cm long [14].
`The vocal folds (formerly known as vocal cords) are
`shownin Fig. 4. The larynx is composed of the vocal folds,
`the top ofthe cricoid cartilage, the arytenoid cartilages, and
`the thyroid cartilage (also known as “Adam’s apple”). The
`vocal folds are stretched between the thyroid cartilage and
`the arytenoid cartilages. The area between the vocal folds
`is called the glottis.
`As the acoustic wave passes through the vocaltract, its
`frequency content (spectrum)is altered by the resonances of
`the vocal tract. Vocal tract resonances are called formants.
`Thus,
`the vocal
`tract shape can be estimated from the
`spectral shape (e.g., formant location and spectraltilt) of
`the voice signal.
`Voice verification systems typically use features derived
`only from the vocal tract. As seen in Fig. 4, the human vo-
`cal mechanism is driven by an excitation source, which also
`contains speaker-dependent information. The excitation is
`generated by airflow from the lungs, carried by the trachea
`(also called the “wind pipe’) through the vocal folds (or the
`arytenoid cartilages). The excitation can be characterized as
`phonation, whispering, frication, compression, vibration, or
`a combination of these.
`
`1441
`
`8 kHz sampling with 3.8 kHz analog bandwidth (STU-III like)
`
`1.2 gigabytes of data
`
`transmission of analog signals over long-distance telephone
`lines.
`
`B. YOHO Speaker-Verification Corpus
`The work presented here is based on high-quality sig-
`nals for benign-channel speaker-verification applications.
`The primary data base for this work is known as the
`YOHO Speaker-Verification Corpus, which was collected
`by ITT under a U.S. government contract. The YOHO
`data base wasthe first large-scale, scientifically controlled
`and collected, high-quality speech data base for speaker-
`verification testing at high confidence levels. Table 3 de-
`scribes the YOHOdata base [21]. YOHOis available from
`the Linguistic Data Consortium (University of Pennsylva-
`nia), and test plans have been developed for its use [8].
`This data base already is in digital form, emulating the
`third generation Secure Terminal Unit’s (STU-III) secure
`voice telephone input characteristics, so the first signal
`processing block ofthe verification system in Fig. 3 (signal
`conditioning and acquisition) is taken care of.
`the
`In a text-dependent speaker-verification scenario,
`phrases are known to the system (e.g.,
`the claimant
`is
`prompted to say them). The syntax used in the YOHO
`data base is “combination lock” phrases. For example,
`the prompt might read, “Say: twenty-six, eighty-one,fifty-
`seven.”
`
`YOHO was designed for U.S. government evaluation
`of speaker-verification systems in “office” environments.
`In addition to office environments,
`there are enormous
`consumer markets that must contend with noisy speech
`(e.g., telephone services) and far-field microphones (e.g.,
`com