`Heckerman et al.
`
`USOO6260011B1
`(10) Patent No.:
`US 6,260,011 B1
`(45) Date of Patent:
`Jul. 10, 2001
`
`(54) METHODS AND APPARATUS FOR
`AUTOMATICALLY SYNCHRONIZING
`ELECTRONIC AUDIO FILES WITH
`ELECTRONICTEXT FILES
`
`(75) Inventors: David E. Heckerman, Bellevue; Fileno
`A. Alleva, Redmond; Robert L.
`Rounthwaite, Fall City; Daniel Rosen,
`Bellevue; Mei-Yuh Hwang, Redmond;
`Yoram Yaacovi, Redmond; John L.
`Manferdelli, Redmond, all of WA (US)
`(73) Assignee: Microsoft Corporation, Redmond, WA
`(US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(*) Notice:
`
`(21) Appl. No.: 09/531,054
`(22) Filed:
`Mar 20, 2000
`(51) Int. Cl. ........................... G10L 15/08; G1OL 15/26;
`G1 OL11/06
`(52) U.S. Cl. .......................... 704/235; 704/231; 704/251;
`704/254; 704/278
`(58) Field of Search ..................................... 704/260, 252,
`704/270.1, 270, 243, 231, 236, 278, 238
`
`(56)
`
`References Cited
`U.S. PATENT DOCUMENTS
`3,700,815 * 10/1972 Doddington et al. ................ 704/246
`4,779.209 * 10/1988 Stapleford et al. ...
`... 704/278
`5,008,871 * 4/1991 Howells et al. ....................... 369/28
`5,333.275 * 7/1994. Wheatley et al. .
`... 704/243
`5,649,060 * 7/1997 Ellozy et al. ..
`... 704/278
`5,737,725 * 4/1998 Case ...........
`... 704/270
`5,758,024
`5/1998 Alleva .................................... 395/63
`5,794,197
`8/1998 Alleva et al. ..
`... 704/255
`5,960,447 * 9/1999 Holt et al. ......
`... 704/235
`6,076,059 * 6/2000 Glickman et al. ................... 704/260
`OTHER PUBLICATIONS
`Hauptmann et al., “Story Segmentation & Detection of
`Commercials in roadcast News Video', Research & Tech
`nology Advances in Digital Libraries, Apr. 24, 1998.*
`
`Pedro J. Moreno, Chris Joerg, Jean-Manuel Van Thong, and
`Oren Glickman, “A Recursive Algorithm for the Forced
`Alignment of Very Long Audio Segments”; Cambridge
`Research Laboratory, pp. 1-4, Nov. 20, 1998.
`Lawrence Rabiner and Biing-Hwang Juang, Fundamentals
`of Speech Recognition, pp. 434-495 (1993).
`
`* cited by examiner
`
`Primary Examiner Richemond Dorvil
`ASSistant Examiner-Daniel A. Nolan
`(74) Attorney, Agent, or Firm-Straub & Pokotylo;
`Michael P. Straub
`ABSTRACT
`(57)
`Automated methods and apparatus for Synchronizing audio
`and text data, e.g., in the form of electronic files, represent
`ing audio and text expressions of the Same work or infor
`mation are described. A Statistical language model is gen
`erated from the text data. A Speech recognition operation is
`then performed on the audio data using the generated
`language model and a speaker independent acoustic model.
`Silence is modeled as a word which can be recognized. The
`Speech recognition operation produces a time indexed Set of
`recognized words Some of which may be Silence. The
`recognized words are globally aligned with the words in the
`text data. Recognized periods of Silence, which correspond
`to expected periods of Silence, and are adjoined by one or
`more correctly recognized words are identified as points
`where the text and audio files should be Synchronized, e.g.,
`by the insertion of bi-directional pointers. In one
`embodiment, for a text location to be identified for synchro
`nization purposes, both words which bracket, e.g., precede
`and follow, the recognized Silence must be correctly iden
`tified. Pointers, corresponding to identified locations of
`Silence to be used for Synchronization purposes are inserted
`into the text and/or audio files at the identified locations.
`Audio time Stamps obtained from the Speech recognition
`operation may be used as the bi-directional pointers. Syn
`chronized text and audio data may be output in a variety of
`file formats.
`
`36 Claims, 9 Drawing Sheets
`
`500
`
`10
`
`
`
`Text
`Corpus
`
`314
`
`H-
`
`STATISTICAL
`LANGUAGE
`MODEL
`GENERATION
`MODULE
`
`32
`SPEECH
`RECOGNIZER
`MODULE
`
`20
`
`AUDO
`CORPUS
`
`406
`
`RECOGNIZE)
`TEXT +TFME
`SAMPS
`
`
`
`OA AcousTic
`MODEL
`
`TEXThDO
`ALIGNMENT
`MODULE
`
`412
`
`SYNCHRCNIZED
`TEXTRAUDIO
`
`310
`CONTROL L
`MODULE
`
`PROGRAM
`DATA
`
`138
`
`-1-
`
`Amazon v. Audio Pod
`US Patent 8,738,740
`Amazon EX-1030
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 1 of 9
`
`US 6,260,011 B1
`
`
`
`AUDIO
`CORPUS
`
`-2-
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 2 of 9
`
`US 6,260,011 Bl
`
`oO(s)SH4yV3dS
`
`
`AOVSYAINIONILVYAdO
`
`
`19HLIM)daldvd¥O3CIA
`
`
`SOVAUSINI||S0VAYSLNI[Sy4y3iNIJAC|94YSHLOSAIMG!
`¢AYNDIS
`
`
`YOLVEATSOOVSOIHdVeD
`LoLVLVd
`OLLANDVNSEL—hLNOILWOMdd¥
`
`
`WVd90Ud|WWYDOudIddV|ONILWeadOHSHLO|NOILVON
`
`
`8cl|ySIdGHVH
`
`SAINGOW~—(S)WVHDOUd|~WSLSAS
`
`JOVIH3INIyuoodd
`SAIdBEL
`
`MYOMLAN
`
`ZEl
`
`NALSAS
`
`-3-
`
`
`
`
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 3 of 9
`
`US 6,260,011 B1
`
`AFPFLICATION FROGRAMS
`304
`306
`
`WORD
`PROCESSOR
`
`ELECTRONIC
`BOOK
`
`SPREAD
`SHEET
`
`CONTROL
`MODULE
`
`SPEECH
`RECOGNIZER
`MODULE
`
`AUDIO/TEXT
`SYNCHRONIZATION
`314
`316
`312 STATISTICAL
`LANGUAGE
`MODEL
`GENERATION
`MODULE
`
`SPEECH
`RECOGNITION AEGE
`TRAINING
`MODULE
`MODULE
`
`Fig. 3
`
`
`
`138
`
`402
`UNALIGNED
`TEXT FILES
`
`PROGRAM DATA
`UNALIGNED50
`AUDIO FILES
`
`408
`
`410
`
`RECOGNIZED
`TEXT + TIME
`STAMPS
`
`STATISTICAL
`LM
`
`| ACOUSTIC
`MODEL
`
`SYNCHRONIZED
`TEXTIAUDIO
`
`Fig. 4
`
`-4-
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 4 of 9
`
`US 6,260,011 B1
`
`1500
`
`TEXT
`CORPUS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`STATISTICAL
`LANGUAGE
`MODEL
`GENERATION
`MODULE
`
`
`
`SPEECH
`RECOGNIZER
`MODULE
`
`
`
`RECOGNIZED
`TEXT + TIME
`STAMPS
`
`
`
`ACOUSTIC
`MODEL
`
`Fig. 5
`
`TEXTIAUDIO
`ALIGNMENT
`MODULE
`
`
`
`412
`
`SYNCHRONIZED
`TEXTIAUDIO
`
`CONTROL
`MODULE
`
`
`
`
`
`310
`
`PROGRAM
`DATA
`
`138
`
`-5-
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 5 of 9
`
`US 6,260,011 B1
`
`10
`S C
`Se
`TEXT
`FILES
`
`TEXT
`CORPUS
`
`
`
`
`
`FILES
`AUDIO
`CORPUS
`
`START
`
`602
`
`GENERATE STATISTICAL
`LANGUAGE MODEL (LM)
`FROM TEXT CORPUS
`
`3O8
`M
`
`41 O
`
`604
`
`ACOUSTC
`MODEL
`
`606
`
`PERFORMA SPEECH RECOGNITION OPERATION ON
`AUDIO CORPUS CORRESPONDING TO THE TEXT
`CORPUS USING GENERATED LANGUAGE MODEL
`AND SPEAKER INDEPENDENT ACOUSTIC MODEL
`
`RECOGNIZED
`TEXT + TIME
`STAMPS
`
`406
`
`GLOBALLY ALIGN RECOGNIZED TEXT WITH
`TEXT INTEXT CORPUS
`
`6O7
`
`608
`
`IDENTIFY LOCATIONS IN RECOGNIZED TEXT WHERE
`SILENCE ISADJOINED BY CORRECTLY DENTIFIED WORD(S)
`BY COMPARING RECOGNIZED TEXT TO TEXT IN GLOBALLY
`ALIGNED TEXT CORPUS
`
`USE THE DENTIFIED LOCATIONS TO INDEXTEXT
`AND/OR AUDIO CORPUS THEREBY GENERATING
`SYNCRONIZED TEXT AND AUDIO FILES
`
`610
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Fig. 6
`
`412
`
`612
`
`TEXTIAUDIO
`
`STORE SYNCHRONIZED
`TEXTIAUDIO FILES
`614
`
`STOP
`
`-6-
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 6 of 9
`
`US 6,260,011 B1
`
`Calme John. I long for
`
`702
`
`yOu.
`
`704
`
`
`
`708
`
`710
`w
`
`712
`
`ACTUAL
`TEXT
`
`CALL
`
`RECOGNIZED
`WORD
`
`714
`
`716
`
`718
`
`720
`
`Fig. 7
`
`-7-
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 7 of 9
`
`US 6,260,011 B1
`
`412
`
`TEXT
`WITH TIME
`STAMP
`AUDIO
`CORPUS
`
`TEXT
`WITH TIME
`STAMP
`TEXT
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 8
`
`
`
`804
`
`814
`
`Calme John. (110) long
`for you. ...
`
`Calme John. 110) I long
`for you. ...
`
`Fig. 9
`
`-8-
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 8 of 9
`
`US 6,260,011 B1
`
`412
`
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 10
`
`
`
`1004
`
`1014
`
`Calme John. AUDIO 1,
`110 long for you. ...
`
`Calme John. TEXT 1,
`110 long for you. ...
`
`Fig.11
`
`-9-
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 9 of 9
`
`US 6,260,011 B1
`
`
`
`412
`
`S-9
`RAE AUDIO 2
`NSC2
`
`AUDOY
`
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 12
`
`1204
`
`Call me John.
`
`1205
`
`long for you. ...
`
`1214
`
`Call me John.
`
`1215
`
`| long for you. ...
`
`Fig. 13
`
`-10-
`
`
`
`1
`METHODS AND APPARATUS FOR
`AUTOMATICALLY SYNCHRONIZING
`ELECTRONIC AUDIO FILES WITH
`ELECTRONICTEXT FILES
`
`FIELD OF THE INVENTION
`The present invention relates to methods and apparatus
`for Synchronizing electronic audio and text data in an
`automated manner, and to using the Synchronized data, e.g.,
`audio and text files.
`
`BACKGROUND OF THE INVENTION
`Historically, books and other literary works have been
`expressed in the form of text. Given the growing use of
`computers, text is now frequently represented and Stored in
`electronic form, e.g., in the form of text files. Accordingly,
`in the modern age, users of computer devices can obtain
`electronic copies of books and other literary WorkS.
`Frequently text is read aloud So that the content of the text
`can be provided to one or more people in an oral, as opposed
`to written, form. The reading of stories to children and the
`reading of text to the physically impaired are common
`examples where text is read aloud. The commercial distri
`bution of literary works in both electronic and audio ver
`Sions has been commonplace for a significant period of time.
`The widespread availability of personal computers and other
`computer devices capable of displaying text and playing
`audio files Stored in electronic form has begun to change the
`way in which text versions of literary works and their audio
`counterparts are distributed.
`Electronic distribution of books and other literary works
`in the form of electronic text and audio files can now be
`accomplished via compact discS and/or the Internet. Elec
`tronic versions of literary works in both text and audio
`versions can now be distributed far more cheaply than paper
`copies. While the relatively low cost of distributing elec
`tronic versions of a literary work provide authors and
`distributors an incentive for distributing literary works in
`electronic form, consumers can benefit from having Such
`Works in electronic form as well.
`ConsumerS may wish to Switch between audio and text
`versions of a literary work. For example, in the evening an
`individual may wish to read a book. However, on their way
`to work, the same individual may want to listen to the same
`version of the literary work from the point, e.g., Sentence or
`paragraph, where they left off reading the night before.
`Consumers attempting to improve their reading skills can
`also find text and audio versions in the form of electronic
`files beneficial. For example, an individual attempting to
`improve his/her reading skills may wish to listen to the audio
`version of a book while having text corresponding to the
`audio being presented highlighted on a display device. Also,
`many vision-impaired or hearing-impaired readers might
`benefit from having linked audio and text versions of the
`literary work.
`While electronic text and audio versions of many literary
`works exist, relatively few of these works include links
`between the audio and text versions needed to Support the
`easy accessing of the same point in both versions of a work.
`Without Such links between the text and audio versions of a
`work, it is difficult to easily switch between the two versions
`of the work or to highlight text corresponding to the portion
`of the audio version being played at a given moment in time.
`LinkS or indexes used to Synchronize audio and text
`versions of the same work may be manually generated via
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,260,011 B1
`
`2
`human intervention. However, Such human involvement can
`be costly and time consuming. Accordingly, there is a need
`for methods and apparatus for automating the Synchroniza
`tion of electronic text and audio versions of a work.
`Previous attempts to automate the Synchronization of
`electronic text files and audio files of the same work have
`focused primarily on the indexing of audio files correspond
`ing to radio and other broadcasts with electronic text files
`representing transcripts of the broadcasts. Such indexing is
`designed to allow an individual viewing an excerpt from a
`transcript over the Internet to hear an audio clip correspond
`ing to the excerpt. In Such applications, the precision
`required in the alignment is often considered not to be
`critical and an error in alignment of up to 2 Seconds is
`considered by Some to be acceptable.
`While the task of aligning audio files corresponding to TV
`and radio broadcasts and text transcripts of the broadcasts is
`Similar in nature to the task of aligning text files of books or
`other literary works with audio versions made there from,
`there are important differences between the two tasks which
`arise from the differing content of the files being aligned and
`the ultimate use of the aligned files.
`In the case of recordings of literary and other text docu
`ments which are read aloud and recorded for commercial
`purposes, a single reader is often responsible for the reading
`of the entire text. The reader is often carefully chosen by the
`company producing the audio version of the literary work
`for proper pronunciation, inflection, general understandabil
`ity and overall accuracy. In addition, audio recordings of
`books and other literary works are normally generated in a
`Sound controlled environment designed to keep background
`noise to a minimum. Thus commercial audio versions of
`books or other literary works intended to be offered for sale,
`either alone or in combination with a text copy, are often of
`reasonably good quality with a minimum of background
`noise. Furthermore, they tend to accurately reflect the punc
`tuation in the original work and, in the case of commercial
`audio versions of literary works, a single individual may be
`responsible for the audio versions of Several books or Stories
`Since commercial production companies tend to use the
`Same reader to produce the audio versions of multiple
`literary works, e.g., books.
`In the case of transcripts produced from, e.g., radio
`broadcasts, television broadcasts, or court proceedings, mul
`tiple speakers with different pronunciation characteristics,
`e.g., accents, frequently contribute to the same transcript.
`Each Speaker may contribute to only a Small portion of the
`total recording. The original audio may have a fair amount
`of background noise, e.g., music or other noise. In addition,
`in TV and radio broadcasts, speech from multiple speakers
`may overlap, making it difficult to distinguish the end of a
`Sentence spoken by one Speaker and the Start of a Sentence
`from a new speaker. Furthermore, punctuation in the tran
`Script may be less accurate then desired given that the
`transcript may be based on unrehearsed conversational
`Speech generated without regard to how it might later be
`transcribed using written punctuation markS.
`In the case of attempting to Synchronize text and audio
`versions of literary works, given the above discussed uses of
`Such files, accurately Synchronizing the Starting points of
`paragraphs and Sentences is often more important than being
`able to Synchronize individual words within Sentences.
`In View of the above discussion, it is apparent that there
`is a need for new methods and apparatus which can be used
`to accurately Synchronize audio and text files. It is desirable
`that at least Some methods and apparatus be well Suited for
`
`-11-
`
`
`
`US 6,260,011 B1
`
`1O
`
`15
`
`25
`
`35
`
`3
`Synchronizing text and audio versions of literary WorkS. It is
`also desirable that the methods and apparatus be capable of
`Synchronizing the Starting points of Sentences and/or para
`graphs in audio and text files with a high degree of accuracy.
`SUMMARY OF THE PRESENT INVENTION
`The present invention is directed to methods and appa
`ratus for automatically generating Synchronized audio and
`text data, e.g., files, from unsynchronized electronic audio
`and text versions of the same work, e.g., literary work,
`program or document.
`The Synchronization of long audio files, e.g. 30 minutes
`and longer with corresponding text in an automated manner,
`presents significant difficulties Since absolute certainty as to
`points in the audio and text versions which correlate to each
`other exists only at the beginning and end of the complete
`text and audio versions of the same work.
`When Synchronizing text and audio versions of the same
`work, it is highly desirable to Synchronize at least one point
`per paragraph, preferably at the Start of each paragraph.
`When positions within a paragraph are also to be
`Synchronized, the Start of Sentences is a particularly useful
`location to Synchronize Since people tend to prefer reading
`or listening to speech from the Start, as opposed to the
`middle, of Sentences.
`The inventors of the present invention recognized that
`Silence normally occurs at the ends of paragraphs and
`Sentences but, for the most part, does not occur between
`words within a Sentence during ordinary Speech. They also
`recognized that in many audio versions of literary works and
`other text documents read aloud, the amount of background
`noise is intentionally kept to a minimum. This makes periods
`of Silence in an audio version of a literary work relatively
`easy to detect. In addition, the locations where Silence
`occurs is relatively easy to predict from punctuation and/or
`other content within the text version of the work.
`Given that Silence may occur within a Sentence in an
`audio version of a literary work, e.g., because of a pause by
`the reader which is not reflected in the text by punctuation,
`the detection of periods of Silence alone may be insufficient
`to reliably Synchronize audio and text versions of a literary
`work. This is particularly the case in long audio Sequences.
`The inventors of the present application recognized that
`by performing Speech recognition, Spoken words in an audio
`work, in addition to periods of Silence, could be detected
`automatically and used for purposes of Synchronizing the
`text and audio versions of the work. Unfortunately, with
`known speech recognition techniques, recognition errors
`occur. In addition, even when recognition errors do not
`occur, differences may exist between an audio and text
`version of the same work due, e.g., to reading errors on the
`part of the individual or individuals responsible for gener
`ating the audio version of the work.
`The present invention uses a combination of Silence
`detection and detection of actual words for purposes of
`Synchronizing audio and text versions of the same work.
`In accordance with the present invention, a speech rec
`ognition operation is performed on an audio corpus to
`recognize actual words and periods of Silence. For speech
`recognition purposes Silence may be modeled as a word. A
`60
`time indexed set of recognized words and periods of Silence
`is produced by the Speech recognition process of the present
`invention. The results of the Speech recognition operation
`are globally aligned with the text corpus by matching as
`much of the recognized text as possible to the corresponding
`text of the work without changing the Sequence of the
`recognized or actual text.
`
`65
`
`40
`
`45
`
`50
`
`55
`
`4
`When periods of detected Silence correspond to expected
`locations within the actual text, e.g., ends of Sentences and
`paragraphs, one or more words adjoining the period of
`Silence in the recognized text are compared to one or more
`corresponding words adjoining the expected location of
`Silence in the actual text. If the words adjoining the text were
`properly recognized, both the recognized word or words
`adjoining the Silence and the actual word or words adjoining
`the expected point of silence will match. When there is a
`match, the identified location of the silence in the audio file
`and the corresponding location in the text file are identified
`as corresponding audio and text locations where a pointer
`correlating the two files should be inserted.
`In one particular embodiment, for a location correspond
`ing to detected Silence to be used for purposes of file
`Synchronization, the recognized words bracketing i.e., pre
`ceding and following, the detected Silence must be properly
`recognized, e.g., match, the words in the actual text brack
`eting the location believed to correspond to the detected
`Silence.
`When a location in a text file corresponding to detected
`Silence is identified for purposes of file Synchronization, a
`pointer to the recognized Silence in the audio file is added at
`the location in the text file having been identified as corre
`sponding to the recognized Silence. This results in the ends
`of Sentences and/or paragraphs being Synchronized in the
`text file with corresponding occurrences of Silence in the
`audio file.
`Each pointer added to the text file may be, e.g., a time
`indeX or time Stamp into the corresponding audio file. A
`Similar pointer, e.g., time indeX or Stamp, may be added to
`the audio file if the corresponding audio file does not already
`include Such values.
`Pointers inserted into the audio and text files for synchro
`nization purposes may take on a wide range of forms in
`addition to time Stamp values. For example, pointerS may
`include a filename or file identifier in conjunction with an
`index value used to access a particular point within the
`identified file. In Such cases, the pointers added to audio files
`may include a file name or file identifier which identifies the
`corresponding text file. Pointers added to the text files in
`Such embodiments may include a file name or file identifier
`which identifies the corresponding audio file.
`AS part of the Speech recognition process of the present
`invention, Statistical language models, generated from the
`text corpus to be Synchronized, may be used. Statistical
`language models, e.g., tri-gram language models, predict the
`statistical probability that a hypothesized word or words will
`occur in the context of one or more previously recognized
`words. Since the Synchronization of audio and text files in
`accordance with the present invention relies heavily on the
`accurate identification of Silence in the context of preceding
`and/or Subsequent words, it was recognized that Statistical
`language models, as opposed to simple language models,
`were more likely to produce recognition speech results that
`were useful in Synchronizing audio and text files based on
`the detection of Silence in the context of expected words. In
`accordance with the present invention, Statistical language
`models are generated from the text corpus which is to be
`Synchronized with a corresponding audio corpus.
`While the use of Statistical language models for Speech
`recognition purposes is one feature of the present invention,
`it is recognized that other types of language models may be
`employed instead without departing from the Overall inven
`tion.
`Numerous additional features and advantages of the
`present invention will be discussed in the detailed descrip
`tion which follows.
`
`-12-
`
`
`
`US 6,260,011 B1
`
`S
`BRIEF DESCRIPTION OF THE DRAWINGS
`FIG. 1 illustrates unsynchronized electronic text and
`audio corpuses corresponding to the same exemplary literary
`work.
`FIG. 2 illustrates a computer System implemented in
`accordance with one embodiment of the present invention.
`FIG.3 illustrates a set of application programs included in
`the computer system of FIG. 2.
`FIG. 4 illustrates a set of program data included in the
`computer system of FIG. 2.
`FIG. 5 illustrates the flow of data and information
`between various modules of the present invention.
`FIG. 6 is a flow diagram illustrating the steps of the
`present invention involved in Synchronizing text and audio
`files.
`FIG. 7 illustrates an exemplary text corpus and the global
`alignment of the content of the text corpus with recognized
`Speech.
`FIGS. 8, 10 and 12 illustrate exemplary synchronized text
`and audio corpuses created in accordance with various
`embodiments of the present invention.
`FIGS. 9, 11 and 13 illustrate exemplary content of the
`aligned audio and text corpuses shown in FIGS. 8, 10 and
`12, respectively.
`
`DETAILED DESCRIPTION
`AS discussed above, the present invention is directed to
`methods and apparatus for automatically Synchronizing
`electronic audio and text data, e.g., files, corresponding to
`the same work, e.g., literary work, radio program, document
`or information.
`FIG. 1 illustrates a set 9 of unsynchronized text and audio
`files corresponding to, e.g., the same exemplary literary
`work. A plurality of N text files 12, 14 form a text corpus 10
`which represents the complete text of the exemplary literary
`work. Text files 12, 14 may be in any one of a plurality of
`electronic formats, e.g., an ASCII format, used to Store text
`information. A plurality of M audio files 22, 24 form an
`audio corpus 20 which represents a complete audio version
`of the exemplary work. Audio files 22, 24 may be in the form
`of WAVE or other electronic audio file formats used to store
`Speech, music and/or other audio signals. Note that the
`number N of text files which form the text corpus 10 may be
`different than the number M of audio files which form the
`audio corpus 20.
`While the text corpus 10 and audio corpus 20 correspond
`to the Same literary work, the audio and text files are
`unsynchronized, that is, there are no links or reference points
`in the files which can be used to correlate the informational
`content of the two files. Thus, it is not possible to easily
`access a point in the audio corpus 10 which corresponds to
`the same point in the literary work as a point in the text
`corpus 20. This makes it difficult to access the same location
`in the literary work when Switching between text and audio
`modes of presenting the literary work.
`FIG. 2 and the following discussion provide a brief,
`general description of an exemplary apparatus, e.g., com
`60
`puter System, in which at least Some aspects of the present
`invention may be implemented. The computer System may
`be implemented as a portable device, e.g., notebook com
`puter or a device for presenting books or other literary works
`Stored in electronic form.
`The present invention will be described in the general
`context of computer-executable instructions, Such as pro
`
`6
`gram modules, being executed by a personal computer.
`However, the methods of the present invention may be
`effected by other apparatus. Program modules may include
`applications, routines, programs, objects, components, data
`Structures, etc. that perform a task(s) or implement particular
`abstract data types. Moreover, those skilled in the art will
`appreciate that at least Some aspects of the present invention
`may be practiced with other configurations, including hand
`held devices, multiprocessor Systems, microprocessor-based
`or programmable consumer electronics, network computers,
`minicomputers, Set top boxes, mainframe computers, and
`the like. At least Some aspects of the present invention may
`also be practiced in distributed computing environments
`where tasks are performed by remote processing devices
`linked through a communications network. In a distributed
`computing environment, program modules may be located
`in local and/or remote memory Storage devices.
`With reference to FIG. 2, an exemplary apparatus 100 for
`implementing at least Some aspects of the present invention
`includes a general purpose computing device in the form of
`a conventional personal computer 120. The personal com
`puter 120 may include a processing unit 121, a System
`memory 122, and a System buS 123 that couples various
`System components including the System memory 122 to the
`processing unit 121. The system bus 123 may be any of
`Several types of bus Structures including a memory bus or
`memory controller, a peripheral bus, and a local bus using
`any of a variety of bus architectures. The System memory
`may include read only memory (ROM) 124 and/or random
`access memory (RAM) 125. A basic input/output system
`126 (BIOS), containing basic routines that help to transfer
`information between elements within the personal computer
`120, such as during start-up, may be stored in ROM 124. The
`personal computer 120 may also include a hard disk drive
`127 for reading from and writing to a hard disk, (not shown),
`a magnetic disk drive 128 for reading from or writing to a
`(e.g., removable) magnetic disk 129, and an (magneto-)
`optical disk drive 130 for reading from or writing to a
`removable (magneto) optical disk 131 Such as a compact
`disk or other (magneto) optical media. The hard disk drive
`127, magnetic disk drive 128, and (magneto) optical disk
`drive 130 may be coupled with the system bus 123 by a hard
`disk drive interface 132, a magnetic disk drive interface 133,
`and a (magneto) optical drive interface 134, respectively.
`The drives and their associated Storage media provide non
`Volatile Storage of machine readable instructions, data
`Structures, program modules and other data for the personal
`computer 120. Although the exemplary environment
`described herein employs a hard disk, a removable magnetic
`disk 129 and a removable(magneto) optical disk 131, those
`skilled in the art will appreciate that other types of Storage
`media, Such as magnetic cassettes, flash memory cards,
`digital Video disks, Bernoulli cartridges, random access
`memories (RAMs), read only memories (ROM), and the
`like, may be used instead of, or in addition to, the Storage
`devices introduced above.
`A number of program modules may be stored on the hard
`disk 127, magnetic disk 129, (magneto) optical disk 131,
`ROM 124 or RAM 125. In FIG. 2 an operating system 135,
`one (1) or more application programs 136, other program
`modules 137, and/or program data 138 are shown as being
`stored in RAM 125. Operating system 135", application
`program(s) 136', other program modules 137 and program
`data 138' are shown as being stored on hard disk driver 127.
`As will be discussed below in regard to FIG. 3, in the
`exemplary embodiment the application programs include an
`audio/text Synchronization program implemented in accor
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`65
`
`-13-
`
`
`
`7
`dance with the present invention. In addition, program data
`138 described in further detail with regard to FIG. 4,
`includes an acoustic model and other data used by the
`audio/text synchronization program 308 of the present
`invention.
`A user may enter commands and information into the
`personal computer 120 through input devices, Such as a
`keyboard 140 and pointing device 142 for example. Other
`input devices (not shown) Such as a microphone, joystick,
`game pad, Satellite dish, Scanner, or the like may also be
`included. These and other input devices are often connected
`to the processing unit 121 through a Serial port interface 146
`coupled to the System bus. However, input devices may be
`connected by other interfaces, Such as a parallel port, a game
`port or a universal serial bus (USB). A monitor 147 or other
`type of display device may also be connected to the System
`bus 123 via an interface, such as a video adapter 148 for
`example. In addition to the monitor, the personal computer
`120 may include a sound card 161 coupled to speaker(s) 162
`and other peripheral output devices (not shown), Such as
`printers for example.
`The personal computer 120 may operate in a networked
`environment which defines logical connections to one (1) or
`more remote computers, Such as a remote computer 149. The
`remote computer 149 may be another personal computer, a
`Server, a router, a network PC, a peer device or other
`common network node, and may include many or all of the
`elements described above relative to the personal computer
`120. The logical connections depicted in FIG. 1A include a
`local area network (LAN) 151 and a wide area network
`(WAN) 152, an intranet and the Internet.
`When used in a LAN, the personal computer 120 may be
`connected to the LAN 151 through a network interface
`adapter (or “NIC”) 153. When used in a WAN, such as the
`Internet, the personal computer 120 may include a modem
`154 or other means for establishing communications over
`the wide area network 152. The modem 154, which may be
`internal or external, may be connected to the System buS 123
`via the serial port interface 146. In a networked
`environment, at least Some of the program modules depicted
`relative to the personal computer 120 may be stored in a
`remote memory Storage device. The network connections
`shown are exemplary and other means of establishing a
`communications link between the computerS may be used.
`FIG. 3 illustrates the set of application programs 136
`stored in the memory 125 in greater detail. As illustrated, the
`application programs 136 include a word processor program
`302, an electronic book program 304, a spread sheet pro
`gram 306 and an audio/text synchronization program 308 of
`the present invention. The electronic book program 304 is
`capable of accessing and presenting the content of audio
`and/or text files to the user of t