`Heckerman et al.
`
`I lllll llllllll Ill lllll lllll lllll lllll lllll 111111111111111111111111111111111
`US006260011Bl
`US 6,260,011 Bl
`Jul. 10, 2001
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) METHODS AND APPARATUS FOR
`AUTOMATICALLY SYNCHRONIZING
`ELECTRONIC AUDIO FILES WITH
`ELECTRONIC TEXT FILES
`
`(75)
`
`Inventors: David E. Heckerman, Bellevue; Fileno
`A. Alleva, Redmond; Robert L.
`Rounthwaite, Fall City; Daniel Rosen,
`Bellevue; Mei-Yuh Hwang, Redmond;
`Yoram Yaacovi, Redmond; John L.
`Manferdelli, Redmond, all of WA (US)
`
`(73)
`
`Assignee: Microsoft Corporation, Redmond, WA
`(US)
`
`( *)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/531,054
`
`(22) Filed:
`
`Mar. 20, 2000
`
`(51)
`
`Int. Cl.7 ........................... GlOL 15/08; GlOL 15/26;
`GlOL 11/06
`(52) U.S. Cl. .......................... 704/235; 704/231; 704/251;
`704/254; 704/278
`(58) Field of Search ..................................... 704/260, 252,
`704/270.1, 270, 243, 231, 236, 278, 238
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`3,700,815 * 10/1972 Doddington et al. ................ 704/246
`4,779,209 * 10/1988 Stapleford et al. .................. 704/278
`5,008,871 * 4/1991 Howells et al.
`....................... 369/28
`5,333,275 * 7/1994 Wheatley et al.
`................... 704/243
`5,649,060 * 7/1997 Ellozy et al.
`........................ 704/278
`5,737,725 * 4/1998 Case ..................................... 704/270
`5,758,024
`5/1998 Alleva .................................... 395/63
`8/1998 Alleva et al. ........................ 704/255
`5,794,197
`5,960,447 * 9/1999 Holt et al. ............................ 704/235
`6,076,059 * 6/2000 Glickman et al. ................... 704/260
`
`OIBER PUBLICATIONS
`
`Hauptmann et al, "Story Segmentation & Detection of
`Commercials in roadcast News Video", Research & Tech(cid:173)
`nology Advances in Digital Libraries, Apr. 24, 1998. *
`
`Pedro J Moreno, Chris Joerg, Jean-Manuel Van Thong, and
`Oren Glickman, "A Recursive Algorithm for the Forced
`Alignment of Very Long Audio Segments"; Cambridge
`Research Laboratory, pp. 1-4, Nov. 20, 1998.
`
`Lawrence Rabiner and Biing-Hwang Juang, Fundamentals
`of Speech Recognition, pp. 434-495 (1993).
`
`* cited by examiner
`
`Primary Examiner-Richemond Dorvil
`Assistant Examiner-Daniel A Nolan
`(74) Attorney, Agent, or Firm-Straub
`Michael P. Straub
`
`(57)
`
`ABSTRACT
`
`& Pokotylo;
`
`Automated methods and apparatus for synchronizing audio
`and text data, e.g., in the form of electronic files, represent(cid:173)
`ing audio and text expressions of the same work or infor(cid:173)
`mation are described. A statistical language model is gen(cid:173)
`erated from the text data. A speech recognition operation is
`then performed on the audio data using the generated
`language model and a speaker independent acoustic model.
`Silence is modeled as a word which can be recognized. The
`speech recognition operation produces a time indexed set of
`recognized words some of which may be silence. The
`recognized words are globally aligned with the words in the
`text data. Recognized periods of silence, which correspond
`to expected periods of silence, and are adjoined by one or
`more correctly recognized words are identified as points
`where the text and audio files should be synchronized, e.g.,
`by the insertion of bi-directional pointers. In one
`embodiment, for a text location to be identified for synchro(cid:173)
`nization purposes, both words which bracket, e.g., precede
`and follow, the recognized silence must be correctly iden(cid:173)
`tified. Pointers, corresponding to identified locations of
`silence to be used for synchronization purposes are inserted
`into the text and/or audio files at the identified locations.
`Audio time stamps obtained from the speech recognition
`operation may be used as the bi-directional pointers. Syn(cid:173)
`chronized text and audio data may be output in a variety of
`file formats.
`
`36 Claims, 9 Drawing Sheets
`
`f500
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 1 of 19
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 1 of 9
`
`US 6,260,011 Bl
`
`10
`
`20
`
`TEXT 1
`
`TEXTN
`
`TEXT
`CORPUS
`
`12
`
`22
`
`14
`
`24
`
`AUDIO 1
`
`AUDIO M
`
`AUDIO
`CORPUS
`
`Fig. 1
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 2 of 19
`
`
`
`lo-"
`~
`lo-"
`lo-"
`b
`Q
`O'I
`'N
`O'I
`rJ'J.
`
`e
`
`\C
`0 .....,
`N
`~ .....
`'Jl =(cid:173)~
`
`'"""'
`N c c
`'"""' ~=
`~ = :-
`
`~ = ......
`~ ......
`~
`•
`\JJ.
`d •
`
`COMPUTER
`REMOTE
`
`151
`
`INTERFACE
`SERIAL PORT
`
`134
`
`i
`
`-
`
`-
`
`-
`
`161
`-J 62 _
`
`CARD
`SOUND
`
`-
`
`-
`
`-
`
`123
`
`148
`
`no
`
`~~=----1::~--121736~ ~38::>~129-(o) IEID\I.___ _ __.
`
`.
`
`.
`
`142
`
`7
`
`FIGURE 2
`
`131J140
`
`OPERATING APPLICATION OTHER PROGRAM PROGRAM
`
`DATA
`
`MODULES
`
`PROGRAM(S)
`
`SYSTEM
`
`i} 130-
`
`DATA ~138 DRIVE
`
`i}
`
`0
`
`'
`
`~
`
`PROGRAM ~ HARD DISK 128..:J
`MODULES
`PROGRAM
`
`_________ lSPEA~RS(S)h
`
`LERATOR)
`
`___.___
`
`VIDE
`
`GRAPH?c~Dt!Z~R (WITH
`
`0
`
`---
`
`132 ~-..>.L.-~
`
`INTERFACE
`DISK DRIVE
`MAGNETIC
`
`SYSTEM BUS
`
`HARD DISK
`
`INTERFACE
`
`136 I DRIVE
`
`OTHER
`
`SYSTEM
`OPERATING
`
`RAM
`
`-
`
`-
`
`-
`
`---
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 3 of 19
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 3 of 9
`
`US 6,260,011 Bl
`
`136
`
`302
`
`/
`
`APPLICATION PROGRAMS
`304
`
`/
`
`306
`/
`
`WORD
`PROCESSOR
`
`ELECTRONIC
`BOOK
`
`SPREAD
`SHEET
`
`/310
`CONTROL
`MODULE
`
`y12
`
`SPEECH
`RECOGNIZER
`MODULE
`
`-- 308
`
`AUDIO/TEXT
`SYNCHRONIZATION
`---- 314
`STATISTICAL
`LANGUAGE
`MODEL
`GENERATION
`MODULE
`
`316
`
`/
`SPEECH
`RECOGNITION
`TRAINING
`MODULE
`
`318
`./
`
`TEXT/AUDIO
`ALIGNMENT
`MODULE
`
`Fig. 3
`
`138
`
`402
`\
`UNALIGNED
`TEXT FILES
`
`PROGRAM DAT A
`
`UNALIGNED
`AUDIO FILES
`
`404
`/
`
`408
`
`'
`
`STATISTICAL
`LM
`
`410
`........_
`
`ACOUSTIC
`MODEL
`
`Fig. 4
`
`406
`
`RECOGNIZED
`TEXT+ TIME
`STAMPS
`412\
`
`SYNCHRONIZED
`TEXT/AUDIO
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 4 of 19
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 4 of 9
`
`US 6,260,011 Bl
`
`10
`
`'500
`
`20
`
`AUDIO
`CORPUS
`
`312
`
`406
`
`TEXT
`CORPUS
`
`314
`
`STATISTICAL
`LANGUAGE
`MODEL
`GENERATION
`MODULE
`
`SPEECH
`RECOGNIZER
`MODULE
`
`'--------..--------' 408 ___ :::...,__~
`
`STATISTICAL
`LM
`
`ACOUSTIC
`MODEL
`
`Fig. 5
`
`TEXT/AUDIO
`ALIGNMENT
`MODULE
`
`SYNCHRONIZED
`TEXT/AUDIO
`
`310
`
`CONTROL
`MODULE
`
`PROGRAM
`DATA
`
`138
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 5 of 19
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 5 of 9
`
`US 6,260,011 Bl
`
`10
`
`20
`
`TEXT
`FILES
`TEXT
`CORPUS
`
`AUDIO
`FILES
`AUDIO
`CORPUS
`
`START
`
`602
`
`GENERATE STATISTICAL
`LANGUAGE MODEL (LM)
`FROM TEXT CORPUS
`
`STATISTICAL
`LM
`
`604
`
`308
`I
`
`410
`
`ACOUSTIC
`MODEL
`
`606
`
`PERFORM A SPEECH RECOGNITION OPERATION ON
`AUDIO CORPUS CORRESPONDING TO THE TEXT
`CORPUS USING GENERATED LANGUAGE MODEL
`AND SPEAKER INDEPENDENT ACOUSTIC MODEL
`
`406
`
`GLOBALLY ALIGN RECOGNIZED TEXT WITH
`TEXT IN TEXT CORPUS
`
`607
`
`608
`
`IDENTIFY LOCATIONS IN RECOGNIZED TEXT WHERE
`SILENCE IS ADJOINED BY CORRECTLY IDENTIFIED WORD(S)
`BY COMPARING RECOGNIZED TEXT TO TEXT IN GLOBALLY
`ALIGNED TEXT CORPUS
`
`USE THE IDENTIFIED LOCATIONS TO INDEX TEXT
`AND/OR AUDIO CORPUS THEREBY GENERATING
`SYNCRONIZED TEXT AND AUDIO FILES
`
`610
`
`Fig. 6
`
`612
`
`STORE SYNCHRONIZED
`TEXT/AUDIO FILES
`
`614
`
`STOP
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 6 of 19
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 6 of 9
`
`US 6,260,011 Bl
`
`Call me John. I long for v
`
`702
`
`you.
`
`704
`
`ACTUAL
`TEXT
`
`CALL
`
`ME
`----
`:JOHN
`
`I <SIL>
`I
`I - - - -
`LONG
`
`FOR
`
`YOU
`
`<SIL>
`
`714
`
`716
`
`718
`
`720
`
`708
`
`•
`
`RECOGNIZED
`WORD
`CALL
`
`710
`
`'
`
`START
`12
`
`712
`
`•
`
`STOP
`48
`
`706
`
`722
`
`ME
`85
`50
`---------------,
`109
`JOHN
`90
`I
`117 v
`
`<SIL>
`
`I
`I
`I
`118
`125
`_ _ _ _ _ _ _ _ _ _ _ _ _ _ ..J
`
`110
`
`SONG
`
`FOR
`
`YOU
`
`<SIL>
`
`126
`
`144
`
`157
`
`170
`
`142
`
`156
`
`169
`
`185
`
`Fig. 7
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 7 of 19
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 7 of 9
`
`US 6,260,011 Bl
`
`412
`
`812
`
`TEXT
`WITH TIME
`STAMP
`
`TEXT
`CORPUS
`
`804
`
`TEXT
`--- ---------- ---- WITH TIME
`814
`STAMP
`AUDIO
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 8
`
`804
`
`---
`
`Call me John. [11 O] I long
`for you ....
`
`814
`
`---
`
`Call me John. [11 O] I long
`for you ....
`
`Fig. 9
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 8 of 19
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 8 of 9
`
`US 6,260,011 Bl
`
`412
`
`002
`
`1012
`
`TEXT 1
`
`--
`
`1004
`1014
`-- ---------- -- -- AUDIO 1
`
`1016
`1006
`
`TEXT
`CORPUS
`
`AUDIO
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 10
`
`1004
`
`~
`
`1014
`
`~
`
`Call me John. [AUDIO 1,
`11 O] I long for you ....
`
`Call me John. [TEXT 1,
`11 O] /long for you . ...
`
`Fig. 11
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 9 of 19
`
`
`
`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 9 of 9
`
`US 6,260,011 Bl
`
`412
`
`1202
`
`1212
`
`1204
`121
`--------- -- -- AUDIO 1
`TEXT 1
`.--------...J..-.+- 1205
`.___T_E_X_T_2 _
`___. -
`--------- -- -- AUDIO 2
`
`1216
`1206
`
`TEXT Y
`
`-
`
`--------- -- -- AUDIO Y
`
`TEXT
`CORPUS
`
`AUDIO
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 12
`
`1204
`
`Call me John.
`
`1205
`~
`I long for you ....
`
`1214
`...........--:
`Call me John.
`
`1215
`ee:::
`I long for you . ...
`
`Fig. 13
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 10 of 19
`
`
`
`US 6,260,011 Bl
`
`1
`METHODS AND APPARATUS FOR
`AUTOMATICALLY SYNCHRONIZING
`ELECTRONIC AUDIO FILES WITH
`ELECTRONIC TEXT FILES
`
`FIELD OF THE INVENTION
`
`The present invention relates to methods and apparatus
`for synchronizing electronic audio and text data in an
`automated manner, and to using the synchronized data, e.g.,
`audio and text files.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`2
`human intervention. However, such human involvement can
`be costly and time consuming. Accordingly, there is a need
`for methods and apparatus for automating the synchroniza(cid:173)
`tion of electronic text and audio versions of a work.
`Previous attempts to automate the synchronization of
`electronic text files and audio files of the same work have
`focused primarily on the indexing of audio files correspond(cid:173)
`ing to radio and other broadcasts with electronic text files
`representing transcripts of the broadcasts. Such indexing is
`10 designed to allow an individual viewing an excerpt from a
`transcript over the Internet to hear an audio clip correspond(cid:173)
`ing to the excerpt. In such applications, the precision
`required in the alignment is often considered not to be
`critical and an error in alignment of up to 2 seconds is
`15 considered by some to be acceptable.
`While the task of aligning audio files corresponding to TV
`and radio broadcasts and text transcripts of the broadcasts is
`similar in nature to the task of aligning text files of books or
`other literary works with audio versions made there from,
`20 there are important differences between the two tasks which
`arise from the differing content of the files being aligned and
`the ultimate use of the aligned files.
`In the case of recordings of literary and other text docu(cid:173)
`ments which are read aloud and recorded for commercial
`purposes, a single reader is often responsible for the reading
`of the entire text. The reader is often carefully chosen by the
`company producing the audio version of the literary work
`for proper pronunciation, inflection, general understandabil-
`ity and overall accuracy. In addition, audio recordings of
`books and other literary works are normally generated in a
`sound controlled environment designed to keep background
`noise to a minimum. Thus commercial audio versions of
`books or other literary works intended to be offered for sale,
`either alone or in combination with a text copy, are often of
`reasonably good quality with a minimum of background
`noise. Furthermore, they tend to accurately reflect the punc(cid:173)
`tuation in the original work and, in the case of commercial
`audio versions of literary works, a single individual may be
`responsible for the audio versions of several books or stories
`since commercial production companies tend to use the
`same reader to produce the audio versions of multiple
`literary works, e.g., books.
`In the case of transcripts produced from, e.g., radio
`broadcasts, television broadcasts, or court proceedings, mul(cid:173)
`tiple speakers with different pronunciation characteristics,
`e.g., accents, frequently contribute to the same transcript.
`Each speaker may contribute to only a small portion of the
`total recording. The original audio may have a fair amount
`of background noise, e.g., music or other noise. In addition,
`in TV and radio broadcasts, speech from multiple speakers
`may overlap, making it difficult to distinguish the end of a
`sentence spoken by one speaker and the start of a sentence
`from a new speaker. Furthermore, punctuation in the tran-
`55 script may be less accurate then desired given that the
`transcript may be based on unrehearsed conversational
`speech generated without regard to how it might later be
`transcribed using written punctuation marks.
`In the case of attempting to synchronize text and audio
`60 versions of literary works, given the above discussed uses of
`such files, accurately synchronizing the starting points of
`paragraphs and sentences is often more important than being
`able to synchronize individual words within sentences.
`In view of the above discussion, it is apparent that there
`is a need for new methods and apparatus which can be used
`to accurately synchronize audio and text files. It is desirable
`that at least some methods and apparatus be well suited for
`
`25
`
`Historically, books and other literary works have been
`expressed in the form of text. Given the growing use of
`computers, text is now frequently represented and stored in
`electronic form, e.g., in the form of text files. Accordingly,
`in the modern age, users of computer devices can obtain
`electronic copies of books and other literary works.
`Frequently text is read aloud so that the content of the text
`can be provided to one or more people in an oral, as opposed
`to written, form. The reading of stories to children and the
`reading of text to the physically impaired are common
`examples where text is read aloud. The commercial distri(cid:173)
`bution of literary works in both electronic and audio ver(cid:173)
`sions has been commonplace for a significant period of time.
`The widespread availability of personal computers and other
`computer devices capable of displaying text and playing
`audio files stored in electronic form has begun to change the
`way in which text versions of literary works and their audio 30
`counterparts are distributed.
`Electronic distribution of books and other literary works
`in the form of electronic text and audio files can now be
`accomplished via compact discs and/or the Internet. Elec(cid:173)
`tronic versions of literary works in both text and audio
`versions can now be distributed far more cheaply than paper
`copies. While the relatively low cost of distributing elec(cid:173)
`tronic versions of a literary work provide authors and
`distributors an incentive for distributing literary works in
`electronic form, consumers can benefit from having such
`works in electronic form as well.
`Consumers may wish to switch between audio and text
`versions of a literary work. For example, in the evening an
`individual may wish to read a book. However, on their way 45
`to work, the same individual may want to listen to the same
`version of the literary work from the point, e.g., sentence or
`paragraph, where they left off reading the night before.
`Consumers attempting to improve their reading skills can
`also find text and audio versions in the form of electronic 50
`files beneficial. For example, an individual attempting to
`improve his/her reading skills may wish to listen to the audio
`version of a book while having text corresponding to the
`audio being presented highlighted on a display device. Also,
`many vision-impaired or hearing-impaired readers might
`benefit from having linked audio and text versions of the
`literary work.
`While electronic text and audio versions of many literary
`works exist, relatively few of these works include links
`between the audio and text versions needed to support the
`easy accessing of the same point in both versions of a work.
`Without such links between the text and audio versions of a
`work, it is difficult to easily switch between the two versions
`of the work or to highlight text corresponding to the portion
`of the audio version being played at a given moment in time. 65
`Links or indexes used to synchronize audio and text
`versions of the same work may be manually generated via
`
`35
`
`40
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 11 of 19
`
`
`
`US 6,260,011 Bl
`
`10
`
`20
`
`3
`synchronizing text and audio versions of literary works. It is
`also desirable that the methods and apparatus be capable of
`synchronizing the starting points of sentences and/or para(cid:173)
`graphs in audio and text files with a high degree of accuracy.
`SUMMARY OF THE PRESENT INVENTION
`The present invention is directed to methods and appa(cid:173)
`ratus for automatically generating synchronized audio and
`text data, e.g., files, from unsynchronized electronic audio
`and text versions of the same work, e.g., literary work,
`program or document.
`The synchronization of long audio files, e.g. 30 minutes
`and longer with corresponding text in an automated manner,
`presents significant difficulties since absolute certainty as to
`points in the audio and text versions which correlate to each
`other exists only at the beginning and end of the complete
`text and audio versions of the same work.
`When synchronizing text and audio versions of the same
`work, it is highly desirable to synchronize at least one point
`per paragraph, preferably at the start of each paragraph.
`When positions within a paragraph are also to be
`synchronized, the start of sentences is a particularly useful
`location to synchronize since people tend to prefer reading
`or listening to speech from the start, as opposed to the
`middle, of sentences.
`The inventors of the present invention recognized that
`silence normally occurs at the ends of paragraphs and
`sentences but, for the most part, does not occur between
`words within a sentence during ordinary speech. They also
`recognized that in many audio versions of literary works and 30
`other text documents read aloud, the amount of background
`noise is intentionally kept to a minimum. This makes periods
`of silence in an audio version of a literary work relatively
`easy to detect. In addition, the locations where silence
`occurs is relatively easy to predict from punctuation and/or 35
`other content within the text version of the work.
`Given that silence may occur within a sentence in an
`audio version of a literary work, e.g., because of a pause by
`the reader which is not reflected in the text by punctuation,
`the detection of periods of silence alone may be insufficient
`to reliably synchronize audio and text versions of a literary
`work. This is particularly the case in long audio sequences.
`The inventors of the present application recognized that
`by performing speech recognition, spoken words in an audio
`work, in addition to periods of silence, could be detected
`automatically and used for purposes of synchronizing the
`text and audio versions of the work. Unfortunately, with
`known speech recognition techniques, recognition errors
`occur. In addition, even when recognition errors do not
`occur, differences may exist between an audio and text
`version of the same work due, e.g., to reading errors on the
`part of the individual or individuals responsible for gener(cid:173)
`ating the audio version of the work.
`The present invention uses a combination of silence
`detection and detection of actual words for purposes of
`synchronizing audio and text versions of the same work.
`In accordance with the present invention, a speech rec(cid:173)
`ognition operation is performed on an audio corpus to
`recognize actual words and periods of silence. For speech
`recognition purposes silence may be modeled as a word. A 60
`time indexed set of recognized words and periods of silence
`is produced by the speech recognition process of the present
`invention. The results of the speech recognition operation
`are globally aligned with the text corpus by matching as
`much of the recognized text as possible to the corresponding 65
`text of the work without changing the sequence of the
`recognized or actual text.
`
`4
`When periods of detected silence correspond to expected
`locations within the actual text, e.g., ends of sentences and
`paragraphs, one or more words adjoining the period of
`silence in the recognized text are compared to one or more
`5 corresponding words adjoining the expected location of
`silence in the actual text. If the words adjoining the text were
`properly recognized, both the recognized word or words
`adjoining the silence and the actual word or words adjoining
`the expected point of silence will match. When there is a
`match, the identified location of the silence in the audio file
`and the corresponding location in the text file are identified
`as corresponding audio and text locations where a pointer
`correlating the two files should be inserted.
`In one particular embodiment, for a location correspond(cid:173)
`ing to detected silence to be used for purposes of file
`15 synchronization, the recognized words bracketing i.e., pre(cid:173)
`ceding and following, the detected silence must be properly
`recognized, e.g., match, the words in the actual text brack(cid:173)
`eting the location believed to correspond to the detected
`silence.
`When a location in a text file corresponding to detected
`silence is identified for purposes of file synchronization, a
`pointer to the recognized silence in the audio file is added at
`the location in the text file having been identified as corre(cid:173)
`sponding to the recognized silence. This results in the ends
`25 of sentences and/or paragraphs being synchronized in the
`text file with corresponding occurrences of silence in the
`audio file.
`Each pointer added to the text file may be, e.g., a time
`index or time stamp into the corresponding audio file. A
`similar pointer, e.g., time index or stamp, may be added to
`the audio file if the corresponding audio file does not already
`include such values.
`Pointers inserted into the audio and text files for synchro(cid:173)
`nization purposes may take on a wide range of forms in
`addition to time stamp values. For example, pointers may
`include a filename or file identifier in conjunction with an
`index value used to access a particular point within the
`identified file. In such cases, the pointers added to audio files
`may include a file name or file identifier which identifies the
`40 corresponding text file. Pointers added to the text files in
`such embodiments may include a file name or file identifier
`which identifies the corresponding audio file.
`As part of the speech recognition process of the present
`invention, statistical language models, generated from the
`45 text corpus to be synchronized, may be used. Statistical
`language models, e.g., tri-gram language models, predict the
`statistical probability that a hypothesized word or words will
`occur in the context of one or more previously recognized
`words. Since the synchronization of audio and text files in
`50 accordance with the present invention relies heavily on the
`accurate identification of silence in the context of preceding
`and/or subsequent words, it was recognized that statistical
`language models, as opposed to simple language models,
`were more likely to produce recognition speech results that
`55 were useful in synchronizing audio and text files based on
`the detection of silence in the context of expected words. In
`accordance with the present invention, statistical language
`models are generated from the text corpus which is to be
`synchronized with a corresponding audio corpus.
`While the use of statistical language models for speech
`recognition purposes is one feature of the present invention,
`it is recognized that other types of language models may be
`employed instead without departing from the overall inven(cid:173)
`tion.
`Numerous additional features and advantages of the
`present invention will be discussed in the detailed descrip(cid:173)
`tion which follows.
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 12 of 19
`
`
`
`US 6,260,011 Bl
`
`5
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 illustrates unsynchronized electronic text and
`audio corpuses corresponding to the same exemplary literary
`work.
`FIG. 2 illustrates a computer system implemented in
`accordance with one embodiment of the present invention.
`FIG. 3 illustrates a set of application programs included in
`the computer system of FIG. 2.
`FIG. 4 illustrates a set of program data included in the
`computer system of FIG. 2.
`FIG. 5 illustrates the flow of data and information
`between various modules of the present invention.
`FIG. 6 is a flow diagram illustrating the steps of the
`present invention involved in synchronizing text and audio
`files.
`FIG. 7 illustrates an exemplary text corpus and the global
`alignment of the content of the text corpus with recognized
`speech.
`FIGS. 8, 10 and 12 illustrate exemplary synchronized text
`and audio corpuses created in accordance with various
`embodiments of the present invention.
`FIGS. 9, 11 and 13 illustrate exemplary content of the
`aligned audio and text corpuses shown in FIGS. 8, 10 and
`12, respectively.
`
`DETAILED DESCRIPTION
`
`As discussed above, the present invention is directed to
`methods and apparatus for automatically synchronizing
`electronic audio and text data, e.g., files, corresponding to
`the same work, e.g., literary work, radio program, document
`or information.
`FIG. 1 illustrates a set 9 of unsynchronized text and audio
`files corresponding to, e.g., the same exemplary literary
`work. A plurality of N text files 12, 14 form a text corpus 10
`which represents the complete text of the exemplary literary
`work. Text files 12, 14 may be in any one of a plurality of
`electronic formats, e.g., an ASCII format, used to store text
`information. A plurality of M audio files 22, 24 form an
`audio corpus 20 which represents a complete audio version
`of the exemplary work. Audio files 22, 24 may be in the form
`of WAVE or other electronic audio file formats used to store
`speech, music and/or other audio signals. Note that the
`number N of text files which form the text corpus 10 may be
`different than the number M of audio files which form the
`audio corpus 20.
`While the text corpus 10 and audio corpus 20 correspond
`to the same literary work, the audio and text files are
`unsynchronized, that is, there are no links or reference points
`in the files which can be used to correlate the informational
`content of the two files. Thus, it is not possible to easily
`access a point in the audio corpus 10 which corresponds to
`the same point in the literary work as a point in the text
`corpus 20. This makes it difficult to access the same location
`in the literary work when switching between text and audio
`modes of presenting the literary work.
`FIG. 2 and the following discussion provide a brief,
`general description of an exemplary apparatus, e.g., com(cid:173)
`puter system, in which at least some aspects of the present
`invention may be implemented. The computer system may
`be implemented as a portable device, e.g., notebook com(cid:173)
`puter or a device for presenting books or other literary works
`stored in electronic form.
`The present invention will be described in the general
`context of computer-executable instructions, such as pro-
`
`6
`gram modules, being executed by a personal computer.
`However, the methods of the present invention may be
`effected by other apparatus. Program modules may include
`applications, routines, programs, objects, components, data
`5 structures, etc. that perform a task(s) or implement particular
`abstract data types. Moreover, those skilled in the art will
`appreciate that at least some aspects of the present invention
`may be practiced with other configurations, including hand(cid:173)
`held devices, multiprocessor systems, microprocessor-based
`10 or programmable consumer electronics, network computers,
`minicomputers, set top boxes, mainframe computers, and
`the like. At least some aspects of the present invention may
`also be practiced in distributed computing environments
`where tasks are performed by remote processing devices
`15 linked through a communications network. In a distributed
`computing environment, program modules may be located
`in local and/or remote memory storage devices.
`With reference to FIG. 2, an exemplary apparatus 100 for
`implementing at least some aspects of the present invention
`20 includes a general purpose computing device in the form of
`a conventional personal computer 120. The personal com(cid:173)
`puter 120 may include a processing unit 121, a system
`memory 122, and a system bus 123 that couples various
`system components including the system memory 122 to the
`25 processing unit 121. The system bus 123 may be any of
`several types of bus structures including a memory bus or
`memory controller, a peripheral bus, and a local bus using
`any of a variety of bus architectures. The system memory
`may include read only memory (ROM) 124 and/or random
`30 access memory (RAM) 125. A basic input/output system
`126 (BI OS), containing basic routines that help to transfer
`information between elements within the personal computer
`120, such as during start-up, may be stored in ROM 124. The
`personal computer 120 may also include a hard disk drive
`35 127 for reading from and writing to a hard disk, (not shown),
`a magnetic disk drive 128 for reading from or writing to a
`(e.g., removable) magnetic disk 129, and an (magneto-)
`optical disk drive 130 for reading from or writing to a
`removable (magneto) optical disk 131 such as a compact
`40 disk or other (magneto) optical media. The hard disk drive
`127, magnetic disk drive 128, and (magneto) optical disk
`drive 130 may be coupled with the system bus 123 by a hard
`disk drive interface 132, a magnetic disk drive interface 133,
`and a (magneto) optical drive interface 134, respectively.
`45 The drives and their associated storage media provide non(cid:173)
`volatile storage of machine readable instructions, data
`structures, program modules and other data for the personal
`computer 120. Although the exemplary environment
`described herein employs a hard disk, a removable magnetic
`50 disk 129 and a removable(magneto) optical disk 131, those
`skilled in the art will appreciate that other types of storage
`media, such as magnetic cassettes, flash memory cards,
`digital video disks, Bernoulli cartridges, random access
`memories (RAMs), read only memories (ROM), and the
`55 like, may be used instead of, or in addition to, the storage
`devices introduced above.
`A number of program modules may be stored on the hard
`disk 127, magnetic disk 129, (magneto) optical disk 131,
`ROM 124 or RAM 125. In FIG. 2 an operating system 135,
`60 one (1) or more application programs 136, other program
`modules 137, and/or program data 138 are shown as being
`stored in RAM 125. Operating system 135', application
`program(s) 136', other program modules 137' and program
`data 138' are shown as being stored on hard disk driver 127.
`65 As will be discussed below in regard to FIG. 3, in the
`exemplary embodiment the application programs include an
`audio/text synchronization program implemented in accor-
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 13 of 19
`
`
`
`US 6,260,011 Bl
`
`10
`
`8
`used to generate the synchronized audio and/or text files
`from unsynchronized audio and text files. The modules
`include a control module 310, a speech recognizer module
`312, a statistical language model generation module 314,
`5 optional speech recognition training module 316 and text/
`audio alignment module 318. The control module 310 is
`responsible for controlling the interaction of the various
`other modules which comprise the audio/text synchroniza(cid:173)
`tion program 136 and is responsible for controlling the
`accessing and storage of audio and text files. The speech
`recognizer module 312 is used for performing speech
`recognition, as a function of a language model and an
`acoustic model