throbber
(12) United States Patent
`Heckerman et al.
`
`I lllll llllllll Ill lllll lllll lllll lllll lllll 111111111111111111111111111111111
`US006260011Bl
`US 6,260,011 Bl
`Jul. 10, 2001
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) METHODS AND APPARATUS FOR
`AUTOMATICALLY SYNCHRONIZING
`ELECTRONIC AUDIO FILES WITH
`ELECTRONIC TEXT FILES
`
`(75)
`
`Inventors: David E. Heckerman, Bellevue; Fileno
`A. Alleva, Redmond; Robert L.
`Rounthwaite, Fall City; Daniel Rosen,
`Bellevue; Mei-Yuh Hwang, Redmond;
`Yoram Yaacovi, Redmond; John L.
`Manferdelli, Redmond, all of WA (US)
`
`(73)
`
`Assignee: Microsoft Corporation, Redmond, WA
`(US)
`
`( *)
`
`Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 09/531,054
`
`(22) Filed:
`
`Mar. 20, 2000
`
`(51)
`
`Int. Cl.7 ........................... GlOL 15/08; GlOL 15/26;
`GlOL 11/06
`(52) U.S. Cl. .......................... 704/235; 704/231; 704/251;
`704/254; 704/278
`(58) Field of Search ..................................... 704/260, 252,
`704/270.1, 270, 243, 231, 236, 278, 238
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`3,700,815 * 10/1972 Doddington et al. ................ 704/246
`4,779,209 * 10/1988 Stapleford et al. .................. 704/278
`5,008,871 * 4/1991 Howells et al.
`....................... 369/28
`5,333,275 * 7/1994 Wheatley et al.
`................... 704/243
`5,649,060 * 7/1997 Ellozy et al.
`........................ 704/278
`5,737,725 * 4/1998 Case ..................................... 704/270
`5,758,024
`5/1998 Alleva .................................... 395/63
`8/1998 Alleva et al. ........................ 704/255
`5,794,197
`5,960,447 * 9/1999 Holt et al. ............................ 704/235
`6,076,059 * 6/2000 Glickman et al. ................... 704/260
`
`OIBER PUBLICATIONS
`
`Hauptmann et al, "Story Segmentation & Detection of
`Commercials in roadcast News Video", Research & Tech(cid:173)
`nology Advances in Digital Libraries, Apr. 24, 1998. *
`
`Pedro J Moreno, Chris Joerg, Jean-Manuel Van Thong, and
`Oren Glickman, "A Recursive Algorithm for the Forced
`Alignment of Very Long Audio Segments"; Cambridge
`Research Laboratory, pp. 1-4, Nov. 20, 1998.
`
`Lawrence Rabiner and Biing-Hwang Juang, Fundamentals
`of Speech Recognition, pp. 434-495 (1993).
`
`* cited by examiner
`
`Primary Examiner-Richemond Dorvil
`Assistant Examiner-Daniel A Nolan
`(74) Attorney, Agent, or Firm-Straub
`Michael P. Straub
`
`(57)
`
`ABSTRACT
`
`& Pokotylo;
`
`Automated methods and apparatus for synchronizing audio
`and text data, e.g., in the form of electronic files, represent(cid:173)
`ing audio and text expressions of the same work or infor(cid:173)
`mation are described. A statistical language model is gen(cid:173)
`erated from the text data. A speech recognition operation is
`then performed on the audio data using the generated
`language model and a speaker independent acoustic model.
`Silence is modeled as a word which can be recognized. The
`speech recognition operation produces a time indexed set of
`recognized words some of which may be silence. The
`recognized words are globally aligned with the words in the
`text data. Recognized periods of silence, which correspond
`to expected periods of silence, and are adjoined by one or
`more correctly recognized words are identified as points
`where the text and audio files should be synchronized, e.g.,
`by the insertion of bi-directional pointers. In one
`embodiment, for a text location to be identified for synchro(cid:173)
`nization purposes, both words which bracket, e.g., precede
`and follow, the recognized silence must be correctly iden(cid:173)
`tified. Pointers, corresponding to identified locations of
`silence to be used for synchronization purposes are inserted
`into the text and/or audio files at the identified locations.
`Audio time stamps obtained from the speech recognition
`operation may be used as the bi-directional pointers. Syn(cid:173)
`chronized text and audio data may be output in a variety of
`file formats.
`
`36 Claims, 9 Drawing Sheets
`
`f500
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 1 of 19
`
`

`

`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 1 of 9
`
`US 6,260,011 Bl
`
`10
`
`20
`
`TEXT 1
`
`TEXTN
`
`TEXT
`CORPUS
`
`12
`
`22
`
`14
`
`24
`
`AUDIO 1
`
`AUDIO M
`
`AUDIO
`CORPUS
`
`Fig. 1
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 2 of 19
`
`

`

`lo-"
`~
`lo-"
`lo-"
`b
`Q
`O'I
`'N
`O'I
`rJ'J.
`
`e
`
`\C
`0 .....,
`N
`~ .....
`'Jl =(cid:173)~
`
`'"""'
`N c c
`'"""' ~=
`~ = :-
`
`~ = ......
`~ ......
`~
`•
`\JJ.
`d •
`
`COMPUTER
`REMOTE
`
`151
`
`INTERFACE
`SERIAL PORT
`
`134
`
`i
`
`-
`
`-
`
`-
`
`161
`-J 62 _
`
`CARD
`SOUND
`
`-
`
`-
`
`-
`
`123
`
`148
`
`no
`
`~~=----1::~--121736~ ~38::>~129-(o) IEID\I.___ _ __.
`
`.
`
`.
`
`142
`
`7
`
`FIGURE 2
`
`131J140
`
`OPERATING APPLICATION OTHER PROGRAM PROGRAM
`
`DATA
`
`MODULES
`
`PROGRAM(S)
`
`SYSTEM
`
`i} 130-
`
`DATA ~138 DRIVE
`
`i}
`
`0
`
`'
`
`~
`
`PROGRAM ~ HARD DISK 128..:J
`MODULES
`PROGRAM
`
`_________ lSPEA~RS(S)h
`
`LERATOR)
`
`___.___
`
`VIDE
`
`GRAPH?c~Dt!Z~R (WITH
`
`0
`
`---
`
`132 ~-..>.L.-~
`
`INTERFACE
`DISK DRIVE
`MAGNETIC
`
`SYSTEM BUS
`
`HARD DISK
`
`INTERFACE
`
`136 I DRIVE
`
`OTHER
`
`SYSTEM
`OPERATING
`
`RAM
`
`-
`
`-
`
`-
`
`---
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 3 of 19
`
`

`

`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 3 of 9
`
`US 6,260,011 Bl
`
`136
`
`302
`
`/
`
`APPLICATION PROGRAMS
`304
`
`/
`
`306
`/
`
`WORD
`PROCESSOR
`
`ELECTRONIC
`BOOK
`
`SPREAD
`SHEET
`
`/310
`CONTROL
`MODULE
`
`y12
`
`SPEECH
`RECOGNIZER
`MODULE
`
`-- 308
`
`AUDIO/TEXT
`SYNCHRONIZATION
`---- 314
`STATISTICAL
`LANGUAGE
`MODEL
`GENERATION
`MODULE
`
`316
`
`/
`SPEECH
`RECOGNITION
`TRAINING
`MODULE
`
`318
`./
`
`TEXT/AUDIO
`ALIGNMENT
`MODULE
`
`Fig. 3
`
`138
`
`402
`\
`UNALIGNED
`TEXT FILES
`
`PROGRAM DAT A
`
`UNALIGNED
`AUDIO FILES
`
`404
`/
`
`408
`
`'
`
`STATISTICAL
`LM
`
`410
`........_
`
`ACOUSTIC
`MODEL
`
`Fig. 4
`
`406
`
`RECOGNIZED
`TEXT+ TIME
`STAMPS
`412\
`
`SYNCHRONIZED
`TEXT/AUDIO
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 4 of 19
`
`

`

`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 4 of 9
`
`US 6,260,011 Bl
`
`10
`
`'500
`
`20
`
`AUDIO
`CORPUS
`
`312
`
`406
`
`TEXT
`CORPUS
`
`314
`
`STATISTICAL
`LANGUAGE
`MODEL
`GENERATION
`MODULE
`
`SPEECH
`RECOGNIZER
`MODULE
`
`'--------..--------' 408 ___ :::...,__~
`
`STATISTICAL
`LM
`
`ACOUSTIC
`MODEL
`
`Fig. 5
`
`TEXT/AUDIO
`ALIGNMENT
`MODULE
`
`SYNCHRONIZED
`TEXT/AUDIO
`
`310
`
`CONTROL
`MODULE
`
`PROGRAM
`DATA
`
`138
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 5 of 19
`
`

`

`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 5 of 9
`
`US 6,260,011 Bl
`
`10
`
`20
`
`TEXT
`FILES
`TEXT
`CORPUS
`
`AUDIO
`FILES
`AUDIO
`CORPUS
`
`START
`
`602
`
`GENERATE STATISTICAL
`LANGUAGE MODEL (LM)
`FROM TEXT CORPUS
`
`STATISTICAL
`LM
`
`604
`
`308
`I
`
`410
`
`ACOUSTIC
`MODEL
`
`606
`
`PERFORM A SPEECH RECOGNITION OPERATION ON
`AUDIO CORPUS CORRESPONDING TO THE TEXT
`CORPUS USING GENERATED LANGUAGE MODEL
`AND SPEAKER INDEPENDENT ACOUSTIC MODEL
`
`406
`
`GLOBALLY ALIGN RECOGNIZED TEXT WITH
`TEXT IN TEXT CORPUS
`
`607
`
`608
`
`IDENTIFY LOCATIONS IN RECOGNIZED TEXT WHERE
`SILENCE IS ADJOINED BY CORRECTLY IDENTIFIED WORD(S)
`BY COMPARING RECOGNIZED TEXT TO TEXT IN GLOBALLY
`ALIGNED TEXT CORPUS
`
`USE THE IDENTIFIED LOCATIONS TO INDEX TEXT
`AND/OR AUDIO CORPUS THEREBY GENERATING
`SYNCRONIZED TEXT AND AUDIO FILES
`
`610
`
`Fig. 6
`
`612
`
`STORE SYNCHRONIZED
`TEXT/AUDIO FILES
`
`614
`
`STOP
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 6 of 19
`
`

`

`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 6 of 9
`
`US 6,260,011 Bl
`
`Call me John. I long for v
`
`702
`
`you.
`
`704
`
`ACTUAL
`TEXT
`
`CALL
`
`ME
`----
`:JOHN
`
`I <SIL>
`I
`I - - - -
`LONG
`
`FOR
`
`YOU
`
`<SIL>
`
`714
`
`716
`
`718
`
`720
`
`708
`
`•
`
`RECOGNIZED
`WORD
`CALL
`
`710
`
`'
`
`START
`12
`
`712
`
`•
`
`STOP
`48
`
`706
`
`722
`
`ME
`85
`50
`---------------,
`109
`JOHN
`90
`I
`117 v
`
`<SIL>
`
`I
`I
`I
`118
`125
`_ _ _ _ _ _ _ _ _ _ _ _ _ _ ..J
`
`110
`
`SONG
`
`FOR
`
`YOU
`
`<SIL>
`
`126
`
`144
`
`157
`
`170
`
`142
`
`156
`
`169
`
`185
`
`Fig. 7
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 7 of 19
`
`

`

`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 7 of 9
`
`US 6,260,011 Bl
`
`412
`
`812
`
`TEXT
`WITH TIME
`STAMP
`
`TEXT
`CORPUS
`
`804
`
`TEXT
`--- ---------- ---- WITH TIME
`814
`STAMP
`AUDIO
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 8
`
`804
`
`---
`
`Call me John. [11 O] I long
`for you ....
`
`814
`
`---
`
`Call me John. [11 O] I long
`for you ....
`
`Fig. 9
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 8 of 19
`
`

`

`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 8 of 9
`
`US 6,260,011 Bl
`
`412
`
`002
`
`1012
`
`TEXT 1
`
`--
`
`1004
`1014
`-- ---------- -- -- AUDIO 1
`
`1016
`1006
`
`TEXT
`CORPUS
`
`AUDIO
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 10
`
`1004
`
`~
`
`1014
`
`~
`
`Call me John. [AUDIO 1,
`11 O] I long for you ....
`
`Call me John. [TEXT 1,
`11 O] /long for you . ...
`
`Fig. 11
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 9 of 19
`
`

`

`U.S. Patent
`
`Jul. 10, 2001
`
`Sheet 9 of 9
`
`US 6,260,011 Bl
`
`412
`
`1202
`
`1212
`
`1204
`121
`--------- -- -- AUDIO 1
`TEXT 1
`.--------...J..-.+- 1205
`.___T_E_X_T_2 _
`___. -
`--------- -- -- AUDIO 2
`
`1216
`1206
`
`TEXT Y
`
`-
`
`--------- -- -- AUDIO Y
`
`TEXT
`CORPUS
`
`AUDIO
`CORPUS
`
`SYNCRONIZED TEXT & AUDIO
`
`Fig. 12
`
`1204
`
`Call me John.
`
`1205
`~
`I long for you ....
`
`1214
`...........--:
`Call me John.
`
`1215
`ee:::
`I long for you . ...
`
`Fig. 13
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 10 of 19
`
`

`

`US 6,260,011 Bl
`
`1
`METHODS AND APPARATUS FOR
`AUTOMATICALLY SYNCHRONIZING
`ELECTRONIC AUDIO FILES WITH
`ELECTRONIC TEXT FILES
`
`FIELD OF THE INVENTION
`
`The present invention relates to methods and apparatus
`for synchronizing electronic audio and text data in an
`automated manner, and to using the synchronized data, e.g.,
`audio and text files.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`2
`human intervention. However, such human involvement can
`be costly and time consuming. Accordingly, there is a need
`for methods and apparatus for automating the synchroniza(cid:173)
`tion of electronic text and audio versions of a work.
`Previous attempts to automate the synchronization of
`electronic text files and audio files of the same work have
`focused primarily on the indexing of audio files correspond(cid:173)
`ing to radio and other broadcasts with electronic text files
`representing transcripts of the broadcasts. Such indexing is
`10 designed to allow an individual viewing an excerpt from a
`transcript over the Internet to hear an audio clip correspond(cid:173)
`ing to the excerpt. In such applications, the precision
`required in the alignment is often considered not to be
`critical and an error in alignment of up to 2 seconds is
`15 considered by some to be acceptable.
`While the task of aligning audio files corresponding to TV
`and radio broadcasts and text transcripts of the broadcasts is
`similar in nature to the task of aligning text files of books or
`other literary works with audio versions made there from,
`20 there are important differences between the two tasks which
`arise from the differing content of the files being aligned and
`the ultimate use of the aligned files.
`In the case of recordings of literary and other text docu(cid:173)
`ments which are read aloud and recorded for commercial
`purposes, a single reader is often responsible for the reading
`of the entire text. The reader is often carefully chosen by the
`company producing the audio version of the literary work
`for proper pronunciation, inflection, general understandabil-
`ity and overall accuracy. In addition, audio recordings of
`books and other literary works are normally generated in a
`sound controlled environment designed to keep background
`noise to a minimum. Thus commercial audio versions of
`books or other literary works intended to be offered for sale,
`either alone or in combination with a text copy, are often of
`reasonably good quality with a minimum of background
`noise. Furthermore, they tend to accurately reflect the punc(cid:173)
`tuation in the original work and, in the case of commercial
`audio versions of literary works, a single individual may be
`responsible for the audio versions of several books or stories
`since commercial production companies tend to use the
`same reader to produce the audio versions of multiple
`literary works, e.g., books.
`In the case of transcripts produced from, e.g., radio
`broadcasts, television broadcasts, or court proceedings, mul(cid:173)
`tiple speakers with different pronunciation characteristics,
`e.g., accents, frequently contribute to the same transcript.
`Each speaker may contribute to only a small portion of the
`total recording. The original audio may have a fair amount
`of background noise, e.g., music or other noise. In addition,
`in TV and radio broadcasts, speech from multiple speakers
`may overlap, making it difficult to distinguish the end of a
`sentence spoken by one speaker and the start of a sentence
`from a new speaker. Furthermore, punctuation in the tran-
`55 script may be less accurate then desired given that the
`transcript may be based on unrehearsed conversational
`speech generated without regard to how it might later be
`transcribed using written punctuation marks.
`In the case of attempting to synchronize text and audio
`60 versions of literary works, given the above discussed uses of
`such files, accurately synchronizing the starting points of
`paragraphs and sentences is often more important than being
`able to synchronize individual words within sentences.
`In view of the above discussion, it is apparent that there
`is a need for new methods and apparatus which can be used
`to accurately synchronize audio and text files. It is desirable
`that at least some methods and apparatus be well suited for
`
`25
`
`Historically, books and other literary works have been
`expressed in the form of text. Given the growing use of
`computers, text is now frequently represented and stored in
`electronic form, e.g., in the form of text files. Accordingly,
`in the modern age, users of computer devices can obtain
`electronic copies of books and other literary works.
`Frequently text is read aloud so that the content of the text
`can be provided to one or more people in an oral, as opposed
`to written, form. The reading of stories to children and the
`reading of text to the physically impaired are common
`examples where text is read aloud. The commercial distri(cid:173)
`bution of literary works in both electronic and audio ver(cid:173)
`sions has been commonplace for a significant period of time.
`The widespread availability of personal computers and other
`computer devices capable of displaying text and playing
`audio files stored in electronic form has begun to change the
`way in which text versions of literary works and their audio 30
`counterparts are distributed.
`Electronic distribution of books and other literary works
`in the form of electronic text and audio files can now be
`accomplished via compact discs and/or the Internet. Elec(cid:173)
`tronic versions of literary works in both text and audio
`versions can now be distributed far more cheaply than paper
`copies. While the relatively low cost of distributing elec(cid:173)
`tronic versions of a literary work provide authors and
`distributors an incentive for distributing literary works in
`electronic form, consumers can benefit from having such
`works in electronic form as well.
`Consumers may wish to switch between audio and text
`versions of a literary work. For example, in the evening an
`individual may wish to read a book. However, on their way 45
`to work, the same individual may want to listen to the same
`version of the literary work from the point, e.g., sentence or
`paragraph, where they left off reading the night before.
`Consumers attempting to improve their reading skills can
`also find text and audio versions in the form of electronic 50
`files beneficial. For example, an individual attempting to
`improve his/her reading skills may wish to listen to the audio
`version of a book while having text corresponding to the
`audio being presented highlighted on a display device. Also,
`many vision-impaired or hearing-impaired readers might
`benefit from having linked audio and text versions of the
`literary work.
`While electronic text and audio versions of many literary
`works exist, relatively few of these works include links
`between the audio and text versions needed to support the
`easy accessing of the same point in both versions of a work.
`Without such links between the text and audio versions of a
`work, it is difficult to easily switch between the two versions
`of the work or to highlight text corresponding to the portion
`of the audio version being played at a given moment in time. 65
`Links or indexes used to synchronize audio and text
`versions of the same work may be manually generated via
`
`35
`
`40
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 11 of 19
`
`

`

`US 6,260,011 Bl
`
`10
`
`20
`
`3
`synchronizing text and audio versions of literary works. It is
`also desirable that the methods and apparatus be capable of
`synchronizing the starting points of sentences and/or para(cid:173)
`graphs in audio and text files with a high degree of accuracy.
`SUMMARY OF THE PRESENT INVENTION
`The present invention is directed to methods and appa(cid:173)
`ratus for automatically generating synchronized audio and
`text data, e.g., files, from unsynchronized electronic audio
`and text versions of the same work, e.g., literary work,
`program or document.
`The synchronization of long audio files, e.g. 30 minutes
`and longer with corresponding text in an automated manner,
`presents significant difficulties since absolute certainty as to
`points in the audio and text versions which correlate to each
`other exists only at the beginning and end of the complete
`text and audio versions of the same work.
`When synchronizing text and audio versions of the same
`work, it is highly desirable to synchronize at least one point
`per paragraph, preferably at the start of each paragraph.
`When positions within a paragraph are also to be
`synchronized, the start of sentences is a particularly useful
`location to synchronize since people tend to prefer reading
`or listening to speech from the start, as opposed to the
`middle, of sentences.
`The inventors of the present invention recognized that
`silence normally occurs at the ends of paragraphs and
`sentences but, for the most part, does not occur between
`words within a sentence during ordinary speech. They also
`recognized that in many audio versions of literary works and 30
`other text documents read aloud, the amount of background
`noise is intentionally kept to a minimum. This makes periods
`of silence in an audio version of a literary work relatively
`easy to detect. In addition, the locations where silence
`occurs is relatively easy to predict from punctuation and/or 35
`other content within the text version of the work.
`Given that silence may occur within a sentence in an
`audio version of a literary work, e.g., because of a pause by
`the reader which is not reflected in the text by punctuation,
`the detection of periods of silence alone may be insufficient
`to reliably synchronize audio and text versions of a literary
`work. This is particularly the case in long audio sequences.
`The inventors of the present application recognized that
`by performing speech recognition, spoken words in an audio
`work, in addition to periods of silence, could be detected
`automatically and used for purposes of synchronizing the
`text and audio versions of the work. Unfortunately, with
`known speech recognition techniques, recognition errors
`occur. In addition, even when recognition errors do not
`occur, differences may exist between an audio and text
`version of the same work due, e.g., to reading errors on the
`part of the individual or individuals responsible for gener(cid:173)
`ating the audio version of the work.
`The present invention uses a combination of silence
`detection and detection of actual words for purposes of
`synchronizing audio and text versions of the same work.
`In accordance with the present invention, a speech rec(cid:173)
`ognition operation is performed on an audio corpus to
`recognize actual words and periods of silence. For speech
`recognition purposes silence may be modeled as a word. A 60
`time indexed set of recognized words and periods of silence
`is produced by the speech recognition process of the present
`invention. The results of the speech recognition operation
`are globally aligned with the text corpus by matching as
`much of the recognized text as possible to the corresponding 65
`text of the work without changing the sequence of the
`recognized or actual text.
`
`4
`When periods of detected silence correspond to expected
`locations within the actual text, e.g., ends of sentences and
`paragraphs, one or more words adjoining the period of
`silence in the recognized text are compared to one or more
`5 corresponding words adjoining the expected location of
`silence in the actual text. If the words adjoining the text were
`properly recognized, both the recognized word or words
`adjoining the silence and the actual word or words adjoining
`the expected point of silence will match. When there is a
`match, the identified location of the silence in the audio file
`and the corresponding location in the text file are identified
`as corresponding audio and text locations where a pointer
`correlating the two files should be inserted.
`In one particular embodiment, for a location correspond(cid:173)
`ing to detected silence to be used for purposes of file
`15 synchronization, the recognized words bracketing i.e., pre(cid:173)
`ceding and following, the detected silence must be properly
`recognized, e.g., match, the words in the actual text brack(cid:173)
`eting the location believed to correspond to the detected
`silence.
`When a location in a text file corresponding to detected
`silence is identified for purposes of file synchronization, a
`pointer to the recognized silence in the audio file is added at
`the location in the text file having been identified as corre(cid:173)
`sponding to the recognized silence. This results in the ends
`25 of sentences and/or paragraphs being synchronized in the
`text file with corresponding occurrences of silence in the
`audio file.
`Each pointer added to the text file may be, e.g., a time
`index or time stamp into the corresponding audio file. A
`similar pointer, e.g., time index or stamp, may be added to
`the audio file if the corresponding audio file does not already
`include such values.
`Pointers inserted into the audio and text files for synchro(cid:173)
`nization purposes may take on a wide range of forms in
`addition to time stamp values. For example, pointers may
`include a filename or file identifier in conjunction with an
`index value used to access a particular point within the
`identified file. In such cases, the pointers added to audio files
`may include a file name or file identifier which identifies the
`40 corresponding text file. Pointers added to the text files in
`such embodiments may include a file name or file identifier
`which identifies the corresponding audio file.
`As part of the speech recognition process of the present
`invention, statistical language models, generated from the
`45 text corpus to be synchronized, may be used. Statistical
`language models, e.g., tri-gram language models, predict the
`statistical probability that a hypothesized word or words will
`occur in the context of one or more previously recognized
`words. Since the synchronization of audio and text files in
`50 accordance with the present invention relies heavily on the
`accurate identification of silence in the context of preceding
`and/or subsequent words, it was recognized that statistical
`language models, as opposed to simple language models,
`were more likely to produce recognition speech results that
`55 were useful in synchronizing audio and text files based on
`the detection of silence in the context of expected words. In
`accordance with the present invention, statistical language
`models are generated from the text corpus which is to be
`synchronized with a corresponding audio corpus.
`While the use of statistical language models for speech
`recognition purposes is one feature of the present invention,
`it is recognized that other types of language models may be
`employed instead without departing from the overall inven(cid:173)
`tion.
`Numerous additional features and advantages of the
`present invention will be discussed in the detailed descrip(cid:173)
`tion which follows.
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 12 of 19
`
`

`

`US 6,260,011 Bl
`
`5
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 illustrates unsynchronized electronic text and
`audio corpuses corresponding to the same exemplary literary
`work.
`FIG. 2 illustrates a computer system implemented in
`accordance with one embodiment of the present invention.
`FIG. 3 illustrates a set of application programs included in
`the computer system of FIG. 2.
`FIG. 4 illustrates a set of program data included in the
`computer system of FIG. 2.
`FIG. 5 illustrates the flow of data and information
`between various modules of the present invention.
`FIG. 6 is a flow diagram illustrating the steps of the
`present invention involved in synchronizing text and audio
`files.
`FIG. 7 illustrates an exemplary text corpus and the global
`alignment of the content of the text corpus with recognized
`speech.
`FIGS. 8, 10 and 12 illustrate exemplary synchronized text
`and audio corpuses created in accordance with various
`embodiments of the present invention.
`FIGS. 9, 11 and 13 illustrate exemplary content of the
`aligned audio and text corpuses shown in FIGS. 8, 10 and
`12, respectively.
`
`DETAILED DESCRIPTION
`
`As discussed above, the present invention is directed to
`methods and apparatus for automatically synchronizing
`electronic audio and text data, e.g., files, corresponding to
`the same work, e.g., literary work, radio program, document
`or information.
`FIG. 1 illustrates a set 9 of unsynchronized text and audio
`files corresponding to, e.g., the same exemplary literary
`work. A plurality of N text files 12, 14 form a text corpus 10
`which represents the complete text of the exemplary literary
`work. Text files 12, 14 may be in any one of a plurality of
`electronic formats, e.g., an ASCII format, used to store text
`information. A plurality of M audio files 22, 24 form an
`audio corpus 20 which represents a complete audio version
`of the exemplary work. Audio files 22, 24 may be in the form
`of WAVE or other electronic audio file formats used to store
`speech, music and/or other audio signals. Note that the
`number N of text files which form the text corpus 10 may be
`different than the number M of audio files which form the
`audio corpus 20.
`While the text corpus 10 and audio corpus 20 correspond
`to the same literary work, the audio and text files are
`unsynchronized, that is, there are no links or reference points
`in the files which can be used to correlate the informational
`content of the two files. Thus, it is not possible to easily
`access a point in the audio corpus 10 which corresponds to
`the same point in the literary work as a point in the text
`corpus 20. This makes it difficult to access the same location
`in the literary work when switching between text and audio
`modes of presenting the literary work.
`FIG. 2 and the following discussion provide a brief,
`general description of an exemplary apparatus, e.g., com(cid:173)
`puter system, in which at least some aspects of the present
`invention may be implemented. The computer system may
`be implemented as a portable device, e.g., notebook com(cid:173)
`puter or a device for presenting books or other literary works
`stored in electronic form.
`The present invention will be described in the general
`context of computer-executable instructions, such as pro-
`
`6
`gram modules, being executed by a personal computer.
`However, the methods of the present invention may be
`effected by other apparatus. Program modules may include
`applications, routines, programs, objects, components, data
`5 structures, etc. that perform a task(s) or implement particular
`abstract data types. Moreover, those skilled in the art will
`appreciate that at least some aspects of the present invention
`may be practiced with other configurations, including hand(cid:173)
`held devices, multiprocessor systems, microprocessor-based
`10 or programmable consumer electronics, network computers,
`minicomputers, set top boxes, mainframe computers, and
`the like. At least some aspects of the present invention may
`also be practiced in distributed computing environments
`where tasks are performed by remote processing devices
`15 linked through a communications network. In a distributed
`computing environment, program modules may be located
`in local and/or remote memory storage devices.
`With reference to FIG. 2, an exemplary apparatus 100 for
`implementing at least some aspects of the present invention
`20 includes a general purpose computing device in the form of
`a conventional personal computer 120. The personal com(cid:173)
`puter 120 may include a processing unit 121, a system
`memory 122, and a system bus 123 that couples various
`system components including the system memory 122 to the
`25 processing unit 121. The system bus 123 may be any of
`several types of bus structures including a memory bus or
`memory controller, a peripheral bus, and a local bus using
`any of a variety of bus architectures. The system memory
`may include read only memory (ROM) 124 and/or random
`30 access memory (RAM) 125. A basic input/output system
`126 (BI OS), containing basic routines that help to transfer
`information between elements within the personal computer
`120, such as during start-up, may be stored in ROM 124. The
`personal computer 120 may also include a hard disk drive
`35 127 for reading from and writing to a hard disk, (not shown),
`a magnetic disk drive 128 for reading from or writing to a
`(e.g., removable) magnetic disk 129, and an (magneto-)
`optical disk drive 130 for reading from or writing to a
`removable (magneto) optical disk 131 such as a compact
`40 disk or other (magneto) optical media. The hard disk drive
`127, magnetic disk drive 128, and (magneto) optical disk
`drive 130 may be coupled with the system bus 123 by a hard
`disk drive interface 132, a magnetic disk drive interface 133,
`and a (magneto) optical drive interface 134, respectively.
`45 The drives and their associated storage media provide non(cid:173)
`volatile storage of machine readable instructions, data
`structures, program modules and other data for the personal
`computer 120. Although the exemplary environment
`described herein employs a hard disk, a removable magnetic
`50 disk 129 and a removable(magneto) optical disk 131, those
`skilled in the art will appreciate that other types of storage
`media, such as magnetic cassettes, flash memory cards,
`digital video disks, Bernoulli cartridges, random access
`memories (RAMs), read only memories (ROM), and the
`55 like, may be used instead of, or in addition to, the storage
`devices introduced above.
`A number of program modules may be stored on the hard
`disk 127, magnetic disk 129, (magneto) optical disk 131,
`ROM 124 or RAM 125. In FIG. 2 an operating system 135,
`60 one (1) or more application programs 136, other program
`modules 137, and/or program data 138 are shown as being
`stored in RAM 125. Operating system 135', application
`program(s) 136', other program modules 137' and program
`data 138' are shown as being stored on hard disk driver 127.
`65 As will be discussed below in regard to FIG. 3, in the
`exemplary embodiment the application programs include an
`audio/text synchronization program implemented in accor-
`
`Ultratec Exhibit 1026
`Ultratec v Sorenson IP Holdings Page 13 of 19
`
`

`

`US 6,260,011 Bl
`
`10
`
`8
`used to generate the synchronized audio and/or text files
`from unsynchronized audio and text files. The modules
`include a control module 310, a speech recognizer module
`312, a statistical language model generation module 314,
`5 optional speech recognition training module 316 and text/
`audio alignment module 318. The control module 310 is
`responsible for controlling the interaction of the various
`other modules which comprise the audio/text synchroniza(cid:173)
`tion program 136 and is responsible for controlling the
`accessing and storage of audio and text files. The speech
`recognizer module 312 is used for performing speech
`recognition, as a function of a language model and an
`acoustic model

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket