`Fischer et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7.412,643 B1
`Aug. 12, 2008
`
`USOO7412643B1
`
`(54) METHOD AND APPARATUS FOR LINKING
`REPRESENTATION AND REALIZATION
`DATA
`
`(75) Inventors: Uwe Fischer, Schoenaich (DE); Stefan
`Hoffmann, Weil im Schoenbuch (DE):
`Werner Kriechbaum,
`Ammerbuch-Breitenholz (DE); Gerhard
`Stenzel, Herrenberg (DE)
`(73) Assignee: International Business Machines
`Corporation, Armonk, NY (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`(21) Appl. No.: 09/447,871
`
`(*) Notice:
`
`Nov. 23, 1999
`
`(22) Filed:
`(51) Int. Cl.
`(2006.01)
`G06F 7700
`(2006.01)
`G06F 3/00
`(2006.01)
`GOL 5/00
`(52) U.S. Cl. ....................... 715/200; 715/201: 715/234;
`715/730; 704/246; 704/251
`(58) Field of Classification Search .............. 715/500.1,
`715/501.1, 513, 730, 804; 345/723,730;
`704/246, 251, 260
`See application file for complete search history.
`References Cited
`
`(56)
`
`U.S. PATENT DOCUMENTS
`5,649,060 A * 7/1997 Ellozy et al. ................ 704/278
`5,737,725 A * 4, 1998 Case .......................... TO4,260
`5,857,099 A *
`1/1999 Mitchell et al. ............. TO4/235
`5,929,849 A * 7/1999 Kikinis ....................... 725,113
`5,963.215 A * 10/1999 Rosenzweig ................ 345,649
`6,076,059 A * 6/2000 Glickman et al. ........... TO4,260
`6,098,082 A * 8/2000 Gibbon et al. ........... T15,501.1
`
`6,172,675 B1* 1/2001 Ahmad et al. ............ T15,500.1
`6,243,676 B1* 6/2001 Witteman ................... TO4,243
`6,249,765 B1* 6/2001 Adler et al. ...........
`... 704,500
`6,260,011 B1* 7/2001 Heckerman et al. ......... 704/235
`6,263,507 B1 * 7/2001 Ahmad et al. .........
`... 725,134
`6,271,892 B1* 8/2001 Gibbon et al. ....
`... 348,700
`6,282,511 B1* 8/2001 Mayer ..............
`... TO4,270
`6,336,093 B2
`1/2002 Fasciano ..................... 704/278
`(Continued)
`FOREIGN PATENT DOCUMENTS
`
`JP
`
`2000-138664
`5, 2000
`(Continued)
`OTHER PUBLICATIONS
`Gibbon et al., “Generating Hypermedia Documents from Transcrip
`tions of Television Programs Using Parallel TextAlignment'. AT&T
`Labs Research, Feb. 1998, pp. 26-33.*
`(Continued)
`Primary Examiner William Bashore
`Assistant Examiner—Maikhanh Nguyen
`(74) Attorney, Agent, or Firm Scully, Scott, Murphy &
`Presser, P.C.: Lisa M. Yamonaco
`
`(57)
`
`ABSTRACT
`
`A method and apparatus for creating links between a repre
`sentation, (e.g. text data.) and a realization, (e.g. correspond
`ing audio data.) is provided. According to the invention the
`realization is structured by combining a time-stamped ver
`sion of the representation generated from the realization with
`structural information from the representation. Thereby so
`called hyper links between representation and realization are
`created. These hyper links are used for performing search
`operations in realization data equivalent to those which are
`possible in representation data, enabling an improved access
`to the realization (e.g. via audio databases).
`
`14 Claims, 7 Drawing Sheets
`
`ANAYZE THE STRUCTURE OF
`REPRESENTATION 10 AND
`SEPARATESTRUCTURE 105
`AND CONFENT 04
`
`301
`
`ANALYEE THE REALIZATION
`AND CREATE A TIME
`STAMPED REPRESENTATION 107
`
`302
`
`
`
`
`
`
`
`CREATE AN ALIGNED
`REPRESENTATION OS BY
`ALIGNING CONTENT 104 AND
`TIME-SIAMPED REPRESENTATION 107
`
`CREATE HYPERLINKS 111 BY
`COMBINING ALIGNED REPRESENTATION
`109 AND STRUCTURE 105
`
`303
`
`304
`
`-1-
`
`Amazon v. Audio Pod
`US Patent 9,319,720
`Amazon EX-1048
`
`
`
`US 7.412,643 B1
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`JP
`
`2001-313633
`
`11, 2001
`
`6,357,042 B2 * 3/2002 Srinivasan et al. ............ 725/32
`6,404,978 B1
`6/2002 Abe ............................ 386/55
`6,434.520 B1* 8/2002 Kanevsky et al. ........... TO4,243
`6,462,754 B1 102002 Chakraborty et al. .345,723
`6,473,778 B1
`10/2002 Gibbon .................... T15,501.1
`6,603,921 B1 ck
`8, 2003 Kanevsky et al. ............. 386.96
`6,636,238 B1 ck 10/2003 Amir et al.
`......
`... 345,730
`6,728,753 B1 ck
`4/2004 Parasnis et al. ............. TO9,203
`6,791,571 B1 ck
`9, 2004 Lamb - - - - - - - - - - - - - - - - - - - - - - - - - 345,619
`2001/002343.6 A1
`9/2001 Srinivasan et al.
`709,219
`2002/0059604 A1* 5/2002 Papagan et al. ............... 7.25/51
`FOREIGN PATENT DOCUMENTS
`
`
`
`JP
`
`2001-111543
`
`4/2001
`
`OTHER PUBLICATIONS
`Amir et al., “CueVideo: Automated video/audio indexing and brows
`ing”, IBM Almaden Research Center, ACM, 1996, p. 326.*
`Favela et al., “Image-retrieval agent: integrating image content and
`text', CICESE Research Center, IEEE, Oct. 1999, pp. 36-39.
`S. Srinivasan et al., “What is in that video anyway?'. In Search of
`Better Browsing, 6th IEEE Int. Conference on Multimedia Comput
`ing and Systems, Jun. 7-11, 1999, Florence, Italy, pp. 1-6.*
`D. Ponceleon et al., “Key to Effective Video Retrieval: Effective
`Cataloging and Browsing.” Proceedings of the 6th ACM Interna
`tional Conference on Multimedia, 1998, pp. 99-107.*
`* cited by examiner
`
`-2-
`
`
`
`U.S. Patent
`
`Aug. 12, 2008
`
`Sheet 1 of 7
`
`US 7.412,643 B1
`
`
`
`NOII WINESEHdBH
`
`(EdWWIS-EWI]
`
`NOII WINHSHHdHH
`
`NIWTd
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`——————————————————————-
`
`-3-
`
`
`
`U.S. Patent
`
`Aug. 12, 2008
`
`Sheet 2 of 7
`
`US 7412,643 B1
`
`FIG 2
`
`
`
`
`
`
`
`
`
`
`
`
`
`O
`
`<!DOCTYPE book system "book.dtd"
`<!ENTITY; ISOlatl system "isolatl.ent">
`%ISOlatl
`>
`<book)
`<chapter
`<heading>Key Note Speechk/heading>
`KSection>
`<p><a>
`Its a great honor for me to share this stage with the Lord Mayor and Chief
`Executive Of Hanover; Mr. Jung; and in a few minutes, Chancell Or Kohl.<laKa- I've
`been looking forward to this evening for a long time, because I've known for many
`years how important CeBIT is to the global information technology industry.<ls><l p>
`
`
`
`
`
`FIG. 4
`
`FIG. S
`
`104
`
`400
`
`
`
`It's
`d
`great
`honor
`
`Chancell Of
`Kohl,
`I've
`been
`lOOking
`technology
`industry.
`
`It's
`
`great
`honor
`
`Chancellor
`Kohl.
`I've
`been
`looking
`technology
`industry.
`
`400
`
`501
`
`OS
`/
`
`502
`
`-4-
`
`
`
`U.S. Patent
`
`Aug. 12, 2008
`
`Sheet 3 of 7
`
`US 7.412,643 B1
`
`FIG 3
`
`
`
`
`
`
`
`ANALYZE THE STRUCTURE OF
`REPRESENTATION 101 AND
`SEPARATE STRUCTURE 105
`AND CONTENT 104
`
`
`
`301
`
`ANALYZE THE REALIZATION
`AND CREATE A TIME
`STAMPED REPRESENTATION 107
`
`302
`
`CREATE AN ALIGNED
`REPRESENTATION 109 BY
`ALIGNING CONTENT 104 AND
`TIME-STAMPED REPRESENTATION O7
`
`303
`
`CREATE HYPER LINKS 111 BY
`COMBINING ALIGNED REPRESENTATION
`109 AND STRUCTURE 105
`
`304
`
`
`
`
`
`
`
`
`
`
`
`
`
`-5-
`
`
`
`U.S. Patent
`
`Aug. 12, 2008
`
`Sheet 4 of 7
`
`US 7.412,643 B1
`
`FIG. 5
`
`<book) 1
`
`Kchapter) 11
`
`<heading 111
`
`<Section> 112
`
`Key 1111
`note 1112
`speech 1113
`
`<lheading>
`
`<p> 1121
`| N.
`KSX 11211
`KSX 1122
`
`It's 112111
`a 112112
`great 112113
`
`I've 1122
`been 12122
`looking 112123
`
`KIS>
`
`KISP t
`
`50
`
`502
`
`-6-
`
`
`
`U.S. Patent
`
`Aug. 12, 2008
`
`Sheet 5 of 7
`
`US 7.412,643 B1
`
`FIG. 7
`
`O7
`
`FIG. B.
`
`O9
`
`It's
`
`great
`honor
`
`ChancellOr
`Kohl,
`I've
`been
`bOoking
`technology
`industry.
`
`It's
`8
`great
`honor
`
`ChancellOr
`Kohl.
`I've
`been
`lOOking
`technology
`industry.
`
`70
`
`702
`
`703
`
`400
`
`702
`
`703
`
`FIG. 9
`
`704
`
`
`
`<! DOCTYPE linkWeb SYSTEM linkWeb.dtd"
`<IENTITY sgml link SYSTEM "lousgm' CDATA SGML>
`>
`/
`<linkWeb>
`Kaudio linkends="sgml54 audio54">
`fic id-'audio54">file=d:\lou\lou beta, Mpg start-588 end-24703 unit-inskiurlloc
`111 <treelOC id="sgm 154" locsrc=sgml link-1 12 1 1k/treel OCX
`al linkends="sgml55 audio.55'>
`Kurilloc id="audio55">file=d:\lou\loubeta, mpg start=24703 end=38839 unitsmskfurll OC
`<treeloc id="sgm 155 locsrc=sgml link-1 12 12k/treel DCX
`
`-7-
`
`
`
`U.S. Patent
`U.S. Patent
`
`Aug. 12, 2008
`Aug. 12, 2008
`
`Sheet 6 of 7
`Sheet 6 of 7
`
`US 7.412,643 B1
`US 7,412,643 B1
`
`
`
`
`
`-8-
`
`
`
`U.S. Patent
`U.S. Patent
`
`Aug. 12, 2008
`
`Sheet 7 Of 7
`
`US 7.412,643 B1
`US 7,412,643 B1
`
`Qo|5tT“OTsET‘“9I4aSa‘Auysnpuy
`yueuodwtMoy=sueakAuewJO}UMOUX9A,Tasnegaq‘awit,Buo{eJo}BuluadaSty}0}P4eNsO}GuTYOOTVaagAAT]ZT27Ton.=+
`
`
`
`
`
`
`
`
`OrtOOET~TOvtVOETTOET
`
`
`
`AGoyouyoa,voT}ewsojuTTeqophay}0}STJ7Aa7
`eyaqrnoj-ayty
`SU=}1UNGERGE=PUaEOL-2=}e}SBdu'ejaq-noy=aqtj
`
`
`
`SW=}1UNEQ/p2=puaBag=}Je}5Hdw
`
`
`SAT]T.27I IX3L|YOLVQ01
`JOaATqNIaxzJatyypuesodeypyoyayzypIMaBezssty}aveys0}awJoysovoyyeoube
`
`
`
`
`_400ztar‘Ord
`
`
`“TYOYJOT]auey)‘SayNUTWAa}eUTpue‘Gun“sy‘JaAqUUeY
`aro1anv|GIWasGIWIS|HOIVIO)
` potefe||Ssoraw|SsTHOS|tsoraw|rsTHOS|
`
`[warseas[arrPswes
`
`
`Pe
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`-9-
`
`
`
`
`
`US 7,412,643 B1
`
`1.
`METHOD AND APPARATUS FORLINKING
`REPRESENTATION AND REALIZATION
`DATA
`
`FIELD OF THE INVENTION
`
`The present invention is directed to the field of multimedia
`data handling. It is more particularly directed to linking mul
`timedia representation and realization data.
`
`10
`
`BACKGROUND OF THE INVENTION
`
`15
`
`In recent years a new way of presenting information has
`been established. In this new multimedia approach, informa
`tion is presented by combining several media, e.g. written
`text, audio and video. However, when using e.g. the audio
`data, finding and addressing specific structures (pages, chap
`ters, etc. corresponding to the equivalent textual representa
`tion of the audio data) are either time consuming, complex, or
`impossible. A solution to overcome these problems is to link
`text and audio. The concept of linking text and audio is
`already used by some information providers. However, it is
`not widely used. One of the reasons for this is that it is a
`resource consuming process to build the hyper-links between
`the audio data and the corresponding textual representation.
`This either means a huge investment on the producers side, or
`a limited number of links, which limits the value for the user.
`As a result of the limiting state of the art user queries directed
`to databases containing multimedia material have to be in
`30
`most cases quite general. For example a user asks "In which
`document do the words “Italian' and “inflation’ occur?' A
`response to this query results in the complete audio document
`to be returned in which the requested data is enclosed.
`
`25
`
`2
`The hyper links are created by an apparatus according to
`the present invention. In one embodiment it is stored in a
`hyper document. These hyper links are used for performing
`search operations in audio data equivalent to those which are
`possible in representation data. This enables an improved
`access to the realization (e.g. via audio databases). Further
`more it is not only possible to search for elements of the input
`data, (e.g. words.) within the resulting hyper links or hyper
`document. But, it is also possible to navigate within the result
`ing data, (e.g. the hyper document.) and define the scope of
`the playback. In this context the word navigation means
`things like go to next paragraph, show complete section
`that includes this paragraph, etc. In an embodiment, the
`Scope of the playback is defined by clicking a display of a
`sentence, a paragraph, a chapter, etc. in a hyper link docu
`ment.
`Thereby the segments of the realization, (e.g. the audio
`stream.) become immediately accessible. In accordance with
`the present invention, these capabilities are not created
`through a manual process. All or part of this information is
`extracted and put together automatically.
`The time-alignment process of the present invention con
`nects the realization domain with the representation domain
`and therefore allows that certain operations, which are gen
`erally difficult to perform in the realization domain, be shifted
`into the representation domain where the corresponding
`operation is relatively easy to perform. For example, in
`recorded speech, standard text-mining technologies can be
`used to locate sequences of interest. The structure informa
`tion can be used to segment the audio signal in meaningful
`units like sentence, paragraph or chapter.
`An aspect of the present invention enables the automatic
`creation of link and navigation information between text and
`related audio or video. This gives producers of multimedia
`applications a huge process improvement. On one hand, an
`advantage is that the Software creates hyperlinks to the audio
`on a word by word, or sentence-by-sentence basis depending
`upon which is the more appropriate granularity for the appli
`cation. Other embodiments use another basis that is appro
`priate for the problem to be solved. Therefore a major disad
`Vantage of previous techniques, namely the limited number of
`links, is eliminated. On the other hand the technique of the
`present invention it dramatically reduces the amount of
`manual work necessary to synchronize a text transcript with
`its spoken audio representation, even if the result creates a
`higher value for the user. It also eliminates another disadvan
`tage of the previous techniques, namely the high costs of
`building such linked multimedia documents.
`Another aspect of the present invention is to generate a high
`level of detail, such that applications can be enhanced with
`new functions, or even new applications may be developed.
`Single or multiple words within a text can be aligned with the
`audio. Thus single or multiple words within a speech can be
`played, or one word in a sentence in a language learning
`application, or any sentence in a lesson, document, speech,
`etc. can be played.
`
`35
`
`40
`
`45
`
`50
`
`55
`
`SUMMARY OF THE INVENTION
`
`Accordingly, it is an aspect of the present invention to
`provide an enhanced method and apparatus to link text and
`audio data. It recognizes that most acoustic multimedia data
`have a common property which distinguishes them from
`visual data. These data can be expressed in two equivalent
`forms: as a textual or symbolic representation, e.g. score,
`Script or book, and as realizations, e.g. an audio stream. As
`used in an example of the present invention an audio stream is
`eitheran audio recording or the audio track of a video record
`ing or similar data.
`Information typically is presented as textual representa
`tion. The representation contains both the description of the
`content of the realization and the description of the structure
`of the realization. This information is used in the present
`invention to provide a method and apparatus for linking the
`representation and the realization.
`Starting from a textual or symbolic representation, (e.g. a
`structured electronic text document.) and one or multiple
`realizations (e.g. digital audio files like audio recording
`which represent the corresponding recorded spoken words.)
`so called hyper links between the representation, (e.g. the
`text.) and the related realization, (e.g. the audio part.) are
`created. An embodiment is provided such that the realization
`is structured by combining a time-stamped (or otherwise
`marked) version of the representation generated from the
`realization with structural information from the representa
`tion. Errors within the time stamped representation are elimi
`nated by aligning the time-stamped version of the represen
`tation generated from the realization with the content of the
`original representation in beforehand.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`60
`
`65
`
`These and other aspects, features, and advantages of the
`present invention will become apparent upon further consid
`eration of the following detailed description of the invention
`when read in conjunction with the drawing figures, in which:
`FIG. 1 shows an example of a schematic block diagram of
`the aligner in accordance with the present invention;
`FIG. 2 shows an example of a textual representation of a
`book in SGML:
`
`-10-
`
`
`
`US 7,412,643 B1
`
`3
`FIG.3 shows an example of a flow chart diagram describ
`ing a method of combining representation and realization in
`accordance with the present invention;
`FIG. 4 shows an example of a plain representation as cre
`ated by a structural analyzer,
`FIG. 5 shows an example of a tree structure of a represen
`tation with locators;
`FIG. 6 shows an example of structural information as cre
`ated by the structural analyzer;
`FIG. 7 shows an example of a time-stamped representation
`as created by the temporal analyzer,
`FIG. 8 shows an example of a time-stamped aligned rep
`resentation as created by the time aligner,
`FIG. 9 shows an example of a hyper-link document with
`hyperlinks as created by the link generator,
`FIG.10 shows an example ofa aligner for other file formats
`in accordance with the present invention;
`FIG. 11 shows an example of an aligner with enhancer in
`accordance with the present invention;
`FIG. 12 shows an example of a first mapping table as used
`in an audio database in accordance with the present invention;
`FIG. 13 shows an example of a second mapping table as
`used in an audio database in accordance with the present
`invention;
`FIG. 14 shows an example of a third mapping table as used
`in an audio database in accordance with the present invention;
`FIG. 15 shows an example of a fourth mapping table as
`used in an audio database in accordance with the present
`invention.
`
`DETAILED DESCRIPTION OF THE INVENTION
`
`FIG. 1 shows an example embodiment of an aligner 100
`according to the present invention. The aligner 100 comprises
`a structural analyzer 103 with input means. The structural
`analyzer 103 is connected via two output means to a time
`aligner 108 and a link generator 110. The aligner 100 further
`comprises a temporal analyzer 106 with input means. The
`temporal analyzer 106 is connected via output means to the
`time aligner 108. The time aligner 108 with two input means
`for receiving data from the structural analyzer 103 as well as
`from the temporal analyzer 106 is connected via output
`means to the link generator 110. The link generator 110 with
`two input means for receiving data from the structural ana
`lyzer 103 as well as from the time aligner 108 has an output
`means for sending data.
`As shown in FIG. 1, the structuring process starts from a
`representation 101 and a realization 102. Usually both the
`representation 101 and the realization 102 are each stored in
`a separate file, but each of the data sets may actually be
`distributed among several files or be merged in one complex
`hyper-media file. In an alternate embodiment, both the rep
`resentation 101 and the realization 102 may be fed into the
`system as a data stream.
`The representation 101 is a descriptive mark-up document,
`e.g. the textual representation of a book, or the score of a
`Symphony. An example of a textual representation of a book
`marked up in Standard Generalized Markup Language
`(SGML) as defined in ISO 8879 is shown in FIG. 2. Thereby
`the SGML document comprises parts defining the structural
`elements of the book (characterized by the tag signs < ... >)
`and the plain content of the book. Instead of SGML other
`markup languages, e.g. Extensible Markup Language (XML)
`or LaTeX may be similarly used.
`An example of a realization 102 is an audio stream in a
`arbitrary standard format, e.g. WAVE or MPEG. It may be for
`example a RIFF-WAVE file with the following characteris
`
`4
`tics: 22050/11025 Hz, 16 bit mono. In the example the real
`ization 102 can be a narrated book in the form of a digital
`audio book.
`An example of a procedure for combining representation
`101 and realization 102 according to the present invention is
`illustrated in FIG. 3. In a first processing step 301, the repre
`sentation 101 is fed into the structural analyzer 103. The
`structural analyzer 103 analyzes the representation 101 and
`separates the original plain representation 104 and a struc
`tural information 105. The plain representation 104 includes
`the plain content of the representation 101, that is the repre
`sentation 101 stripped of all the mark-up. As an example the
`plain representation 104 (comprising the original words 400)
`of the representation 101 is shown in FIG. 4.
`An example for a structural information 105 appropriate
`for audio-books is a text with locators. Therefore in the above
`embodiment the structural analyzer 103 builds a tree structure
`of the SGML tagged text 101 of the audio book and creates
`locators which determine the coordinates of the elements
`(e.g. words) within the structure of the representation 101.
`Those skilled in the art will not fail to appreciate that the
`imposed structure is not restricted to a hierarchical tree like a
`table of contents, but other structures, e.g. lattice or index may
`be used as well.
`The process of document analysis and creation of struc
`tural information 105 as carried out in step 301 is now
`described. In FIG. 5 a tree structure with corresponding loca
`tors 501, 502,..., as built during this process is illustrated for
`the SGML formatted example depicted in FIG. 2.
`After the representation 101 is obtained, the SGML file is
`fed into the structural analyzer 103, the structural analyzer
`103 searches start elements (with the SGML tag
`structure < . . . >) and stop elements (with the SGML tag
`structure </... >) of the representation 101. If the event is a
`start element a new locator is created. In the present embodi
`ment, for the event <booke the locator “1”, for the event
`<chapters the locator “11” etc. is created. If the event is a data
`element, like <heading> or <S> (sentence), the content
`(words) together with the current locators are used to build the
`structural information 105 and the plain text is used to build
`the plain representation 104. In case the event is an end
`element, the structural analyzer 103 leaves the current locator
`and the procedure continues to examine the further events. If
`no further event exists the procedure ends.
`An example embodiment of structural information 105
`output by the structural analyzer 103 is shown in FIG. 6. The
`structural information 105 contains the elements of the real
`ization 101 (corresponding to the plain representation 104),
`e.g. the words, in the first column, and the corresponding
`locators 501, 502, ... in the second column.
`In step 302 of FIG.3, which may be carried out before, after
`or at the same time as step 301, the realization 102, e.g. the
`audio stream, is fed into the temporal analyzer 106. The
`temporal analyzer 106 generates a time-stamped (or other
`wise marked) representation 107 from the realization 102. It
`is advantageous to generate a time-stamped representation
`107 of the complete realization 102. However, some embodi
`ments create marked or time-stamped representations 107
`only of parts of the realization 102.
`The time-stamped representation 107 includes the tran
`Script and time-stamps of all elementary representational
`units like e.g. word or word clusters. In the above example a
`speech recognition engine is used as temporal analyzer 106 to
`generate a raw time-tagged transcript 107 of the audio file
`102. Many commercially available speech recognition
`engines might be used, for example IBM's ViaVoice. How
`ever, in addition to the recognition of words, the temporal/
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`-11-
`
`
`
`5
`marker analyzer 106 should be able to allocate time stamps
`and/or marks for each word. An example for a such a time
`stamped representation 107 is the transcript shown in FIG. 7.
`The start times 702 and the end times 703 in milliseconds are
`assigned to each word 701 of the resulting representation. The
`start and end times locators 702, 703 are time locators that
`specify an interval in the audio stream data using the coordi
`nate system appropriate for the audio format, e.g. millisec
`onds for WAVE-files. The time-stamped representation 107
`as shown in FIG. 7 may include words 704 which have not
`been recognized correctly, e.g. “Hohl” instead of “Kohl' or
`“booking instead of “looking”.
`In FIG. 3, step 303, the plain representation 104 derived
`from step 301 and the time-stamped representation 107
`derived from step 302are fed to the time aligner 108. The time
`aligner 108 aligns the plain representation 104 and the time
`stamped representation 107. Thereby for the aligned ele
`ments, the time locator (start time 702, end time 703) from the
`time-stamped representation 107 is attached to the content
`elements (e.g. words) from the plain representation 104 lead
`ing to the time-stamped aligned representation 109. The time
`aligner 108 creates an optimal alignment of the words 701
`from the time-stamped representation 107 and the words
`contained in the plain representation 104. This can be done by
`a variety of dynamic programming techniques. Such an align
`ment automatically corrects isolated errors 704 made by the
`temporal analyzer 106 by aligning the misrecognized words
`704 with the correct counterparts, e.g. “Hohl” with “Kohl”,
`“booking with “looking'. Missing parts of representation
`101 and/or missing realization 102 result in that segments of
`30
`the plain representation 104 and/or the time-stamped repre
`sentation 107 remain unaligned. An example of an aligned
`representation 109 combining the correct words 400 and the
`time locators 702, 703 is shown in FIG.8.
`In step 304 of FIG.3, the structural information 105 and the
`time-stamped aligned representation 109, e.g. in form of data
`streams, are fed into a link generator 110. The link generator
`110 then combines the locators 501, 502, ... of each element
`from the structural information 105 with the respective time
`locator 702. 703 from the time-stamped aligned representa
`tion 109, thereby creating connections between equivalent
`elements of representation 101 and realization 102, so called
`time-alignment hyper links 111. In an embodiment these
`hyper links 111 are stored in a hyperlink document. In an
`alternative embodiment these hyperlinks are transferred to a
`database.
`It is advantageous that the hyper-link document be a
`HyTime document conforming to the ISO/IEC 10744: 1992
`standard, or a type of document using another convention to
`express hyperlinks, e.g. DAISY. XLink, SMIL, etc.
`Whereas in the above example the locators of each word
`are combined, it is however possible to combine the locators
`of sentences or paragraphs or pages with the corresponding
`time locators. An example for a hyper-link document 900 in
`HyTime format is shown in FIG.9. Therein for each sentence
`the locators 501, 502. . . . . for the representation 101 and the
`time locator 702, 703, ..., for the realization 102 are com
`bined in hyper links. An alternate embodiment creates hyper
`links 111 wherein the locators for each word or for each other
`element (paragraph, page, etc.) are combined.
`It will be understood and appreciated by those skilled in the
`art that the inventive concepts described by the present appli
`cation may be embodied in a variety of system contexts. Some
`of the typical application domains are described in the fol
`lowing.
`Sometimes either the representation or the realization (or
`both) is not available in a native or operating data format
`
`50
`
`35
`
`40
`
`45
`
`55
`
`60
`
`65
`
`US 7,412,643 B1
`
`10
`
`15
`
`25
`
`6
`directly processable by the aligner 100. In this case the avail
`able data has to be converted from a native format into the data
`format which can be used by the aligner 100 directly.
`Thus, in Some cases, the native alien format of the original
`representation is not the same format as the native alien for
`mat of the realization. The representation is given in a native
`data format (A). The realization is given in a native data
`format (B). These data formats are different. In an embodi
`ment, the representation (A) is converted into an operating
`data format (A') and the realization (B) is converted into an
`operating data format (B').
`FIG. 10 illustrates an example of an aligner 1000 for other
`file formats in accordance with the present invention. Using
`the aligner 1000 it becomes possible to create hyper links or
`hyper-link documents defined in the native format of the
`representation and/or realization. For example for the repre
`sentation a large variety of Such native representational for
`mats exist. These range from proprietary text formats like
`e.g., Microsoft Word or Lotus WordPro, to text structuring
`languages like e.g. Troff or TeX.
`This aligner 1000 includes aligner 100 shown in FIG. 1.
`Additionally a first converter 1001, and/or a second converter
`1002, and a link transformer 1003 are elements of the aligner
`1000. These elements are connected to each other as shown in
`FIG 10.
`In an embodiment the following procedure is applied. First
`the native representation 1004 is converted by the first con
`verter 1001 into a representation 101 in an operating or stan
`dard format, e.g. SGML. Additionally the first converter 1001
`produces information necessary to re-convert the resulting
`hyper links 111 into the native format. Such information can
`be e.g. a representation mapping table 1006 (a markup map
`ping table).
`The native realization 1005 is converted by a second con
`verter 1002 into a realization 102 in the operating or standard
`format, e.g. WAVE. In addition a realization mapping table
`1007 (a time mapping table) is created by the second con
`verter 1002.
`In the described example it is assumed, that both the rep
`resentation and the realization have to be converted before
`being processed by the aligner 100. A situation is however
`possible, in which only the representation 101 or only or the
`realization 102 has to be converted. Accordingly the proce
`dure has to be adapted to the particular situation.
`Both converters 1001, 1002 are programmed according to
`the source and destination formats. The detailed implemen
`tation of the converters 1001, 1002 and the way of creating the
`mapping tables 1006, 1007 are accomplished in ways known
`to those skilled in the art. Next both the representation and the
`realization, each in operating/standard format, are fed into the
`aligner 100. Aligner 100 creates the hyper-links 111 as
`described above. Next, the hyper-links 111 or the correspond
`ing hyper-link document 900 and the mapping tables 1006,
`1007 are used by the link transformer 1003 to create native
`hyper-links 1008 in the format of the original representation
`1004. For this purpose the link transformer 1003 uses the
`mapping tables 1006 and/or 1007 to replace the locators in the
`hyper links 111 with locators using the appropriate coordi
`nate systems for the native representation 1004 and/or native
`realization 1005 as specified by the mapping tables 1006,
`1007. For example if the native representation 1004 was
`written in HTML format, it would than be converted into
`SGML format by the first converter 1001. The hyperlinks 111
`created by the aligner 100 would than be retransformed into
`HTML by the link transformer 1003 using the mapping table
`1006.
`
`-12-
`
`
`
`US 7,412,643 B1
`
`5
`
`10
`
`15
`
`7
`Sometimes either the representation 101 and/or the real
`ization 102 is enhanced by using information from the other.
`Examples include automatic Subtitling, time-stamping the
`dialogues in a script, etc.
`FIG. 11 illustrates an example of an aligner 1100 with an
`enhancer corresponding to the present invention. The
`enhancer 1101 is employed to create enhanced versions of
`representation 101 and/or realization 102. The enhancer 1101
`uses the hyper links 111 or the hyper-link document 900 from
`the aligner 100 and the original representation 101 and/or the
`original realization 102 to create an enhanced representation
`and/or an enhanced realization 1102 or both. Thereby the
`enhancer 1101 includes the hyper links 111 into the original
`representation 101 or realization 102. A typical example for
`an enhanced representation 1102 is the addition of audio clips
`to an HTML file. Other examples are the addition of a syn
`chronized representation to MIDI or RIFF files. It is noted
`that the aligner 1100 with enhancer can of course be com
`bined with the principle of the aligner 1000 for other file
`formats as described above.
`Telecast applications (TV, digital audio broadcasting etc.)
`use an interleaved system stream that carries the representa
`tion 101, the realization 102 and a synchronization informa
`tion. The synchronized information is created in the form of a
`system stream by an aligner with a multiplexer in accordance
`with the present invention. Again FIG. 11 may be used to
`illustrate the system. This aligner with multiplexer may be
`implemented to use the aligner 100 as described above. A
`multiplexer (corresponding to the enhancer 1101 in FIG. 11)
`is employed to generates an interleaved system stream (cor
`responding to the enhanced representation 1102 in FIG. 11).
`In this way, the multiplexer combines the original repre