`
`(19) World Intellectual Property Organization
`International Bureau
`
`1111111111111111 IIIIII 1111111111111111111111111111111111111111 IIII IIIIIII IIII IIII IIII
`
`(43) International Publication Date
`10 October 2002 (10.10.2002)
`
`PCT
`
`(10) International Publication Number
`WO 02/080524 A2
`
`(51) International Patent Classitication7:
`
`H04N 5/00
`
`(21) International Application Number:
`
`PCT/JB02/00896
`
`(22) International Filing Date: 19 March 2002 (19.03.2002)
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`English
`
`English
`
`MCGEE, Thomas, F.; Prof. Holstlaan 6, NL-5656 AA
`Eindhoven (NL). JASINSCHI, Radu, S.; Prof. Holstlaan
`6, NL-5656 AA Eindhoven (NL).
`
`(74) Agent: GROENENDAAL, Antonius, W., M.; Interna(cid:173)
`tionaal Octrooibureau B.V., Prof. Holstlaan 6, NL-5656
`AA Eindhoven (NL).
`
`(81) Designated States (national): CN, JP, KR.
`
`(30) Priority Data:
`09/822,447
`
`30 March 2001 (30.03.2001) US
`
`(84) Designated States (regional): European patent (A'.T, BE,
`CH, CY, DE, DK, ES, Fl, FR, GB, GR, IE, IT, LU, MC,
`NL, PT, SE, TR).
`
`(71) Applicant: KONINKLIJKE PHILIPS ELECTRON(cid:173)
`ICS N.V. [NL/NL]; Groenewoudseweg l, NL-5621 BA
`Eindhoven (NL).
`
`Published:
`without international search report and to be republished
`upon receipt of that report
`
`For two-letter codes and other abbreviations .. refer to the "Guid(cid:173)
`ance Notes on Codes and Abbreviations" appearing at the begin(cid:173)
`ning of each regular issue of the PCT Gazette.
`
`(72) Inventors: DIMITROVA, Nevenka; Prof. Holstlaan 6,
`NL-5656 AA Eindhoven (NL). AGNIHOTRI, Lalitha;
`Prof.
`Holstlaan 6, NL-5656 AA Eindhoven (NL).
`
`iiiiiiiiiiii
`iiiiiiiiiiii
`iiiiiiiiiiii
`
`---iiiiiiiiiiii
`=
`iiiiiiiiiiii --
`
`(54) Title: STREAMING VIDEO BOOKMARKS
`
`iiiiiiiiiiii
`iiiiiiiiiiii
`
`RAWVIOEO,
`NTSC
`
`202
`
`204
`
`FRAMES
`
`220
`
`210
`
`____,,,_ 206
`
`MACROBLOCK
`CREATOR
`MACROBLOCKS
`
`OCT
`TRANSFORMER
`
`OCT
`MACROBLOCKS
`
`230
`
`240
`
`260
`
`SIGNIFICANT
`SCENE
`PROCESSOR
`
`KEYFRAME
`FILTERER
`
`FRAMES
`MACROBLOCKS
`
`FRAMES
`MACROBLOCKS
`
`MEDIA PROCESSOR
`
`HOST PROCESSOR
`
`234
`
`FRAME MEMORY
`
`...._
`~ (57) Abstract: A method, apparatus and systems for bookmarking an area of interest of stored video content is provided. As a
`viewer is watching a video and finds an area of interest, they can bookmark the particular segment of the video and then return to
`0 that segment with relative simplicity. This can be accomplished by pressing a button, clicking with a mouse or otherwise sending
`:-,. a signal to a device for marking a particular location of the video that is of interest. Frame identifiers can also be used to select a
`~ desired video from an index and to then retrieve the video from a medium containing multiple videos.
`
`""' M
`
`lf) = Q0 =
`
`-i-
`
`Amazon v. Audio Pod
`US Patent 10,805,111
`Amazon EX-1069
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`Streaming video bookmarks
`
`5
`
`BACKGROUND OF THE INVENTION
`The invention relates generally to accessing stored video content and more
`particularly to a method and apparatus for bookmarking video content for identifying
`meaningful segments of a video signal for convenient retrieval at a later time.
`Users often obtain videos stored in VHS format, DVD, disks, files or
`otherwise for immediate viewing or for viewing at a later time. Frequently, the videos can be
`of great length and might have varied content. For example, a viewer might record several
`hours of content, including various television programs or personal activities on a single
`video cassette, hard drive or other storage medium. It is often difficult for viewers to return
`to particularly significant portions of a video. It is often inconvenient to record frame counter
`numbers or recording time information, particularly while viewing a video.
`Users frequently use frustrating hit-or-miss methods for returning to segments
`of particular interest. For example, a viewer might record or obtain a video that includes
`performances of a large number of comedians or figure skaters, but only be interested in the
`performances of a relatively small number of these individuals. Also, a viewer might be
`recording the broadcast while watching the Superbowl or World Series, and wish to return to
`five or six memorable plays of the game.
`Current methods for locating particular segments of interest have been
`inconvenient to use and accordingly, it is desirable to provide an improved apparatus and
`20 method for bookmarking a meaningful segment of a video.
`
`10
`
`15
`
`25
`
`SUMMARY OF THE INVENTION
`Generally speaking, in accordance with the invention, a method, apparatus and
`systems for bookmarking an area of interest of stored video content is provided. As a viewer
`is watching a video and finds an area of interest, they can bookmark the particular segment of
`the video and then return to that segment with relative simplicity. This can be accomplished
`by pressing a button, clicking with a mouse or otherwise sending a signal to a device for
`marking a particular location of the video that is of interest. The boundaries of the entire
`segment can then be automatically identified using various superhistograms, frame
`
`-1-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`2
`signatures, cut detection methods, closed caption information, audio information, and so on,
`by analyzing the visual, audio and transcript portions of the video signal. The visual
`information can be analyzed for changes in color, edge and shape to determine change of
`individuals by face changes, key frames, video texts and the like. Various audio features
`such as silence, noise, speech, music, and combinations thereof can be analyzed to determine
`the beginning and ending of a segment. Closed captioning information can also be analyzed
`for words, categories and the like. By processing this information to determine the
`boundaries of a meaningful segment of the video, the bookmark will not merely correspond
`to a specific point of the video, but to an entire automatically created segment of the content.
`Thus, not only can bookmark methods, systems and devices in accordance
`with the invention enable a user to conveniently return to a segment of a video of interest, the
`user can be brought to the beginning of the segment and can optionally only view the
`particular segment of interest, or scroll through or view only segments of interest in
`sequence.
`
`For example, if a bookmark signal-is sent while a particular speaker is
`speaking in a video of a situation comedy, identifying the current speaker when the
`bookmark signal is delivered can identify segment boundaries by determining when that
`speaker begins and stops speaking. This information can be useful for certain types of
`content, such as identifying a segment of a movie, but not for others. Histogram information
`such as change of color-palette signals can also help identify segment changes. Closed
`captions and natural language processing techniques can provide further information for
`delineating one topic from the next and will also help in identifying boundaries based on
`topics, dialogues and so forth. By selecting or combining evidence from the above segment
`identification techniques, the boundaries of the segment can be determined and established.
`The above can also be combined with analysis of the structure of the program as a whole to
`further identify the segments.
`In one embodiment of the invention, the bookmark signal identifies a frame
`and the segment is based on time, such as 30 seconds or 1 minute, or video length such a s a
`selected number of frames, for example, before and after the selected frame. Alternatively,
`the segment can be set to a predefined length, such as 30 second$ or 1 minute from the
`segment beginning. Thus, if a bookmark signal is sent towards the end of a long segment,
`only the first part of the segment and possibly just the portion with the bookmark signal will
`be stored. Each segment can include EPG data, a frame or transcript information or
`combinations thereof. Indices of segments can be reviewed from remote locations, such as
`
`5
`
`IO
`
`15
`
`20
`
`25
`
`30
`
`-2-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`3
`via the internet or world wide web and videos can be selected by searching through such an
`index.
`
`In one embodiment of the invention, new scenes are·detected on a running
`basis as a video is being watched. When a bookmark signal is activated, the system then
`looks for the end of the scene and records/indexes the bookmarked scene or stores the scene
`separately.
`
`In one embodiment of the invention, when a user watching video activates the
`bookmark feature, the unique characteristics of the individual frame are recorded. Then, if a
`user has a large volume of video content in a storage medium and wants to return to a
`bookmarked scene or segment, but cannot remember the identity of the movie, television
`program or sporting event, the characteristics of the frame, as a unique or relatively unique
`identifier are searched and the scene ( or entire work) can be retrieved. Thus, a viewer could
`scroll through a series of video bookmarks until the desired scene is located and go directly to
`the scene or to the beginning of the work. Users can even keep personal lists of favorite
`bookmarked segments of not only video, but music, audio and other stored content and can
`access content from various internet or web accessible content sources by transmitting the
`frame identifier or segment identifier to the content source.
`Bookmarks in accordance with the invention can be backed up to a remote
`device, such as a PDA or other computerized storage device. Such a device can categorize
`the bookmarks, such as by analyzing EPG data, frame information, transcript information,
`such as by doing a key word search, or other video features. In fact, the systems and methods
`in accordance with the invention can also be used to bookmark and categorize various types
`of electronic content, such as segments from audio books, music, radio programs, text
`documents, multimedia presentations, photographs or other images, and so on. It can also be
`advantageous to store bookmarks as different levels, so that certain privacy and/or parental
`guidance issues can be addressed. In certain embodiments of the invention, the bookmarks
`can be accessed through web pages, mobile communication devices, PDAs, watches and
`other electronic devices.
`Thus, an individual can store EPG data, textual data or some other information
`as well as the bookmarks to give a richer prospective of the video. This textual information
`could be part or all of the transcript, the EPG data related to a synopsis or actor, a keyframe
`and so on. This information could be further used to characterize the segment and bookmark.
`
`5
`
`IO
`
`15
`
`20
`
`25
`
`30
`
`-3-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`4
`
`Accordingly, it is an object of the invention to provide an improved method,
`system and device for bookmarking and retrieving video and other content which overcomes
`drawbacks of existing methods, systems and devices.
`
`10
`
`5 BRIEF DESCRIPTION OF THE DRAWINGS
`For a fuller description of the invention, reference is had to the following
`description, taken in connection with the accompanying drawings, in which:
`FIG. 1 illustrates a video analysis process for segmenting video content in
`accordance with embodiments of the invention;
`FIGS. 2A and 2B are block diagrams of devices used in creating a visual index
`of segments in accordance with embodiments of the invention;
`FIG. 3 is a schematic diagram showing the selection of frame information
`from a video image in accordance with embodiments of the invention;
`FIG. 4 is a chart showing three levels of a segmentation analysis in accordance
`15 with embodiments of the invention; and
`FIG. 5 shows the process flow for the incoming video.
`
`25
`
`DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
`Often a viewer would like to bookmark a segment of the video they are
`20 watching for future retrieval. Bookmarking video can make it much easier to return to
`particular segments of interest. As a user watches a live video or video stored on a tape, disk,
`DVD, VHS tape or otherwise, they can press a button or otherwise cause a signal to be sent
`to a device electronically coupled to the video to enter a marking point. This marking point
`(or the signature of the frame) can be recorded in free areas of the tape (such as control areas)
`or medium on which the video is recorded or the time or frame count for the particular point
`of the tape can be recorded on a separate storage medium.
`FIG. 5 shows the process flow. The incoming video can be divided (formatted)
`into frames in step 501. Next for each of the frames, a signature is developed and stored in
`step 502. If the user has selected the frame for bookmarking then the frame is identified and
`the signature with its frame position and video information stored as a bookmark in step 503.
`The boundaries around the bookmark are then identified and their information can be stored
`as well in step 504. The segment identification, such as the segment boundaries or the video
`can be stored depending on the user in step 505.
`
`30
`
`-4-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`5
`
`5
`
`10
`
`In one embodiment of the invention, a user might store the bookmarks on a
`PDA, server or other storage device. This can act as a look up table. A user can also verify if
`they have viewed or obtained a specific video by comparing a bookmark or frame
`information to frame information of the video, stored, for example on an external server. A
`viewer might download video and then after viewing, delete the video, keeping only the
`bookmark( s) and then retrieve the video from an external source when additional viewing is
`desired. Thus, storage resources can be maximized and the efficiency of centralized content
`storage sources can be utilized.
`In one embodiment of the invention, when a viewer clicks on a video, the
`frame being displayed at that time is extracted out for analysis. A signature, histogram,
`closed captioning or some other low-level feature or combination of features, could represent
`this frame. Examples will be provided below.
`Although systems in accordance with the invention can be set up to return to
`the exact point where the bookmark signal is activated, in enhanced systems or applications a
`15 meaningful segment of the video can be bookmarked and users can have the option of
`returning to either the exact point or to the beginning of a meaningful segment, rather than to
`the middle of a segment or to the end of a segment, as a user might not decide to bookmark a
`segment until after it has been viewed and found to be of interest.
`Identifying the segment to which a bookmark corresponds can be
`accomplished in various manners. For example, in a preferred embodiment of the invention,
`the entire video or large portions thereof can be analyzed in accordance with the invention
`and broken down into segments. Then, when a bookmark signal is activated, the segment
`which is occurring when the signal is activated ( or the prior segment, or both) can be
`bookmarked. In another embodiment of the invention, the analysis to determine the
`boundaries of a segment are not conducted until after the bookmark signal is activated. This
`information (video signature, start and end time of the tape, frame count and so forth) can be
`stored in the same location identified above.
`In still another embodiment of the invention, a method of identifying items of
`content such as videos, audio, images, text and combinations thereof, and the like can be
`performed by creating a bookmark comprising a selected segment of the content item having
`sufficient identifying information to identify the content item and retaining the segment
`identifying the item on a storage medium, such as a storage medium at a service provider.
`Users could then download the bookmarks at a remote location at their election. Users could
`
`20
`
`25
`
`30
`
`-5-
`
`
`
`WO 02/080524
`
`PCT/1B02/00896
`
`6
`then use the bookmarks to identify the original item of content from which the bookmark was
`created. These downloads of bookmarks can be created in accordance with personal profiles.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`OCT Frame Signatures
`When the viewer selects a frame, one type of frame signature can be derived
`from the composition of the OCT (Discrete Cosine Transform) coefficients. A frame
`signature representation is derived for each grouping of similarly valued DCT blocks in a
`frame, i.e., a frame signature is derived from region signatures within the frame. Each region
`signature is derived from block signatures as explained in the section below. Qualitatively,
`the frame signatures contain information about the prominent regions in the video frames
`representing identifiable objects. The signature of this frame can than be used to retrieve this
`portion of the video.
`Referring to FIG. 3, extracting block, region and frame signatures can be
`performed as follows. Based on the DC and highest values of the AC coefficients, a signature
`is derived for each block 301 in a video frame 302. Then, blocks 301 with similar signatures
`are compared and size and location of groups of blocks 301 are determined in order to d_erive
`region signatures.
`The block signature 310 can be eight bits long, out of which three bits 320 are
`devoted to the DC signature and five bits 330 are devoted to the AC values. The DC part 320
`of the signature 310 is derived by determining where the DC value falls within a specified
`range of values (e.g. -2400 to 2400). The range can be divided into a preselected number of
`intervals. In this case, eight intervals are used (eight values are represented by three bits).
`Depending on the type of application, the size of the whole signature can be changed to
`accommodate a larger number of intervals and therefore finer granularity representation.
`Each interval is assigned a predefined mapping from the range of DC values to the DC part
`320 of the signature. Five bits 330 are used to represent the content of the AC values. Each
`AC value is compared to a threshold, e.g. 200 and if the value is greater than the threshold,
`the corresponding bit in the AC signature is set to one. An example is shown in FIG. 3, where
`only value 370 is greater than the threshold of 200.
`As shown in FIG. 3, five bits are used to represent the content of the AC
`values. Each AC value is compared to a threshold, if the value is greater than the threshold,
`the corresponding bit in the AC signature is set to one.
`After deriving block signatures for each frame, regions of similarly valued
`block signatures are determined. Regions consist of two or more blocks that share similar
`
`-6-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`7
`
`block signatures. In this process, a region growing method can be used for isolating regions
`in the image. Traditionally, region growing methods use pixel color and neighborhood
`concepts to detect regions. In one embodiment of the invention, block signature is used as a
`basis for growing regions. Each region can then be assigned a region signature, e.g.:
`regionSignature( mblockSignature, regionSize, R.x, Ry) where Rx and Ry are the coordinates
`of the center of the region. Each region corresponds roughly to an object in the image.
`A selected frame can be represented by the most prominent groupings
`(regions) of OCT blocks. Ann-word long signature is derived for a frame where n determines
`the number of important regions ( defined by the application) and a word consists of a
`predetermined number of bytes. Each frame can be represented by a number of prominent
`regions. In one embodiment of the invention, the number of regions in the image is limited
`and only the largest regions are kept. Because one frame is represented by a number of
`regions, the similarity between frames can be regulated by choosing the number of regions
`that are similar, based on their block signature, size and location. The regions can be sorted
`by region size and then the top n region signatures can be selected as a representative of the
`frame:frame (regionSignaturel, .... regionSignaturen). It should be noted that this
`representation of keyframes is based on the visual appearance of the images, and does not
`attempt to describe any semantics of the images.
`
`Frame Searching
`To find the position in the video, a frame comparison procedure compares a
`bookmarked frame F" with all frames F' in a list of frames. Their respective region
`signatures are compared according to their size:
`
`5
`
`10
`
`15
`
`20
`
`frame_difference = Llregion_size,'-region_size,"I
`
`II
`
`i=I
`
`The frame difference can be calculated for the regions in the frame signature
`25 with the same centroids. In this case, the position of the objects as well as the signature value
`is taken into account. On the other hand, there are cases when the position is irrelevant and
`we need to compare just the region sizes and disregard the position of the regions.
`If the frame difference is zero then we can use the position information from
`the matching frame to retrieve that section of the video.
`Other Frame Signature Types
`Signatures can be created by using combination of features from the frames,
`such as the maximum absolute difference (MAD) between the preceding and/or following
`frame. The intensity of the frame, bitrate used for the frame, whether the frame is interlaced
`
`30
`
`-7-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`8
`
`or progressive; whether the frame is from a 16:9 or 4:3 format, and so forth. This type of
`information could be used in any combination to identify the frame and a retrieval process
`developed similar to that described above used.
`
`5
`
`Color Histograms
`Instead of using the signatures described above, one could calculate a color
`histogram for the frame and use this for retrieval. The color histogram could consist of any
`
`10
`
`15
`
`20
`
`25
`
`30
`
`number of bins.
`
`Closed Captioning
`Closed captioning data could also be used to bookmark the segment by
`extracting out the key words that represent the section.
`
`Combinations
`Any combination of the above could also be used to bookmark the frame or
`
`section.
`
`Defining the segments
`The segments could be manually bookmarked by the viewer by having the
`viewer click on the start and end points of the video. Alternatively, the bookmarking could
`happen automatically using a technique such as a superhistogram. Automatic techniques for
`determining the boundaries of a frame are discussed below. For example, a scene will often
`maintain a certain color palette. A change in scene usually entails a break in this color
`palette. While the video is playing automatic video analysis can be performed to extract the
`histograms. When the viewer clicks on the video the color histogram for that frame is
`compared to the previous captured frames to identify the start of the frame then the same
`comparisons can be done to find the end of the scene. Using this information it is now
`possible to store only the segment of interest for the viewer. This information can also be
`used for more meaningful retrieval of the full video. For instance, instead of going directly to
`the position of when the viewer clicked, one could actually go to the start of the scene that
`contains that frame.
`
`Example
`The viewer is watching a video of the Wizard of Oz movie. The current view
`contains frames where Dorothy, the Tin Man, the Cowardly Lio:q and the Scarecrow go into
`the Emerald City from the poppy field. The viewer clicks on the video, e.g., when the Horse
`of a Different Color passes. In one embodiment of the invention, the frame/scene analysis has
`been continuous. The system can then extract the selected frame and generates both the DCT
`frame signature as well as the color histogram, for example. The analysis program searches
`
`-8-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`9
`
`through the previous stored frames until it finds one that does not belong to the same color
`palette. This denotes the start of the scene. The program has continued analyzing the video
`until it locates the end of the scene by virtue of another significant change in color palette . If
`the user had already decided to record the whole video, the start and end points are marked.
`In another embodiment of the invention, only the segment is stored. Meanwhile the program
`has been analyzing and storing the DCT frame information for the individual frames.
`Sometime later, if the viewer views the bookmarked frame and decides to retrieve the portion
`of the video, the DCT frame information is compared with the stored information until a
`match is found. Then the marked points around this frame are used to retrieve that portion of
`the video.
`
`Segmenting the video can be performed using analysis techniques such as
`those discussed in US Pat. Nos. 6,137,544 and 6,125,229, the contents of which are
`incorporated herein by reference.
`Segmenting a video signal can also be accomplished with use of a layered
`probabilistic system which can be referred to as a Bayesian Engine or BE. Such a system is
`described in J. Pearl, "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
`Inference," Morgan Kaufmann Publishers, Inc. San Mateo, California (1988). Such a system
`can be understood with reference to FIG. 9.
`FIG. 4 shows a three layer probabilistic framework in three layers: low level
`410, mid-level 420 and high level 430. Low level layer 410 describes signal processing
`parameters for a video signal 401. These can include visual features such as color, edge, and
`shape, audio parameters, such as average energy, bandwidth, pitch, met-frequency cepstral
`coefficients, linear prediction coding coefficients, and zero-crossings; and the transcript,
`which can be pulled from the ASCII characters of the closed captions. If closed caption
`information is not available, voice recognition methods can be used to convert the audio to
`transcript characters.
`The arrows indicate the combinations of low-level 410 features that create
`mid-level 420 features. Mid-level 420 features are associated with whole frames or
`collections of frames, while low level 410 features are associated with pixels or short time
`intervals. Keyframes (first frame of a shot), faces, and video text are mid-level visual
`features. Silence, noise, speech, music and combinations thereof are mid-level 420 features.
`Keywords and the closed caption/transcript categories also are part of mid-level 420.
`High level features can describe semantic video content obtained through the
`integration of mid-level 420 features across the different modalities.
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`-9-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`10
`
`This approach is highly suitable because probabilistic frameworks are
`designed to deal with uncertain information, and they are appropriate for representing the
`integration of information. The BE's probabilistic integration employs either intra or inter(cid:173)
`modalities. Intra-modality integration refers to integration of features within a single domain.
`For example: integration of color, edge, and shape information for videotext represents intra-
`modality integration because it all takes place in the visual domain. Integration of mid-level
`audio categories with the visual categories face and videotext offers an example of inter(cid:173)
`modalities.
`
`Bayesian networks are directed acyclical graphs (DAG) in which the nodes
`correspond to (stochastic) variables. The arcs describe a direct causal relationship between
`the linked variables. The strength of these links is given by conditional probability
`distributions (cpds). More formally, let the set A Li, ... _N) of N variables define a DAG. For
`each variable there exists a sub-set of variables of A, 9 _i, the parents set of _i i.e., the
`predecessors of _i, in the DAG, such that PL 19 _i) = P(_i l_t, ... ,_i-t), where P( 1 ·) is a cpd,
`strictly positive. Now, given the joint probability density function (pdf) P(_i, ... y}, using the
`chain rule:
`
`P(_i, ... y} = P(_N I _N-1, ... , _1) x ... xPG I _1)P(_1). According to this
`equation, the parent set 9 _i has the property that _i and {_1, ... , y} W _i are conditionally
`
`independent given 9 _i•
`In FIG. 4, the flow diagram of the BE has the structure of a DAG made up of
`three layers. In each layer, each element corresponds to a node in the DAG. The directed
`arcs join one node in a given layer with one or more nodes of the preceding layer. Two sets
`of arcs join the elements of the three layers. For a given layer and for a given element we
`compute a joint pdf as (probability density function) previously described. More precisely,
`for an element (node) ,-(I) associated with the /-th layer, the joint pdf is:
`
`5
`
`10
`
`15
`
`20
`
`25
`
`X
`
`{p(_(/-1) I 9(/-I))
`I
`I
`
`, • •
`
`pc_(l-1)
`N(l-1)
`
`. l}
`I 9(/-t)
`N{l-lJ/
`
`, • •
`
`(1)
`
`-10-
`
`
`
`WO 02/080524
`
`PCT/IB02/00896
`
`11
`
`where for each element}" there exists a parent set 9p>, the union of the parent sets for a
`given level/, i.e., 9U> <.®; i=/U>9lfJ, There can exist an overlap between the different parent
`
`sets for each level.
`Topic segmentation (and classification) performed by BE is shown in the third
`layer (high-level) of FIG. 4. The complex nature of multimedia content requires integration
`across multiple domains. It is preferred to use the comprehensive set of data from the audio,
`visual, and transcript domains.
`In the BE structure, FIG. 4, for each of the three layers, each node and arrow is
`associated to a cpd. In the low-level layer the cpd's are assigned by the AE as described
`above. For the mid-level layer, twenty closed captions categories (for example) are
`generated: weather, international, crime, sports, movie, fashion, tech stock, music,
`automobile, war, economy, energy, stock, violence, financial, national (affairs), biotech,
`disaster, art, and politics. It is advantageous to use a knowledge tree for each category made
`up of an association table of keywords and categories. After a statistical processing, the
`system performs categorization using category vote histograms. If a word in the closed
`captions file matches a knowledge base keyword, then the corresponding category gets a
`vote. The probability, for each category, is given by the ratio between the total number of
`votes per keyword and the total number of votes for a closed captions para-graph.
`Systems in accordance with the invention can perform segmentation
`segmenting the TV program into commercial vs. non-commercial parts; classifying the non(cid:173)
`commercial parts into segments based on two high-level categories: financial news and talk
`shows, for example (performed by the BE).
`Initial segmentation can be done using closed caption data to divide the video
`into program and commercial segments. Next the closed captions of the program segments
`are analyzed for single, double, and triple arrows. Double arrows indicate a speaker change.
`The system marks text between successive double arrows with a start and end time in order to
`use it as an atomic closed captions unit. Systems in accordance with the invention can use
`these units as the segmenting building blocks. In order to determine a segment's high-level
`indexing (whether it is financial news or a talk show, for example) Scout computes two joint
`probabilities. These are defined as:
`p-FIN-TOPIC = p-VTEXT * p-KWORDS * p-FACE *
`p-AUDIO-FIN * p-CC-FIN * p-FACETEXT-FIN
`
`(2),
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`p-TALK-TOPIC = p-VTEXT * p-KWORDS * p-FACE *p-AUDIO-TALK *
`
`-11-
`
`
`
`WO 02/080524
`
`p-CC- TALK* p-FACETEXT-TALK
`
`12
`
`PCT/IB02/00896
`
`(3).
`
`The audio probabilities p-AUDIO-FIN for financial news and p-AUDIO-
`
`5
`
`T ALK for talk shows are created by the combination of different individual audio category
`probabilities. The closed captions probabilities p-CC-FIN for financial news and p-CC-
`T ALK for talk shows are chosen as the largest probability out of the list of twenty
`
`probabilities. The face and videotext probabilities p-FACETEXT-FIN and p-FACETEXT(cid:173)
`TALK are obtained by comparing the face and videotext probabilities p-FACE and p-TEXT
`
`which determine, for each individual closed caption unit, the prob-ability of face and text
`
`l O occurrence. One heuristic use builds on the fact that talk shows are dominated by faces while
`financial news has both faces and text. The high-level segmenting is done on each closed
`captions unit by computing in a n