`(12) Patent Application Publication (10) Pub. No.: US 2003/0107592 A1
`(43) Pub. Date:
`Jun. 12, 2003
`Li et al.
`
`US 2003.01.07592A1
`
`(54) SYSTEM AND METHOD FOR RETRIEVING
`INFORMATION RELATED TO PERSONS IN
`VIDEO PROGRAMS
`(75) Inventors: Dongge Li, Ossining, NY (US);
`Nevenka Dimitrova, Yorktown Heights,
`NY (US); Lalitha Agnihotri, Fishkill,
`NY (US)
`Correspondence Address:
`PHILIPS ELECTRONICS NORTH
`AMERICAN CORP
`580 WHITE PLAINS RD
`TARRYTOWN, NY 10591 (US)
`(73) Assignee: KONINKLIJKE PHILIPS ELEC
`TRONICS N.V.
`
`(21) Appl. No.:
`
`10/014,234
`
`(22) Filed:
`
`Dec. 11, 2001
`Publication Classification
`
`(51) Int. Cl." ....................................................... G09G 5/00
`(52) U.S. Cl. .............................................................. 345/745
`(57)
`ABSTRACT
`An information tracking device receives content data, Such
`as a video or television signal from one or more information
`Sources and analyzes the content data according to a query
`criteria to extract relevant Stories. The query criteria utilizes
`a variety of information, Such as but not limited to a user
`request, a user profile, and a knowledge base of known
`relationships. Using the query criteria, the information track
`ing device calculates a probability of a perSon or event
`occurring in the content data and Spots and extracts Stories
`accordingly. The results are index, ordered, and then dis
`played on a display device.
`
`(O
`
`Remote USéru
`Stra
`
`Set-top
`gocesso
`
`O
`
`2Oo
`
`S
`
`
`
`spl A Y
`Device
`
`S
`
`le
`
`oMe
`c
`
`A AA |\rize
`
`2. 29
`
`seat'
`
`JFo Yu Aation
`4 ource
`
`so
`
`fe (Mft to
`
`in foe MA to N
`Souve-C e.
`
`Page 1
`
`Amazon Ex. 1106
`Amazon.com v. CustomPlay
`IPR2018-01498
`
`
`
`Patent Application Publication
`
`Jun. 12, 2003 Sheet 1 of 6
`
`US 2003/0107592 A1
`
`OTZ
`
`
`
`Q |
`
`Page 2
`
`
`
`Patent Application Publication
`
`US 2003/0107592 A1
`
`QS
`
`
`
`32 z)^ o5 |----------
`
`
`
`
`
`.*rm --••••••••••••••r•***~*~*~~~~ ~~~~ ~~~~ ~ ~ ~ ~ ~ ~~~~
`
`
`
`
`
`Page 3
`
`
`
`Patent Application Publication Jun. 12, 2003 Sheet 3 of 6
`FIG. 3
`
`US 2003/0107592 A1
`
`Video in
`
`
`
`StoryManagement Temporal, Causal
`(link, organize, score, rating)
`
`Page 4
`
`
`
`Patent Application Publication Jun. 12, 2003. Sheet 4 of 6
`
`US 2003/0107592 A1
`
`FIG. 4
`
`
`
`Video in
`
`Page 5
`
`
`
`Patent Application Publication Jun. 12, 2003 Sheet 5 of 6
`
`US 2003/0107592 A1
`
`FIG. 5
`
`Video/Text/Audio in
`
`Visual Segmentation:
`Cuts, face shots
`(constant faces
`
`Audio
`Segmentation &
`Classification
`
`Transcript topic
`by time slicing
`
`goe
`
`Information
`Fusion
`
`
`
`Internal Story
`Segmentation &
`Annotation
`
`Person
`Finding
`
`
`
`onferencing &
`Name resolution
`Spotted
`
`Page 6
`
`
`
`Patent Application Publication Jun. 12, 2003 Sheet 6 of 6
`
`US 2003/0107592 A1
`
`FIG. 6
`
`Ol
`
`STORY
`
`Cob
`
`Index by Name,
`Topic, Keywords
`
`Causality Relationship
`extraction
`
`Temporal Relationship
`Extraction
`
`lo
`
`I9
`
`
`
`l
`
`Profile &
`Relevance
`Feedback
`
`Page 7
`
`
`
`US 2003/0107592 A1
`
`Jun. 12, 2003
`
`SYSTEMAND METHOD FOR RETREVING
`INFORMATION RELATED TO PERSONS IN
`VIDEO PROGRAMS
`
`FIELD OF THE INVENTION
`0001. The present invention relates to a person tracker
`and method of retrieving information related to a targeted
`person from multiple information Sources.
`
`BACKGROUND OF INVENTION
`0002 With some 500+ channels of available television
`content and endleSS Streams of content accessible via the
`Internet, it might seem that one would always have access to
`desirable content. However, to the contrary, Viewers are
`often unable to find the type of content they are Seeking. This
`can lead to a frustrating experience.
`0003) When a user watches television there often occur
`times when the user would be interested in learning further
`information about perSons in the program the user is watch
`ing. Present Systems, however, fail to provide a mechanism
`for retrieving information related to a targeted Subject, Such
`as an actor or actress, or an athlete. For example, EP 1031
`964 is directed to an automated Search device. For example,
`a user with access to 200 television Stations Speaks his desire
`for watching, for example, Robert Redford movies or games
`shows. Voice recognition Systems cause a Search of available
`content and present the user with Selections based on the
`request. Thus, the System is an advanced channel Selecting
`System and does not go Outside the presented channels to
`obtain additional information for the user. Further, U.S. Pat.
`No. 5,596,705 presents the user with a multi-level presen
`tation of, for example, a movie. The Viewer can watch the
`movie or with the System, formulate queries to obtain
`additional information regarding the movie. However, it
`appears that the Search is of a closed System of movie related
`content In contrast, the disclosure of invention goes outside
`of the available television programs and outside of a single
`Source of content. Several examples are given. A user is
`watching a live cricket match and can retrieve detailed
`Statistics on the player at bat. A user watching a movie wants
`to know more about the actor on the Screen and additional
`information is located from various Web Sources, not a
`parallel Signal transmitted with the movie. A user Sees an
`actress on the Screen who looks familiar, but can’t remember
`her name. The System identifies all the programs the user has
`watched that the actress has been in. Thus, the proposal
`represents a broader, or open-ended Search System for
`accessing a much larger universe content than either of the
`two cited references.
`0004. On the Internet, a user looking for content can type
`a Search request into a Search engine. However, these Search
`engines are often hit or miss and can be very inefficient to
`use. Furthermore, current Search engines are unable to
`continuously access relevant content to update results over
`time. There are also specialized web sites and news groups
`(e.g., sports sites, movie sites, etc.) for users to access.
`However, these Sites require users to log in and inquire about
`a particular topic each time the user desires information.
`0005 Moreover, there is no system available that inte
`grates information retrieving capability acroSS Various
`media types, Such as television and the Internet, and can
`extract people or Stories about Such perSons from multiple
`
`channels and site. In one system, disclosed in EP915621,
`URLS are embedded in a closed caption portion of a trans
`mission so that the URLS can be extracted to retrieve the
`corresponding web pages in Synchronization with the tele
`Vision signal. However, Such Systems fail to allow for user
`interaction.
`0006 Thus there is a need for a system and method for
`permitting a user to create a targeted request for information,
`which request is processed by a computing device having
`access to multiple information Sources to retrieve informa
`tion related to the Subject of the request.
`
`SUMMARY OF THE INVENTION
`0007. The present invention overcomes the shortcomings
`of the prior art. Generally, a perSon tracker comprises a
`content analyzer comprising a memory for Storing content
`data received from an information Source and a processor for
`executing a Set of machine-readable instructions for analyZ
`ing the content data according to query criteria. The perSon
`tracker further comprises an input device communicatively
`connected to the content analyzer for permitting a user to
`interact with the content analyzer and a display device
`communicatively connected to the content analyzer for
`displaying a result of analysis of the content data performed
`by the content analyzer. According to the Set of machine
`readable instructions, the processor of the content analyzer
`analyzes the content data to extract and indeX one or more
`Stories related to the query criteria.
`0008 More specifically, in an exemplary embodiment,
`the processor of the content analyzer uses the query criteria
`to Spot a Subject in the content data and retrieve information
`about the Spotted person to the user. The content analyzer
`also further comprises a knowledge base which includes a
`plurality of known relationships including a map of known
`faces and Voices to names and other related information. The
`celebrity finder System is implemented based on the fusion
`of cues from audio, Video and available video-text or closed
`caption information. From the audio data, the System can
`recognize Speakers based on the Voice. From the visual cues,
`the System can track the face trajectories and recognize faces
`for each of the face trajectories. Whenever available, the
`System can extract names from Video text and close caption
`data. A decision-level fusion Strategy can then be used to
`integrate different cues to reach a result. When the user Sends
`a request related to the identify of the person shown on the
`Screen, the person tracker can recognize that perSon accord
`ing to the embedded knowledge, which may be Stored in the
`tracker or loaded from a Server. Appropriate responses can
`then be created according to the identification results. If
`additional or background information is desired, a request
`may also be sent to the Server, which then Searches through
`a candidate list or various external Sources, Such as the
`Internet (e.g., a celebrity web site) for a potential answer or
`clues that will enable the content analyzer to determine an
`SWC.
`0009. In general, the processor, according to the machine
`readable instructions performs Several Steps to make the
`most relevant matches to a user's request or interests,
`including but not limited to perSon Spotting, Story extraction,
`inferencing and name resolution, indexing, results presen
`tation, and user profile management. More specifically,
`according to an exemplary embodiment, a perSon Spotting
`
`Page 8
`
`
`
`US 2003/0107592 A1
`
`Jun. 12, 2003
`
`function of the machine-readable instructions extracts faces,
`Speech, and text from the content data, makes a first match
`of known faces to the extracted faces, makes a Second match
`of known Voices to the extracted Voices, Scans the extracted
`text to make a third match to known names, and calculates
`a probability of a particular perSon being present in the
`content databased on the first, Second, and third matches. In
`addition, a Story extraction function preferably Segments
`audio, Video and transcript information of the content data,
`performs information fusion, internal Story Segmentation/
`annotation, and inferencing and name resolution to extract
`relevant Stories.
`0.010 The above and other features and advantages of the
`present invention will become readily apparent from the
`following detailed description thereof, which is to be read in
`connection with the accompanying drawings.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`0011. In the drawing figures, which are merely illustra
`tive, and wherein like reference numerals depict like ele
`ments throughout the Several views:
`0012 FIG. 1 is a schematic diagram of an overview of an
`exemplary embodiment of an information retrieval System in
`accordance with the present invention;
`0013 FIG. 2 is a schematic diagram of an alternate
`embodiment of an information retrieval System in accor
`dance with the present invention;
`0014 FIG. 3 is a is a flow diagram of a method of
`information retrieval in accordance with the present inven
`tion;
`FIG. 4 is a flow diagram of a method of person
`0.015
`Spotting and recognition in accordance with the present
`invention;
`0016 FIG. 5 is a flow diagram of a method of story
`extraction; and
`0017 FIG. 6 is a flow diagram of a method of indexing
`the extracted Stories.
`
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENTS
`0.018. The present invention is directed to an interactive
`System and method for retrieving information from multiple
`media Sources according to a request of a user of the System.
`0019. In particular, an information retrieval and tracking
`System is communicatively connected to multiple informa
`tion sources. Preferably, the information retrieval and track
`ing System receives media content from the information
`Sources as a constant Stream of data. In response to a request
`from a user (or triggered by a user's profile), the System
`analyzes the content data and retrieves that data most closely
`related to the request. The retrieved data is either displayed
`or Stored for later display on a display device.
`0020 System Architecture
`0021. With reference to FIG. 1, there is shown a sche
`matic overview of a first embodiment of an information
`retrieval System 10 in accordance with the present invention.
`A centralized content analysis System 20 is interconnected to
`a plurality of information sources 50. By way of non
`
`limiting example, information Sources 50 may include cable
`or Satellite television and the Internet. The content analysis
`System 20 is also communicatively connected to a plurality
`of remote user sites 100, described further below.
`0022. In the first embodiment, shown in FIG. 1, central
`ized content analysis System 20 comprises a content ana
`lyzer 25 and one or more data storage devices 30. The
`content analyzer 25 and the storage devices 30 are prefer
`ably interconnected via a local or wide area network. The
`content analyzer 25 comprises a processor 27 and a memory
`29, which are capable of receiving and analyzing informa
`tion received from the information sources 50. The proces
`Sor 27 may be a microprocessor and associated operating
`memory (RAM and ROM), and include a second processor
`for pre-processing the Video, audio and text components of
`the data input. The processor 27, which may be, for example,
`an Intel Pentium chip or other more powerful multiproces
`Sor, is preferably powerful enough to perform content analy
`sis on a frame-by-frame basis, as described below. The
`functionality of content analyzer 25 is described in further
`detail below in connection with FIGS. 3-5.
`0023 The storage devices 30 may be a disk array or may
`comprise a hierarchical Storage System with tera, peta and
`exabytes of Storage devices, optical Storage devices, each
`preferably having hundreds or thousands of giga-bytes of
`Storage capability for Storing media content. One skilled in
`the art will recognize that any number of different Storage
`devices 30 may be used to Support the data Storage needs of
`the centralized content analysis system 20 of an information
`retrieval System 10 that accesses Several information Sources
`50 and can Support multiple users at any given time.
`0024. As described above, the centralized content analy
`sis System 20 is preferably communicatively connected to a
`plurality of remote user sites 100 (e.g., a user's home or
`office), via a network 200. Network 200 is any global
`communications network, including but not limited to the
`Internet, a wireleSS/satellite network, cable network, any the
`like. Preferably, network 200 is capable of transmitting data
`to the remote user sites 100 at relatively high data transfer
`rates to Support media rich content retrieval, Such as live or
`recorded television.
`0025. As shown in FIG. 1, each remote site 100 includes
`a set-top box 110 or other information receiving device. A
`Set-top box is preferable because most Set-top boxes, Such as
`TiVo(R), WebTB(R), or UltimateTV(R), are capable of receiving
`several different types of content. For instance, the Ulti
`mateTV(R) set-top box from Microsoft(R) can receive content
`data from both digital cable Services and the Internet.
`Alternatively, a Satellite television receiver could be con
`nected to a computing device, Such as a home personal
`computer 140, which can receive and proceSS web content,
`via a home local area network. In either case, all of the
`information receiving devices are preferably connected to a
`display device 115, such as a television or CRT/LCD dis
`play.
`0026. Users at the remote user sites 100 generally access
`and communicate with the set-top box 110 or other infor
`mation receiving device using various input devices 120,
`Such as a keyboard, a multi-function remote control, Voice
`activated device or microphone, or personal digital assistant.
`Using Such input devices 120, users can input specific
`
`Page 9
`
`
`
`US 2003/0107592 A1
`
`Jun. 12, 2003
`
`requests to the perSon tracker, which uses the requests Search
`for information related to a particular perSon, as described
`further below.
`0027. In an alternate embodiment, shown in FIG. 2, a
`content analyzer 25 is located at each remote site 100 and is
`communicatively connected to the information sources 50.
`In this alternate embodiment, the content analyzer 25 may be
`integrated with a high capacity Storage device or a central
`ized storage device (not shown) can be utilized. In either
`instance, the need for a centralized analysis System 20 is
`eliminated in this embodiment. The content analyzer 25 may
`also be integrated into any other type of computing device
`140 that is capable of receiving and analyzing information
`from the information sources 50, such as, by way of non
`limiting example, a personal computer, a hand held com
`puting device, a gaming console having increased proceSS
`ing and communications capabilities, a cable Set-top box,
`and the like. A secondary processor, such as the TriMedia TM
`Tricodec card may be used in Said computing device 140 to
`pre-process video signals. However, in FIG. 2 to avoid
`confusion, the content analyzer 25, the Storage device 130,
`and the Set-top box 110 are each depicted Separately.
`0028 Functioning of Content Analyzer
`0029. As will become evident from the following discus
`Sion, the functionality of the information retrieval system 10
`has equal applicability to both television/video based con
`tent and web-based content. The content analyzer 25 is
`preferably programmed with a firmware and Software pack
`age to deliver the functionalities described herein. Upon
`connecting the content analyzer 25 to the appropriate
`devices, i.e., a television, home computer, cable network,
`etc., the user would preferably input a personal profile using
`input device 120 that will be stored in a memory 29 of the
`content analyzer 25. The personal profile may include infor
`mation Such as, for example, the user personal interests (e.g.,
`Sports, news, history, gossip, etc.), perSons of interest (e.g.,
`celebrities, politicians, etc.), or places of interest (e.g.,
`foreign cities, famous sites, etc.), to name a few. Also, as
`described below, the content analyzer 25 preferably stores a
`knowledge base from which to draw known data relation
`ships, such as G. W. Bush is the President of the United
`States.
`0030. With reference to FIG. 3, the functionality of the
`content analyzer will be described in connection with the
`analysis of a Video signal. In Step 302, the content analyzer
`25 performs a Video content analysis using audio Visual and
`transcript processing to perform perSon Spotting and recog
`nition using, for example, a list of celebrity or politician
`names, Voices, or images in the user profile and/or knowl
`edge base and external data Source, as described below in
`connection with FIG. 4. In a real-time application, the
`incoming content stream (e.g., live cable television) is
`buffered either in the storage device 30 at the central site 20
`or in the local storage device 130 at the remote site 100
`during the content analysis phase. In other non-real-time
`applications, upon receipt of a request or other prescheduled
`event (described below), the content analyzer 25 accesses
`the storage device 30 or 130, as applicable, and performs the
`content analysis.
`0031. The content analyzer 25 of person tracking system
`10 receives a viewer's request for information related to a
`certain celebrity shown in a program and uses the request to
`
`return a response, which can help the viewer better Search or
`manage TV programs of interest. Here are four examples:
`0032 1. User is watching a cricket match. A new
`player comes to bat. The user asks the system 10 for
`detailed Statistics on this player based on this match
`and previous matches this year.
`0033 2. User sees an interesting actor on the screen
`and wants to know more about him. The system 10
`locates Some profile information about the actor from
`the Internet or retrieves news about the actor from
`recently issued Stories.
`0034 3. User sees an actress on the screen who
`looks familiar, but the user cannot remember the
`actress's name. System 10 responds with all the
`programs that this actress has been in along with her
`C.
`0035 4. A user who is very interested in the latest
`news involving a celebrity Sets her personal Video
`recorder to record all the news about the celebrity.
`The system 10 scans the news channels, and celeb
`rity and talk shows, for example, for the celebrity
`and records of channels all matching programs.
`0036 Because most cable and satellite television signals
`carry hundreds of channels it is preferable to target only
`those channels that are most likely to produce relevant
`Stories. For this purpose the content analyzer 25 may be
`programmed with knowledge base 450 or field database to
`aid the processor 27 in determining a “field types” for the
`user's request. For example, the name Dan Marino in the
`field database might be mapped to the field “sports”. Simi
`larly, the term “terrorism” might be mapped to the field
`“news. In either instance, upon determination of a field
`type, the content analyzer would then only Scan those
`channels relevant to the field (e.g., news channels for the
`field “news”). While these categorizations are not required
`for operation of the content analysis process, using the user's
`request to determine a field type is more efficient and would
`lead to quicker Story extraction. In addition, it should be
`noted that the mapping of particular terms to fields is a
`matter of design choice and could be implemented in any
`number of ways.
`0037 Next, in step 304, the video signal is further
`analyzed to extract Stories from the incoming Video. Again,
`the preferred process is described below in connection with
`FIG. 5. It should be noted that the person spotting and
`recognition can also be executed in parallel with Story
`extraction as an alternative implementation.
`0038 An exemplary method of performing content
`analysis on a Video Signal, Such as a television NTSC signal,
`which is the basis for both the perSon Spotting and Story
`extraction functionality, will now be described. Once the
`video signal is buffered, the processor 27 of the content
`analyzer 25, preferably uses a Bayesian or fusion Software
`engine, as described below, to analyze the Video signal. For
`example, each frame of the Video Signal may be analyzed So
`as to allow for the Segmentation of the Video data.
`0039. With reference to FIG. 4, a preferred process of
`performing perSon Spotting and recognition will be
`described. At level 410, face detection, Speech detection, and
`transcript extraction is performed Substantially as described
`
`Page 10
`
`
`
`US 2003/0107592 A1
`
`Jun. 12, 2003
`
`above. Next, at level 420, the content analyzer 25 performs
`face model and Voice model extraction by matching the
`extracted faces and Speech to known face and Voice models
`Stored in the knowledge base. The extracted transcript is also
`Scanned to match known names Stored in the knowledge
`base. At level 430, using the model extraction and name
`matches, a perSon is Spotted or recognized by the content
`analyzer. This information is then used in conjunction with
`the story extraction functionality as shown in FIG. 5.
`0040. By way of example only, a user may be interested
`in political events in the mid-east, but will be away on
`vacation on a remote island in South East Asia; thus, unable
`to receive news updates. Using input device 120, the user
`can enter keywords associated with the request. For
`example, the user might enter Israel, Palestine, Iraq, Iran,
`Ariel Sharon, Saddam Hussein, etc. These key terms are
`stored in a user profile on a memory 29 of the content
`analyzer 25 AS discussed above, a database of frequently
`used terms or perSons is Stored in the knowledge base of the
`content analyzer 25. The content analyzer 25 looks-up and
`matches the inputted key terms with terms Stored in the
`database. For example, the name Ariel Sharon is matched to
`Israeli Prime Minister, Israel is matched to the mid-east, and
`So on. In this Scenario, these terms might be linked to a news
`field type. In another example, the names of Sports figures
`might return a Sports field result.
`0041) Using the field result, the content analyzer 25
`accesses the most likely areas of the information Sources to
`find related content. For example, the information retrieval
`System might access news channels or news related web
`Sites to find information related to the request terms.
`0042. With reference now to FIG. 5, an exemplary
`method of story extract will be described and shown. First,
`in steps 502, 504, and 506, the video/audio source is
`preferably analyzed to Segment the content into visual, audio
`and textual components, as described below. Next, in Steps
`508 and 510, the content analyzer 25 performs information
`fusion and internal Segmentation and annotation. Lastly, in
`Step 512, using the perSon recognition result, the Segmented
`Story is inferenced and the names are resolved with the
`Spotted Subject.
`0043. Such methods of video segmentation include but
`are not limited to cut detection, face detection, text detec
`tion, motion estimation/segmentation/detection, camera
`motion, and the like. Furthermore, an audio component of
`the Video Signal may be analyzed. For example, audio
`Segmentation includes but is not limited to Speech to text
`conversion, audio effects and event detection, Speaker iden
`tification, program identification, music classification, and
`dialogue detection based on Speaker identification. Gener
`ally speaking, audio Segmentation involves using low-level
`audio features Such as bandwidth, energy and pitch of the
`audio data input. The audio data input may then be further
`Separated into various components, Such as music and
`Speech. Yet further, a Video Signal may be accompanied by
`transcript data (for closed captioning System), which can
`also be analyzed by the processor 27. As will be described
`further below, in operation, upon receipt of a retrieval
`request from a user, the processor 27 calculates a probability
`of the occurrence of a Story in the Video signal based upon
`the plain language of the request and can extract the
`requested Story.
`
`0044 Prior to performing segmentation, the processor 27
`receives the video signal as it is buffered in a memory 29 of
`the content analyzer 25 and the content analyzer accesses the
`video signal. The processor 27 de-multiplexes the video
`Signal to Separate the Signal into its video and audio com
`ponents and in Some instances a text component. Alterna
`tively, the processor 27 attempts to detect whether the audio
`Stream contains Speech. An exemplary method of detecting
`Speech in the audio stream is described below. If Speech is
`detected, then the processor 27 converts the Speech to text to
`create a time-Stamped transcript of the Video signal. The
`processor 27 then adds the text transcript as an additional
`Stream to be analyzed.
`0045 Whether speech is detected or not, the processor 27
`then attempts to determine Segment boundaries, i.e., the
`beginning or end of a classifiable event. In a preferred
`embodiment, the processor 27 performs Significant Scene
`change detection first by extracting a new keyframe when it
`detects a Significant difference between Sequential I-frames
`of a group of pictures. AS noted above, the frame grabbing
`and keyframe extracting can also be performed at pre
`determined intervals. The processor 27 preferably, employs
`a DCT-based implementation for frame differencing using
`cumulative macroblock difference measure. Unicolor key
`frames or frames that appear Similar to previously extracted
`keyframes get filtered out using a one-byte frame Signature.
`The processor 27 bases this probability on the relative
`amount above the threshold using the differences between
`the Sequential I-frames.
`0046. A method of frame filtering is described in U.S.
`Pat. No. 6,125,229 to Dimitrova et al. the entire disclosure
`of which is incorporated herein by reference, and briefly
`described below. Generally Speaking the processor receives
`content and formats the Video signals into frames represent
`ing pixel data (frame grabbing). It should be noted that the
`process of grabbing and analyzing frames is preferably
`performed at pre-defined intervals for each recording device.
`For instance, when the processor begins analyzing the Video
`Signal, keyframes can be grabbed every 30 Seconds.
`0047 Once these frames are grabbed every selected
`keyframe is analyzed. Video Segmentation is known in the
`art and is generally explained in the publications entitled, N.
`Dimitrova, T. McGee, L. Agnihotri, S. Dagtas, and R.
`Jasinschi, “On Selective Video Content Analysis and Filter
`ing,” presented at SPIE Conference on Image and Video
`Databases, San Jose, 2000; and “Text, Speech, and Vision
`For Video Segmentation: The Infomedia Project” by A.
`Hauptmann and M. Smith, AAAI Fall 1995 Symposium on
`Computational Models for Integrating Language and Vision
`1995, the entire disclosures of which are incorporated herein
`by reference. Any Segment of the Video portion of the
`recorded data including visual (e.g., a face) and/or text
`information relating to a perSon captured by the recording
`devices will indicate that the data relates to that particular
`individual and, thus, may be indexed according to Such
`Segments. AS known in the art, Video Segmentation includes,
`but is not limited to:
`0048 Significant scene change detection: wherein con
`secutive Video frames are compared to identify abrupt Scene
`changes (hard cuts) or Soft transitions (dissolve, fade-in and
`fade-out). An explanation of significant Scene change detec
`tion is provided in the publication by N. Dimitrova, T.
`
`Page 11
`
`
`
`US 2003/0107592 A1
`
`Jun. 12, 2003
`
`McGee, H. Elenbaas, entitled “Video Keyframe Extraction
`and Filtering: A Keyframe is Not a Keyframe to Everyone”,
`Proc. ACM Conf. on Knowledge and Information Manage
`ment, pp. 113-120, 1997, the entire disclosure of which is
`incorporated herein by reference.
`0049 Face detection: wherein regions of each of the
`Video frames are identified which contain skin-tone and
`which correspond to oval-like Shapes. In the preferred
`embodiment, once a face image is identified, the image is
`compared to a database of known facial images Stored in the
`memory to determine whether the facial image shown in the
`Video frame corresponds to the user's viewing preference.
`An explanation of face detection is provided in the publi
`cation by Gang Wei and Ishwar K. Sethi, entitled “Face
`Detection for Image Annotation’, Pattern Recognition Let
`ters, Vol. 20, No. 11, November 1999, the entire disclosure
`of which is incorporated herein by reference.
`0050 Motion
`Estimation/Segmentation/Detection:
`wherein moving objects are determined in Video Sequences
`and the trajectory of the moving object is analyzed. In order
`to determine the movement of objects in Video Sequences,
`known operations Such as optical flow estimation, motion
`compensation and motion Segmentation are preferably
`employed. An explanation of motion estimation/Segmenta
`tion/detection is provided in the publication by Patrick
`Bouthemy and Francois Edouard, entitled “Motion Segmen
`tation and Qualitative Dynamic Scene Analysis from an
`Image Sequence”, International Journal of Computer Vision,
`Vol. 10, No. 2, pp. 157-182, April 1993, the entire disclosure
`of which is incorporated herein by reference.
`0051. The audio component of the video signal may also
`be analyzed and monitored for the occurrence of words/
`Sounds that are relevant to the user's request. Audio Seg
`mentation includes the following types of analysis of Video
`programs: Speech-to-text conversion, audio effects and event
`detection, Speaker identification, program identification,
`music classification, and dialog detection based on Speaker
`identification.
`0.052 Audio segmentation and classification includes
`division of the audio signal into Speech and non-speech
`portions. The first Step in audio Segmentation involves
`Segment classification using low-level audio features Such as
`bandwidth, energy and pitch. Channel Separation is
`employed to Sepa