`Bergen et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 6,956,573 B1
`Oct. 18, 2005
`
`USOO6956573B1
`
`(54) METHOD AND APPARATUS FOR
`OTHER PUBLICATIONS
`EFFICIENTLY REPRESENTING STORING
`66
`AND ACCESSING WIDEO INFORMATION
`Shibata et al. (“Content-Based structuring of video infor
`mation": 0-8186-7436-9/96, 1996 IEEE).*
`(75) Inventors: James Russell Bergen, Hopewell, NJ
`Jaillon et al. (Image Mosaicing Applied to Three
`(US); Curtis R. Carlson, Princeton, NJ
`Dimensional Surfaces”: 1051-4651/94–1944 IEEE).
`Smoliar et al (“Content-based Video indexing and
`S. s ESSE"" Retrieval". 1070-986x94-1994 IEEE).
`US); Rakesh K
`M
`th Jct.
`Hong Jiang Zhang, Atreyi Kankanhalli, Stephen W. Smoliar,
`Cranbufy NJ (US)
`s
`“Automatic Partitioning of Full-motion Video', Multimedia
`(73) Assignee: Sarnoff Corporation, Princeton, NJ
`Systems pp. 10-28, 1993.
`(US)
`Y. Gong, C. H. Chuan, Z. Yongwei, M. Sakauchi, “A
`Generic Video Parsing System With A Scene Description
`Language (SDL)”, Real-Time Imaging, Vol. 2, pp. 45-59,
`1996.
`H. D. Wactlar, T. Kanade, M. A. Smith, S. M. Stevens,
`“Intelligent Access to Digital Video: Informedia Project',
`IEEE Computer Society, vol. 29, No. 5, pp. 46-52 (1996).
`M. Christel, S. Stevens, T. Kanade, M. Mauldin, R. Reddy
`and H. Wactlar, “Techniques for the Creation and Explora
`tion of Digital Video Libraries”, Multimedia Tools and
`Applications, vol. 2, pp. 1-33, (1996).
`* cited by examiner
`Primary Examiner Almis R. Jankus
`(74) Attorney, Agent, or Firm-William J. Burke
`(57)
`ABSTRACT
`A method and concomitant apparatus for comprehensively
`representing Video information in a manner facilitating
`indexing of the Video information. Specifically, a method
`according to the inveniton comprises the Steps of dividing a
`continuous video Stream into a plurality of Video Scenes, and
`at least one of the Steps of dividing, using intra-Scene motion
`analysis, at least one of the plurality of Scenes into one or
`more layers, representing, as a mosaic, at least one of the
`pluraliy of Scenes, computing, for at least one layer or Scene,
`one or more content-related appearance attributes, and Stor
`Ing, in a database, the content-related appearance attributes
`or said mosaic representations.
`
`(*) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 1414 days.
`(21) Appl. No.: 08/970,889
`(22) Filed:
`Nov. 14, 1997
`Related U.S. Application Data
`(60) Provisional application No. 60/031,003, filed on Nov.
`15, 1996.
`(51) Int. Cl. ............................................... G06T 15/00
`(52) U.S. Cl. ...................................................... 345/473
`(58) Field of Search ................................ 345/327,473;
`382/284, 220, 305,236; 715/716
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`4.941,125 A 7/1990 Boyne ........................ 364/900
`5,485,611 A 1/1996 Astle .......................... 395/600
`5,550.965 A 8/1996 Gabbe et al. ............... 395/154
`5,635,982 A * 6/1997 Zhang et al. ............... 348/231
`5,649.032 A 7/1997 Burt et al. .................. 382/284
`5,657,402 A 8/1997 Bender et al. .............. 382/284
`5,706,417 A
`1/1998 Adelson ..................... 395/129
`5,751,286 A * 5/1998 Barber et al. ............... 345.348
`5,821,945 A * 10/1998 Yeo et al. ................... 345/440
`5,915,044 A
`6/1999 Gardos et al. .............. 382/236
`
`23 Claims, 10 Drawing Sheets
`
`stice ...|s|sassissississ 74 sn-sn-710
`
`739--
`
`738
`
`732
`
`BACK FF Far F5
`GROUN
`
`F6
`
`
`
`748
`
`742
`
`Fr. Fm
`
`75
`
`769
`
`768
`
`770
`
`80
`
`Page 1
`
`AMAZON EX. 1028
`Amazon v. CustomPlay
`US Patent No. 9,380,282
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 1 of 10
`
`US 6,956,573 B1
`
`Av1dSId
`
`colt
`
`i
`
`MYOMLIN
`
`
`
`AsvavivdJSNIDNA
`
`YHOMLAN
`
`vS
`
`OFOSI
`
`-69ZO
`
`
`—~|9ZtaouNos!|“9g
`
`
`7LtHATIOWINOO]{NOLWAYOANI
`
`rap|091\as$8GZ}SOIAaG4SS300V39VWIwo|__inan
`
`
`
`‘O24—-==-!O3dIAO3AIAsNOLWIWHOSNIoniABVTIONY|SCrecoh
`—CeOTONIBOHLAY|GL}SOVINALNI!l
`
`OOOz!
`
`NOLLWAHOSNIESSISATWNY[$5HOLNIWDAS
`
`
`6-021
`
`dvOSCIA
`
`YOOOSCIA
`
`U-OLI.e
`
`U-9S
`
`L5ld
`
`Page 2
`
`Page 2
`
`
`
`
`
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 2 of 10
`
`US 6,956,573 B1
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`205
`
`210
`
`220
`
`230
`
`CALCULATE
`FRAME(N)
`DESCRIPTION
`
`CALCULATE
`FRAME(N+1)
`DESCRIPTION
`
`CALCULATE
`FFD OF
`FRAME (N) AND
`FRAME (N+1)
`
`HRESHOLD
`
`245
`
`SET SCENE
`CUT FLAG
`
`
`
`250
`
`255
`
`FIG. 2
`
`Page 3
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 3 of 10
`
`US 6,956,573 B1
`
`DIVIDESCENES
`3
`INFOF6RERD-1810
`{AND BACKGROUND
`
`COMPUTE
`INTRA-SCENE
`ATTRIBUTES
`
`
`
`35
`
`320
`
`325
`
`330
`
`STORE
`INTRA-SCENE
`ATTRIBUTEDAA
`
`COMPUTE
`NTER-SCENE
`ATTRIBUTES
`
`STORE
`iNTER-SCENE
`Al TRIBUTE DATA
`
`
`
`a lar
`
`as
`
`skr arma Y
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`INTER-SCENE
`REPRESENTATIONS
`3D- 345
`
`F.G. 3
`
`Page 4
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 4 of 10
`
`US 6,956,573 B1
`
`470
`
`- - - - - - - - - - - - - - -
`CAMERA476-2
`
`:
`
`GPS
`RECEIVER-476-1
`
`}
`
`DISPLAY
`472
`
`VIDEO BOOK
`PROGRAM
`
`WDEOMAP
`PROGRAM
`CONTROLLER474
`
`(NETWORKS
`\ 160
`
`NETWORK
`INTERFACE 473
`
`
`
`STORAGE
`UNIT477
`
`STORAGE UNIT
`INTERFACE 478
`
`KEYPAD
`175
`
`FIG. 4
`
`Page 5
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 5 of 10
`
`US 6,956,573 B1
`
`
`
`:
`
`P-3
`sh;
`Seaw:
`gases
`
`s:
`is raisals
`
`Page 6
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 6 of 10
`
`US 6,956,573 B1
`
`605
`
`AUTHORING
`
`VIDEOAS
`AMAP
`
`DISTRIBUTION
`
`630
`
`610
`
`
`
`
`
`
`
`
`
`ANNOTATED
`REFERENCE
`VIDEO/IMAGE
`DATABASE
`CREATION
`
`612
`
`ACCESS
`
`PRESENTATION/
`CREATION OF
`ANNOTATED IMAGE
`
`620
`
`
`
`INDEXING
`INTO THE
`DATABASE
`
`64
`
`REPRESENTATION
`OF SCENES ASA
`COLLECTION OF
`VIEWS
`
`ANNOTATION OF
`REFERENCE
`IMAGERY WITH
`DATABASE/
`ANCILLIARY
`INFORMATION
`
`613
`
`622
`
`624
`
`APPEARANCE
`BASED
`INDEXING
`INFORMATION
`
`USING ANCILLIARY
`INFORMATION
`SUCHAS GPS,
`COMPASSETC.
`
`USING WIDEO
`INFORMATION
`
`
`
`
`
`FIG. 6
`
`Page 7
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 7 of 10
`
`US 6,956,573 B1
`
`since .... S1s2S3S4S5Ss
`
`710
`
`FFFFFSF6/
`
`720
`
`
`
`769
`
`768
`
`Page 8
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 8 of 10
`
`US 6,956,573 B1
`
`OUERYTYPE
`
`QUERY
`SPECIFICATION
`
`805
`
`810
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`COMPUTE FEATURES
`FOR THE SPECIFIED
`CRUERY
`
`TRANSMITAPPROPRIATEFEATURE
`VECTOR(S) TO THE DATABASE
`SEARCHENGINE
`
`SEARCH THROUGH THE STORED
`MULT-DIMENSIONALTREESTRUCTURESTO
`RETRIEVE POTENTIAL MATCHING DATA
`
`LINEARLY SEARCH THROUGH RETRIEVED DATA
`TO FINDETHER THE TOP k MATCHES
`OR
`ALLMATCHES WITHINA GIVENTHRESHOLD
`
`FORMAT THE REPRESENTATIVE
`FRAMES/MOSAICS OF ALL THE MATCHING
`VIDEO SHOTS FOR PRESENTATIONALONG
`WITH (OPTIONALLY) A VISUALORNUMERIC
`INDICATOR OF THE QUALITY OF MATCH
`
`TRANSMIT THE RESULT INA VISUAL
`STORYBOARD FORM FOR DISPLAYAT
`THE BROWSER/QUERY/USEREND
`
`820
`
`830
`
`840
`
`850
`
`860
`
`870
`
`FIG. 8
`
`Page 9
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 9 of 10
`
`US 6,956,573 B1
`
`
`
`
`
`
`
`
`
`
`
`905
`
`
`
`910
`
`
`
`INPUT FRAME
`
`DECIMATE FRAME TO
`PRODUCE IMAGE PYRAMID
`
`SELECT FEATURE AND
`ASSOCATED FILTER
`
`APPLYN FEATURE FILTERS
`TO EACHSUBBAND OF
`PYRAMID
`
`RECTIFY FILTER OUTPUTS
`
`GENERATEFEATURE MAP
`FOREACH RECTIFIED
`FILTER OUTPUT
`
`
`
`INTEGRATEFEATURE MAPS
`FOREACHSUBBAND TO
`PRODUCEATTRIBUTEPYRAMD
`
`STOREATTRIBUTEPYRAMID
`
`ADDITIONAL
`FEATURE
`?
`
`FIG. 9
`
`Page 10
`
`
`
`U.S. Patent
`
`Oct. 18, 2005
`
`Sheet 10 of 10
`
`US 6,956,573 B1
`
`ALAINLLV
`
`SGINVYVd
`
`
`
`ATWO01SLVYSDALNI
`
`ASILOSY
`
`Page 11
`
`
`
`wADYANA.JUNLVAS
`
`SCINVWeVd
`
`LN~5
`yunLvas
`
`SdvNby
`
`OL‘Sid
`
`<M
`
`Page 11
`
`
`
`
`1
`METHOD AND APPARATUS FOR
`EFFICIENTLY REPRESENTING STORING
`AND ACCESSING WIDEO INFORMATION
`
`The invention claims benefit of U.S. Provisional Appli
`cation No. 60/031,003, filed Nov. 15, 1996.
`The invention relates to Video processing techniques and,
`more particularly, the invention relates to a method and
`apparatus for efficiently Storing and accessing Video infor
`mation.
`
`1O
`
`BACKGROUND OF THE DISCLOSURE
`
`15
`
`The capturing of analog video signals in the consumer,
`industrial and government/military environments is well
`known. For example, a moderately priced personal computer
`including a Video capture board is typically capable of
`converting an analog video input signal into a digital video
`Signal, and Storing the digital Video signal in a mass Storage
`device (e.g., a hard disk drive). However, the usefulness of
`the Stored digital Video signal is limited due to the Sequential
`nature of present video access techniques. These techniques
`treat the Stored Video information as merely a digital repre
`Sentation of a Sequential analog information Stream. That is,
`25
`Stored Video is accessed in a linear manner using familiar
`VCR-like commands, such as the PLAY, STOP, FAST
`FORWARD, REWIND and the like. Moreover, a lack of
`annotation and manipulation tools due to, e.g., the enormous
`amount of data inherent in a Video signal, precludes the use
`of rapid access and manipulation techniques common in
`database management applications.
`Therefore, a need exists in the art for a method and
`apparatus for analyzing and annotating raw video informa
`tion to produce a video information database having prop
`erties that facilitate a plurality of non-linear access tech
`niques.
`
`35
`
`SUMMARY OF THE INVENTION
`
`The invention is a method and apparatus for comprehen
`Sively representing Video information in a manner facilitat
`ing indexing of the Video information. Specifically, a method
`according to the inveniton comprises the Steps of dividing a
`continuous video Stream into a plurality of Video Scenes, and
`at least one of the Steps of dividing, using intra-Scene motion
`analysis, at least one of the plurality of Scenes into one or
`more layers, representing, as a mosaic, at least one of the
`pluraliy of Scenes, computing, for at least one layer or Scene,
`one or more content-related appearance attributes, and Stor
`ing, in a database, the content-related appearance attributes
`or Said mosaic representations.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The teachings of the present invention can be readily
`understood by considering the following detailed descrip
`tion in conjunction with the accompanying drawings, in
`which:
`FIG. 1 depicts a high level block diagram of a video
`information processing System according to the invention;
`FIG. 2 is a flow diagram of a Segmentation routine
`Suitable for use in the Video information processing System
`of FIG. 1;
`FIG. 3 is a flow diagram of an authoring routine suitable
`for use in the Video information processing System of FIG. 1;
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,956,573 B1
`
`2
`FIG. 4 depicts a “Video-Map” embodiment of the inven
`tion Suitable for use as a Stand-alone System, or as a client
`within the video information processing system of FIG. 1;
`FIG. 5 shows a user holding the Video-Map embodiment
`of FIG. 4, and an exemplary Screen display of an annotated
`image of the skyline of New York city;
`FIG. 6 depicts exemplary implementation and use Steps of
`the Video-Map embodiment of FIG. 4; and
`FIG. 7 is a graphical representation of the relative
`memory requirements of two Scene Storage methods.
`FIG. 8 is a flow diagram of a query execution routine
`according to the invention; and
`FIGS. 9 and 10 are, respectively, a flow diagram 900 and
`a high-level function diagram 1000 of an attribute genera
`tion method according to the invention.
`
`DETAILED DESCRIPTION
`
`The invention claims benefit of U.S. Provisional Appli
`cation No. 60/031,003, filed Nov. 15, 1996, and incorporated
`herein by reference in its entirety.
`The invention will be described within the context of a
`Video information processing System. It will be recognized
`by those skilled in the art that various other embodiments of
`the invention may be realized using the teachings of the
`following description. AS examples of Such embodiments, a
`video-on-demand embodiment and a “Video-Map” embodi
`ment will also be described.
`The invention is directed toward providing an information
`database Suitable for providing a Scene-based Video infor
`mation to a user. The representation may include motion or
`may be motionless, depending on the application. Briefly,
`the process of constructing the Scene-based Video represen
`tation may be conceptualized as a plurality of analysis Steps
`operative upon the appropriate portions of an evolving Scene
`representation. That is, each of the various video processing
`techniques that will be described below are operative on
`Some, but not all, of the information associated with a
`particular Scene. To illustrate this point, consider the fol
`lowing video processing steps (all of which will be
`described in more detail below): Segmenting, mosaic con
`Struction, motion analysis, appearance analysis and ancillary
`data capture.
`Segmenting comprises the process of dividing a continu
`ouS Video stream into a plurality of Segments, or Scenes,
`where each Scene comprises a plurality of frames, one of
`which is designated a “key frame.”
`Mosaic construction comprises the process of computing,
`for a given Scene or Video Segment, a variety of “mosaic'
`representations and associated frame coordinate transforms,
`Such as background mosaics, Synopsis mosaics, depth lay
`ers, parallax maps, frame-mosaic coordinate transforms, and
`frame-reference image coordinate transforms. For example,
`in one mosaic representation a single mosaic is constructed
`to represent the background Scenery in a Scene, while
`individual frames in the Scene include only foreground
`information that is related to the mosaic by an affine or a
`projective transformation. Thus, the 2D mosaic representa
`tion efficiently utilizes memory by Storing the background
`information of a Scene only once.
`Motion analysis comprises the process of computing, for
`a given Scene or video Segment, a description of the Scene
`or video segment in terms of: (1) layers of motion and
`Structure corresponding to objects, Surfaces and Structures at
`different depths and orientations; (2) independently moving
`objects; (3) foreground and background layer representa
`tions; and (4) parametric and parallax/depth representations
`
`Page 12
`
`
`
`3
`for layers, object trajectories and camera motion. This
`analysis in particular leads to the creation of the associated
`mosaic representations for the foreground, background and
`other layers in the Scene/Segment.
`Appearance analysis is the process of computing, for a
`frame or a layer (e.g., background, depth) of a scene or video
`Segment, content-related attribute information Such as color
`or texture descriptorS represented as a collection of feature
`VectOrS.
`Ancillary data capture comprises the process of capturing,
`through ancillary data streams (time, Sensor data, telemetry)
`or manual entry, ancillary data related to Some or all of the
`Scenes or Video Segments.
`Part of the invention is the selective use of the above
`mentioned Video processing Steps to provide a comprehen
`Sive method of representing video information in a manner
`facilitating indexing of the video information. That is, the
`Video information may be represented using Some or all of
`the above mentioned Video processing Steps, and each video
`processing Step may be implemented in a more or leSS
`complex manner. Thus, the invention provides a compre
`hensive, yet flexible method of representing video for index
`ing that may be adapted to many different applications.
`For example, a network newscast application may be
`adequately represented as 2D mosaic formed using a motion
`analysis processing Step that only Separates a background
`layer (i.e. the news Set) from a foreground object (i.e., the
`anchorperson). A more complex example is the representa
`tion of a baseball game as multiple layers, Such as a cloud
`layer, a field layer and a player layer. Factors including the
`complexity of a Scene, the type of camera motion for the
`scene, and the critical (or non-critical) nature of the scene
`content may be used as guides in determing the appropriate
`representation level of the Scene.
`FIG. 1 is a high level block diagram of a video informa
`tion processing System 100 according to the invention. The
`Video information processing System 100 comprises three
`functional Subsystems, an authoring Sub-System, an acceSS
`sub-system and a distribution sub-system. The three func
`tional Subsystems non-exclusively utilize various functional
`blocks within the video information processing system 100.
`Each of the three sub-systems will be described in more
`detail below, and with respect to the various drawings.
`Briefly, the authoring Sub-system 120, 140 is used to gen
`erate and Store a representation of pertinent aspects of raw
`Video information and, Specifically, to logically Segment,
`analyze and efficiently represent raw Video information to
`produce a video information database having properties that
`facilitate a plurality of acceSS techniques. The access Sub
`system 130, 125, 150 is used to access the video information
`database according access techniques Such as textual or
`Visual indexing and attribute query techniques, dynamic
`browsing techniques and other iterative and relational infor
`mation retrieval techniques. The distribution Sub-System
`130, 160, 170 is used to process accessed video information
`to produce video information Streams having properties that
`facilitate controllably accurate or appropriate information
`Stream retrieval and compositing by a client. Client-side
`compositing comprises the Steps necessary to retrieve Spe
`cific information in a form Sufficient to achieve a client-side
`purpose.
`Video information processing system 100 receives a video
`Signal S1 from a video signal Source (not shown). The video
`Signal S1 is coupled to an authoring Sub-System 120 and an
`image vault 150. The authoring subsystem 120 processes the
`video signal S1 to produce a video information database 125
`having properties that facilitate a plurality of acceSS tech
`
`15
`
`25
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 6,956,573 B1
`
`4
`niques. For example, the Video representative information
`resulting from the previously-mentioned comprehensive
`representation steps (i.e., Segmenting, mosaic construction,
`motion analysis, appearance analysis and ancillary data
`capture) is stored in video information database 125. Video
`information database 125, in response to a control C1
`requesting, e.g., Video frames or Scenes Substantially match
`ing Some or all of the Stored Video representative informa
`tion, generates an output signal S4 that flexibly provides
`Video information representation information Satisfying the
`request.
`The video information database 125 is optionally coupled
`to an ancillary information source 140. The ancillary infor
`mation Source is used to provide non-Video information
`asSociated with the Video information Stored in the database
`125. Such information may include, e.g., positional infor
`mation identifying, e.g., camera positions used to produce
`particular video Segments or Scenes. Such information may
`also comprise annotations, both Visual and audible, that, e.g.,
`identify portions of one or more frames or Scenes, or provide
`Some commentary relevant to one or more frames or Scenes.
`The image vault 150, illustratively a disk array or server
`Specifically designed to Store and distribute video informa
`tion, Stores the Video information carried by Video Signal S1.
`The image vault 150, in response to a control signal C2
`requesting, e.g., a Specific Video program, generates a Video
`output signal S5.
`An acceSS engine 130, illustratively a Video-on-demand
`Server, generates control Signals C1 and C2 for controlling,
`respectively, the annotated Video database 125 and the image
`vault 150. The access engine 130 also receives the video
`output signal S5 from the image vault 150, and the output
`signal S4 from the video information database 125. The
`acceSS engine 130, in response to a control Signal C3,
`illustratively a Video browser request or a Video Server
`request, produces a signal S6.
`The access engine 130 is coupled to one or more clients
`(170-1 through 170-n) via a distribution network 160, illus
`tratively a cable television network or a telecommunications
`network. Each client is associated with a control Signal path
`(C3-1 through C3-n) and a signal path (S6-1 through S6-n).
`Each client 170 includes a display 172 and a controller 174.
`The controller 174 is responsive to user input via an input
`device 175, illustratively a remote control unit or a key
`board. In operation, a client 170 provides, e.g., textual
`and/or visual browsing and query requests to the access
`engine 130. The acceSS engine responsively utilizes infor
`mation stored in the annotated video database 125 and the
`image vault 150 to produce the signal S6 responsive to the
`client request.
`The authoring and access Subsystems will first be
`described in a general manner with respect to the Video
`information processing system 100 of FIG. 1. The distribu
`tion subsystem will then be described within the context of
`several embodiments of the invention. In describing the
`Several embodiments of the invention, Several differences in
`the implementation of the authoring and access Subsystems
`with respect to the embodiments will be noted.
`The inventors have recognized that the problems of video
`Sequence Segmentation and Video Sequence Searching may
`be addressed by the use of a short, yet highly representative
`description of the contents of the imageS. This description is
`in the form of a low-dimensional vector of real-valued
`quantities defined by the inventors as a multi-dimensional
`feature vector (MDFV). The MDFV “descriptor” comprises
`a vector descriptor of a predetermined dimensionality that is
`representative of one or more attributes associated with an
`
`Page 13
`
`
`
`US 6,956,573 B1
`
`15
`
`25
`
`35
`
`40
`
`S
`image. An MDFV is generated by Subjecting an image to a
`predetermined set of digital filters, where each filter is tuned
`to a Specific range of Spatial frequencies and orientations.
`The filters, when taken together, cover a wide range of
`Spatial-frequencies and orientations. The respective output
`Signals from the filters are converted into an energy repre
`Sentation by, e.g., integrating the Squared modulus of the
`filtered image over the image region. The MDFV comprises
`these energy measures.
`FIGS. 9 and 10 are, respectively, a flow diagram 900 and
`a high-level function diagram 1000 of an attribute genera
`tion method according to the invention. The method of FIG.
`9 will be described with reference to FIG. 10. Specifically,
`the method 900 and function diagram 1000 are directed
`toward the processing of an input image I to produce
`attribute information (i.e., MDFV) in the form of an
`attribute pyramid.
`For the purposes of appearance-based indexing, two kinds
`of multi-dimensional features are computed: (1) Features
`that capture distributions without capturing any Spatial con
`Straints; and (2) Features that compute local appearance and
`are grouped together to capture the global spatial arrange
`ment.
`The first type of features that are computed do not
`preserve the previously, the input video signal S1 is option
`ally is divided into layerS and moving objects. In particular,
`a layer may be the complete background Scene or a portion
`of the background Scene (with respect to objects deemed to
`be part of a foreground portion of the Scene). For each of the
`layers (including potentially the complete background
`Scene) a multi-dimensional statistical distribution is com
`puted to capture the global appearance of the layer. Specific
`examples of these distributions are: (1) Histograms of multi
`dimensional color features chosen from a Suitable Space,
`such as Lab, YUV or RGB; (2) Histograms of multi
`dimensional texture-like features where each feature is the
`output of Gaussian and derivative and/or Gabor filters,
`where each filter is defined for a specific orientation and
`Scale. These filters, which are arranged individually or as
`filter banks, may be efficiently computed using pyramid
`techniques. Multi-dimensional histograms and, in particular,
`many one-dimensional histograms, are defined using the
`output of the filters (or filter banks) at each location in a
`Scene layer. In particular, a collection of Single dimensional
`histograms, Such as disclosed in the above-referenced U.S.
`application Ser. No. 08/511,258, may be used.
`The Second type of features that are computed preserve
`the Spatial arrangement of the features within a layer or an
`object. The following steps are followed to create this
`representation. First, the locations of distinctive features are
`computed. Second, multi-dimensional feature vectors are
`computed for each location.
`The locations of distinctive features are those locations in
`the layer or object where the appearance has Some Saliency.
`The inventors define Saliency as a local maximum response
`of a given feature with respect to Spatial Scale. For instance,
`if a corner-like feature is Selected to define Saliency, then a
`filter corresponding to a corner detector is computed at a
`collection of closely Spaced Spatial Scales for the filter. The
`Scale may also be defined using the levels of a feature
`pyramid. The response of the filter is computed at each
`Spatial location and acroSS multiple Scales. Locations where
`the filter response is a maximum both with respect to Scale
`and with respect to neighboring spatial locations is chosen as
`a Salient feature.
`Multi-dimensional feature vectors are next computed at
`each Salient location. That is, filter responses for filters at
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`multiple Scales and orientations are computed. These may be
`defined using Gaussian and derivative filters or Gabor filters.
`A collection of these filters that Systematically Sample the
`Space of orientations and Scales (within reasonable limits,
`for instance Scale changes between /8 and 8, but in principle
`may be arbitrary) is computed. This collection as each of the
`Salient points becomes the multi-dimensional feature repre
`Sentation for that point. For each layer and object, a collec
`tion of these features along with their spatial locations is
`Stored in a database using a kd-tree (R-tree) like multi
`dimensional data structure.
`The attribute generation method 900 of FIG. 9 is entered
`at step 905, when an input frame is made available. At step
`910 the input frame in retrieved, and at step 915 the input
`frame is Subjected to a known pyramid processing step (e.g.,
`decimation) to produce an image pyramid. In FIG. 10, the
`input frame is depicted as an input image Io, and the pyramid
`processing Step produces an image pyramid comprising
`three image pyramid Subbands, I, I and I. It is produced
`by, e.g., Subsampling Io. I2 is produced by, e.g., Subsampling
`I. I is produced by, e.g., Subsampling I. Since each
`Subband of the image pyramid will be processed in the same
`manner, only the processing of Subband I will be described
`in detail. Moreover, an image pyramid comprising any
`number of Subbands may be used. A suitable pyramid
`generation method is described in commonly assigned and
`copending U.S. application Ser. No. 08/511,258, entitled
`METHOD AND APPARATUS FOR GENERATING
`IMAGE TEXTURES, filed Aug. 4, 1995, and incorporated
`herein by reference in its entirety.
`After generating an image pyramid (step 915) the attribute
`generation method 900 of FIG. 9 proceeds to step 920,
`where an attribute feature and an associated filtering Scheme
`are selected, and to step 925, where N feature filter are used
`to filter each of the Subbands of the image pyramid. In FIG.
`10 the image Subband I is coupled to a digital filter F.
`comprising three subfilters f-f. Each of the three subfilters
`is tuned to a Specific, narrow range of Spatial frequencies and
`orientations. The type of filtering used, the number of filters
`used, and the range of each filter is adjusted to emphasis the
`type of attribute information produced. For example, the
`inventors have determined that color attributes are appro
`priately emphasized by using Gaussian filters, while texture
`attributes are appropriately emphasized by using oriented
`filters (i.e., filters looking for contrast information in differ
`ing pixel orientations). It must be noted that more or less
`than three sub-filters may be used, and that the filters may be
`of different types.
`After filtering each of the image pyramid Subbands (step
`925), the attribute generation method 900 of FIG. 9 proceeds
`to step 930, where the filter output signals are rectified to
`remove any negative components. In FIG. 10, the output
`signal from each of the three subfilters f-f of digital filter
`F is coupled to a respective Subrectifier within a rectifier R.
`The rectifier R removes negative terms by, e.g., Squaring
`the respective output Signals.
`After rectifying each of the filter output signals (step 930),
`the attribute generation method 900 of FIG. 9 proceeds to
`step 935, where a feature map is generated for the attributes
`represented by each rectified filter output signal. In FIG. 10,
`feature map FM comprises three feature maps associated
`with, e.g., three Spatial frequencies and orientations of
`Subband image I. The three feature maps are then integrated
`to produce a single attribute representation FM" of Sub
`band image I.
`After generating the feature maps (step 935), the attribute
`generation method 900 of FIG. 9 proceeds to step 940,
`
`Page 14
`
`
`
`US 6,956,573 B1
`
`15
`
`25
`
`35
`
`40
`
`7
`where the respective feature maps of each Subband are
`integrated together in one or more integration operations to
`produce an attribute pyramid. In FIG. 10, the previously
`described processing of Subband image I is performed for
`Subband images 12 and I in Substantially the same manner.
`After producing the attribute pyramid related to a par
`ticular attribute (step 940), the routine 900 of FIG. 9
`proceeds to step 945, where the attribute pyramid is stored,
`and to step 945, where a query is made as to whether any
`additional features of the image pyramid are to be examined.
`If the query at stop 945 is affirmatively answered, then the
`routine 900 proceeds to step 920, where the next feature and
`its associated filter are selected. Steps 925-950 are then
`repeated. If the query at Step 945 is negatively answered,
`then the routine 900 proceeds to step 955, where a query is
`made as to whether the next frame should be processed. If
`the query at step 955 is affirmatively answered, then the
`routine 900 proceeds to step 910, where the next frame is
`input. Steps 915-955 are then repeated. If the query at step
`955 is negatively answered, then the routine 900 exits at step
`96.O.
`It is important to note that the attribute information
`generated using the above-described attribute generation
`method 900, 1000 occupies much less memory space than
`the video frame itself. Moreover, a plurality of Such
`attributes Stored in non-pyramid or pyramid form comprise
`an indeX to the underlying Video information that may be
`efficiently accessed and Searched, as will be described
`below.
`The first functional Subsystem of the video information
`processing system 100 of FIG. 1, the authoring Sub-system
`120, will now be described in detail. As previously noted,
`the authoring Sub-System 120 is used to generate and Store
`a representation of pertinent aspects of raw video informa
`tion, Such as information present in Video signal S1. In the
`information processing system 100 of FIG. 1, the authoring
`Subsystem 120 is implemented using three functional
`blocks, a Video Segmentor 122, an analysis engine 124 and
`a video information database 125. Specifically, the video
`Segmentor 122 Segments the Video signal S1 into a plurality
`of logical Segments, Such as Scenes, to produce a Segmented
`Video signal S2, including Scene cut indicia. The analysis
`engine 124 analyzes one or more of a plurality of Video
`information frames included within each segment (i.e.,
`Scene) in the Segmented video signal S2 to produce an
`information stream S3. The information stream S3 couples,
`to an information database 125, information components
`generated by the analysis engine 124 that are used in the
`construction of the video information database 125. The
`video information database 125 may also include various
`annotations to the Stored Video information and ancillary
`information.
`The Segmentation, or “Scene c