`
`(19) World Intellectual Property Organization
`International Bureau
`
`(43) International Publication Date
`31 January 2002 (31.01.2002)
`
` (10) International Publication Number
`
`WO 02/08948 A2
`
`(51) International Patent Classification’:
`
`GO6F 17/00
`
`Inventors; and
`(72)
`(75)
`Inventors/Applicants (for US only); SULL, Sanghoon
`(21) International Application Number:=PC'1/US01/23631
`[KR/KR]; Gaeop 4-cha Woosung Apt.
`8-402, Do-
`Gop-Dong, KangNam-Ku, Seoul, 135-270 (KR). KIM,
`Hyeokman [KR/KR]; A-Nam Apt.
`101-308 Myun-
`gRyun-Dong, Jong-Ro-Ku, Seoul, 110-521 (KR). CHOI,
`Hyungseok [KR/KR]; HyunDai Apt.
`103-104 Ssang-
`Moon 4-Dong, Dobong-Ku, Seoul,
`132-034 (KR).
`CHUNG, Min, Gyo [KR/KR]; DaeWon Apt. 806-901,
`GumGok-Dong, PunDang-Ku, SungNam City, Kyonggi,
`463-480 (KR). YOON, Ja-Cheon [KR/KR]; SangRok Soo
`Apt.
`204-303, I'WonBon-Dong, KangNam-Ku, Seoul,
`135-947 (KR). OH, Jeongtaek [KR/KR]; DaeRim Apt.
`207-2104 ChungGye-dong, NoWon-gu, Seoul, 139-220
`(KR). LEE, Sangwook [KR/KR]; 102-801 Oksu Heights
`Apt., 100 Oksu Dong, Sundong-Ku, Seoul, 133-100 (KR).
`SONG,S., Moon-Ho [KR/KR]; Yongsan-gu Ichon-Dong
`402, Gangchon Apt. 102-702, Seoul, 133-100 (KR). KIM,
`Jung, Rim [KR/KR]; Lotte Apt. 108-1701, Kuro-Dong,
`
`(22) InternationalFiling Date:
`
`23 July 2001 (23.07.2001)
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`English
`
`English
`
`(30) Priority Data:
`60/221 ,394
`60/221,843
`60/222,373
`60/271,908
`60/291,728
`
`24 July 2000 (24.07.2000)
`28 July 2000 (28.07.2000)
`31 July 2000 (31.07.2000)
`27 February 2001 (27.02.2001)
`17 May 2001 (17.05.2001)
`
`US
`US
`US
`US
`US
`
`(71) Applicant (for all designated States except US): VIV-
`COM, INC. [US/US]; 4180 Wallis Ct., Palo Alto, CA
`94306 (US).
`
`(54) Title: SYSTEM AND METHOD FOR INDEXING, SEARCHING, IDENTIFYING, AND EDITING PORTIONS OF ELEC-
`‘TRONIC MULTIMEDIA FILES
`
`{Continued on next page]
`
`
`
`
`
`Positional
`Information
`Content
`Information
`
`224
`226
`
`
`List of Multimedia Bookmarks
`p— 222,
`
`Positional
`Information
`
`
`Information |
`
`|
`
` Content
`
`WO02/08948A2
`
`(57) Abstract: A method and system are provided
`for tagging, indexing, searching, retrieving, manipu-
`lating, and editing video images on a wide area net-
`work such as the Internet. A first set of methods is
`
`provided for enabling users to add bookmarksto mul-
`timedia files, such as movies, and audio files, such
`as music. The multimedia bookmarkfacilitates the
`
`searching of portions or segments of multimediafiles,
`particularly when used in conjunction with a search
`engine. Additional methods are provided that refor-
`mat a video imagefor use on a variety of devicesthat
`have a wide range of resolutions by selecting some
`material (in the case of smaller resolutions) or more
`material (in the case of larger resolutions) from the
`same multimedia file. Still more methods are pro-
`videdfor interrogating images that contain textual in-
`formation (in graphical form) so that the text may be
`copied to a tag or bookmarkthat can itself be indexed
`and searched to facilitate later retrieval via a search
`
`engine.
`
`Amazon vy. Audio Pod
`USPatent 10,805,111
`
`Amazon EX-1060
`
`-i-
`
`Amazon v. Audio Pod
`US Patent 10,805,111
`Amazon EX-1060
`
`
`
`WO 02/08948
`
`A2
`
`
`
`Kuro-gu, Seoul, 152-055 (KR). LEE, Keansub [KR/KR];
`972-2 Pyokjokgol Jugong Apt. 836-1701, Yongtong-Dong,
`Paldal-gu, Suwon City, Kyonggi, 463-060 (KR). CHUN,
`Seong, Soo [KR/KR]; Dusan apt. 425-1402, Imae-dong,
`Pundang-gu, Songnam City, Kyonggi, 463-060 (KR). OH,
`Sangwook |KR/KR|; 609-42 Yongdam2-Dong, Cheju
`City, Cheju, 690-042 (KR). KIM, Yunam [KR/KR];
`2529-3daeYu JanRah Mansion 302, Nollyun-Dong,
`CheJu City, Cheju, 690-180 (KR).
`
`(74)
`
`Agents: CHICHESTER,Ronald,L. et al.; Baker Botts
`L.L-P., One Shell Plaza, 910 Louisiana, Houston, TX 77002
`(US).
`
`(81)
`
`Designated States (national): AE, AG, AL, AM, AT, AT
`(utility model), AU, AZ, BA, BB, BG, BR, BY, BZ, CA,
`CH, CN, CO, CR, CU, CZ, DE (utility model), DK (utility
`model), DM, DZ, EC, EE (utility model), ES, FI (utility
`modcl), GB, GD, GE, GH, GM, HR, HU, 1D,IL, IN, IS,
`JP, KE, KG, KP, KR (utility model), KZ, LC, LK, LR, LS,
`
`LT, LU, LV, MA, MD, MG,MK, MN, MW,Mx, MZ, NO,
`NZ, PL, PT, RO, RU, SD, SE, SG, SI, SK (utility model),
`SL, TJ, TM, TR, TT, TZ, UA, UG, US, UZ, VN, YU, ZA,
`ZW.
`
`(84)
`
`Designated States (regional): ARIPO patent (GH, GM,
`KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZW), Eurasian
`patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), European
`patent (AT, BE, CH, CY, DE, DK, ES, FI, FR, GB, GR, TE,
`IT, LU, MC, NL, PT, SE, TR), OAPI patent (BF, BJ, CF,
`CG, CI, CM, GA, GN, GQ, GW, ML, MR, NE, SN, TD,
`TG).
`
`Published:
`without international search report and to be republished
`upon receipt of that report
`
`For two-letter codes and other abbreviations, refer to the "Guid-
`ance Notes on Codes andAbbreviations" appearing at the begin-
`ning ofeach regular issue ofthe PCT Gazette.
`
`-ii-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`SYSTEM AND METHODFORINDEXING, SEARCHING, IDENTIFYING,
`
`AND EDITING PORTIONS OF ELECTRONIC MULTIMEDIA FILES
`
`Backgroundof the Invention
`
`Field of the Invention
`
`The present invention relates generally to marking multimedia files. More
`
`10
`
`specifically, the present invention relates to applying or inserting tags ito multimedia
`
`files for indexing and searching, as well as for editing portions of multimedia files, all
`
`to facilitate the storing, searching, and retrieving of the multimedia information.
`
`Backgroundofthe Related Art
`
`1.
`
`Multimedia Bookmarks -
`
`15
`
`With the phenomenal growth of the Internet, the amount of multimedia content
`
`that can be accessed by the public has virtually exploded. There are occasions where a
`
`user who once accessed particular multimedia content needs or desires to access the
`
`content again at a later time, possibly at or from a different place. For example, in the
`
`case of data interruption due to a poor network condition, the user may be required to
`
`20
`
`access the content again.
`
`In another case, a user who once viewed multimedia content
`
`at work may want to continue to view the content at home. Most users would want to
`
`restart accessing the content from the point where they had left off Moreover,
`
`subsequent access may be initiated by a different user in an exchange of information
`
`between users. Unfortunately, multimedia content is represented in a streaming file
`format so that a user has to view the file from the beginning in orderto look for the
`exact point wherethefirst user left off.
`
`In orderto save the time involved in browsing the data from the beginning, the
`
`concept of a bookmark may be used. A conventional bookmark marks a document
`such as a static web page forlaterretrieval by saving a link (address) to the document.
`
`30
`
`For example, Internet browsers support a bookmark facility by saving an address called
`
`a Uniform Resource Identifier
`
`(URI)
`
`to a particular
`
`file.
`
`Internet Explorer,
`
`manufactured by the Microsoft Corporation of Redmond, Washington, uses the term
`
`“favorite” to describe a similar concept.
`
`-4-
`
`-1-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`Conventional bookmarks, however, store only the information related to the
`
`location of a file, such as the directory name with a file name, a Universal Resource
`
`Locator (URL), or the URI The files referred to by conventional bookmarks are
`
`treated in the same way regardless of the data formats for storing the content.
`
`Typically, a simple link is used for multimedia content also. For example, to link to a
`
`multimedia content file through the Internet, a URI is used. Each time the file is
`
`revisited using the bookmark, the multimedia content associated with the bookmark is
`
`always played from the beginning.
`
`Figure 1 illustrates a list 108 of conventional bookmarks 110, each comprising
`
`positional
`
`information 112 and title 114.
`
`The positional
`
`information 112 of a
`
`conventional bookmark is composed of a URI as well as a bookmarked position 106.
`
`The bookmarked positionis a relative time or byte position measured from a beginning
`
`of the multimedia content. The title 114 can be specified by a user, as well as delivered
`
`with the content, and it
`
`is typically used to make the user easily recognize the
`
`15
`
`bookmarked URI in a bookmark list 108. For the case of a conventional bookmark
`
`without using a bookmarked position, when a user wants to replay the specified
`
`multimediafile, the file is played from the beginning ofthe file each time, regardless of
`how muchofthefile the user has already viewed. The user has no choice but to record
`the last accessed position on a memo and to move manually the last stopped point. If
`
`20
`
`the multimedia file is viewed by streaming, the user must go through a series of
`
`buffering to find out the last accessed position, thus wasting much time. Even for the
`
`conventional bookmark with a bookmarked position, the same problem occurs when
`
`the multimedia content is delivered in live broadcast, since the bookmarked position
`
`within the multimedia content is not usually available, as well as when the user wants
`to replay one ofthe variations ofthe bookmarked multimedia content.
`
`25
`
`Further, conventional bookmarks do not provide a convenient way of switching
`
`between different data formats. Multimedia content may be generated and stored in a
`
`variety of formats. For example, video may be stored in the formats such as MPEG,
`
`ASF, RM, MOV, and AVI. Audio may be stored in the formats such as MID, MP3,
`
`30
`
`and WAV. There may be occasions where a user wants to switch the play of content
`
`from one format to another. Since different data formats produced from the same
`
`multimedia content are often encoded independently, the same segment is stored at
`
`-2-
`
`-2-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`different
`
`temporal positions within the different
`
`formats.
`
`Since conventional
`
`bookmarks have no facility to store any content information, users have no choice but
`
`to review the multimedia content from the beginning and to search manually for the
`
`last-accessed segment within the content.
`
`Time information may be incorporated into a bookmark to return to the last-
`
`accessed segment within the multimedia content. The use of time information only,
`
`however, fails to return to exactly the same segment at a later time for the following
`
`reasons.
`
`If a bookmark incorporating time information was used to save the last-
`
`accessed segment during the preview of multimedia content broadcast, the bookmark
`
`10
`
`information would not be valid during a regular full-version broadcast, so as to return
`
`to the last-accessed segment. Similarly, if a bookmark incorporating time information
`
`was used to save the last-accessed segment during real-time broadcast, the bookmark
`
`would not be effective during later access becausethe later available version may have
`
`been edited or a time code wasnot available during the real-time broadcast.
`
`Many video and audio archiving systems, consisting of several differently
`
`compressed files called "variations", could be produced from a single source
`multimedia content. Many web-casting sites provide multiple streaming files for a
`single video content with different bandwidths according to each video format. For
`
`example, CNN.com provides five different streaming videos for a single video content:
`two different types of streaming videos with the bandwidths of 28.8 kbps and 80 kbps,
`both encoded in Microsoft's Advanced Streaming Format (ASF). CNN.com also
`provides RM streaming format by RealNetworks, Inc. of Seattle, Washington (RM),
`and a streaming video with the smart bandwidth encoded in Apple Computer, Inc.’s
`QuickTime streaming format (MOV).
`In this case, the five video files may start and
`end at different time points from the viewpoint of the source video content, since each
`
`variation may be produced by an independent encoding process varying the values
`chosen for encoding formats, bandwidths, resolutions, etc. This results in mismatches
`
`of time points because a specific time point of the source video content may be
`presented as different media time points in the five videofiles.
`When a multimedia bookmarkis utilized, the mismatches of positions cause a
`problem of mis-positioned playback. Consider a simple case where one makes a
`multimedia bookmark on a master file of a multimedia content (for example, video
`
`13
`
`20
`
`25
`
`30
`
`-3-
`
`-3-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`encoded in a given format), and tries to play another variation (for example, video
`
`encoded in a different format) from the bookmarked position. If the two variations do
`
`not start at the same position of the source content, the playback will not start at the
`
`bookmarked position. That is, the playback will start at the position that is temporally
`
`shifted with the difference between the start positions of the two variations.
`
`The entire multimedia presentation is often lengthy. However,
`
`there are
`
`frequent occasions when the presentation is interrupted, voluntarily or forcibly, to
`
`terminate before finishing. Examples include a user whostarts playing a video at work
`
`leaves the office and desires to continue watching the video at home, or a user who may
`
`10
`
`be forced to stop watching the video and log out due to system shutdown.
`
`It is thus
`
`necessary to save the termination position of the multimedia file into persistent storage
`
`in order to return directly to the point of termination without a time-consuming
`
`playback of the multimedia file from the beginning.
`
`The interrupted presentation of the multimedia file will usually resume exactly
`
`15
`
`at the previously saved terminated position. However, in some cases,it is desirable to
`
`begin the playback of the multimedia file a certain time before the terminated point,
`since such rewinding could help refresh the user’s memory.
`
`In the priorart, the EPG (Electronic Program Guide) has played a crucialrole as
`
`a provider of TV programming information. EPG facilitates a user’s efforts to search
`
`20
`
`for TV programs that he or she wants to view. However, EPG’s two-dimensional
`
`presentation (channels vs. time slots) becomes cumbersomeas terrestrial, cable, and
`
`satellite systems send out
`
`thousands of programs through hundreds of channels.
`
`Navigation through a large table of rows and columns in order to search for desired
`
`programs is frustrating.
`
`25
`
`30
`
`One of the features provided by the recent set-top box (STB) is the personal
`video recording (PVR) that allows simultaneous recording and playback. Such STB
`usually contains digital video encoder/decoder based on an international digital video
`compression standard such as MPEG-1/2, as well as the large local storage for the
`digitally compressed video data. Some ofthe recent STBs also allow connection to the
`
`Internet. Thus, STB users can experience new services such as time-shifting and web-
`enhancedtelevision (TV).
`
`However,there still exist some problems for the PVR-enabled STBs. Thefirst
`
`-4-
`
`-4-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`problem is that even the latest STBs alone cannot fully satisfy users’ ever-increasing
`
`desire for diverse functionalities. The STBs now on the market are very limited in
`
`terms of computing and memory and so it
`
`is not easy to execute most CPU and
`
`memory intensive applications. For example, the people who are bored with plain
`
`playback of the recorded video may desire more advanced features such as video
`
`browsing/summary and search. Actually, all of those features require metadata for the
`
`recorded video. The metadata are usually the data describing content, suchas thetitle,
`
`genre and summary ofa television program. The metadata also include audiovisual
`
`characteristic data such as raw image data corresponding to a specific frame of the
`
`10
`
`video stream. Someof the description is structured around "segments" that represent
`
`spatial, temporal or spatio-temporal components of the audio-visual content.
`
`In the
`
`case of video content, the segment may be a single frame, a single shot consisting of
`
`successive frames, or a group of several successive shots. Each segment may be
`
`described by some elementary semantic information using texts.
`
`The segment
`
`is
`
`15
`
`referenced by the metadata using media locators such as frame numberor time codes,
`
`However, the generation of such video metadata usually requires intensive computation
`
`and a human operator’s help, so practically speaking, it is not feasible to generate the
`
`metadata in the current STB. Thus, one possible solution for this problem is to
`
`generate the metadata in the server connected to the STB and todeliver it to the STB
`via network. However, in this scenario, it is essential to know the start position of
`
`20
`
`recorded video with respect to the video stream used to generate the metadata in the
`
`server/content provider in order to match the temporal position referenced by the
`
`metadata to the position of the recorded video.
`
`The second problem is related to discrepancy between the two time instants: the
`
`25
`
`time instant at which the STB starts the recording of the user-requested TV program,
`
`and the time instant at which the TV program is actually broadcast. Suppose, for
`instance, that a user initiated PVR request for a TV program scheduledto go on the air
`
`at 11:30 AM,butthe actual broadcasting time is 11:31 AM.
`
`In this case, when the user
`
`wants to play the recorded program, the user has to watch the unwanted segmentat the
`
`30
`
`beginning of the recorded video, which lasts for one minute. This time mismatch could
`
`bring some inconvenience to the user who wants to view only the requested program.
`
`However, the time mismatch problem can be solved by using metadata delivered from
`
`-5-
`
`-5-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`the server, for example,
`
`reference frames/segment representing the beginning of the
`
`TV program. The exact location of the TV program, then, can be easily found by
`
`simply matching the reference frames withall the recorded frames for the program.
`
`2.
`
`Search
`
`The rapid expansion of the World Wide Web (WWW)
`
`and mobile
`
`communications has also brought great interest in efficient multimedia data search,
`
`browsing and management. Content-based image retrieval
`
`(CBIR)
`
`is a powerful
`
`concept for finding images based on image contents, and content-based image search
`
`and browsing have been tested using many CBIR systems. See, M. Flickner, Harpreet
`
`10
`
`Sawhney, Wayne Niblack, Jonathan Ashley, Q. Huang, Byron Dom, Monika Gorkani,
`
`Jim Hafine, Denis Lee, Dragutin Petkovic, David Steele and Peter Yanker, "Query by
`
`image and video content: The QBIC system," JEEE Computer, Vol. 28. No. 9, pp. 23-
`
`32, Sept., 1995; Carson, Chad ez ai., "Region-Based Image Querying [Blobworldj,”
`
`Workshop on Content-Based Access of Image and Video Libraries, Puerto Rico, Jun.
`
`15
`
`1997,
`
`J. R. Smith and S. Chang, "Visually searching the web for content," JEEE
`
`Multimedia Magazine, Vol. 4, No. 3, pp. 12-20, Summer 1997, also Columbia U.
`
`CU/CTR Technical Report 459-96-25; A. Pentland, R. W. Picard and S. Sclaroff, "A
`
`Photobook:tools for content-based manipulation of image databases,” in Proc. Of SPIE
`
`Conf. On Storage and Retrievalfor Image and Video Databases-II, No. 2185, pp. 34-
`
`20
`
`47, San Jose, CA, Feb., 1944,
`
`J. R. Bach, C. Fuller, A. Guppy, A. Hampapur, B.
`
`Horowitz, R. Humphrey, R. C. Jain and C. Shu, "Virage image search engine: an open
`
`framework for image management,” Symposium on Electronic Imaging: Science and
`
`Technology --Storage & Retrieval for Image and Video Databases IV, IS&T/SPIE’96,
`
`Feb., 1996;
`
`J. R. Smith and 8. Chang, "VisualSEEk: A Fully Automated Content-
`
`25
`
`Based Image Query System," ACM Multimedia Conference, Boston, MA, Nov. 1996;
`Jing Huang, S. Ravi Kumar, Mandar Mitra, Wei-Jing Zhu and Ramin Zabih. "Image
`
`Indexing Using Color Correlograms," in JEEE Conference on Computer Vision and
`Pattern Recognition, pp. 762-768, Jun., 1997; and Simone Santini, and Ramesh Jain,
`
`"The 'El Nino' Image Database System," in International Conference on Multimedia
`
`30
`
`Computing and Systems, pp. 524-529, Jun., 1999.
`
`Currently, most of the content-based image search engines rely on low-level
`
`image features such as color, texture and shape. While high-level image descriptors are
`
`-6-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`potentially more intuitive for common users, the derivation of high-level descriptors is
`
`still in its experimental stages in the field of computer vision and requires complex
`
`vision processing. Despite its efficiency and ease of implementation, on the other hand,
`
`the main disadvantage of low-level image features is that they are perceptually non-
`
`intuitive for both expert and non-expert users, and therefor, do not normally represent
`
`users' intent effectively. Furthermore, they are highly sensitive to a small amount of
`
`image variation in feature shape, size, position, orientation, brightness and color.
`
`Perceptually similar images are often highly dissimilar in terms of low-level image
`
`features. Searches made by low-level features are often unsuccessful and it usually
`
`10
`
`takes manytrials to find images satisfactory to a user.
`
`Efforts have been made to overcome the limitations of low-level features.
`
`Relevance feedback is a popular idea for incorporating user’s perceptual feedback in
`
`the image search.
`
`See, Y. Rui, T. Huang, and S. Mehrota, "A relevance feedback
`
`architecture in content-based multimedia information retrieval systems," in JEEE
`
`15
`
`Workshop on Content-based Access ofImage and Video Libraries, Puerto Rico, pp. 82-
`
`89, Jun., 1997, Yong Rui, Thomas S. Huang, Michael Ortega, and Sharad Mehrotra,
`
`“Relevance Feedback: A Power Tool in Interactive Content-Based ImageRetrieval," in
`
`IEEE Tran on Circuits and Systems for Video Technology, Special
`
`Issue on
`
`Segmentation, Description, and Retrieval of Video Content, pp. 644-655, Vol. 8, No. 5,
`
`20
`
`Sept., 1998, G. Aggarwal, P. Dubey, S. Ghosal, A. Kulshreshtha, and A. Sarkar,
`
`in Proc. of IEEE
`"7PURE: perceptual and user-friendly retrieval of images,"
`International Conference on Multimedia and Exposition, Vol. 2, pp. 693-696, Jul.,
`
`2000, Ye Lu, Chunhui Hu, Xingquan Zhu, HongJiang Zhang and Qiang Yang, “A
`
`unified framework for semantics and feature based relevance feedback in image
`retrieval systems," in Proc. ofACM International Conference on Multimedia, pp. 31-
`37, Oct., 2000, H. Muller, W. Muller, S. Marchand-Maillet, and T. Pun, "Strategies for
`
`positive and negative relevance feedback in image retrieval," in Proc. of IEEE
`Conference on Pattern Recognition, Vol. 1, pp. 1043-1046, Sept., 2000, S. Aksoy, R.
`M. Haralick, F. A. Cheikh, and M. Gabbouj, "A weighted distance approach to
`relevance feedback," in Proc. ofIEEE Conference on Pattern Recognition, Vol. 4, pp.
`
`812-815, Sept., 2000.;
`
`I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P.
`
`N. Yianilos, "The Bayesian imageretrieval system, PicHunter:theory, implementation,
`
`25
`
`30
`
`-7-
`
`-7-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`and psychophysical experiments," in JEEE Transaction on Image Processing, Vol. 9,
`
`pp. 20-37, Jan., 2000; P. Muncesawang, and Guan Ling, "Multi-resolution-histogram
`
`indexing and relevance feedback learning for image retrieval," in Proc. of IEEE
`
`International Conference on Image Processing, Vol. 2, pp. 526-529, Jan., 2001. A user
`
`can manually establish relevance between a query and retrieved images, and the
`
`relevant images can be used for refining the query. When the refinement is made by
`
`adjusting a set of low-level
`
`feature weights, however,
`
`the user’s intent
`
`is still
`
`represented by low-level features and their basic limitationsstill remain.
`
`Several approaches have been made to the integration of human perceptual
`
`10
`
`responses and low-level features in image retrieval. One notable approach is to adjust
`
`an image’s feature’s distance attributes based on the human perceptual input.
`
`See,
`
`Simone Santini, and Ramesh Jain, "The "El Nino' Image Database System," in
`
`International Conference on Multimedia Computing and Systems, pp. 524-529, Jun.,
`
`1999. Another approach, called "blob world," combines low-level features to derive
`
`15
`
`slightly higher-level descriptions and presents the "blobs'""of grouped features to a user
`to provide a better understanding of feature characteristics. See, Carson, Chad,et al.,
`"Region-Based Image Querying [Blobworld]," Workshop on Content-Based Access of
`
`Image and Video Libraries, Puerto Rico, Jun., 1997. While those schemes successfully
`
`reflect a user’s intent to some degree, it remains to be seen how grouping of features or
`
`20
`
`feature distance modification can achieve the perceptual relevance in imageretrieval.
`
`A more traditional computer vision approach to the derivation of high-level object
`
`descriptors based on generic object recognition has been presented for imageretrieval.
`
`See, David A. Forsyth and Margaret Fleck, "Body Plans," in JEEE Conference on
`Computer Vision and Pattern Recognition, pp. 678-683, Jun., 1997, Dueto its limited
`feasibility for general
`image objects and complex processing,
`its utility is still
`
`restricted.
`With the rapid proliferation of large image/video databases, there has been an
`increasing demand for effective methods to search the large image/video databases
`automatically by their content. For a query image/video clip given by a user, these
`methods search the databases for the images/videos that are most similar to the query.
`In other words, the goal of the image/video search is to find best matches to the query
`image/video from the database.
`
`25
`
`30
`
`-8-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`Several approaches have been made towards the development of the fast,
`
`effective multimedia search methods. Milanes ef al. utilized hierarchical clustering to
`' organize an image database into visually similar groupings.
`See, R. Milanese, D.
`
`Squire, and T. Pun, "Correspondence analysis and hierarchical indexing for content-
`
`based image retrieval,” in Proc. IEEE Int. Conf: Image Processing, Vol. 3, Lausanne,
`
`Switzerland, pp. 859-862, Sept., 1996. Zhang and Zhong provided a hierarchical self-
`organizing map (HSOM) methodto organize an image database into a two-dimensional
`grid.
`See, H. J. Zhang and D. Zhong, "A scheme for visual feature based image
`
`indexing," in Proc. SPIE/IS&T Conf. Storage Retrieval Image Video Database III, Vol.
`
`10
`
`2420, pp. 36-46, San Jose, CA, Feb., 1995. However, a weakness of HSOM isthatit is
`
`generally too computationally expensive to apply to a large multimedia database.
`
`In addition, there are other well known solutions using Voronoi diagram, Kd-
`
`tree, and R-tree.
`
`See, J. Bentley, “Multidimensional binary search trees used for
`
`associative searching," Comm. of the ACM, Vol. 18, No. 9, pp. 509-517, 1975, S. Brin,
`"Near neighbor search in large metric spaces," in Proc. 21° Conf. On Very Large
`
`15
`
`Databases (VLDB’95), Zurich, Switzerland, pp. 574-584, 1995. However, it is also
`
`knownthat those approaches are not adequate for the high dimensional feature vector
`
`spaces, and thus, they are useful only in low dimensional feature spaces.
`
`Peer to Peer Searching
`
`20
`
`Peer-to-Peer (P2P) is a class of applications making the most of previously
`
`unused resources (for example, storage, content, and/or CPU cycles), which are
`
`available on the peers at the edges of networks. P2P computing allows the peers to
`
`share the resources and services, or to aggregate CPU cycles, orto chat with each other,
`
`by direct exchange. Two of the more popular implementations of P2P computing are
`
`25
`
`Napster and Gnutella. Napster has its peers register files with a broker, and uses the
`
`broker to search for files to copy. The broker plays the role of server in a client-server
`
`modelto facilitate the interaction between the peers. Gnutella has peers register files
`
`with network neighbors, and searches the P2P network for files to copy. Since this
`
`model does not require a centralized broker, Gnutella is considered to be a true P2P
`
`30
`
`system.
`
`3.
`
`Editing
`
`-9-
`
`-9-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`-10-
`
`In the prior art, video files were edited through video editing software by
`
`copying several segments of the input videos and pasting them to an output video. The
`
`prior art method, however, confronts two major problems mentioned below.
`
`The first problem of the prior art methodis that it requires additional storage to
`
`store the new version of an edited video file. Conventional video editing software
`
`generally uses the original input video file to create an edited video.
`
`In most of the
`
`cases, editors having a large database of videos attempt to edit the videos to create a
`
`new one. In this case, the storage is wasted storing duplicated portions of the video.
`
`The second problem with the prior art method is that a whole new metadata have to be
`
`generated for a newly created video. If the metadata are not edited in accordance with
`
`the edition of the video, even if the metadata for the specific segment of the input video
`
`are already constructed, the metadata may not accurately reflect the content. Because
`
`considerable effort is required to create the metadata of videos, it is desirable to reuse
`
`efficiently existing metadata,if possible.
`
`Metadata of a video segment contain textual
`
`information such as
`
`time
`
`information (for example, starting frame number and duration, or starting frame
`
`numberas well as the finishing frame number), title, keyword, and annotation, as well
`
`as image information such as the key frame of a segment. The metadata of segments
`
`10
`
`15
`
`can form a hierarchical structure where the larger segment contains the smaller
`
`20
`
`segments. Because it is hard to store both the video and their metadata into a single
`
`file, the video metadata are separately stored as a metafile, or stored in a database
`
`management system (DBMS).
`
`If metadata having a hierarchical structure are used, browsing a whole video,
`
`searching for a segment using the keyword and annotation of each segment, and using
`the key frames of each segment for visual summary of the video are supported. Also,
`
`25
`
`not only does it support the existing simple playback, but also the playback and
`repeated playback of a specific segment. Therefor, the use of hierarchically-structured
`metadata is becoming popular.
`
`30
`
`4.
`
`Transcoding
`With the advance of information technology, such as the popularity of the
`Internet, multimedia presentation proliferates into ever increasing kinds of media,
`including wireless media. Multimedia data are accessed by ever increasing kinds of
`
`-10-
`
`-10-
`
`
`
`WO 02/08948
`
`PCT/US01/23631
`
`-11-
`
`devices such as hand-held computers (HHCs), personal digital assistants (PDAs), and
`
`smart cellular phones. There is a need for accessing multimedia content in a universal
`
`fashion from a wide variety of devices.
`
`See,
`
`J. R. Smith, R. Mohan and C. Li,
`
`"Transcoding Internet Content for Heterogeneous Client Devices," in Proc. ISCASA,
`
`Monterey, California, 1998.
`
`Several approaches have been made to enable effectively such universal
`
`multimedia access (UMA). A data representation, the InfoPyramid,is a framework for
`
`aggregating the
`
`individual
`
`components of multimedia
`
`content with content
`
`descriptions, and methods and rules for handling the content and content descriptions.
`
`See, C. Li, R. Mohan and J. R. Smith, "Multimedia Content Description in the
`
`InfoPyramid," in Proc. IEEE Intern. Conf. on Acoustics, Speech and Signal Processing,
`
`May, 1998. The InfoPyramid describes content in different modalities, at different
`
`resolutions and at multiple abstractions. Then a transcoding tool dynamically selects
`
`the resolutions or modalities that best meet the client capabilities from the InfoPyramid.
`
`i)
`
`J. R. Smith proposed a notion of importance value for each of the regions of an image
`
`as a hint to reduce the overall data size in bits of the transcoded image. See, J. R.
`
`Smith, R. Mohan and C. Li, "Content-based Transcoding of Images in the Internet," in-
`
`Proc. IEEE Intern. Conf: on Image Processing, Oct., 1998, S. Paek and J.R. Smith,
`
`"Detecting Image Purpose in World-Wide Web Documents,” in Proc. SPIEAS&T
`
`20
`
`Photonics West, Document Recognition, Jan., 1998. The importance value describes
`
`the relative importance of the region/block in the image presentation compared with the
`
`other regions. This value ranges from 0 to 1, where 1 stands for the highest important
`
`region and O for the lowest.
`
`For example,
`
`the regions of high importance are
`
`compressed with a lower compression factor than the remaining part of the image.
`
`25
`
`Then, the other parts of the image are first blurred and then compressed with a higher
`
`compression factor in orderto reduce the overall data size of the compressed image.
`
`When an image is transmitted to a variety of client devices with different
`
`display siz