`
`(19) World Intellectual Property Organization
`International Bureau
`
`1111111111111111 IIIIII 111111111111111111111111111111 IIIII 111111111111111 IIII IIII
`
`(43) International Publication Date
`31 January 2002 (31.01.2002)
`
`PCT
`
`(10) International Publication Number
`WO 02/08948 A2
`
`(51) International Patent Classification 7:
`
`G06F 17/00
`
`(21) International Application Number: PCT/US0l/23631
`
`(22) International Filing Date:
`
`23 July 2001 (23.07.2001)
`
`(25) Filing Language:
`
`(26) Publication Language:
`
`English
`
`English
`
`(30) Priority Data:
`60/221,394
`60/221,843
`60/222,373
`60/271,908
`60/291,728
`
`24 July 2000 (24.07.2000) US
`28 July 2000 (28.07.2000) US
`31 July 2000 (31.07.2000) US
`27 February 2001 (27.02.2001) US
`17 May 2001 (17.05.2001) US
`
`(71) Applicant (for all designated States except US): VIV(cid:173)
`COM, INC. [US/US]; 4180 Wallis Ct., Palo Alto, CA
`94306 (US).
`
`(72) Inventors; and
`(75) Inventors/Applicants (for US only): SULL, Sanghoon
`[KR/KR]; Gaeop 4-cha Woosung Apt.
`8-402, Do(cid:173)
`Gop-Dong, KangNam-Ku, Seoul, 135-270 (KR). KIM,
`Hyeokman [KR/KR]; A-Nam Apt.
`101-308 Myun(cid:173)
`gRyun-Dong, Jong-Ro-Ku, Seoul, 110-521 (KR). CHOI,
`Hyungseok [KR/KR]; HyunDai Apt.
`103-104 Ssang(cid:173)
`Moon 4-Dong, Dobong-Ku, Seoul, 132-034 (KR).
`CHUNG, Min, Gyo [KR/KR]; DaeWon Apt. 806-901,
`GumGok-Dong, PunDang-Ku, SungNam City, Kyonggi,
`463-480 (KR). YOON, Ja-Cheon [KR/KR]; SangRok Soo
`Apt. 204-303, IIWonBon-Dong, KangNam-Ku, Seoul,
`135-947 (KR). OH, Jeongtaek [KR/KR]; DaeRim Apt.
`207-2104 ChungGye-dong, NoWon-gu, Seoul, 139-220
`(KR). LEE, Sangwook [KR/KR]; 102-801 Oksu Heights
`Apt., 100 Oksu Dong, Sundong-Ku, Seoul, 133-100 (KR).
`SONG, S., Moon-Ho [KR/KR]; Yongsan-gu Ichon-Dong
`402, Gangchon Apt. 102-702, Seoul, 133-100 (KR). KIM,
`Jung, Rim [KR/KR]; Lotte Apt. 108-1701, Kuro-Dong,
`
`[Continued on next page}
`
`(54) Title: SYSTEM AND METHOD FOR INDEXING, SEARCHING, IDENTIFYING, AND EDITING PORTIONS OF ELEC(cid:173)
`TRONIC MULTIMEDIA FILES
`
`(57) Abstract: A method and system are provided
`for tagging, indexing, searching, retrieving, manipu(cid:173)
`lating, and editing video images on a wide area net(cid:173)
`work such as the Internet. A first set of methods is
`provided for enabling users to add bookmarks to mul(cid:173)
`timedia files, such as movies, and audio files, such
`as music. The multimedia bookmark facilitates the
`searching of portions or segments of multimedia files,
`particularly when used in conjunction with a search
`engine. Additional methods are provided that refor(cid:173)
`mat a video image for use on a variety of devices that
`have a wide range of resolutions by selecting some
`material (in the case of smaller resolutions) or more
`material (in the case of larger resolutions) from the
`same multimedia file. Still more methods are pro(cid:173)
`vided for interrogating images that contain textual in(cid:173)
`formation (in graphical form) so that the text may be
`copied to a tag or bookmark that can itself be indexed
`and searched to facilitate later retrieval via a search
`engine.
`
`File Edit Go Bookmarks Options Directory
`Location:
`
`202
`X
`
`List of Multimedia Bookmarks
`
`208
`
`Positional
`Information
`
`Content
`Information
`
`224
`
`226
`
`Positional
`Infonnation
`, - - - - a ;t-- 220
`Content
`Information
`
`r-;1281 .12t
`214 ••••••
`
`~--~
`
`~ 2 0 0
`
`iiiiiiiiiiii
`iiiiiiiiiiii
`
`-==
`iiiiiiiiiiii == ---
`
`!!!!!!!! -iiiiiiiiiiii
`
`iiiiiiiiiiii
`
`-i-
`
`Amazon v. Audio Pod
`US Patent 9,319,720
`Amazon EX-1060
`
`
`
`WO 02/08948 A2
`
`1111111111111111 IIIIII 111111111111111111111111111111 IIIII 111111111111111 IIII IIII
`
`Kuro-gu, Seoul, 152-055 (KR). LEE, Keansub [KR/KR];
`972-2 Pyokjokgol Jugong Apt. 836-1701, Yongtong-Dong,
`Paldal-gu, Suwon City, Kyonggi, 463-060 (KR). CHUN,
`Seong, Soo [KR/KR]; Dusan apt. 425-1402, Imae-dong,
`Pundang-gu, Songnam City, Kyonggi, 463-060 (KR). OH,
`Sangwook [KR/KR]; 609-42 Yongdam2-Dong, Cheju
`City, Cheju, 690-042 (KR). KIM, Yunam [KR/KR];
`2529-3daeYu HanRah Mansion 302, NoHyun-Dong,
`Che.Tu City, Cheju, 690-180 (KR).
`
`(74) Agents: CHICHESTER, Ronald, L. et al.; Baker Botts
`L.L.P., One Shell Plaza, 910 Louisiana, Houston, TX 77002
`(US).
`
`(81) Designated States (national): AE, AG, AL, AM, AT, AT
`(utility model), AU, AZ, BA, BB, BG, BR, BY, BZ, CA,
`CH, CN, CO, CR, CU, CZ, DE (utility model), DK (utility
`model), DM, DZ, EC, EE (utility model), ES, F1 (utility
`model), GB, GD, GE, GH, GM, HR, HU, ID, IL, IN, IS,
`JP, KE, KG, KP, KR (utility model), KZ, LC, LK, LR, LS,
`
`LT, LU, LV, MA, MD, MG, MK, MN, MW, MX, MZ, NO,
`NZ, PL, PT, RO, RU, SD, SE, SG, SI, SK (utility model),
`SL, TJ, TM, TR, TT, TZ, UA, UG, US, UZ, VN, YU, ZA,
`zw.
`
`(84) Designated States (regional): ARIPO patent (GH, GM,
`KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZW), Eurasian
`patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), European
`patent (AT, BE, CH, CY, DE, DK, ES, Fl, FR, GB, GR, IE,
`TT, LU, MC, NL, PT, SE, TR), OAPI patent (BF, BJ, CF,
`CG, CI, CM, GA, GN, GQ, GW, ML, MR, NE, SN, TD,
`TG).
`
`Published:
`without international search report and to be republished
`upon receipt of that report
`
`For two-letter codes and other abbreviations, refer to the "Guid(cid:173)
`ance Notes on Codes and Abbreviations" appearing at the begin(cid:173)
`ning of each regular issue of the PCT Gazette.
`
`-ii-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`- 1 -
`
`SYSTEM AND METHOD FOR INDEXING, SEARCHING, IDENTIFYJNG,
`
`AND EDITING PORTIONS OF ELECTRONIC MULTIMEDIA FILES
`
`5
`
`Background of the Invention
`
`Field of the Invention
`
`The present invention relates generally to marking multimedia files. More
`
`10
`
`specifically, the present invention relates to applying or inserting tags into multimedia
`
`files for indexing and searching, as well as for editing p01tions of multimedia files, all
`
`to facilitate the storing, searching, and retrieving of the multimedia information.
`
`Background of the Related Art
`
`1.
`
`Multimedia Bookmarks
`
`15
`
`With the phenomenal growth of the Internet, the amount of multimedia content
`
`that can be accessed by the public has vi1iually exploded. There are occasions where a
`
`user who once accessed particular multimedia content needs or desires to access the
`
`content again at a later time, possibly at or from a different place. For example, in the
`
`case of data interruption due to a poor network condition, the user may be required to
`
`20
`
`access the content again. In another case, a user who once viewed multimedia content
`
`at work may want to continue to view the content at home. Most users would want to
`
`restart accessing the content from the point where they had left off. Moreover,
`
`subsequent access may be initiated by a different user in an exchange of information
`
`between users. Unfortunately, multimedia content is represented in a streaming file
`
`25
`
`format so that a user has _to view the file from the beginning in order to look for the
`
`exact point where the first user left off.
`
`In order to save the time involved in browsing the data from the beginning, the
`
`concept of a bookmark may be used. A conventional bookmark marks a document
`
`such as a static web page for later retrieval by saving a link (address) to the document.
`
`30
`
`For example, Internet browsers supp01t a bookmark facility by saving an address called
`
`a Uniform Resource Identifier (URI) to a patiicular file.
`
`Internet Explorer,
`
`manufactured by the Microsoft Corporation of Redmond, Washington, uses the te1m
`
`"favorite" to describe a similar concept.
`
`-1-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`-2-
`
`Conventional bookmarks, however, store only the information related to the
`
`location of a file, such as the directory name with a file name, a Universal Resource
`
`Locator (URL), or the URJ. The files refeITed to by conventional bookmarks are
`
`treated in the same way regardless of the data formats for storing the content.
`
`5 Typically, a simple link is used for multimedia content also. For example, to link to a
`
`multimedia content file through the Internet, a URI is used. Each time the file is
`
`revisited using the bookmark, the multimedia content associated with the bookmark is
`
`always played from the beginning.
`
`Figure 1 illustrates a list 108 of conventional bookmarks 110, each comprising
`
`10
`
`positional information 112 and title 114. The positional information 112 of a
`
`conventional bookmark is composed of a URI as well as a bookmarked position 106.
`
`The bookmarked position is a relative time or byte position measured from a beginning
`
`of the multimedia content. The title 114 can be specified by a user, as well as delivered
`
`with the content, and it is typically used to make the user easily recognize the
`
`15
`
`bookmarked URI in a bookmark list 108. For the case of a conventional bookmark
`
`without using a bookmarked position, when a user wants to replay the specified
`
`multimedia file, the file is played from the beginning of the file each time, regardless of
`
`how much of the file the user has already viewed. The user has no choice but to record
`
`the last accessed position on a memo and to move manually the last stopped point. If
`
`20
`
`the multimedia file is viewed by streaming, the user must go through a series of
`
`buffering to find out the last accessed position, thus wasting much time. Even for the
`
`conventional bookmark with a bookmarked position, the same problem occurs when
`
`the multimedia content is delivered in live broadcast, since the bookmarked position
`
`within the multimedia content is not usually available, as well as when the user wants
`
`25
`
`to replay one of the variations of the bookmarked multimedia content.
`
`Further, conventional bookmarks do not provide a convenient way of switching
`
`between different data formats. Multimedia content may be generated and stored in a
`
`variety of formats. For example, video may be stored in the formats such as MPEG,
`
`ASF, RM, MOV, and AVI. Audio may be stored in the formats such as MID, MP3,
`
`30
`
`and WA V. There may be occasions where a user wants to switch the play of content
`
`from one format to another. Since different data formats produced from the same
`
`multimedia content are often encoded independently, the same segment is stored at
`
`-2-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`-3-
`
`different temporal positions within the different formats.
`
`Since conventional
`
`bookmarks have no facility to store any content information, users have no choice but
`
`to review the multimedia content from the beginning and to search manually for the
`
`last-accessed segment within the content.
`
`5
`
`Time information may be incorporated into a bookmark to return to the last-
`
`accessed segment within the multimedia content. The use of time information only,
`
`however, fails to return to exactly the same segment at a later time for the following
`
`reasons.
`
`If a bookmark incorporating time information was used to save the last(cid:173)
`
`accessed segment during the preview of multimedia content broadcast, the bookmark
`
`IO
`
`information would not be valid during a regular full-version broadcast, so as to return
`
`to the last-accessed segment. Similarly, if a bookmark incorporating time information
`
`was used to save the last-accessed segment during real-time broadcast, the bookmark
`
`would not be effective during later access because the later available version may have
`
`been edited or a time code was not available during the real-time broadcast.
`
`15
`
`Many video and audio archiving systems, consisting of several differently
`
`compressed files called "variations", could be produced from a single source
`
`multimedia content. Many web-casting sites provide multiple streaming files for a
`
`single video content with different bandwidths according to each video format. For
`
`example, CNN.com provides five different streaming videos for a single video content:
`
`20
`
`two different types of streaming videos with the bandwidths of 28. 8 kbps and 80 kbps,
`
`both encoded in Microsoft's Advanced Streaming Format (ASF). CNN.com also
`
`provides RM streaming format by RealNetworks, Inc. of Seattle, Washington (RM),
`
`and a streaming video with the smart bandwidth encoded in Apple Computer, Inc. 's
`
`QuickTime streaming format (MOV). In this case, the five video files may start and
`
`25
`
`end at different time points from the viewpoint of the source video content, since each
`
`variation may be produced by an independent encoding process varying the values
`
`chosen for encoding formats, bandwidths, resolutions, etc. This results in mismatches
`
`of time points because a specific time point of the source video content may be
`
`presented as different media time points in the five video files.
`
`30
`
`When a multimedia bookmark is utilized, the mismatches of positions cause a
`
`problem of mis-positioned playback. Consider a simple case where one makes a
`
`multimedia bookmark on a master file of a multimedia content (for example, video
`
`-3-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`-4-
`
`encoded in a given format), and tries to play another variation (for example, video
`
`encoded in a different format) from the bookmarked position. If the two variations do
`
`not start at the same position of the source content, the playback will not start at the
`
`bookmarked position. That is, the playback will start at the position that is temporally
`
`5
`
`shifted with the difference between the strut positions of the two variations.
`
`The entire multimedia presentation is often lengthy. However, there at·e
`
`:frequent occasions when the presentation is inte1Tupted, voluntarily or forcibly, to
`
`terminate before finishing. Examples include a user who struts playing a video at work
`
`leaves the office and desires to continue watching the video at home, or a user who may
`
`10
`
`be forced to stop watching the video and log out due to system shutdown. It is thus
`
`necessary to save the termination position of the multimedia file into persistent storage
`
`in order to return directly to the point of termination without a time-consuming
`
`playback of the multimedia file from the beginning.
`
`The interrupted presentation of the multimedia file will usually resume exactly
`
`15
`
`at the previously saved tenninated position. However, in some cases, it is desirable to
`
`begin the playback of the multimedia file a certain time before the terminated point,
`
`since such rewinding could help refresh the user's memory.
`
`In the prior art, the EPG (Electronic Program Guide) has played a crucial role as
`
`a provider of TV programming information. EPG facilitates a user's efforts to seat·ch
`
`20
`
`for TV programs that he or she wants to view. However, EPG's two-dimensional
`
`presentation ( channels vs. time slots) becomes cumbersome as teITestrial, cable, and
`
`satellite systems send out thousands of programs through hundreds of channels.
`
`Navigation through a large table of rows and columns in order to search for desired
`
`programs is frustrating.
`
`25
`
`One of the features provided by the recent set-top box (STB) is the personal
`
`video recording (PVR) that allows simultaneous recording and playback. Such STB
`
`usually contains digital video encoder/decoder based on an international digital video
`
`compression standard such as MPEG-1/2, as well as the lat·ge local storage for the
`
`digitally compressed video data. Some of the recent STBs also allow connection to the
`
`30
`
`Internet. Thus, STB users can experience new services such as time-shifting and web(cid:173)
`
`enhanced television (TV).
`
`However, there still exist some problems for the PVR-enabled STBs. The first
`
`-4-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`-5-
`
`problem is that even the latest STBs alone cannot fully satisfy users' ever-increasing
`
`desire for diverse functionalities. The STBs now on the market are very limited in
`
`terms of computing and memory and so it is not easy to execute most CPU and
`
`memory intensive applications. For example, the people who are bored with plain
`
`5
`
`playback of the recorded video may desire more advanced features such as video
`
`browsing/summary and search. Actually, all of those features require metadata for the
`
`recorded video. The metadata are usually the data describing content, such as the title,
`
`genre and summary of a television program. The metadata also include audiovisual
`
`characteristic data such as raw image data corresponding to a specific frame of the
`
`10
`
`video stream. Some of the description is structured around "segments" that represent
`
`spatial, temporal or spatio-temporal components of the audio-visual content.
`
`In the
`
`case of video content, the segment may be a single frame, a single shot consisting of
`
`successive frames, or a group of several successive shots. Each segment may be
`described by some elementary semantic information using texts. The segment is
`
`15
`
`referenced by the metadata using media locators such as frame number or time codes.
`
`However, the generation of such video metadata usually requires intensive computation
`
`and a human operator's help, so practically speaking, it is not feasible to generate the
`
`metadata in the current STB. Thus, one possible solution for this problem is to
`
`generate the metadata in the server connected to the STB and to deliver it to the STB
`
`20
`
`via network. However, in this scenario, it is essential to know the start position of
`
`recorded video with respect to the video stream used to generate the metadata in the
`
`server/content provider in order to match the temporal position referenced by the
`
`metadata to the position of the recorded video.
`
`The second problem is related to discrepancy between the two time instants: the
`time instant at which the STB starts the recording of the user-requested TV program,
`
`25
`
`and the time instant at which the TV program is actually broadcast. Suppose, for
`
`instance, that a user initiated PVR request for a TV program scheduled to go on the air
`
`at 11:30 AM, but the actual broadcasting time is 11:31 AM. In this case, when the user
`
`wants to play the recorded program, the user has to watch the unwanted segment at the
`
`30
`
`beginning of the recorded video, which lasts for one minute. This time mismatch could
`
`bring some inconvenience to the user who wants to view only the requested program.
`
`However, the time mismatch problem can be solved by using metadata delivered from
`
`-5-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`-6-
`
`the server, for example, reference frames/segment representing the beginning of the
`
`TV program. The exact location of the TV program, then, can be easily found by
`
`simply matching the reference frames with all the recorded frames for the program.
`
`2.
`
`Search
`
`5
`
`The rapid expansion of the World Wide Web (WWW) and mobile
`
`communications has also brought great interest in efficient multimedia data search,
`
`browsing and management. Content-based image retrieval (CBIR) is a powerful
`
`concept for finding images based on image contents, and content-based image search
`
`and browsing have been tested using many CBIR systems. See, M. Flickner, Harpreet
`
`10
`
`Sawhney, Wayne Niblack, Jonathan Ashley, Q. Huang, Byron Dom, Monika Gorkani,
`
`Jim Hafine, Denis Lee, Dragutin Petkovic, David Steele and Peter Yanker, "Query by
`
`image and video content: The QBIC system," IEEE Computer, Vol. 28. No. 9, pp. 23-
`
`32, Sept., 1995; Carson, Chad et al., "Region-Based Image Querying [Blobworld],"
`
`Workshop on Content-Based Access of Image and Video Libraries, Puerto Rico, Jun.
`
`15
`
`1997; J. R. Smith and S. Chang, "Visually searching the web for content," IEEE
`
`Multimedia Magazine, Vol. 4, No. 3, pp. 12-20, Summer 1997, also Columbia U.
`
`CU/CTR Technical Report 459-96-25; A. Pentland, R. W. Picard and S. Sclaroff, "A
`
`Photo book: tools for content-based manipulation of image databases," in Proc. Of SPIE
`
`Conf. On Storage and Retrieval for Image and Video Databases-II, No. 2185, pp. 34-
`
`20
`
`47, San Jose, CA, Feb., 1944; J. R. Bach, C. Fuller, A. Guppy, A. Hampapur, B.
`
`Horowitz, R. Humphrey, R. C. Jain and C. Shu, "Virage image search engine: an open
`
`framework for image management," Symposium on Electronic Imaging: Science and
`
`Technology --Storage & Retrieval for Image and Video Databases IV, IS&T/SPIE'96,
`
`Feb., 1996; J. R. Smith and S. Chang, "VisualSEEk: A Fully Automated Content-
`
`25
`
`Based Image Query System," ACM Multimedia Conference, Boston, MA, Nov. 1996;
`
`Jing Huang, S. Ravi Kumar, Mandar Mitra, Wei-Jing Zhu and Ramin Zabih. "Image
`
`Indexing Using Color C01Telograms, 11 in IEEE Conference on Computer Vision and
`
`Pattem Recognition, pp. 762-768, Jun., 1997; and Simone Santini, and Ramesh Jain,
`
`"The 'El Nino' Image Database System," in Intemational Conference on Multimedia
`
`30 Computing and Systems, pp. 524-529, Jun., 1999.
`
`Currently, most of the content-based image search engines rely on low-level
`
`image features such as color, texture and shape. While high-level image descriptors are
`
`-6-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`-7-
`
`potentially more intuitive for common users, the derivation of high-level descriptors is
`
`still in its experimental stages in the field of computer vision and requires complex
`
`vision processing. Despite its efficiency and ease of implementation, on the other hand,
`
`the main disadvantage of low-level image features is that they are perceptually non-
`
`5
`
`intuitive for both expert and non-expert users, and therefor, do not normally represent
`
`users' intent effectively. Furthermore, they are highly sensitive to a small amount of
`
`image variation in feature shape, size, position, orientation, brightness and color.
`
`Perceptually similar images are often highly dissimilar in tenns of low-level image
`
`features. Searches made by low-level features are often unsuccessful and it usually
`
`10
`
`takes many trials to find images satisfactory to a user.
`
`Efforts have been made to overcome the limitations of low-level features.
`
`Relevance feedback is a popular idea for incorporating user's perceptual feedback in
`
`the image search. See, Y. Rui, T. Huang, and S. Mehrota, "A relevance feedback
`
`architecture in content-based multimedia information retrieval systems," in IEEE
`
`15 Workshop on Content-based Access of Image and Video Libraries, Puerto Rico, pp. 82-
`
`89, Jun., 1997; Yong Rui, Thomas S. Huang, Michael Ortega, and Sharad Mehrotra,
`
`"Relevance Feedback: A Power Tool in Interactive Content-Based Image Retrieval," in
`
`IEEE Tran on Circuits and Systems for Video Technology, Special Issue on
`
`Segmentation, Description,·and Retrieval of Video Content, pp. 644-655, Vol. 8, No. 5,
`
`20
`
`Sept., 1998; G. Aggarwal, P. Dubey, S. Ghosal, A. Kulshreshtha, and A. Sarkar,
`
`"iPURB: perceptual and user-friendly retrieval of images,"
`
`in Proc. of IEEE
`
`International Conference on Multimedia and Exposition, Vol. 2, pp. 693-696, Jul.,
`
`2000; Ye Lu, Chunhui Hu, Xingquan Zhu, HongJiang Zhang and Qiang Yang, "A
`
`unified framework for semantics and feature based relevance feedback in image
`
`25
`
`retrieval systems," in Proc. of ACM International Conference on Multimedia, pp. 31-
`
`37, Oct., 2000; H. Muller, W. Muller, S. Marchand-Maillet, and T. Pun, "Strategies for
`
`positive and negative relevance feedback in image retrieval," in Proc. of IEEE
`Conference on Pattem Recognition, Vol. 1, pp. 1043-1046, Sept., 2000; S. Aksoy, R.
`
`M. Haralick, F. A. Cheikh, and M. Gabbouj, "A weighted distance approach to
`
`30
`
`relevance feedback," in Proc. of IEEE Conference on Pattern Recognition, Vol. 4, pp.
`
`812-815, Sept., 2000.; I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P.
`
`N. Yianilos, ''The Bayesian image retrieval system, PicHunter:theory, implementation,
`
`-7-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`-8-
`
`and psychophysical experiments," in IEEE Transaction on Image Processing, Vol. 9,
`
`pp. 20-37, Jan., 2000; P. Muneesawang, and Guan Ling, "Multi-resolution-histogram
`
`indexing and relevance feedback learning for image retrieval," in Proc. of IEEE
`
`Intenzational Conference on Image Processing, Vol. 2, pp. 526-529, Jan., 2001. A user
`
`5
`
`can manually establish relevance between a query and retrieved images, and the
`
`relevant images can be used for refining the query. When the refinement is made by
`
`adjusting a set of low-level feature weights, however, the user's intent is still
`
`represented by low-level features and their basic limitations still remain.
`
`Several approaches have been made to the integration of human perceptual
`
`10
`
`responses and low-level features in image retrieval. One notable approach is to adjust
`
`an image's feature's distance attributes based on the human perceptual input. See,
`
`Simone Santini, and Ramesh Jain, "The 'El Nino' Image Database System," in
`
`Intenzational Conference on Multimedia Computing and Systems, pp. 524-529, Jun.,
`
`1999. Another approach, called "blob world," combines low-level features to derive
`
`15
`
`slightly higher-level descriptions and presents the "blobs"" of grouped features to a user
`
`to provide a better understanding of feature characteristics. See, Carson, Chad, et al.,
`
`"Region-Based Image Querying [Blobworld]," Workshop on Content-Based Access of
`
`Image and Video Libraries, Puerto Rico, Jun., 1997. While those schemes successfully
`
`reflect a user's intent to some degree, it remains to be seen how grouping of features or
`
`20
`
`feature distance modification can achieve the perceptual relevance in image retrieval.
`
`A more traditional computer vision approach to the derivation of high-level object
`
`descriptors based on generic object recognition has been presented for image retrieval.
`
`See, David A. Forsyth and Margaret Fleck, "Body Plans," in IEEE Conference on
`
`Computer Vision and Pattenz Recognition, pp. 678-683, Jun., 1997. Due to its limited
`
`25
`
`feasibility for general image objects and complex processing, its utility is still
`
`restricted.
`
`With the rapid proliferation of large image/video databases, there h~ been an
`
`increasing demand for effective methods to search the large image/video databases
`
`automatically by their content. For a query image/video clip given by a user, these
`
`30 methods search the databases for the images/videos that are most similar to the query.
`
`In other words, the goal of the image/video search is to find best matches to the query
`
`image/video from the database.
`
`-8-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`-9-
`
`Several approaches have been made towards the development of the fast,
`
`effective multimedia search methods. Milanes et al. utilized hierarchical clustering to
`organize an image database into visually similar groupings. See, R. Milanese, D.
`
`Squire, and T. Pun, "Conespondence analysis and hierarchical indexing for content-
`
`s
`
`based image retrieval," in Proc. IEEE Int. Corif. Image Processing, Vol. 3, Lausanne,
`
`Switzerland, pp. 859-862, Sept., 1996. Zhang and Zhong provided a hierarchical self(cid:173)
`
`organizing map (HSOM) method to organize an image database into a two-dimensional
`
`grid. See, H. J. Zhang and D. Zhong, "A scheme for visual feature based image
`
`indexing," in Proc. SPIEIIS&T Conj. Storage Retrieval Image Video Database III, Vol.
`
`10
`
`2420, pp. 36-46, San Jose, CA, Feb., 1995. However, a weakness ofHSOM is that it is
`
`generally too computationally expensive to apply to a large multimedia database.
`
`In addition, there are other well known solutions using Voronoi diagram, Kd(cid:173)
`
`tree, and R-tree. See, J. Bentley, "Multidimensional binary search trees used for
`
`associative searching," Comm. of the ACM, Vol. 18, No. 9, pp. 509-517, 1975; S. Brin,
`"Near neighbor search in large metric spaces," in Proc. 2r1 Conj. On Very Large
`
`15
`
`Databases (VLDB '95), Zurich, Switzerland, pp. 574-584, 1995. However, it is also
`
`known that those approaches are not adequate for the high dimensional feature vector
`
`spaces, and thus, they are useful only in low dimensional feature spaces.
`
`Peer to Peer Searching
`Peer-to-Peer (P2P) is a class of applications making the most of previously
`
`20
`
`unused resources (for example, storage, content, and/or CPU cycles), which are
`
`available on the peers at the edges of networks. P2P computing allows the peers to
`
`share the resources and services, or to aggregate CPU cycles, or to chat with each other,
`
`by direct exchange. Two of the more popular implementations of P2P computing are
`
`25 Napster and Gnutella. Napster has its peers register files with a broker, and uses the
`
`broker to search for files to copy. The broker plays the role of server in a client-server
`
`model to facilitate the interaction between the peers. Gnutella has peers register files
`
`with network neighbors, and searches the P2P network for files to copy. Since this
`
`model does not require a centralized broker, Gnutella is considered to be a true P2P
`
`30
`
`system.
`
`3.
`
`Editing
`
`-9-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`- 10 -
`
`In the prior rut, video files were edited through video editing softwru·e by
`
`copying several segments of the input videos and pasting them to an output video. The
`
`prior rut method, however, confronts two major problems mentioned below.
`
`The first problem of the prior rut method is that it requires additional storage to
`
`5
`
`store the new version of an edited video file. Conventional video editing software
`generally uses the original input video file to create an edited video. In most of the
`
`cases, editors having a large database of videos attempt to edit the videos to create a
`
`new one. In this case, the storage is wasted storing duplicated pmtions of the video.
`
`The second problem with the prior rut method is that a whole new metadata have to be
`
`IO
`
`generated for a newly created video. If the metadata ru·e not edited in accordance with
`
`the edition of the video, even if the metadata for the specific segment of the input video
`
`are already constructed, the metadata may not accurately reflect the content. Because
`
`considerable effoti is required to create the metadata of videos, it is desirable to reuse
`
`efficiently existing metadata, if possible.
`
`15
`
`Metadata of a video segment contain textual info1mation such as time
`
`information ( for example, struting frame number and duration, or struiing frame
`
`number as well as the finishing frame number), title, keyword, and annotation, as well
`
`as image information such as the key frame of a segment. The metadata of segments
`
`can form a hierarchical structure where the larger segment contains the smaller
`
`20
`
`segments. Because it is hard to store both the video and their metadata into a single
`
`file, the video metadata ru·e separately stored as a metafile, or stored in a database
`
`management system (DBMS).
`
`If metadata having a hierru·chical structure are used, browsing a whole video,
`
`searching for a segment using the keyword and annotation of each segment, and using
`
`25
`
`the key frames of each segment for visual summary of the video are suppmied. Also,
`
`not only does it suppoti the existing simple playback, but also the playback and
`
`repeated playback of a specific segment. Therefor, the use of hierarchically-structured
`
`metadata is becoming popular.
`4.
`Transcoding
`
`30
`
`With the advance of information technology, such as the populru·ity of the
`
`Internet, multimedia presentation proliferates into ever increasing kinds of media,
`
`including wireless media. Multimedia data are accessed by ever increasing kinds of
`
`-10-
`
`
`
`WO 02/08948
`
`PCT/US0l/23631
`
`- 11-
`
`devices such as hand-held computers (IDICs), personal digital assistants (PDAs), and
`
`smart cellular phones. There is a need for accessing multimedia content in a universal
`
`fashion from a wide variety of devices. See, J. R. Smith, R. Mohan and C. Li,
`
`"Transcoding Internet Content for Heterogeneous Client Devices," in Proc. ISCASA,
`
`5 Monterey, California, 1998.
`
`Several approaches have been made to enable effectively such universal
`
`multimedia access (UMA). A data representation, the InfoPyramid, is a :framework for
`
`aggregating
`
`the
`
`individual components of multimedia content with content
`
`descriptions, and methods and rules for handling the content and content descriptions.
`1 o See, C. Li, R. Mohan and J. R. Smith, "Multimedia Content Description in the
`InfoPyramid," in Proc. IEEE Intern. Conj. on Acoustics, Speech and Signal Processing,
`
`May, 1998. The InfoPyramid describes content in different modalities, at different
`
`resolutions and at multiple abstractions. Then a transcoding tool dynamically selects
`
`the resolutions or modalities that best meet the client capabilities from the InfoPyramid.
`
`15
`
`J. R. Smi



