throbber
Miodern
`Information
`Atg tee]
`
`WESLEY Comcast, Exhibit-1005
`
`Ricardo Baeza-Yates
`
`Berthier Ribeiro-Neto
`
`Comcast, Exhibit-1005
`
`1
`
`

`

`ModernInformation Retrieval
`
`Ricardo Baeza-Yates
`Berthier Ribeiro—Neto
`
`
`
`ACM Press
`
`New York
`
`aV
`
`V
`
`PEARSON
`
`ddison
`Meier
`
`aA
`
`2
`
`

`

`Pearson Education Limited
`
`Edinburgh Gate
`Harlow
`Essex CM20 2JE
`
`England
`
`and Associated Companies throughout the world
`
`Visit us on the World Wide Webat:
`www.pearsoned.co.uk
`
`Copyright © 1999 by the ACMpress, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the-Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part ofthis publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London WIT 4LP.
`
`While the publisher has made every attemptto trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`:
`
`Many of the designations used by manufacturers andsellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentionedin this book. A list of the trademark designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloging-in-Publication Data
`Baeza-Yates, R. (Ricardo)
`- Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p. cm.
`Includes bibliographical reference and index.
`ISBN 0-201-39829-X
`
`1. Information storage and retrieval systems. I. Ribeiro, Berthier de Araiijo
`Neto, 1960-.
`II. Title.
`Z667.B34
`1999
`025.04—dce21
`
`99-10033
`CIP
`
`3
`
`

`

`Preface
`
`
`
`Information retrieval (IR.) has changed considerably in recent years with the expansion
`of the World Wide Web and the advent of modern and inexpensive graphical user
`interfaces and mass storage devices. As a result, traditional IR textbooks have become
`quite out of date and this has led to the introduction of new IR. books. Nevertheless, we
`believe that there is still great need for a book that approaches thefield in a rigorous
`and complete way from a computer-science perspective (as opposed to a user-centered
`perspective). This book is an effort to partially fulfill this gap and should be useful for
`a first course on information retrieval as well as for a graduate course on the topic.
`The book comprises two portions which complement and balance each other.
`The core portion includes nine chapters authored or coauthored by the designers of
`the book. The second portion, which is fully integrated with the first, is formed by
`six state-of-the-art chapters written by leading researchers in their fields. The same
`notation and glossary are used in all the chapters. Thus, despite the fact that several
`people have contributed to the text, this book is really much more a textbook than
`an edited collection of chapters written by separate authors. Furthermore, unlike a
`collection of chapters, we have carefully designed the contents and organization of the
`book to present a cohesive view of all the important aspects of modern information
`retrieval.
`'
`:
`From IR models to indexing text, from IR visual tools and interfaces to the Web,
`from IR. multimedia to digital libraries, the book provides both breadth of coverage and
`richness of detail. It is our hope that, given the now clear relevance and significance of
`information retrieval to modern society, the book will contribute to further disseminate
`the study of the discipline at information science, computer science, and library science
`departments throughout the world.
`
`
`
`Ricardo Baeza-Yates, Santiago, Chile
`Berthier Ribeiro-Neto, Belo Horizonte, Brazil
`January, 1999
`
`4
`
`

`

`To Helena, Rosa, and our children
`
`territério de homenslivres
`que seré nosso pais
`e sera patria de todos.
`Irm4aos, cantai ese mundo
`que nao verei, mas vira
`um dia, dentro de mil anos,
`talvez mais. ..nao tenho pressa.
`
`de Cidade Prevista no livro
`A Rosa do Povo, 1945
`
`Carlos Drummond de Andrade
`
`territory of free men
`that will be our country
`and will be the nation ofall
`Brothers, sing this world
`which I’ll not see, but which will come
`one day, in a thousand years,
`maybe more...no hurry.
`
`from Prevised City in the book
`The Rose of the People, 1945
`
`Amolos libros
`exploradores,
`libros con bosque o nieve,
`profundidad o cielo
`
`de Oda al Libro(I),
`
`Pablo Neruda
`
`I love books
`that explore,
`books with a forest or snow,
`depth or sky
`
`from Ode to the Book (I),
`
`5
`
`

`

`vill
`
`ACKNOWLEDGEMENTS
`
`Fifth, we thank the support of our institutions, the Departments of Com-
`puter Science of the University of Chile and of the Federal University of Minas
`Gerais, as well as the funding provided by national research agencies (CNPq in
`Brazil and CONICYTin Chile) and international collaboration projects, in par-
`ticular CYTED project VII.13 AMYRI (Environment for Information Managing
`and Retrieval in the World Wide Web) and Finep project SIAM (Information
`Systems for Mobile Computers) under the Pronex program.
`Most important, to Helena, Rosa, and our children, who put up with a
`string of trips abroad, lost weekends, and odd working hours.
`
`List of Trademarks
`Alta Vista is a trademark of Compaq Computer Corporation
`FrameMaker is a trademark of Adobe Systems Incorporated
`IBM SP2 is a trademark of International Business Machines Corporation
`Netscape Communicator is a trademark of Netscape Communications Corporation
`Solaris, Sun 3/50 and Sun UltraSparc-1 are trademarks of Sun Microsystems, Inc.
`Thinking Machines CM-2 is a trademark of Thinking Machines Corporation
`Unix is licensed through X/Open Company Ltd
`
`6
`
`

`

`Contents
`
`
`
`Preface
`
`Acknowledgements
`
`Biographies
`
`Vv
`
`vii
`
`Xvii
`
`1.2
`
`1.3
`
`1
`Introduction
`1.1. Motivation... 2.2... ee, 1
`1.1.1
`Information versus Data Retrieval. ............
`1
`1.1.2
`Information Retrieval at the Center of the Stage... ..
`2
`‘1.1.38
`Focus of the Book... .....0.0.0002...02...0..
`3
`Basic Concepts
`.........0..0.0 0000000 eee eee
`3
`1.2.1 The User Task... 2. ee en 4
`1.2.2 Logical View of the Documents ..............
`5
`Past, Present, and Future ..................0..
`6
`1.3.1 Early Developments... ................0..,
`6
`1.3.2.
`Information Retrieval in the Library ...........
`7
`1.3.3 The Web and Digital Libraries. .........2.2002.
`7
`1.3.4
` PracticalIssues ...........0.......2..00.
`8
`The Retrieval Process. ........0.000..000. 000004 8
`9
`(Organization of the Book. ..........0.0..0. 00.0004
`LO
`1.5.1 Book Topics ...........00.0.00...0.000.4 Ll
`1.5.2 Book Chapters ....................00.
`12
`1.66 Howto Use this Book: ........0..0..0.0.....0..
`15
`1.6.1 Teaching Suggestions ....... re 15
`1.6.2 The Book’s Web Page ................0..
`16
`Bibliographic Discussion .........002...... 004004
`17
`
`1.4
`1.5
`
`1.7
`
`19
`2 Modeling
`2.1
`Introduction ..............00.. Se eee 19
`
`7
`
`

`

`CONTENTS
`
`2.4
`2.5
`
`2.6
`
`2.7
`
`23
`<A Formal Characterization of IR Models .............
`24
`Classic Information Retrieval. ............02. 050084
`24
`2.5.1 Basic ConceptS ... 1.2... 0. ee ee ee
`2.5.2 Boolean Model .......... 0.20 eee ee eee 25
`2.5.3 Vector Model ......ee 27
`2.5.4
`Probabilistic Model .............-2.2. 2.50008.
`30
`2.5.5 Brief Comparison of Classic Models ............
`34
`Alternative Set Theoretic Models -...........2200.4-
`34
`2.6.1
`Fuzzy Set Model ......ee ee 34
`2.6.2 Extended Boolean Model..................
`38
`Alternative Algebraic Models ..............2004.
`Al
`2.7.1 Generalized Vector Space Model... ...........
`Al
`2.7.2. Latent Semantic Indexing Model .............
`4A
`2.7.3. Neural Network Model .................0..
`46
`Alternative Probabilistic Models... .............04.4.
`A8
`2.8.1 Bayesian Networks .... 2.2.2... 2.2.2. 02002208.
`48
`2.8.2
`Inference Network Model..................
`49
`2.8.3. Belief Network Model... ..........-...206:
`56
`2.8.4 Comparison of Bayesian Network Models.........
`59
`2.8.5 Computational Costs of Bayesian Networks .......
`60
`2.8.6 The Impact of Bayesian Network Models .........
`61
`Structured Text Retrieval Models .................
`61
`2.9.1 Model Based on Non-Overlapping Lists. .........
`62
`2.9.2. Model Based on Proximal Nodes .............
`63
`2.10 Models for Browsing .... 1... eee ee ee 65
`2.10.1 Flat Browsing ............ 2.002.220 00 008.
`65
`2.10.2 Structure Guided Browsing ................
`66
`2.10.3 The Hypertext Model... .........2.02.2.404.
`66
`‘Trends and Research Issues... .. 2.2... 2.0040 eee eae
`69
`2.11
`2.12 Bibliographic Discussion ..............064%Fe ee 69
`
`2.8
`
`2.9
`
`
`
`73
`Retrieval Evaluation
`3.1
`Introduction .. 1... ee 73
`3.2
`Retrieval Performance Evaluation ...............4.2.
`74
`3.2.1 Recall and Precision ................0.04.
`75
`3.2.2 Alternative Measures ...........--2-02+02084 82
`3.3 Reference Collections ... 1... 0.02.00. pe eee ee es .6684
`3.3.1 The TREC Collection ................0.4.
`84
`3.3.2 The CACM and ISI Collections ..............
`91
`3.3.3 The Cystic Fibrosis Collection ...............
`94
`‘Trends and Research Issues. .........-..0-00-0. 502004
`96
`Bibliographic Discussion ... 2... 6... 2 ee ee eee ee 96
`
`3.4
`3.5
`
`99
`Query Languages
`4.1
`Introduction ..0... 99
`
`8
`
`

`

`CONTENTS
`
`xi
`
`Single-Word Queries ...........ek ee eee
`4.2.1
`4.2.2 Context Queries... ..........2....0.02.20004
`4.2.3 Boolean Queries... .............0.000008.4
`4.2.4 Natural Language... ..........0.00.2.0000.%
`Pattern Matching ..................00..00.0000.
`Structural Queries... .....0...0...0.0...20- 00004
`4.4.1
`Fixed Structure ............0.0....2..0000.
`4.4.2 Hypertext .. 2... 2... ee ee ee
`4.4.3 Hierarchical Structure ..................0.,
`Query Protocols. ..........0 000000 eee eee eee.
`‘Trends and Research Issues. ..............00000.4
`Bibliographic Discussion ............0....0.000.
`
`4.3
`4.4
`
`4.5
`4.6
`4.7
`
`Query Operations
`5.1
`Introduction... 2.2... .. 0.0.0... 002. eee eee
`5.2
`User Relevance Feedback .........0.000 00000 eee
`5.2.1 Query Expansion and Term Reweighting for the Vector
`Model
`.. 2...0.
`‘Term Reweighting for the Probabilistic Model ......
`5.2.2
`5.2.3 A Variant of Probabilistic Term Reweighting.......
`5.2.4 Evaluation of Relevance Feedback Strategies .......
`5.3 Automatic Local Analysis
`.. 5 .....0..0....0..00000.
`5.3.1 Query Expansion Through Local Clustering .......
`5.3.2 Query Expansion Through Local Context Analysis
`5.4 Automatic Global Analysis... ......0..0.0.0....0.4.
`5.4.1 Query Expansion based on a Similarity Thesaurus.
`.
`.
`5.4.2. Query Expansion based on a Statistical Thesaurus
`‘Trends and Research Issues... ..........020000000.
`Bibliographic Discussion .................0000.
`
`5.5
`5.6
`
`Text and Multimedia Languages and Properties
`6.1
`Introduction ........0.. 0.0.0.0 0000002. ee een
`6.2 Metadata... ........2. 0.20000 000000 ee
`6.3
`Text ...........2.....04. oe ee
`6.3.1
`Formats ............ 2.000000 08 eee eee
`6.3.2
`Information Theory.....................
`6.3.3 Modeling Natural Language ................
`6.3.4
`Similarity Models ............0....0..2.004
`6.4 Markup Languages ...............2.00 00000088
`6.4.1
`SGML .... 2...ee
`64.22 HTML ...... 2... 2.2.2.0. 000 00000000084
`643 XML... .. ee eee
`6.5 Multimedia. .........0..0.0..002..0000.002.008.
`6.5.1
`Formats ...............ee ee
`
`‘101
`
`102
`
`103
`104
`
`106
`
`. 108
`
`114
`
`116
`
`117
`
`117
`
`118
`
`120
`
`121
`
`122
`
`123
`
`124
`
`129
`
`131
`
`131
`
`134
`
`137
`
`138
`
`141
`
`141
`
`142
`
`145
`
`148
`
`156
`
`9
`
`

`

`xi
`
`CONTENTS
`
`6.5.4 HyTime ............-.2. Se ee 159
`‘Trends and Research Issues... . 2... 2. eee ee ee es 160
`Bibliographic Discussion ... 1... 1. ee eee eee ee 162
`
`6.6
`6.7
`
`7.3.
`7.4
`
`163
`Text Operations
`7.1
`Introduction...2... 163
`7.2
`Document Preprocessing ... 2... 6 ee ee es 165
`7.2.1 Lexical Analysis of the Text ................ 165
`7.2.2 Elimination of Stopwords................4.
`167
`7.2.8
`Stemming ...... 2... ee ee 168
`7.2.4
`Index Terms Selection .........-2-.52-205.
`169
`7.2.5 Thesauri... 2... ee ee 170
`Document Clustering .......... 0... 0.02205 08.
`173
`Text Compression... 1... 6. ee es 173
`74.1 Motivation...2... 173
`7.4.2 Basic Concepts .... 1... . 0.5. eee ee eee 175
`7.4.3. Statistical Methods ..............2.2.2.028.
`176
`7.4.4 Dictionary Methods.............2.0.22204.
`183
`7.4.5
`Inverted File Compression .............204- 184
`Comparing Text Compression Techniques.............
`186
`Trends and Research Issues... . 2... 2. ee ee es 188
`Bibliographic Discussion ... 2... 6. ee ee ees
`189
`
`7.5
`7.6
`7.7
`
`8.3.
`
`8.4
`8.5
`
`191
`Indexing and Searching
`8.1
`Introduction... 2... ee eee 191
`8.2
`Inverted Files ......So 192
`8.2.1
`Searching
`. 2... 1 ee ee 195
`8.2.2 Construction. ......... 00 ee ee ee .... 196
`Other Indices for Text
`............02.2.2. 0.052004 199
`8.3.1
`Suffix Trees and Suffix Arrays ..............-. 199
`8.3.2
`Signature Files
`.. 2... 6... ee ee 205
`Boolean Queries... 2... ee ee ees 207
`Sequential Searching ©... 1... ee eee 209
`8.5.1 Brute Force ........0-0 20. eee 209
`8.5.2 Knuth-Morris-Pratt .........0.0.0.02.2.2.20 005004 210
`8.5.3 Boyer-Moore Family .. 1... 2.2... eee ee ee 211
`8.5.4
`Shift--Or 2... 0.0. 212
`8.5.5
`Sufix Automaton ............ 000002 2 eee 213
`8.5.6
`Practical Comparison... 2... 0... 0. eee ee 214
`8.5.7
`Phrases and Proximity .......... Ce ee 215
`Pattern Matching .. 2... 0. ee 215
`8.6.1
`String Matching Allowing Errors ............. 216
`8.6.2 Regular Expressions and Extended Patterns ....... 219
`8.6.3. Pattern Matching Using Indices .............. 220
`Structural Queries... 2... ee 222
`
`8.6
`
`8.7
`
`10
`
`

`

`Sequential Searching ..............2..2.2.. 223
`8.8.1
`8.8.2 Compressed Indices. ..............2..2.0.. 224
`Trends and Research Issues... .............2..2.2.. 226
`8.9
`8.10 Bibliographic Discussion .............., ee 227
`
`CONTENTS
`
`xiii
`
`9.2
`
`9.3
`
`229
`Parallel and Distributed IR
`9.1
`Introduction...2... 229
`9.1.1
`Parallel Computing .................0... 230
`9.1.2
`Performance Measures ................... 231
`Parallel IR... 1... ee 232
`9.2.1
`Introduction ................00..000.2.0.. 232
`9.2.2 MIMD Architectures ..........2...2.2202....
`233
`9.2.3
`SIMD Architectures. ......... ee ee 240
`Distributed IR...2. 249
`9.3.1 Introduction...................2--.2.2..
`249
`9.3.2 Collection Partitioning ................... 251
`9.3.3
`Source Selection... .......0..2.0.00.0 08.
`252
`9.3.4 Query Processing ..............0.00000. 253
`9.3.5 WebIssues............... 00-0000 ce ee 254
`Trends and Research Issues... ..........0-000004. 255
`Bibliographic Discussion ..........2.......020.. 206
`
`9.4
`9.5
`
`257
`User Interfaces and Visualization
`207
`10.1.
`Introduction ........0.0.200200200..020.0.202.2...2...,
`10.2. Human-Computer Interaction ................... 258
`10.2.1 Design Principles ...................0.., 208
`10.2.2 The Role of Visualization ...........2.2..... 259
`10.2.3 Evaluating Interactive Systems ........re 261
`10.3. The Information Access Process ..............2... 262
`10.3.1 Models of Interaction .................2... 262
`10.3.2 Non-Search Parts of the Information Access Process
`..
`265
`10.3.3 Earlier Interface Studies ..........2.2.2.2.22...
`266
`10.4 Starting Points
`..........00202.2002.....0.2.204.
`267
`10.4.1 Lists of Collections ...........2...2.2.0.... 267
`10.4.2 Overviews ........20..0.0.. 00000 be eee,
`268
`10.4.3 Examples, Dialogs, and Wizards.............. 276
`10.4.4 Automated Source Selection ................ 278
`10.56 Query Specification .. 2... le 278
`10.5.1 Boolean Queries... ......0.2.0.200......004.
`279
`10.5.2 From Command Lines to Forms and Menus ....... 280
`10.5.3 Faceted Queries .........02...0200-. 22-24% 281
`10.5.4 Graphical Approaches to Query Specification ..... .
`282
`10.5.5 Phrases and Proximity ..................., 286
`10.5.6 Natural Language and Free Text Queries.........
`287
`
`11
`
`

`

`XIV
`
`CONTENTS
`
`10.7
`
`10.6.2 Query Term Hits Within Document Content .......
`10.6.3 Query Term Hits Between Documents ..........
`10.6.4 SuperBook: Context via Table of Contents ........
`10.6.5 Categories for Results Set Context ........ee
`10.6.6 Using Hyperlinks to Organize Retrieval Results .....
`10.6.7 Tables ................0.4.ee ee
`Using Relevance Judgements .................00.4
`10.7.1 Interfaces for Standard Relevance Feedback .......
`10.7.2 Studies of User Interaction with Relevance Feedback
`Systems .......0.0 0000000. cee eee .
`10.7.3 Fetching Relevant Information in the Background ... .
`10.7.4 Group Relevance Judgements .............0..
`10.7.5 Pseudo-Relevance Feedback ................
`Interface Support for the Search Process .............
`10.8.1 Interfaces for String Matching ...............
`10.8.2 Window Management.................004
`10.8.3 Example Systems ..............0.000004
`10.8.4 Examples of Poor Use of Overlapping Windows .....
`10.8.5 Retaining Search History ..................
`(10.8.6 Integrating Scanning, Selection, and Querying ......
`Trends and Research Issues... .........0...002.000.4
`10.9
`10.10 Bibliographic Discussion ..................0204
`
`10.8
`
`11.2
`
`11.3
`
`11 Multimedia IR: Models and Languages
`Introduction ..2...
`11.1
`Data Modeling... ........0.... 2.000000 00 00048
`11.2.1 Multimedia Data Support in Commercial DBMSs ... .
`11.2.2 The MULTOS Data Model. ........0.....00..
`Query Languages .... 2... 2... 0.2.00. eee ee ee eee
`11.3.1 Request Specification ............2.....00.
`11.3.2 Conditions on Multimedia Data. .............
`11.3.3 Uncertainty, Proximity, and Weights in Query
`Expressions .... 2... -. eee
`11.3.4 Some Proposals .................0.02.200,
`Trends and Research Issues... ........0.0200.0.0.0004
`Bibiographic Discussion... .............0.0. 0004
`
`11.4
`
`11.5
`
`12.2
`
`12.3
`
`12 Multimedia IR: Indexing and Searching
`Introduction... 0... ee
`12.1
`Background — Spatial Access Methods ..............
`A Generic Multimedia Indexing Approach ............
`One-dimensional Time Series... ..............04.
`12.4.1 Distance Function. ..............0....00,
`12.4.2 Feature Extraction and Lower-bounding .........
`12.4.3 Experiments ................00.0.00 0004
`
`12.4
`
`289
`
`293
`
`296
`297
`299
`301
`
`303
`
`304
`
`307
`
`308
`
`308
`
`309
`
`309
`
`311
`
`312
`
`317
`
`317
`
`318
`
`321
`
`322
`
`325
`
`325
`328
`329
`
`331
`
`334
`
`335
`335
`
`337
`
`338
`
`341
`
`342
`
`345
`
`345
`
`347
`
`348
`
`393
`
`353
`
`353
`
`355
`
`12
`
`

`

`CONTENTS
`
`XV
`
`12.5.1 Image Features and Distance Functions. ......... 357
`12.5.2 Lower-bounding ................0..00048. 358
`12.5.3 Experiments. .........0.00.0.0.00 0000 8
`4
`360
`12.6 Automatic Feature Extraction ................... 360
`12.7
`‘Trends and Research Issues... .........0...0000404 361 .
`12.8 Bibliographic Discussion .............0...0.00204 363
`
`367
`13 Searching the Web
`13.1
`Introduction... . 2... ee 367
`13.2 Challenges ........ee 368
`13.3 Characterizing the Web.................2..004 369
`13.3.1 Measuring the Web ..................2.4. 369
`13.3.2 Modeling the Web ............0......004. 371
`13.4 Search Engines .......... Bi ee ee
`373
`13.4.1 Centralized Architecture ./.......0..0..0..0.2.,
`373
`13.4.2 Distributed Architecture ...............00.4. 375
`13.4.3 User Interfaces
`............0..0.-.200.0048. 377
`13.4.4 Ranking ............0.0...0. 02.00 000008.4 380
`13.4.5 Crawling the Web.................0000., 382
`13.4.6 Indices...ee 383
`13.5 Browsing... .. 2.2.0... 0.00.00 00000000 eee ee
`384
`13.5.1 Web Directories .. 0... 2. ee el 384
`13.5.2 Combining Searching with Browsing ........... 386
`13.5.3 Helpful Tools
`..........0.0.0.00...000.00..
`387
`13.6 Metasearchers ...............0.. Se ee ee 387
`13.7 Finding the Needle in the Haystack ............... ~.
`389
`13.7.1 User Problems. ..........0...0.0... 00004 389
`13.7.2 Some Examples ...............02. 000004 390
`13.7.3 Teaching the User... .....0....0.0.0....04. 391
`13.8 Searching using Hyperlinks..................... 392
`13.8.1 Web Query Languages ................02... 392
`13.8.2 Dynamic Search and Software Agents. .......... 393
`13.9 Trends and Research Issues... 2. ........... 0000.
`393
`13.10 Bibliographic Discussion ..........0.0..0.. 0.0004 an
`395
`
`397
`14 Libraries and Bibliographical Systems
`14.1
`Introduction... ...........0 0000.00.00 000084 397
`14.2 Online IR Systems and Document Databases. .......... 398
`14.2.1 Databases ....... 0.2.0.0... .002.000048.
`399
`14.2.2 Online Retrieval Systems. ...... ae ee 403
`14.2.3 IR in Online Retrieval Systems ...........0.. 404
`14.2.4 ‘Natural Language’ Searching ............... 406
`14.3 Online Public Access Catalogs (OPACs) ............. 407
`14.3.1 OPACs and Their Content ....... Dee ee ee 408
`
`13
`
`

`

`xvi
`
`CONTENTS
`
`14.3.4 Alternatives to Vendor OPACs .............. A410
`14.4 Libraries and Digital Library Projects. .............. A12
`14.5
`‘Trends and Research Issues... . 2... 2 ee ee A12
`14.6 Bibliographic Discussion ... 2... 1...
`1 eee ee ee es A13
`
`415
`15 Digital Libraries
`15.1
`Imtroduction .... 2.0... ee es A15
`15.2 Definitions ...1... A17
`15.3. Architectural Issues . 2... 2. ee A18
`15.4 Document Models, Representations, and Access ......... A20
`15.4.1 Multilingual Documents ...............204, A20
`15.4.2 Multimedia Documents ...............-000. A21
`15.4.3 Structured Documents ............2-++2005 A421
`15.4.4 Distributed Collections ...............+2480- A22
`15.4.5 Federated Search ... 2... 2. ee eee 424.
`15.4.6 Access
`. 0.00. ee ee 424
`15.5 Prototypes, Projects, and Interfaces................ A25
`15.5.1 International Range of Efforts ............... A27
`15.5.2 Usability... 2... 2. ee ee eee A28
`15.6 Standards ..... 2.0...
`2 eee A429
`15.6.1 Protocols and Federation. ...............6.. 429
`15.6.2 Metadata
`.. 2... ee A30
`‘Trends and Research Issues... .......-2.-000 50085 A31
`15.7
`15.8 Bibliographical Discussion .......: 2.2.2... 005022044 432
`
`Appendix: Porter’s Algorithm
`
`Glossary
`
`References
`
`Index
`
`433
`
`A37
`
`455
`
`501
`
`14
`
`

`

`Chapter 1
`introduction
`
`
`
`1.1 Motivation
`
`Information retrieval (IR) deals with the representation, storage, organization
`of, and access to information items. The representation and organization of the
`information items should provide the user with easy access to the information in
`which he is interested. Unfortunately, characterization of the user information
`need is not a simple problem. Consider, for instance, the following hypothetical
`user information need in the context of the World Wide Web (or just the Web):
`
`_ Find all the pages (documents) containing information on college ten-
`‘nis teams which: (1) are maintained by an university in the USA and
`(2) participate in the NCAA tennis tournament. To be relevant, the
`page must include information on the national ranking of the team
`in the last three years and the email or phone number of the team
`coach.
`
`Clearly, this full description of the user information need cannot be used directly
`to request information using the current interfaces of Web search engines.
`In-
`stead, the user must first translate this information need into a query which can
`be processed by the search engine (or IR system).
`In its most commonform, this translation yields a set of keywords (or index
`terms) which summarizes the description of the user information need. Given the
`user query, the key goal of an IR system is to retrieve information which might
`be useful or relevant to the user. The emphasis is on the retrieval of information
`as opposed to theretrieval of data.
`
`1.1.1
`
`Information versus Data Retrieval
`
`Data retrieval, in the context of an IR system, consists mainly of determining
`which documents of a collection contain the keywords in the user query which,
`
`15
`
`

`

`2
`
`INTRODUCTION
`
`subject than with retrieving data which satisfies a given query. A data retrieval
`language aims at retrieving all objects which satisfy clearly defined conditions
`such as those in a regular expression or in a relational algebra expression. Thus,
`for a data retrieval system, a single erroneous object among a thousandretrieved
`objects means total failure. For an information retrieval system, however, the
`retrieved objects might be inaccurate and small errors are likely to go unnoticed.
`The main reason for this difference is that information retrieval usually deals
`with natural language text which is not always well structured and could be
`semantically ambiguous. On the other hand, a data retrieval system (such as
`a relational database) deals with data that has a well defined structure and
`semantics.
`,
`Data retrieval, while providing a solution to the user of a database system,
`does not solve the problem of retrieving information about a subject or topic.
`To beeffective in its attempt to satisfy the user information need, the IR system
`must somehow ‘interpret’ the contents of the information items (documents)
`in a collection and rank them according to a degree of relevance to the user
`query. This ‘interpretation’ of a document content involves extracting syntactic
`and semantic information from the document text and using this information
`to match the user information need. The difficulty is not only knowing how
`to extract this information but also knowing how to use it to decide relevance.
`Thus, the notion of relevance is at the center of information retrieval. In fact, the
`primary goal of an IR system is to retrieve all the documents which are relevant
`to a user query while retrieving as few non-relevant documents as possible.
`
`1.1.2
`
`Information Retrieval at the Center of the Stage
`
`In the past 20 years, the area of information retrieval has grown well beyondits
`primary goals of indexing text and searching for useful documents in a collec-
`tion. Nowadays, research in IR includes modeling, documentclassification and
`categorization, systems architecture, user interfaces, data visualization,filtering,
`languages, etc. Despite its maturity, until recently, IR was seen as a narrow
`area of interest mainly to librarians and information experts. Such a tenden-
`tious vision prevailed for many years, despite the rapid dissemination, among
`users of modern personal computers, of IR tools for multimedia and hypertext
`applications. In the beginning of the 1990s, a single fact changed once and for
`all these perceptions — the introduction of the World Wide Web.
`The Web is becoming a universal repository of human knowledge and cul-
`ture which has allowed unprecedent sharing of ideas and information in a scale
`never seen before.
`Its success is based on the conception of a standard user
`interface which is always the same no matter what computational environment
`is used to run the interface. As a result, the user is shielded from details of
`communication protocols, machine location, and operating systems. Further,
`any user can create his own Web documents and make them point to any other
`Web documents without restrictions. This is a key aspect because it turns the
`
`16
`
`

`

`BASIC CONCEPTS
`3
`consequence, any Web user can push his personal agenda with little effort and
`almost at no cost. This universe without frontiers has attracted tremendous
`attention from millions of people everywhere since the very beginning. Further-
`more, it is causing a revolution in the way people use computers and perform
`their daily tasks. For instance, home shopping and home banking are becoming
`very popular and have generated several hundred million dollars in revenues.
`Despite so much success, the Web has introduced new problems of its own.
`Finding useful information on the Web is frequently a tedious and difficult task.
`For instance, to satisfy his information need, the user might navigate the space
`of Weblinks (ie., the hyperspace)
`searching for information of interest. How-
`ever, since the hyperspace is vast and almost unknown, such a navigation task is
`usually inefficient. For naive users, the problem becomes harder, which might en-
`tirely frustrate all their efforts. The main obstacle is the absence of a well defined
`underlying data model for the Web, which implies that information definition
`and structure is frequently of low quality. These difficulties have attracted re-
`newedinterest in IR andits techniques as promising solutions. As a result, almost
`overnight, IR has gained a place with other technologies at the center of the stage.
`
`1.1.3.
`
`Focus of the Book
`
`Despite the great increase in interest in information retrieval, modern textbooks
`on IR with a broad (and extensive) coverage of the various topics in the field
`are still difficult to find.
`In an attempt to partially fulfill this gap, this book
`presents an overall view of research in IR from a computerscientist’s perspec-
`tive. This means that the focus of the book is on computer algorithms and
`techniques used in information retrieval systems. A rather distinct viewpoint
`is taken by librarians and information science researchers, who adopt a human-
`centered interpretation of the IR problem.
`In this interpretation, the focus is
`on trying to understand how people interpret and use information as opposed
`to how to structure, store, and retrieve information automatically. While most
`of this book is dedicated to the computer scientist’s viewpoint of the IR prob-
`lem, the human-centered viewpoint is discussed to some extent in the last two
`chapters.
`~
`We put great emphasis on the integration of the different areas which are
`closed related to the information retrieval problem and thus, should be treated
`together. For that reason, besides covering text retrieval, library systems, user
`interfaces, and the Web, this book also discusses visualization, multimedia, re-
`trieval, and digital libraries.
`
`1.2 Basic Concepts
`Theeffective retrieval of relevant informationis directly affected both by the user
`
`17
`
`

`

`4
`
`INTRODUCTION
`
`Retrieval
`
`Browsingi)
`
`Database
`
`Figure 1.1
`
`Interaction of the user with the retrieval system through distinct tasks.
`
`1.2.1 -The User Task
`
`The user of a retrieval system has to translate his information need into a query
`in the language provided by the system. With an information retrieval system,
`this normally implies specifying a set of words which convey the semantics of
`the information need. With a data retrieval system, a query expression (such as,
`for instance, a regular expression) is used to convey the constraints that must
`be satisfied by objects in the answer set.
`In both cases, we say that the user
`searches for useful information executing a retrieval task.
`Consider now a user who has an interest which is either poorly defined
`or which is inherently broad. For instance,
`the user might be interested in
`documents about car racing in general.
`In this situation, the user might use
`an interactive interface to simply look around in the collection for documents
`related to car racing. For instance, he might find interesting documents about
`Formula 1 racing, about car manufacturers, or about the ‘24 Hours of Le Mans.’
`Furthermore, while reading about the ‘24 Hours of Le Mans’, he might turn his
`attention to a document which provides directions to Le Mans and, from there,
`to documents which cover tourism in France.
`In this situation, we say that
`the user is browsing the documents in the collection, not searching. It is still a
`process of retrieving information, but one whose main objectives are not clearly
`defined in the beginning and whose purpose might change during the interaction
`with the system.
`~
`In this book, we make a clear distinction between the different tasks the
`user of the retrieval system might be engaged in. His task might be of two distinct
`types:
`information or data retrieval and browsing. Classic information retrieval
`systems normally allow information or data retrieval. Hypertext systems are
`usually tuned for providing quick browsing. Modern digital library and Web
`interfaces might attempt to combine these tasks to provide improved retrieval
`
`18
`
`

`

`BASIC CONCEPTS
`
`5
`
`established approach and is not the dominant paradigm.
`Figure 1.1 illustrates the interaction of the user through the different tasks
`we identify. Information and data retrieval are usually provided by most modern
`information retrieval systems (such as Web interfaces). Further, such systems
`might also provide some(still limited) form of browsing. While combining infor-.
`mation and data retrieval with browsing is not yet a commonpractice, it might
`becomeso in the future.
`Both retrieval and browsing are, in the language of the World Wide Web,
`‘pulling’ actions. That is, the user requests the information in an interactive
`manner. An alternative is to do retrieval in

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket