`Information
`Atg tee]
`
`WESLEY Comcast, Exhibit-1005
`
`Ricardo Baeza-Yates
`
`Berthier Ribeiro-Neto
`
`Comcast, Exhibit-1005
`
`1
`
`
`
`ModernInformation Retrieval
`
`Ricardo Baeza-Yates
`Berthier Ribeiro—Neto
`
`
`
`ACM Press
`
`New York
`
`aV
`
`V
`
`PEARSON
`
`ddison
`Meier
`
`aA
`
`2
`
`
`
`Pearson Education Limited
`
`Edinburgh Gate
`Harlow
`Essex CM20 2JE
`
`England
`
`and Associated Companies throughout the world
`
`Visit us on the World Wide Webat:
`www.pearsoned.co.uk
`
`Copyright © 1999 by the ACMpress, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the-Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part ofthis publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London WIT 4LP.
`
`While the publisher has made every attemptto trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`:
`
`Many of the designations used by manufacturers andsellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentionedin this book. A list of the trademark designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829-X
`
`British Library Cataloguing-in-Publication Data
`A catalogue record for this book is available from the British Library
`
`Library of Congress Cataloging-in-Publication Data
`Baeza-Yates, R. (Ricardo)
`- Modern information retrieval / Ricardo Baeza-Yates, Berthier Ribeiro-Neto.
`p. cm.
`Includes bibliographical reference and index.
`ISBN 0-201-39829-X
`
`1. Information storage and retrieval systems. I. Ribeiro, Berthier de Araiijo
`Neto, 1960-.
`II. Title.
`Z667.B34
`1999
`025.04—dce21
`
`99-10033
`CIP
`
`3
`
`
`
`Preface
`
`
`
`Information retrieval (IR.) has changed considerably in recent years with the expansion
`of the World Wide Web and the advent of modern and inexpensive graphical user
`interfaces and mass storage devices. As a result, traditional IR textbooks have become
`quite out of date and this has led to the introduction of new IR. books. Nevertheless, we
`believe that there is still great need for a book that approaches thefield in a rigorous
`and complete way from a computer-science perspective (as opposed to a user-centered
`perspective). This book is an effort to partially fulfill this gap and should be useful for
`a first course on information retrieval as well as for a graduate course on the topic.
`The book comprises two portions which complement and balance each other.
`The core portion includes nine chapters authored or coauthored by the designers of
`the book. The second portion, which is fully integrated with the first, is formed by
`six state-of-the-art chapters written by leading researchers in their fields. The same
`notation and glossary are used in all the chapters. Thus, despite the fact that several
`people have contributed to the text, this book is really much more a textbook than
`an edited collection of chapters written by separate authors. Furthermore, unlike a
`collection of chapters, we have carefully designed the contents and organization of the
`book to present a cohesive view of all the important aspects of modern information
`retrieval.
`'
`:
`From IR models to indexing text, from IR visual tools and interfaces to the Web,
`from IR. multimedia to digital libraries, the book provides both breadth of coverage and
`richness of detail. It is our hope that, given the now clear relevance and significance of
`information retrieval to modern society, the book will contribute to further disseminate
`the study of the discipline at information science, computer science, and library science
`departments throughout the world.
`
`
`
`Ricardo Baeza-Yates, Santiago, Chile
`Berthier Ribeiro-Neto, Belo Horizonte, Brazil
`January, 1999
`
`4
`
`
`
`To Helena, Rosa, and our children
`
`territério de homenslivres
`que seré nosso pais
`e sera patria de todos.
`Irm4aos, cantai ese mundo
`que nao verei, mas vira
`um dia, dentro de mil anos,
`talvez mais. ..nao tenho pressa.
`
`de Cidade Prevista no livro
`A Rosa do Povo, 1945
`
`Carlos Drummond de Andrade
`
`territory of free men
`that will be our country
`and will be the nation ofall
`Brothers, sing this world
`which I’ll not see, but which will come
`one day, in a thousand years,
`maybe more...no hurry.
`
`from Prevised City in the book
`The Rose of the People, 1945
`
`Amolos libros
`exploradores,
`libros con bosque o nieve,
`profundidad o cielo
`
`de Oda al Libro(I),
`
`Pablo Neruda
`
`I love books
`that explore,
`books with a forest or snow,
`depth or sky
`
`from Ode to the Book (I),
`
`5
`
`
`
`vill
`
`ACKNOWLEDGEMENTS
`
`Fifth, we thank the support of our institutions, the Departments of Com-
`puter Science of the University of Chile and of the Federal University of Minas
`Gerais, as well as the funding provided by national research agencies (CNPq in
`Brazil and CONICYTin Chile) and international collaboration projects, in par-
`ticular CYTED project VII.13 AMYRI (Environment for Information Managing
`and Retrieval in the World Wide Web) and Finep project SIAM (Information
`Systems for Mobile Computers) under the Pronex program.
`Most important, to Helena, Rosa, and our children, who put up with a
`string of trips abroad, lost weekends, and odd working hours.
`
`List of Trademarks
`Alta Vista is a trademark of Compaq Computer Corporation
`FrameMaker is a trademark of Adobe Systems Incorporated
`IBM SP2 is a trademark of International Business Machines Corporation
`Netscape Communicator is a trademark of Netscape Communications Corporation
`Solaris, Sun 3/50 and Sun UltraSparc-1 are trademarks of Sun Microsystems, Inc.
`Thinking Machines CM-2 is a trademark of Thinking Machines Corporation
`Unix is licensed through X/Open Company Ltd
`
`6
`
`
`
`Contents
`
`
`
`Preface
`
`Acknowledgements
`
`Biographies
`
`Vv
`
`vii
`
`Xvii
`
`1.2
`
`1.3
`
`1
`Introduction
`1.1. Motivation... 2.2... ee, 1
`1.1.1
`Information versus Data Retrieval. ............
`1
`1.1.2
`Information Retrieval at the Center of the Stage... ..
`2
`‘1.1.38
`Focus of the Book... .....0.0.0002...02...0..
`3
`Basic Concepts
`.........0..0.0 0000000 eee eee
`3
`1.2.1 The User Task... 2. ee en 4
`1.2.2 Logical View of the Documents ..............
`5
`Past, Present, and Future ..................0..
`6
`1.3.1 Early Developments... ................0..,
`6
`1.3.2.
`Information Retrieval in the Library ...........
`7
`1.3.3 The Web and Digital Libraries. .........2.2002.
`7
`1.3.4
` PracticalIssues ...........0.......2..00.
`8
`The Retrieval Process. ........0.000..000. 000004 8
`9
`(Organization of the Book. ..........0.0..0. 00.0004
`LO
`1.5.1 Book Topics ...........00.0.00...0.000.4 Ll
`1.5.2 Book Chapters ....................00.
`12
`1.66 Howto Use this Book: ........0..0..0.0.....0..
`15
`1.6.1 Teaching Suggestions ....... re 15
`1.6.2 The Book’s Web Page ................0..
`16
`Bibliographic Discussion .........002...... 004004
`17
`
`1.4
`1.5
`
`1.7
`
`19
`2 Modeling
`2.1
`Introduction ..............00.. Se eee 19
`
`7
`
`
`
`CONTENTS
`
`2.4
`2.5
`
`2.6
`
`2.7
`
`23
`<A Formal Characterization of IR Models .............
`24
`Classic Information Retrieval. ............02. 050084
`24
`2.5.1 Basic ConceptS ... 1.2... 0. ee ee ee
`2.5.2 Boolean Model .......... 0.20 eee ee eee 25
`2.5.3 Vector Model ......ee 27
`2.5.4
`Probabilistic Model .............-2.2. 2.50008.
`30
`2.5.5 Brief Comparison of Classic Models ............
`34
`Alternative Set Theoretic Models -...........2200.4-
`34
`2.6.1
`Fuzzy Set Model ......ee ee 34
`2.6.2 Extended Boolean Model..................
`38
`Alternative Algebraic Models ..............2004.
`Al
`2.7.1 Generalized Vector Space Model... ...........
`Al
`2.7.2. Latent Semantic Indexing Model .............
`4A
`2.7.3. Neural Network Model .................0..
`46
`Alternative Probabilistic Models... .............04.4.
`A8
`2.8.1 Bayesian Networks .... 2.2.2... 2.2.2. 02002208.
`48
`2.8.2
`Inference Network Model..................
`49
`2.8.3. Belief Network Model... ..........-...206:
`56
`2.8.4 Comparison of Bayesian Network Models.........
`59
`2.8.5 Computational Costs of Bayesian Networks .......
`60
`2.8.6 The Impact of Bayesian Network Models .........
`61
`Structured Text Retrieval Models .................
`61
`2.9.1 Model Based on Non-Overlapping Lists. .........
`62
`2.9.2. Model Based on Proximal Nodes .............
`63
`2.10 Models for Browsing .... 1... eee ee ee 65
`2.10.1 Flat Browsing ............ 2.002.220 00 008.
`65
`2.10.2 Structure Guided Browsing ................
`66
`2.10.3 The Hypertext Model... .........2.02.2.404.
`66
`‘Trends and Research Issues... .. 2.2... 2.0040 eee eae
`69
`2.11
`2.12 Bibliographic Discussion ..............064%Fe ee 69
`
`2.8
`
`2.9
`
`
`
`73
`Retrieval Evaluation
`3.1
`Introduction .. 1... ee 73
`3.2
`Retrieval Performance Evaluation ...............4.2.
`74
`3.2.1 Recall and Precision ................0.04.
`75
`3.2.2 Alternative Measures ...........--2-02+02084 82
`3.3 Reference Collections ... 1... 0.02.00. pe eee ee es .6684
`3.3.1 The TREC Collection ................0.4.
`84
`3.3.2 The CACM and ISI Collections ..............
`91
`3.3.3 The Cystic Fibrosis Collection ...............
`94
`‘Trends and Research Issues. .........-..0-00-0. 502004
`96
`Bibliographic Discussion ... 2... 6... 2 ee ee eee ee 96
`
`3.4
`3.5
`
`99
`Query Languages
`4.1
`Introduction ..0... 99
`
`8
`
`
`
`CONTENTS
`
`xi
`
`Single-Word Queries ...........ek ee eee
`4.2.1
`4.2.2 Context Queries... ..........2....0.02.20004
`4.2.3 Boolean Queries... .............0.000008.4
`4.2.4 Natural Language... ..........0.00.2.0000.%
`Pattern Matching ..................00..00.0000.
`Structural Queries... .....0...0...0.0...20- 00004
`4.4.1
`Fixed Structure ............0.0....2..0000.
`4.4.2 Hypertext .. 2... 2... ee ee ee
`4.4.3 Hierarchical Structure ..................0.,
`Query Protocols. ..........0 000000 eee eee eee.
`‘Trends and Research Issues. ..............00000.4
`Bibliographic Discussion ............0....0.000.
`
`4.3
`4.4
`
`4.5
`4.6
`4.7
`
`Query Operations
`5.1
`Introduction... 2.2... .. 0.0.0... 002. eee eee
`5.2
`User Relevance Feedback .........0.000 00000 eee
`5.2.1 Query Expansion and Term Reweighting for the Vector
`Model
`.. 2...0.
`‘Term Reweighting for the Probabilistic Model ......
`5.2.2
`5.2.3 A Variant of Probabilistic Term Reweighting.......
`5.2.4 Evaluation of Relevance Feedback Strategies .......
`5.3 Automatic Local Analysis
`.. 5 .....0..0....0..00000.
`5.3.1 Query Expansion Through Local Clustering .......
`5.3.2 Query Expansion Through Local Context Analysis
`5.4 Automatic Global Analysis... ......0..0.0.0....0.4.
`5.4.1 Query Expansion based on a Similarity Thesaurus.
`.
`.
`5.4.2. Query Expansion based on a Statistical Thesaurus
`‘Trends and Research Issues... ..........020000000.
`Bibliographic Discussion .................0000.
`
`5.5
`5.6
`
`Text and Multimedia Languages and Properties
`6.1
`Introduction ........0.. 0.0.0.0 0000002. ee een
`6.2 Metadata... ........2. 0.20000 000000 ee
`6.3
`Text ...........2.....04. oe ee
`6.3.1
`Formats ............ 2.000000 08 eee eee
`6.3.2
`Information Theory.....................
`6.3.3 Modeling Natural Language ................
`6.3.4
`Similarity Models ............0....0..2.004
`6.4 Markup Languages ...............2.00 00000088
`6.4.1
`SGML .... 2...ee
`64.22 HTML ...... 2... 2.2.2.0. 000 00000000084
`643 XML... .. ee eee
`6.5 Multimedia. .........0..0.0..002..0000.002.008.
`6.5.1
`Formats ...............ee ee
`
`‘101
`
`102
`
`103
`104
`
`106
`
`. 108
`
`114
`
`116
`
`117
`
`117
`
`118
`
`120
`
`121
`
`122
`
`123
`
`124
`
`129
`
`131
`
`131
`
`134
`
`137
`
`138
`
`141
`
`141
`
`142
`
`145
`
`148
`
`156
`
`9
`
`
`
`xi
`
`CONTENTS
`
`6.5.4 HyTime ............-.2. Se ee 159
`‘Trends and Research Issues... . 2... 2. eee ee ee es 160
`Bibliographic Discussion ... 1... 1. ee eee eee ee 162
`
`6.6
`6.7
`
`7.3.
`7.4
`
`163
`Text Operations
`7.1
`Introduction...2... 163
`7.2
`Document Preprocessing ... 2... 6 ee ee es 165
`7.2.1 Lexical Analysis of the Text ................ 165
`7.2.2 Elimination of Stopwords................4.
`167
`7.2.8
`Stemming ...... 2... ee ee 168
`7.2.4
`Index Terms Selection .........-2-.52-205.
`169
`7.2.5 Thesauri... 2... ee ee 170
`Document Clustering .......... 0... 0.02205 08.
`173
`Text Compression... 1... 6. ee es 173
`74.1 Motivation...2... 173
`7.4.2 Basic Concepts .... 1... . 0.5. eee ee eee 175
`7.4.3. Statistical Methods ..............2.2.2.028.
`176
`7.4.4 Dictionary Methods.............2.0.22204.
`183
`7.4.5
`Inverted File Compression .............204- 184
`Comparing Text Compression Techniques.............
`186
`Trends and Research Issues... . 2... 2. ee ee es 188
`Bibliographic Discussion ... 2... 6. ee ee ees
`189
`
`7.5
`7.6
`7.7
`
`8.3.
`
`8.4
`8.5
`
`191
`Indexing and Searching
`8.1
`Introduction... 2... ee eee 191
`8.2
`Inverted Files ......So 192
`8.2.1
`Searching
`. 2... 1 ee ee 195
`8.2.2 Construction. ......... 00 ee ee ee .... 196
`Other Indices for Text
`............02.2.2. 0.052004 199
`8.3.1
`Suffix Trees and Suffix Arrays ..............-. 199
`8.3.2
`Signature Files
`.. 2... 6... ee ee 205
`Boolean Queries... 2... ee ee ees 207
`Sequential Searching ©... 1... ee eee 209
`8.5.1 Brute Force ........0-0 20. eee 209
`8.5.2 Knuth-Morris-Pratt .........0.0.0.02.2.2.20 005004 210
`8.5.3 Boyer-Moore Family .. 1... 2.2... eee ee ee 211
`8.5.4
`Shift--Or 2... 0.0. 212
`8.5.5
`Sufix Automaton ............ 000002 2 eee 213
`8.5.6
`Practical Comparison... 2... 0... 0. eee ee 214
`8.5.7
`Phrases and Proximity .......... Ce ee 215
`Pattern Matching .. 2... 0. ee 215
`8.6.1
`String Matching Allowing Errors ............. 216
`8.6.2 Regular Expressions and Extended Patterns ....... 219
`8.6.3. Pattern Matching Using Indices .............. 220
`Structural Queries... 2... ee 222
`
`8.6
`
`8.7
`
`10
`
`
`
`Sequential Searching ..............2..2.2.. 223
`8.8.1
`8.8.2 Compressed Indices. ..............2..2.0.. 224
`Trends and Research Issues... .............2..2.2.. 226
`8.9
`8.10 Bibliographic Discussion .............., ee 227
`
`CONTENTS
`
`xiii
`
`9.2
`
`9.3
`
`229
`Parallel and Distributed IR
`9.1
`Introduction...2... 229
`9.1.1
`Parallel Computing .................0... 230
`9.1.2
`Performance Measures ................... 231
`Parallel IR... 1... ee 232
`9.2.1
`Introduction ................00..000.2.0.. 232
`9.2.2 MIMD Architectures ..........2...2.2202....
`233
`9.2.3
`SIMD Architectures. ......... ee ee 240
`Distributed IR...2. 249
`9.3.1 Introduction...................2--.2.2..
`249
`9.3.2 Collection Partitioning ................... 251
`9.3.3
`Source Selection... .......0..2.0.00.0 08.
`252
`9.3.4 Query Processing ..............0.00000. 253
`9.3.5 WebIssues............... 00-0000 ce ee 254
`Trends and Research Issues... ..........0-000004. 255
`Bibliographic Discussion ..........2.......020.. 206
`
`9.4
`9.5
`
`257
`User Interfaces and Visualization
`207
`10.1.
`Introduction ........0.0.200200200..020.0.202.2...2...,
`10.2. Human-Computer Interaction ................... 258
`10.2.1 Design Principles ...................0.., 208
`10.2.2 The Role of Visualization ...........2.2..... 259
`10.2.3 Evaluating Interactive Systems ........re 261
`10.3. The Information Access Process ..............2... 262
`10.3.1 Models of Interaction .................2... 262
`10.3.2 Non-Search Parts of the Information Access Process
`..
`265
`10.3.3 Earlier Interface Studies ..........2.2.2.2.22...
`266
`10.4 Starting Points
`..........00202.2002.....0.2.204.
`267
`10.4.1 Lists of Collections ...........2...2.2.0.... 267
`10.4.2 Overviews ........20..0.0.. 00000 be eee,
`268
`10.4.3 Examples, Dialogs, and Wizards.............. 276
`10.4.4 Automated Source Selection ................ 278
`10.56 Query Specification .. 2... le 278
`10.5.1 Boolean Queries... ......0.2.0.200......004.
`279
`10.5.2 From Command Lines to Forms and Menus ....... 280
`10.5.3 Faceted Queries .........02...0200-. 22-24% 281
`10.5.4 Graphical Approaches to Query Specification ..... .
`282
`10.5.5 Phrases and Proximity ..................., 286
`10.5.6 Natural Language and Free Text Queries.........
`287
`
`11
`
`
`
`XIV
`
`CONTENTS
`
`10.7
`
`10.6.2 Query Term Hits Within Document Content .......
`10.6.3 Query Term Hits Between Documents ..........
`10.6.4 SuperBook: Context via Table of Contents ........
`10.6.5 Categories for Results Set Context ........ee
`10.6.6 Using Hyperlinks to Organize Retrieval Results .....
`10.6.7 Tables ................0.4.ee ee
`Using Relevance Judgements .................00.4
`10.7.1 Interfaces for Standard Relevance Feedback .......
`10.7.2 Studies of User Interaction with Relevance Feedback
`Systems .......0.0 0000000. cee eee .
`10.7.3 Fetching Relevant Information in the Background ... .
`10.7.4 Group Relevance Judgements .............0..
`10.7.5 Pseudo-Relevance Feedback ................
`Interface Support for the Search Process .............
`10.8.1 Interfaces for String Matching ...............
`10.8.2 Window Management.................004
`10.8.3 Example Systems ..............0.000004
`10.8.4 Examples of Poor Use of Overlapping Windows .....
`10.8.5 Retaining Search History ..................
`(10.8.6 Integrating Scanning, Selection, and Querying ......
`Trends and Research Issues... .........0...002.000.4
`10.9
`10.10 Bibliographic Discussion ..................0204
`
`10.8
`
`11.2
`
`11.3
`
`11 Multimedia IR: Models and Languages
`Introduction ..2...
`11.1
`Data Modeling... ........0.... 2.000000 00 00048
`11.2.1 Multimedia Data Support in Commercial DBMSs ... .
`11.2.2 The MULTOS Data Model. ........0.....00..
`Query Languages .... 2... 2... 0.2.00. eee ee ee eee
`11.3.1 Request Specification ............2.....00.
`11.3.2 Conditions on Multimedia Data. .............
`11.3.3 Uncertainty, Proximity, and Weights in Query
`Expressions .... 2... -. eee
`11.3.4 Some Proposals .................0.02.200,
`Trends and Research Issues... ........0.0200.0.0.0004
`Bibiographic Discussion... .............0.0. 0004
`
`11.4
`
`11.5
`
`12.2
`
`12.3
`
`12 Multimedia IR: Indexing and Searching
`Introduction... 0... ee
`12.1
`Background — Spatial Access Methods ..............
`A Generic Multimedia Indexing Approach ............
`One-dimensional Time Series... ..............04.
`12.4.1 Distance Function. ..............0....00,
`12.4.2 Feature Extraction and Lower-bounding .........
`12.4.3 Experiments ................00.0.00 0004
`
`12.4
`
`289
`
`293
`
`296
`297
`299
`301
`
`303
`
`304
`
`307
`
`308
`
`308
`
`309
`
`309
`
`311
`
`312
`
`317
`
`317
`
`318
`
`321
`
`322
`
`325
`
`325
`328
`329
`
`331
`
`334
`
`335
`335
`
`337
`
`338
`
`341
`
`342
`
`345
`
`345
`
`347
`
`348
`
`393
`
`353
`
`353
`
`355
`
`12
`
`
`
`CONTENTS
`
`XV
`
`12.5.1 Image Features and Distance Functions. ......... 357
`12.5.2 Lower-bounding ................0..00048. 358
`12.5.3 Experiments. .........0.00.0.0.00 0000 8
`4
`360
`12.6 Automatic Feature Extraction ................... 360
`12.7
`‘Trends and Research Issues... .........0...0000404 361 .
`12.8 Bibliographic Discussion .............0...0.00204 363
`
`367
`13 Searching the Web
`13.1
`Introduction... . 2... ee 367
`13.2 Challenges ........ee 368
`13.3 Characterizing the Web.................2..004 369
`13.3.1 Measuring the Web ..................2.4. 369
`13.3.2 Modeling the Web ............0......004. 371
`13.4 Search Engines .......... Bi ee ee
`373
`13.4.1 Centralized Architecture ./.......0..0..0..0.2.,
`373
`13.4.2 Distributed Architecture ...............00.4. 375
`13.4.3 User Interfaces
`............0..0.-.200.0048. 377
`13.4.4 Ranking ............0.0...0. 02.00 000008.4 380
`13.4.5 Crawling the Web.................0000., 382
`13.4.6 Indices...ee 383
`13.5 Browsing... .. 2.2.0... 0.00.00 00000000 eee ee
`384
`13.5.1 Web Directories .. 0... 2. ee el 384
`13.5.2 Combining Searching with Browsing ........... 386
`13.5.3 Helpful Tools
`..........0.0.0.00...000.00..
`387
`13.6 Metasearchers ...............0.. Se ee ee 387
`13.7 Finding the Needle in the Haystack ............... ~.
`389
`13.7.1 User Problems. ..........0...0.0... 00004 389
`13.7.2 Some Examples ...............02. 000004 390
`13.7.3 Teaching the User... .....0....0.0.0....04. 391
`13.8 Searching using Hyperlinks..................... 392
`13.8.1 Web Query Languages ................02... 392
`13.8.2 Dynamic Search and Software Agents. .......... 393
`13.9 Trends and Research Issues... 2. ........... 0000.
`393
`13.10 Bibliographic Discussion ..........0.0..0.. 0.0004 an
`395
`
`397
`14 Libraries and Bibliographical Systems
`14.1
`Introduction... ...........0 0000.00.00 000084 397
`14.2 Online IR Systems and Document Databases. .......... 398
`14.2.1 Databases ....... 0.2.0.0... .002.000048.
`399
`14.2.2 Online Retrieval Systems. ...... ae ee 403
`14.2.3 IR in Online Retrieval Systems ...........0.. 404
`14.2.4 ‘Natural Language’ Searching ............... 406
`14.3 Online Public Access Catalogs (OPACs) ............. 407
`14.3.1 OPACs and Their Content ....... Dee ee ee 408
`
`13
`
`
`
`xvi
`
`CONTENTS
`
`14.3.4 Alternatives to Vendor OPACs .............. A410
`14.4 Libraries and Digital Library Projects. .............. A12
`14.5
`‘Trends and Research Issues... . 2... 2 ee ee A12
`14.6 Bibliographic Discussion ... 2... 1...
`1 eee ee ee es A13
`
`415
`15 Digital Libraries
`15.1
`Imtroduction .... 2.0... ee es A15
`15.2 Definitions ...1... A17
`15.3. Architectural Issues . 2... 2. ee A18
`15.4 Document Models, Representations, and Access ......... A20
`15.4.1 Multilingual Documents ...............204, A20
`15.4.2 Multimedia Documents ...............-000. A21
`15.4.3 Structured Documents ............2-++2005 A421
`15.4.4 Distributed Collections ...............+2480- A22
`15.4.5 Federated Search ... 2... 2. ee eee 424.
`15.4.6 Access
`. 0.00. ee ee 424
`15.5 Prototypes, Projects, and Interfaces................ A25
`15.5.1 International Range of Efforts ............... A27
`15.5.2 Usability... 2... 2. ee ee eee A28
`15.6 Standards ..... 2.0...
`2 eee A429
`15.6.1 Protocols and Federation. ...............6.. 429
`15.6.2 Metadata
`.. 2... ee A30
`‘Trends and Research Issues... .......-2.-000 50085 A31
`15.7
`15.8 Bibliographical Discussion .......: 2.2.2... 005022044 432
`
`Appendix: Porter’s Algorithm
`
`Glossary
`
`References
`
`Index
`
`433
`
`A37
`
`455
`
`501
`
`14
`
`
`
`Chapter 1
`introduction
`
`
`
`1.1 Motivation
`
`Information retrieval (IR) deals with the representation, storage, organization
`of, and access to information items. The representation and organization of the
`information items should provide the user with easy access to the information in
`which he is interested. Unfortunately, characterization of the user information
`need is not a simple problem. Consider, for instance, the following hypothetical
`user information need in the context of the World Wide Web (or just the Web):
`
`_ Find all the pages (documents) containing information on college ten-
`‘nis teams which: (1) are maintained by an university in the USA and
`(2) participate in the NCAA tennis tournament. To be relevant, the
`page must include information on the national ranking of the team
`in the last three years and the email or phone number of the team
`coach.
`
`Clearly, this full description of the user information need cannot be used directly
`to request information using the current interfaces of Web search engines.
`In-
`stead, the user must first translate this information need into a query which can
`be processed by the search engine (or IR system).
`In its most commonform, this translation yields a set of keywords (or index
`terms) which summarizes the description of the user information need. Given the
`user query, the key goal of an IR system is to retrieve information which might
`be useful or relevant to the user. The emphasis is on the retrieval of information
`as opposed to theretrieval of data.
`
`1.1.1
`
`Information versus Data Retrieval
`
`Data retrieval, in the context of an IR system, consists mainly of determining
`which documents of a collection contain the keywords in the user query which,
`
`15
`
`
`
`2
`
`INTRODUCTION
`
`subject than with retrieving data which satisfies a given query. A data retrieval
`language aims at retrieving all objects which satisfy clearly defined conditions
`such as those in a regular expression or in a relational algebra expression. Thus,
`for a data retrieval system, a single erroneous object among a thousandretrieved
`objects means total failure. For an information retrieval system, however, the
`retrieved objects might be inaccurate and small errors are likely to go unnoticed.
`The main reason for this difference is that information retrieval usually deals
`with natural language text which is not always well structured and could be
`semantically ambiguous. On the other hand, a data retrieval system (such as
`a relational database) deals with data that has a well defined structure and
`semantics.
`,
`Data retrieval, while providing a solution to the user of a database system,
`does not solve the problem of retrieving information about a subject or topic.
`To beeffective in its attempt to satisfy the user information need, the IR system
`must somehow ‘interpret’ the contents of the information items (documents)
`in a collection and rank them according to a degree of relevance to the user
`query. This ‘interpretation’ of a document content involves extracting syntactic
`and semantic information from the document text and using this information
`to match the user information need. The difficulty is not only knowing how
`to extract this information but also knowing how to use it to decide relevance.
`Thus, the notion of relevance is at the center of information retrieval. In fact, the
`primary goal of an IR system is to retrieve all the documents which are relevant
`to a user query while retrieving as few non-relevant documents as possible.
`
`1.1.2
`
`Information Retrieval at the Center of the Stage
`
`In the past 20 years, the area of information retrieval has grown well beyondits
`primary goals of indexing text and searching for useful documents in a collec-
`tion. Nowadays, research in IR includes modeling, documentclassification and
`categorization, systems architecture, user interfaces, data visualization,filtering,
`languages, etc. Despite its maturity, until recently, IR was seen as a narrow
`area of interest mainly to librarians and information experts. Such a tenden-
`tious vision prevailed for many years, despite the rapid dissemination, among
`users of modern personal computers, of IR tools for multimedia and hypertext
`applications. In the beginning of the 1990s, a single fact changed once and for
`all these perceptions — the introduction of the World Wide Web.
`The Web is becoming a universal repository of human knowledge and cul-
`ture which has allowed unprecedent sharing of ideas and information in a scale
`never seen before.
`Its success is based on the conception of a standard user
`interface which is always the same no matter what computational environment
`is used to run the interface. As a result, the user is shielded from details of
`communication protocols, machine location, and operating systems. Further,
`any user can create his own Web documents and make them point to any other
`Web documents without restrictions. This is a key aspect because it turns the
`
`16
`
`
`
`BASIC CONCEPTS
`3
`consequence, any Web user can push his personal agenda with little effort and
`almost at no cost. This universe without frontiers has attracted tremendous
`attention from millions of people everywhere since the very beginning. Further-
`more, it is causing a revolution in the way people use computers and perform
`their daily tasks. For instance, home shopping and home banking are becoming
`very popular and have generated several hundred million dollars in revenues.
`Despite so much success, the Web has introduced new problems of its own.
`Finding useful information on the Web is frequently a tedious and difficult task.
`For instance, to satisfy his information need, the user might navigate the space
`of Weblinks (ie., the hyperspace)
`searching for information of interest. How-
`ever, since the hyperspace is vast and almost unknown, such a navigation task is
`usually inefficient. For naive users, the problem becomes harder, which might en-
`tirely frustrate all their efforts. The main obstacle is the absence of a well defined
`underlying data model for the Web, which implies that information definition
`and structure is frequently of low quality. These difficulties have attracted re-
`newedinterest in IR andits techniques as promising solutions. As a result, almost
`overnight, IR has gained a place with other technologies at the center of the stage.
`
`1.1.3.
`
`Focus of the Book
`
`Despite the great increase in interest in information retrieval, modern textbooks
`on IR with a broad (and extensive) coverage of the various topics in the field
`are still difficult to find.
`In an attempt to partially fulfill this gap, this book
`presents an overall view of research in IR from a computerscientist’s perspec-
`tive. This means that the focus of the book is on computer algorithms and
`techniques used in information retrieval systems. A rather distinct viewpoint
`is taken by librarians and information science researchers, who adopt a human-
`centered interpretation of the IR problem.
`In this interpretation, the focus is
`on trying to understand how people interpret and use information as opposed
`to how to structure, store, and retrieve information automatically. While most
`of this book is dedicated to the computer scientist’s viewpoint of the IR prob-
`lem, the human-centered viewpoint is discussed to some extent in the last two
`chapters.
`~
`We put great emphasis on the integration of the different areas which are
`closed related to the information retrieval problem and thus, should be treated
`together. For that reason, besides covering text retrieval, library systems, user
`interfaces, and the Web, this book also discusses visualization, multimedia, re-
`trieval, and digital libraries.
`
`1.2 Basic Concepts
`Theeffective retrieval of relevant informationis directly affected both by the user
`
`17
`
`
`
`4
`
`INTRODUCTION
`
`Retrieval
`
`Browsingi)
`
`Database
`
`Figure 1.1
`
`Interaction of the user with the retrieval system through distinct tasks.
`
`1.2.1 -The User Task
`
`The user of a retrieval system has to translate his information need into a query
`in the language provided by the system. With an information retrieval system,
`this normally implies specifying a set of words which convey the semantics of
`the information need. With a data retrieval system, a query expression (such as,
`for instance, a regular expression) is used to convey the constraints that must
`be satisfied by objects in the answer set.
`In both cases, we say that the user
`searches for useful information executing a retrieval task.
`Consider now a user who has an interest which is either poorly defined
`or which is inherently broad. For instance,
`the user might be interested in
`documents about car racing in general.
`In this situation, the user might use
`an interactive interface to simply look around in the collection for documents
`related to car racing. For instance, he might find interesting documents about
`Formula 1 racing, about car manufacturers, or about the ‘24 Hours of Le Mans.’
`Furthermore, while reading about the ‘24 Hours of Le Mans’, he might turn his
`attention to a document which provides directions to Le Mans and, from there,
`to documents which cover tourism in France.
`In this situation, we say that
`the user is browsing the documents in the collection, not searching. It is still a
`process of retrieving information, but one whose main objectives are not clearly
`defined in the beginning and whose purpose might change during the interaction
`with the system.
`~
`In this book, we make a clear distinction between the different tasks the
`user of the retrieval system might be engaged in. His task might be of two distinct
`types:
`information or data retrieval and browsing. Classic information retrieval
`systems normally allow information or data retrieval. Hypertext systems are
`usually tuned for providing quick browsing. Modern digital library and Web
`interfaces might attempt to combine these tasks to provide improved retrieval
`
`18
`
`
`
`BASIC CONCEPTS
`
`5
`
`established approach and is not the dominant paradigm.
`Figure 1.1 illustrates the interaction of the user through the different tasks
`we identify. Information and data retrieval are usually provided by most modern
`information retrieval systems (such as Web interfaces). Further, such systems
`might also provide some(still limited) form of browsing. While combining infor-.
`mation and data retrieval with browsing is not yet a commonpractice, it might
`becomeso in the future.
`Both retrieval and browsing are, in the language of the World Wide Web,
`‘pulling’ actions. That is, the user requests the information in an interactive
`manner. An alternative is to do retrieval in