`
`1
`
`
`
`Modern Information Retrieval
`
`Ricardo Baeza—Yates
`
`Berthier Ribeiro—Neto
`
`
`
`ACM Press
`
`New York
`
`PEARSO N‘
`
`
`
`ddison
`
`
`A Wesley
`
`
`T A
`
`2
`
`
`
`Pearson Education Limited
`
`Edinburgh Gate
`Harlow
`Essex CM20 2IE
`
`England
`
`and Associated Companies throughout the world
`
`Vz'sz't us on the World Wide Web at:
`www.pearsoned.co.uk
`
`Copyright © 1999 by the ACM press, A Division of the Association for Computing
`Machinary, Inc. (ACM).
`
`The rights of the authors of this Work have been asserted by them in accordance with
`the—Copyright, Designs and Patents Act 1988.
`
`All rights reserved. No part of this publication may be reproduced, stored in a
`retrieval system, or transmitted in any form or by any means, electronic, mechanical,
`photocopying, recording or otherwise, without either the prior written permission of
`the publisher or a licence permitting restricted copying in the United Kingdom issued
`by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP. 0
`
`While the publisher has made every attempt to trace all copyright owners and obtain
`permission to reproduce material, in a few cases this has proved impossible.
`Copyright holders of material which has not been acknowledged are encouraged to
`contact the publisher.
`'
`
`Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Addison Wesley Longman Limited has made
`every attempt to supply trade mark information about manufacturers and their
`products mentioned in this book. A list of the trademark designations and their
`owners appears on page viii.
`
`Typeset in Computer Modern by 56
`Printed and bound in the United States of America
`
`First printed 1999
`
`ISBN 0-201-39829—X
`
`British Library Cataloguing-in—Publication Data
`A catalogue record for‘this book is available from the British Library
`
`Library of Congress Cataloging-in—Publication Data
`Baeza—Yates, R. (Ricardo)
`-Modem information retrieval / Ricardo Baeza—Yates, Berthier Ribeiro—Neto.
`p. cm.
`Includes bibliographical reference and index.
`ISBN 0-20l—39829—X
`
`1. Information storage and retrieval systems. I. Ribeiro, Berthier de Arafijo
`Neto, 1960-.
`II. Title.
`Z667.B34
`1999
`025 .04—dc2 1
`
`99-10033
`CIP
`
`3
`
`
`
`Preface
`
`
`
`Information retrieval (IR) has changed considerably in recent years with the expansion
`of the World Wide Web and the advent of modern and inexpensive graphical user
`interfaces and mass storage devices. As a result, traditional IR textbooks have become
`quite out of date and this has led to the introduction of new IR books. Nevertheless, We
`believe that there is still great need for a book that approaches the field in a rigorous
`and complete Way from a computer—science perspective (as opposed to a user~centered
`perspective). This book is an effort to partially fulfill this gap and should be useful for
`a first course on information retrieval as Well as for a graduate course on the topic.
`The book comprises two portions Which complement and balance each other.
`The core portion includes nine chapters authored or coauthored by the designers of
`the book. The second portion, Which is fully integrated With the first, is formed by
`six state~of-the—art chapters Written by leading researchers in their fields. The same
`notation and glossary are used in all the chapters. Thus, despite the fact that several
`people have contributed to the text, this book is really much more a textbook than
`an edited collection of chapters written by separate authors. Furthermore, unlike a
`collection of chapters, We have carefully designed the contents and organization of the
`book to present a cohesive view of all the important aspects of modern information
`retrieval.
`
`From IR models to indexing text, from IR visual tools and interfaces to the Web,
`from IR multimedia tordigital libraries, the book provides both breadth of coverage and
`richness of detail. It is our hope that, given the now clear relevance and significance of
`information retrieval to modern society, the book Will contribute to further disseminate
`the study of the discipline at information science, computer science, and library science
`departments throughout the World.
`
`
`
`Ricardo Baeza—Yates, Santiago, Chile
`Berthier Ribeiro—Neto, Belo Horizonte, Brazil
`January, 1999
`
`4
`
`
`
`To Helena, Rosa, and our children
`
`territério de homens livres
`que sera nosso pals
`e sera patria de todos.
`
`Irmaos, cantai ese mundo
`que nao verei, mas vira
`
`um dia, dentro de mil anos,
`talvez mais. . .nao tenho pressa.
`
`de Cidade Prevista no livro
`
`A Rosa do Povo, 1945
`
`Carlos Drummond de Andrade
`
`territory of free men
`that Will be our country
`and will be the nation of all
`
`Brothers, sing this world
`which I’ll not see, but which will come
`
`one day, in a thousand years,
`maybe more. . .no hurry.
`
`from Prevised City in the book
`The Rose of the People, 1945
`
`Arno los libros
`
`exploradores,
`libros con bosque o nieve,
`profundidad o cielo
`
`de Oda al Libro (I),
`
`Pablo Neruda
`
`I love books
`
`that explore,
`books with a forest or snow,
`depth or sky
`
`from Ode to the Book (I),
`
`5
`
`
`
`Viii
`
`ACKNOWLEDGEMENTS
`
`Fifth, we thank the support of our institutions, the Departments of Com-
`puter Science of the University of Chile and of the Federal University of Minas
`Gerais, as well as the funding provided by national research agencies (CNPq in
`Brazil and CONICYT in Chile) and international collaboration projects, in par-
`ticular CYTED project VII.13 AMYRI (Environment for Information Managing
`and Retrieval in the World Wide Web) and Finep project SIAM (Information
`Systems for Mobile Computers) under the Pronex program.
`Most important, ‘to Helena, Rosa, and our children, who put up with a
`string of trips abroad, lost weekends, and odd working hours.
`
`List of Trademarks
`
`Alta Vista is a trademark of Compaq Computer Corporation
`FrameMaker is a trademark of Adobe Systems Incorporated
`IBM SP2 is a trademark of International Business Machines Corporation
`Netscape Communicator is a trademark of Netscape Communications Corporation
`Solaris, Sun 3/ 50 and Sun UltraSparc—1 are trademarks of Sun Microsystems, Inc.
`Thinking Machines CM—2 is a trademark of Thinking Machines Corporation
`Unix is licensed through X/ Open Company Ltd
`
`6
`
`
`
`Contents
`
`
`
`Preface
`
`Acknowledgements
`
`Biographies
`
`1.2
`
`1.3
`
`Introduction
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.1
`. Motivation .
`.
`.
`.
`.
`.
`.
`.
`.
`1.1.1
`Information versus Data Retrieval .
`1.1.2
`Information Retrieval at the Center of the Stage .
`1.1.3
`Focus of the Book .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Basic Concepts
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.2.1 The User Task .
`.
`.
`' .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.2.2 Logical View of the Documents
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Past, Present, and Future
`.2
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.3.1 Early Developments .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.3.2
`Information Retrieval in the Library .
`.
`.
`.
`.
`.
`.
`1.3.3 The Web and Digital Libraries .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.3.4 Practical Issues
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`The Retrieval Process .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`,Organization of the Book .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.5.1 Book Topics .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.5.2 Book Chapters
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`How to Use this Book .'
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`1.6.1 Teaching Suggestions .
`.
`.
`.
`.
`.
`.
`«.
`.
`.
`.
`.
`.
`.
`.
`.
`1.6.2 The Book’s Web Page
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Bibliographic Discussion .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`g
`1.4
`1.5
`
`1.6
`
`1.7
`
`.
`
`V
`
`vii
`
`xvii
`
`1
`1
`1
`2
`3
`3
`4
`5
`6
`6
`7
`7
`8
`9
`10
`11
`12
`15
`15
`16
`17
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`2 Modeling
`
`.
`
`I 19
`
`2.1
`
`Introduction .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.’ .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`19
`
`7
`
`
`
`CONTENTS
`
`2.4
`
`2.5
`
`A Formal Characterization of IR Models
`
`Classic Information Retrieval .
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`23
`
`24
`
`24
`25
`
`2.5.1 Basic Concepts
`2.5.2 Boolean Model
`
`.
`.
`
`.
`.
`.
`2.5.3 Vector Model
`2.5.4
`Probabilistic Model
`
`.
`.
`
`.
`.
`
`._
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`2.5.5 Brief Comparison of Classic Models .
`Alternative Set Theoretic Models
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`Fuzzy Set Model
`2.6.1
`2.6.2 Extended Boolean Model .
`
`._ .
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`27
`30
`
`34
`34
`
`34
`38
`
`41
`41
`44
`46
`
`2.6
`
`2.7
`
`.
`.
`.
`.
`.
`.
`Alternative Algebraic Models
`2.7.1 Generalized Vector Space Model .
`2.7.2 Latent Semantic Indexing Model
`2.7.3 Neural Network Model
`.
`.
`.
`.
`.
`
`2.8
`
`Alternative Probabilistic Models .
`
`.
`.
`.
`.
`2.8.1 Bayesian Networks
`2.8.2
`Inference Network Model .
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`48
`
`48
`49
`
`56
`
`2.8.3 Belief Network Model .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`2.8.4 Comparison of Bayesian Network Models .
`2.8.5 Computational Costs of Bayesian Networks
`2.8.6 The Impact of Bayesian Network Models .
`Structured Text Retrieval Models .
`.
`.
`.
`.
`.
`.
`.
`.
`
`2.9
`
`2.9.1 Model Based on Non-Overlapping Lists .
`2.9.2 Model Based on Proximal Nodes
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`2.10 Models for Browsing
`.
`.
`.
`.
`.
`.
`2.10.1 Flat Browsing .
`2.10.2 Structure Guided Browsing
`2.10.3 The Hypertext Model .
`.
`.
`2.11 Trends and Research Issues .
`.
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`59
`60
`61
`61
`
`62
`63
`
`65
`65
`66
`66
`69
`
`69
`
`2.12 Bibliographic Discussion .
`
`Retrieval Evaluation
`
`3.1
`
`Introduction .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.-
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`73
`
`73
`
`74
`
`3.2
`
`Retrieval Performance Evaluation .
`
`3.2.1 Recall and Precision .
`
`3.2.2 Alternative Measures .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.‘ .
`
`75
`
`82
`
`84
`
`3.3
`
`Reference Collections .
`
`.
`
`.
`
`.
`
`.
`
`3.3.1 The TREC Collection
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`3.3.2 The CACM and ISI Collections
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`3.4
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`84
`
`91
`
`94
`96
`
`3.3.3 The Cystic Fibrosis Collection .
`Trends and Research Issues .
`.
`.
`.
`.
`.
`.
`
`3.5
`
`Bibliographic Discussion .
`
`Query Languages
`4.1
`Introduction .
`
`A
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`96
`
`99
`99
`
`8
`
`
`
`CONTENTS
`
`Xi
`
`Single—Word Queries
`4.2.1
`4.2.2 Context Queries .
`.
`
`.
`4.2.3 Boolean Queries .
`4.2.4 Natural Language .
`Pattern Matching .
`.
`.
`.
`.
`Structural Queries .
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.*
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`4.3
`4.4
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`100
`.
`. 101
`
`.
`.
`.
`.
`
`.
`
`102
`103
`104
`106‘
`
`_ 108
`
`4.4.1
`
`Fixed Structure .
`
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`4.4.2 Hypertext
`4.4.3 Hierarchical Structure
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`108
`109
`
`113
`114
`
`4.5
`4.6
`
`4.7
`
`.
`.
`.
`.
`.
`.
`Query Protocols .
`Trends and Research Issues .
`
`Bibliographic Discussion .
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`Query Operations
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`116
`
`117
`
`117
`
`5.1
`
`5.2
`
`Introduction .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`User Relevance Feedback .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`5.2.1 Query Expansion and Term Reweighting for the Vector
`Model
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`5.5
`
`.
`
`.
`
`.
`._
`.
`.
`.
`.
`.
`.
`.
`.
`
`118
`
`118
`
`120
`121
`122
`123
`124
`129
`131
`131
`134
`137
`
`.
`.
`5.2.2 Term Reweighting for the Probabilistic Model
`.
`.
`5.2.3 A Variant of Probabilistic Term Reweighting .
`.
`.
`5.2.4 Evaluation of Relevance Feedback Strategies .
`.
`.
`5.3 Automatic Local Analysis
`.
`.
`Q
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`5.3.1 Query Expansion Through Local Clustering .
`5.3.2 Query Expansion Through Local Context Analysis
`.
`5.4 Automatic Global Analysis .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`5.4.1 Query Expansion based on a Similarity Thesaurus .
`5.4.2 Query Expansion based on a Statistical Thesaurus
`Trends and Research Issues .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`5.6
`
`Bibliographic Discussion .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`Text and Multimedia Languages and Properties
`6.1
`Introduction .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`138
`
`141
`141
`
`142
`
`6.2 Metadata .
`
`6.3
`
`Text
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`144
`
`144
`
`6.3.1
`
`Formats
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`Information Theory .
`6.3.2
`6.3.3 Modeling Natural Language .
`6.3.4
`Similarity Models .
`.
`.
`.
`.
`.
`.
`6.4 Markup Languages
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`6.4.1
`SGML .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`145
`145
`148
`149
`149
`
`152
`
`6.4.2 HTML .
`
`6.4.3 XML .y .
`6.5 Multimedia .
`.
`.6
`6.5.1
`Formats
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.1 .
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`_ 154
`156
`157
`
`9
`
`
`
`xii
`
`CONTENTS
`
`.
`.
`.
`.
`.
`.
`.
`6.5.4 HyTime
`Trends and Research Issues.
`Bibliographic Discussion .
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`6.6
`6.7
`
`Text Operations
`.
`.
`.
`.
`7.1
`Introduction .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`7.2
`Document Preprocessing .
`7.2.1 Lexical Analysis of the Text
`7.2.2 Elimination of Stopwords .
`7.2.3
`Stemming .
`.
`.
`.
`.
`.
`.
`.
`.
`7.2.4
`Index Terms Selection
`.
`.
`7.2.5 Thesauri
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`7.3
`7.4
`
`Document Clustering .
`Text Compression .
`.
`.
`7.4.1 Motivation .
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`
`159
`160
`162
`
`163
`163
`165
`165
`167
`168
`169
`170
`
`173
`173
`173
`
`175
`176
`
`.
`.
`.
`7.4.2 Basic Concepts
`7.4.3. Statistical Methods .
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`7.4.4 Dictionary Methods .
`.
`.
`.
`.
`7.4.5
`Inverted File Compression .
`Comparing Text Compression Techniques .
`Trends and Research Issues .
`.
`.
`.
`.
`.
`.
`.
`.
`
`Bibliographic Discussion .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`183
`184
`186
`188
`
`189
`
`7.5
`7.6
`
`7.7
`
`Indexing and Searching
`8.1
`Introduction .
`.
`.
`.
`
`8.2
`
`Inverted Files
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`Searching
`8.2.1
`8.2.2 Construction .
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`8.3
`
`8.4
`8.5
`
`Other Indices for Text
`.
`.
`.
`.
`.
`.
`.
`.
`.
`8.3.1
`Suffix Trees and Suflix Arrays .
`8.3.2
`Signature Files
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Boolean Queries .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Sequential Searching .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`8.5.1 Brute Force
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`V.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`191
`191
`
`192
`
`195
`196
`
`199
`199
`205
`207
`209
`209
`
`8.5.2 Knuth—Morris—Pratt .
`
`.
`
`8.5.3 Boyer—Moore Family .
`8.5.4
`Shift—Or
`.
`.
`.
`.
`.
`.
`.
`.
`
`8.5.5
`
`Suflix Automaton .
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`. .
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Practical Comparison .
`8.5.6
`.
`.
`.
`..
`.
`.
`.
`.
`.
`.
`.
`Phrases and Proximity .
`8.5.7
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Pattern Matching .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`8.6.1
`String Matching Allowing Errors
`8.6.2 Regular Expressions and Extended Patterns .
`8.6.3 Pattern Matching Using Indices .
`.
`.
`.
`.
`.
`.
`.
`Structural Queries .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`
`210
`
`211
`212
`
`213
`
`214
`215
`215
`216
`219
`220
`222
`
`8.6
`
`8.7
`
`10
`
`
`
`CONTENTS
`
`Sequential Searching .
`8.8.1
`8.8.2 Compressed Indices .
`.
`Trends and Research Issues .
`.
`Bibliographic Discussion
`
`8.9
`
`8.10
`
`Parallel and Distributed IR
`
`9.1
`
`9.2
`
`.
`.
`.
`.
`.
`.
`.
`.
`Introduction .
`9.1.1
`Parallel Computing .
`9.1.2
`Performance Measures
`
`.
`.
`
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`A
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`-
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`Parallel IR .
`.
`.
`.
`9.2.1
`Introduction .
`9.2.2 MIMD Architectures
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`SIMD Architectures .
`9.2.3
`.
`.
`Distributed IR .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`9.3.1
`Introduction .
`.
`.
`.
`.
`9.3.2 Collection Partitioning .
`9.3.3
`Source Selection .
`.
`.
`.
`.
`9.3.4 Query Processing .
`.
`.
`.
`9.3.5 Web Issues .
`.
`.
`.
`.
`.
`.
`.
`Trends and Research Issues .
`.
`.
`Bibliographic Discussion .
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`, .
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`9.3
`
`9.4
`
`9.5
`
`User Interfaces and Visualization
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`Introduction .
`10.1
`10.2 Human—Computer Interaction
`.
`.
`10.2.1 Design Principles
`.
`.
`10.2.2 The Role of Visualization
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`-
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`.
`.3
`.
`.
`.
`.
`.
`.
`.
`10.2.3 Evaluating Interactive Systems
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`The Information Access Process .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.3.1 Models of Interaction .
`.
`.
`.
`.
`10.3.2 Non—Search Parts of the Information Access Process
`10.3.3 Earlier Interface Studies
`.
`.
`.
`.
`.
`.
`.
`.
`
`10.3
`
`10.4
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`.
`
`.
`
`.
`
`10.5
`
`Starting Points
`10.4.1 Lists of Collections
`.
`.
`.
`.
`
`10.4.2 Overviews
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`10.4.3 Examples, Dialogs, and Wizards .
`.
`.
`.
`.
`.
`10.4.4 Automated Source Selection .
`.
`.
`.
`.
`.
`.
`.
`Query Specification .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.5.1 Boolean Queries .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.5.2 From Command Lines to Forms and Menus
`.
`10.5.3 Faceted Queries .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.5.4 Graphical Approaches to Query Specification
`10.5.5 Phrases and Proximity .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.5.6 Natural Language and Free Text Queries .
`.
`
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`
`278
`
`278
`
`279
`
`280
`
`281
`
`282
`
`286
`
`287
`
`11
`
`
`
`xiv
`
`CONTENTS
`
`10.6.2
`
`10.6.3
`
`10.6.4
`
`10.6.5
`
`Query Term Hits Within Document Content .
`.
`.
`.
`.
`Query Term Hits Between Documents
`SuperBook: Context via Table of Contents .
`.
`.
`.
`.
`.
`Categories for Results Set Context
`10.6.6 Using Hyperlinks to Organize Retrieval Results .
`.
`.
`.
`.
`-
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`-
`.
`.
`.
`.
`.
`10.6.7 Tables
`‘
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.7 Using Relevance Judgements .
`10.7.1 Interfaces for Standard Relevance Feedback
`
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`10.7.2 Studies of User Interaction With Relevance Feedback
`.
`.
`.
`.
`.
`.
`.
`.
`.
`~
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`289
`
`293
`
`296
`
`297
`299
`
`301
`
`303
`
`304
`
`307
`
`308
`
`308
`
`10.8
`
`Systems
`1
`10.7.3 Fetching Relevant Information in the Background .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.7.4 Group Relevance Judgements
`10.7.5 Pseudo—Relevance Feedback
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Interface Support for the Search Process
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.8.1 Interfaces for String Matching .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.8.2 Window Management .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`10.8.3 Example Systems .
`.
`.
`.
`.
`.
`.
`.
`10.8.4 Examples of Poor Use of Overlapping Windows .
`10.8.5 Retaining Search History .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`«\10.8.6 Integrating Scanning, Selection, and Querying .
`.
`10.9 Trends and Research Issues .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`10.10 Bibliographic Discussion .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`11
`
`Multimedia IR: Models and Languages
`11.1
`Introduction .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`11.2 Data Modeling .
`11.2.1 Multimedia Data Support in Commercial DBMSs .
`11.2.2 The MULTOS Data Model .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`.
`.
`
`309
`
`309
`
`311
`
`312
`
`317
`
`317
`
`318
`
`321
`
`322
`
`325
`
`325
`328 ‘C
`329
`
`331
`
`334
`
`335
`
`335
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`11.3 Query Languages
`.
`.
`.
`.
`.
`.
`11.3.1 Request Specification .
`11.3.2 Conditions on Multimedia Data .
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`.
`.
`.
`
`11.3.3 Uncertainty, Proximity, and Weights in Query
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`Expressions
`.
`.
`11.3.4 Some Proposals .
`11.4 Trends and Research Issues .
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`.
`.
`
`11.5 Bibiographic Discussion .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`0
`
`.
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`337
`
`338
`
`341
`
`342
`
`12 Multimedia IR: Indexing and Searching
`12.1
`Introduction .
`i .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`12.2 Background —— Spatial Access Methods .
`12.3 A Generic Multimedia Indexing Approach .
`12.4 One-dimensional Time Series .
`.
`.
`.
`.
`.
`.
`.
`.
`
`12.4.1 Distance Function .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`12.4.2 Feature Extraction and Lower-bounding .
`12.4.3 Experiments .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`.
`
`345
`
`345
`
`347
`
`348
`
`353
`
`353
`
`353
`
`355
`
`12
`
`
`
`CONTENTS
`
`XV
`
`12.5.1 Image Features and Distance Functions .
`12.5.2 Lower—bounding .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`12.5.3 Experiments .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`12.6 Automatic Feature Extraction .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`12.7 Trends and Research Issues .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`12.8 Bibliographic Discussion .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`
`357
`358
`360
`360
`361 .
`363
`
`13 Searching the Web
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.1
`Introduction
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.4
`.
`.
`.
`.
`.
`.
`13.2 Challenges .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.3 Characterizing the Web .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.3.1 Measuring the Web .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.3.2 Modeling the Web
`.
`.
`.
`.
`.
`.
`.
`.1
`.
`.
`.
`.
`13.4 Search Engines
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.’
`13.4.1 Centralized Architecture .
`.
`.
`.
`.
`.
`.
`.
`13.4.2 Distributed Architecture .
`.
`.
`.
`.
`.
`.
`.
`13.4.3 User Interfaces
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.4.4 Ranking .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.4.5 Crawling the Web .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.4.6 Indices .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.5 Browsing .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`»
`13.5.1 Web Directories .
`.
`.
`.
`.
`13.5.2 Combining Searching with Browsing .
`13.5.3 Helpful Tools
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.6 Metasearchers .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.7 Finding the Needle in the Haystack .
`.
`.
`.
`.
`.
`13.7.1 User Problems .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.7.2 Some Examples .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.7.3 Teaching the User .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.8 Searching using Hyperlinks .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.8.1 Web Query Languages
`.
`.
`.
`.
`.
`.
`.
`.
`.
`13.8.2 Dynamic Search and Software Agents .
`13.9 Trends and Research Issues.
`.
`.
`;
`.
`.
`.
`.
`.
`13.10 Bibliographic Discussion .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`A.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`367
`367
`.
`.
`.
`368
`.
`.
`.
`369
`.
`.
`.
`369
`.
`.
`.
`371
`.
`.
`.
`373
`.
`.
`.
`373
`.
`.
`.
`375
`.
`.
`.
`. 377
`.
`.
`.
`380
`.
`.
`.
`382
`.
`.
`.
`383
`.
`.
`. ..
`384
`.
`.
`.
`384
`.
`.
`.
`386
`.
`.
`.
`387
`.
`.
`.
`387
`.
`.’
`.
`389
`.
`.
`.
`389
`.
`.
`.
`390
`.
`.
`.
`391
`.
`.
`.
`392
`.
`.
`.
`392
`.
`.
`.
`393
`.
`.
`.
`393
`.
`.
`.
`395
`
`14 Libraries and Bibliographical Systems
`.
`.
`.
`.
`.
`14.1
`Introduction .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`14.2 Online IR Systems and Document Databases .
`14.2.1 Databases
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`14.2.2 Online Retrieval Systems .
`.
`.
`.
`.
`.
`.
`A.
`14.2.3 IR in Online Retrieval Systems
`.
`.
`.
`.
`14.2.4 ‘Natural Language’ Searching .
`.
`.
`.
`.
`14.3 Online Public Access Catalogs (OPACs)
`.
`.
`..
`14.3.1 OPACs and Their Content
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`*.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.* .
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`.
`.
`.
`.
`.
`.
`.
`
`397
`397
`.
`398
`.
`399
`.
`. 403
`. 404
`.
`406
`.
`407
`.
`408
`
`13
`
`
`
`xvi
`
`CONTENTS
`
`14.3.4 Alternatives to Vendor OPACs
`
`14.4 Libraries and Digital Library Projects .
`14.5 Trends and Research Issues .
`.
`.
`.
`.
`.
`.
`
`14.6 Bibliographic Discussion .
`
`15 Digital Libraries
`15.1
`Introduction .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`410
`
`412
`. 412
`
`413
`
`415
`415
`
`417
`
`15.2 Definitions .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`15.3 Architectural Issues .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`15.4 Document Models, Representations, and Access
`15.4.1 Multilingual Documents
`.
`.
`.
`.
`.
`.
`.
`.
`.
`15.4.2 Multimedia Documents .
`.
`.
`.
`.
`.
`.
`.
`.
`
`15.4.3 Structured Documents
`
`.
`
`.
`
`.
`
`.1 .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`
`.
`
`.
`
`418
`
`420
`420
`421
`
`421
`
`422
`
`15.4.4 Distributed Collections .
`
`15.4.5 Federated Search .
`
`15.4.6 Access
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`15.5 Prototypes, Projects, and Interfaces .
`15.5.1 International Range of Efforts .
`15.5.2 Usability .
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`15.6 Standards
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`.
`
`.
`
`.
`.
`.
`.
`
`.
`
`424
`
`424
`
`425
`427
`428
`429
`
`429
`
`15.6.1 Protocols and Federation .
`
`15.6.2 Metadata
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`15.7 Trends and Research Issues .
`
`15.8 Bibliographical Discussion .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`Appendix: Porter’s Algorithm
`
`Glossary
`
`References
`
`Index
`
`430
`
`431
`
`432
`
`433
`
`437
`
`455
`
`501
`
`14
`
`
`
`Chapter 1
`introduction
`
`
`
`1.1 Motivation
`
`Information retrieval (IR) deals with the representation, storage, organization
`of, and access to information items. The representation and organization of the
`information items should provide the user with easy access to the information in
`which he is interested. Unfortunately, characterization of the user information
`need is not a simple problem. Consider, for instance, the following hypothetical
`user information need in the context of the World Wide Web (or just the Web):
`
`A Find all the pages (documents) containing information on college ten-
`nis teams which: (1) are maintained by an university in the USA and
`(2) participate in the NCAA tennis tournament. To be relevant, the
`page must include information on the national ranking of the team
`in the last three years and the email or phone number of the team
`coach.
`
`Clearly, this full description of the user information need cannot be used directly
`to request information using the current interfaces of Web search engines.
`In-
`stead, the user must first translate this information need into a query which can
`be processed by the search engine (or IR system).
`In its most common form, this translation yields a set of keywords (or index
`terms) which summarizes the description of the user information need. Given the
`user query, the key goal of an IR system is to retrieve information which might
`be useful or relevant to the user. The emphasis is on the retrieval of information
`as opposed to the retrieval of data.
`
`1.1.1
`
`reformation versus Data Retrieval
`
`Data retrieval, in the context of an IR system, consists mainly of determining
`which documents of a collection contain the keywords in the user query which,
`
`15
`
`
`
`2
`
`INTRODUCTION
`
`subject than with retrieving data which satisfies a given query. A data retrieval
`language aims at retrieving all objects which satisfy clearly defined conditions
`such as those in a regular expression or in a relational algebra expression. Thus,
`for a data retrieval system, a single erroneous object among a thousand retrieved
`objects means total failure. For an information retrieval system, however, the
`retrieved objects might be inaccurate and small errors are likely to go unnoticed.
`The main reason for this difference is that information retrieval usually deals
`with natural language text which is not always well structured and could be
`semantically ambiguous. On the other hand, a data retrieval system (such as
`a relational database) deals with data that has a well defined structure and
`semantics.
`‘
`
`Data retrieval, while providing a solution to the user of a database system,
`does not solve the problem of retrieving information about a subject or topic.
`To be effective in its attempt to satisfy the user information need, the IR system
`must somehow ‘interpret’ the contents of the information items (documents)
`in a ‘collection and rank them according to a degree of relevance to the user
`query. This ‘interpretation’ of a document content involvesextracting syntactic
`and semantic information from the document text and using this information
`to match the user information need. The difficulty is not only knowing how
`to extr