Showing papers on "Inverted index published in 1994"

PDF

Open Access

Simple, proven approaches to text retrieval

[...]

01 Jan 1994

TL;DR: This technical note describes straightforward techniques for document indexing and retrieval that have been solidly established through extensive testing and are easy to apply and have the advantage that they do not require special skills or training for searching, but are easy for end users.

...read moreread less

Abstract: This technical note describes straightforward techniques for document indexing and retrieval that have been solidly established through extensive testing and are easy to apply. They are useful for many different types of text material, are viable for very large files, and have the advantage that they do not require special skills or training for searching, but are easy for end users. The document and text retrieval methods described here have a sound theoretical basis, are well established by extensive testing, and the ideas involved are now implemented in some commercial retrieval systems. Testing in the last few years has, in particular, shown that the methods presented here work very well with full texts, not only title and abstracts, and with large files of texts containing three quarters of a million documents. These tests, the TREC Tests (see Harman 1993–1997; IPM on term weighting exploiting statistical information about term occurrences; on scoring for request-document matching, using these weights, to obtain a ranked search output; and on relevance feedback to modify request weights or term sets in iterative searching. The normal implementation is via an inverted file organisation using a term list with linked document identifiers, plus counting data, and pointers to the actual texts. The user’s request can be a word list, phrases, sentences or extended text. 1 Terms and matching Index terms are normally content words (but see section 6). In request processing, stop words (e.g. prepositions and conjunctions) are eliminated via a stop word list, and they are usually removed, for economy reasons, in inverted file construction. Terms are also generally stems (or roots) rather than full words, since this means that matches are not missed through trivial word variation, as with singular/plural forms. Stemming can be achieved most simply by the user truncating his request words, to match any inverted index words that include them; but it is a better strategy to truncate using a standard stemming algorithm and suffix list (see Porter 1980), which is nicer for the user and reduces the inverted term list. The request is taken as an unstructured list of terms. If the terms are unweighted, output could be ranked by the number of matching terms – i.e. for a request with 5 terms first by documents with all 5, then by documents with any 4, etc. However, performance may be improved considerably by giving a weight to each term (or each term-document combination). In this case, output is ranked by sum of weights (see below).

...read moreread less

304 citations

Proceedings Article•DOI•

Incremental updates of inverted lists for text document retrieval

[...]

Anthony Tomasic¹, Hector Garcia-Molina¹, Kurt A. Shoens²•Institutions (2)

Stanford University¹, IBM²

24 May 1994

TL;DR: In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index that dynamically separates long and short inverted lists and optimizes retrieval, update, and storage of each type of list.

...read moreread less

Abstract: With the proliferation of the world's “information highways” a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index. The index dynamically separates long and short inverted lists and optimizes retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering trade-offs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. We then describe the best algorithm for a variety of criteria.

...read moreread less

200 citations

Proceedings Article•

Fast Incremental Indexing for Full-Text Information Retrieval

[...]

Eric W. Brown¹, James P. Callan¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

12 Sep 1994

TL;DR: This work describes the system and presents experimental results showing superior incremental indexing and competitive query processing performance, using a traditional inverted file index built on top of a persistent object store.

...read moreread less

Abstract: Full-text information retrieval systems have traditionally been designed for archival environments. They often provide little or no support for adding new documents to an existing document collection, requiring instead that the entire collection be re-indexed. Modern applications, such as information filtering, operate in dynamic environments that require frequent additions to document collections. We provide this ability using a traditional inverted file index built on top of a persistent object store. The data management facilities of the persistent object store are used to produce efficient incremental update of the inverted lists. We describe our system and present experimental results showing superior incremental indexing and competitive query processing performance.

...read moreread less

149 citations

Proceedings Article•DOI•

Document filtering for fast ranking

[...]

Michael Persin¹•Institutions (1)

RMIT University¹

01 Aug 1994

TL;DR: The experiments show that the proposed evaluation technique reduces both main memory usage and query evaluation time, based on early recognition of which documents are likely to be highly ranked, without degradation in retrieval effectiveness.

...read moreread less

Abstract: Ranking techniques are effective for finding answers in document collections but the cost of evaluation of ranked queries can be unacceptably high. We propose an evaluation technique that reduces both main memory usage and query evaluation time. based on early recognition of which documents are likely to be highly ranked. Our experiments show that, for our test data, the proposed technique evaluates queries in 20% of the time and 2% of the memory taken by the standard inverted file implementation, without degradation in retrieval effectiveness.

...read moreread less

113 citations

Patent•

Method and system for automatically indexing data in a document using a fresh index table

[...]

Kyle G. Peltonen¹, Bartosz Milewski¹•Institutions (1)

Microsoft¹

26 Oct 1994

TL;DR: A system and method for indexing words in documents, the system including a master index for storing the words and for storing associated index data is described in this paper, where query requests are processed by searching all relevant indexes and comparing the retrieved results with the data in the fresh index table only the most up-to-date data will actually be returned as the query results.

...read moreread less

Abstract: A system and method for indexing words in documents, the system including a master index for storing the words and for storing associated index data One of the documents is selected for updating and is edited Next, a shadow index is created Each word from the selected edited document is then indexed in the shadow index A fresh index table is updated to indicate that the shadow index contains the most up-to-date data regarding the selected edited document Query requests will be processed by searching all relevant indexes and comparing the retrieved results with the data in the fresh index table Only the most up-to-date data will actually be returned as the query results Periodically, shadow indexes and the master index can be merged into a new master index Only the most up-to-date data, as determined by a comparison with the fresh index table, will be stored in the new master index

...read moreread less

63 citations

Proceedings Article•DOI•

Index structures for information filtering under the vector space model

[...]

Tak W. Yan¹, Hector Garcia-Molina¹•Institutions (1)

Stanford University¹

14 Feb 1994

TL;DR: In this paper, the authors apply the idea of the standard inverted index to index user profiles, in which they, instead of indexing every term in a profile, select only the significant ones to index.

...read moreread less

Abstract: The authors study what data structures and algorithms can be used to efficiently perform large-scale information filtering under the vector space model, a retrieval model established as being effective. They apply the idea of the standard inverted index to index user profiles. They devise an alternative to the standard inverted index, in which they, instead of indexing every term in a profile, select only the significant ones to index. They evaluate their performance and show that the indexing methods require orders of magnitude fewer I/Os to process a document than when no index is used. They also show that the proposed alternative performs better in terms of I/O and CPU processing time in many cases. >

...read moreread less

62 citations

Journal Article•DOI•

Memory efficient ranking

[...]

Alistair Moffat¹, Justin Zobel², Ron Sacks-Davis•Institutions (2)

University of Melbourne¹, RMIT University²

01 Oct 1994-Information Processing and Management

TL;DR: This work describes an approximate ranking process that makes use of a compact array of in-memory, low-precision approximations for the lengths, which allows the ranking of large document collections using less than one byte of memory per document, an eight-fold reduction compared with conventional techniques.

...read moreread less

Abstract: Fast and effective ranking of a collection of documents with respect to a query requires several structures, including a vocabulary, inverted file entries, arrays of term weights and document lengths, a set of partial similarity accumulators, and address tables for inverted file entries and documents. Of all of these structures, the array of document lengths and the set of accumulators are the components accessed most frequently in a ranked query, and it is crucial to acceptable performance that they be held in main memory. Here we describe an approximate ranking process that makes use of a compact array of in-memory, low-precision approximations for the lengths. Combined with another simple rule for reducing the memory required by the partial similarity accumulators, the approximation heuristic allows the ranking of large document collections using less than one byte of memory per document, an eight-fold reduction compared with conventional techniques. Moreover, in our experiments retrieval effectiveness was largely unaffected by the use of these heuristics.

...read moreread less

37 citations

Proceedings Article•DOI•

Synthetic workload performance analysis of incremental updates

[...]

Kurt A. Shoens¹, Anthony Tomasic², Hector Garcia-Molina²•Institutions (2)

IBM¹, Stanford University²

01 Aug 1994

TL;DR: This paper addresses the problem of incremental updates of inverted lists is addressed using a dual-structure index data structure that dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list.

...read moreread less

Abstract: Declining disk and CPU costs have kindled a renewed interest in efficient document indexing techniques. In this paper, the problem of incremental updates of inverted lists is addressed using a dual-structure index data structure that dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. The behavior of this index is studied with the use of a synthetically-generated document collection and a simulation model of the algorithm. The index structure is shown to support rapid insertion of documents, fast queries, and to scale well to large document collections and many disks.

...read moreread less

35 citations

Proceedings Article•

Natural Language Information Retrieval: TREC-3 Report

[...]

Tomek Strzalkowski, Jose Perez Carballo, Mihnea Marinescu

01 Nov 1994

TL;DR: In this article, the authors report on the recent developments in NYU's natural language information retrieval system especially as related to the 3rd Text Retrieval conference (TREC-3).

...read moreread less

Abstract: In this paper we report on the recent developments in NYU's natural language information retrieval system especially as related to the 3rd Text Retrieval conference (TREC-3). The main characteristic of this system is the use of advanced natural language processing to enhance the effectiveness of term-based document retrieval. The system is designed around a traditional statiscal backbone consisting of the indexer module, which builds inverted index files from pre-processed documents, and a retrieval engine which searches and ranks the documents in response to user queries. Natural language processing is used to (1) preprocess the documents in order to extract content-carrying terms, (2) discover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) process user's natural language requests into effective search queries. For the present TREC-3 effort, the total of 3.3 GBytes of text articles have been processed (Tipster disks 1 through 3), including material from the Wall Street Journal, the Associated Press newswire, the Federal Register, Ziff Communication's Computer Library, Department of Energy abstract, U.S. Patents and the San Jose Mercury News, totaling more than 500 million words of English. Since the TREC-2 conference, many components of the system have been redesigned to facilitate its scalability to deal with ever increasing amounts of data. In particular, a randomized index-splitting mechanism has been installed which allows the system to create a number of smaller indexes that can be independently searched.

...read moreread less

32 citations

Journal Article•DOI•

Models of a distributed information retrieval system based on thesauri with weights

[...]

Zygmunt Mazur¹•Institutions (1)

Wrocław University of Technology¹

15 Jan 1994-Information Processing and Management

TL;DR: The presented retrieval rules may be viewed as the logical approach in implementing a physical distributed information retrieval system.

...read moreread less

Abstract: problem of combining n local information systems S 1 , S 2 , …, S n based on thesauri with weights in one system S is considered. The distributed information retrieval system S may be assumed to be a network of the local systems S j . Both the generalization relation on the set of descriptors and weights of descriptors in descriptions of documents are taken into account. The fundamental properties of distributed systems are described. The inverted file structure is often used to organize data in the information retrieval system. Operations on inverted lists are modified in order to use them in the distributed information system. While retrieving the response to any query in a distributed system, we may use the existing inverted lists from local subsystems. In a distributed system the retrieval process follows almost in the same way as in the method of inverted files in conventional systems. It differs only due to use of the additional union operations on selected inverted lists from any local subsystems. The presented retrieval rules may be viewed as the logical approach in implementing a physical distributed information retrieval system.

...read moreread less

17 citations

Journal Article•DOI•

On the efficiency of best-match cluster searches

[...]

Fazli Can¹•Institutions (1)

Miami University¹

01 May 1994-Information Processing and Management

TL;DR: A method for combining CBR and inverted index search is proposed and shown to be cost effective in terms of paging and CPU time and counterintuitive to the concept of best-match CBR, proves that it is much more efficient than conventional approaches.

...read moreread less

Abstract: The efficiency of various cluster-based retrieval (CBR) strategies is analyzed. The possibility of combining CBR and inverted index search (IIS) is investigated. A method for combining the two approaches is proposed and shown to be cost effective in terms of paging and CPU time. In the new method, the selection of documents from the best-matching clusters is done using the inverted index for all documents. Although this is counterintuitive to the concept of best-match CBR, the observations prove that it is much more efficient than conventional approaches. In the experiments, the effects of the number of selected clusters, page size, centroid length, and matching function are considered. The experiments show that the storage overhead of the new method would be moderately higher than that of IIS.

...read moreread less

Distributed queries and incremental updates in information retrieval systems

[...]

Anthony Tomasic¹•Institutions (1)

Princeton University¹

01 Jan 1994

TL;DR: A space of engineering trade-offs which range from optimizing update time to optimizing query performance is described, and the best algorithm for a variety of criteria is determined.

...read moreread less

Abstract: With the proliferation of the world's "information highways" has renewed interest in efficient document indexing techniques. This thesis considers the architecture of information retrieval systems. Distributed queries are studied with analytical and trace-driven simulations. We focus on physical index design, inverted index caching, and database scaling in a distributed shared-nothing system. All three issues are shown to have a strong effect on response time and throughput. Incremental updates of inverted lists are studied using a new dual-structure index data structure. The index dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering trade-offs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. The best algorithm for a variety of criteria is determined. Finally, implementation of our incremental update algorithms is compared to an existing information retrieval system.

...read moreread less

Book Chapter•DOI•

Fast document ranking for large scale information retrieval

[...]

Michael Persin¹, Justin Zobel¹, Ron Sacks-Davis¹•Institutions (1)

RMIT University¹

21 Jun 1994

TL;DR: This paper shows that it is possible to use the re-ordering to achieve a net reduction in index size, regardless of whether the index is compressed, and simultaneously achieves savings in cpu time, disk traffic, memory usage, and index size.

...read moreread less

Abstract: For large document databases, evaluation of ranked queries can be expensive in cpu time, memory usage, and disk traffic. It has been shown that memory usage can be dramatically reduced by use of a simple filtering heuristic that eliminates most documents from consideration. In this paper we show that, by designing inverted indexes explicitly to support filtering, cpu time and disk traffic can also be dramatically reduced. The principle of the index design is that inverted lists are sorted by indocument frequency rather than by document number. In the context of compressed indexes such a re-ordering could result in a large increase in index size. We show, however, that it is possible to use the re-ordering to achieve a net reduction in index size, regardless of whether the index is compressed. Together, these techniques simultaneously achieve savings in cpu time, disk traffic, memory usage, and index size.

...read moreread less

A comparison of Boolean-based retrieval to the WAIS system for retrieval of aeronautical information

[...]

Gary Marchionini, Diane Barlow

01 Mar 1994

TL;DR: An evaluation of an information retrieval system using a Boolean-based retrieval engine and inverted file architecture and WAIS, which uses a vector-based engine, was conducted, finding relevant documents in the WAIS searches were found to be randomly distributed in the retrieved sets rather than distributed by ranks.

...read moreread less

Abstract: An evaluation of an information retrieval system using a Boolean-based retrieval engine and inverted file architecture and WAIS, which uses a vector-based engine, was conducted. Four research questions in aeronautical engineering were used to retrieve sets of citations from the NASA Aerospace Database which was mounted on a WAIS server and available through Dialog File 108 which served as the Boolean-based system (BBS). High recall and high precision searches were done in the BBS and terse and verbose queries were used in the WAIS condition. Precision values for the WAIS searches were consistently above the precision values for high recall BBS searches and consistently below the precision values for high precision BBS searches. Terse WAIS queries gave somewhat better precision performance than verbose WAIS queries. In every case, a small number of relevant documents retrieved by one system were not retrieved by the other, indicating the incomplete nature of the results from either retrieval system. Relevant documents in the WAIS searches were found to be randomly distributed in the retrieved sets rather than distributed by ranks. Advantages and limitations of both types of systems are discussed.

...read moreread less

Journal Article•DOI•

The effect of postings information on searching behaviour

[...]

Frances E. Wood¹, Nigel Ford¹, Christina Walsh¹•Institutions (1)

University of Sheffield¹

01 Jan 1994-Journal of Information Science

TL;DR: How postings information is used for inverted file searching was investigated by comparing searches of the LISA (Library and Information Science Abstracts) database on CD-ROM with and without postings information.

...read moreread less

Abstract: How postings information is used for inverted file searching was investigated by comparing searches, made by postgraduate students at the Department of Information Studies, of the LISA (Library and Information Science Abstracts) database on CD-ROM with and without postings information. Performance (the number of relevant references, precision and recall) was not significantly different but searches with postings information took more time, and more sets were viewed, than in searches without postings. Postings information was used to make decisions to narrow or broaden the search; to view or print the references. The same techniques were used to amend searches whether or not postings information was available

...read moreread less

Journal Article•DOI•

A review of new developments in text retrieval systems

[...]

Andy Ewers

01 Nov 1994-Journal of Information Science

TL;DR: This review was undertaken to evaluate the current and planned developments of the leading suppliers of text retrieval software, with the objective of trying to establish trends in the industry.

...read moreread less

Abstract: This review was undertaken to evaluate the current and planned developments of the leading suppliers of text retrieval software, with the objective of trying to establish trends in the industry. The products evaluated are those from a mini/mainframe background. Theu can be described as fully functional text information management systesms or text database systems and are all multi-user systems. The majority of these products operate an inverted file structure which indexes every word. The review was conducted with senior executives of the suppliers concerned, listed below with their primary products: 1)Information Dimensions: BASISplus; BRS Dataware: BRS/Search; Excalibur Technologies: EFS; Status/IQ: Status/IQ; Verity: TOPIC

...read moreread less

Proceedings Article•DOI•

Parallel indexing in a Chinese information retrieval system

[...]

Kam-Fai Wong¹, Vincent Y. Lum¹•Institutions (1)

The Chinese University of Hong Kong¹

09 Nov 1994

TL;DR: In this article, a parallel Chinese IR system (CIR) has been designed on a SIMD parallel computer, DECmpp, which is configured with 8,192 processing elements.

...read moreread less

Abstract: The increasing data size in Chinese information-based applications renders conventional information retrieval (IR) systems unsuitable. This is because they are limited in both storage and speed. To overcome these predicaments, a parallel Chinese IR system (CIR) has been designed. It is being developed on a SIMD parallel computer, DECmpp, which is configured with 8,192 processing elements. It uses full inverted indices for retrieval. The "divide-and-conquer" principle is exercised in exploiting data parallelism in the inverted index files. The inverted indices are first partitioned into fragments. Each fragment is then assigned to an individual processing elements. Thereafter, during an index retrieval operation, all index fragments are searched in parallel. Although the principle is simple, realising the parallel indexing algorithm in a naive fashion (i.e. without considering the underlying parallel architecture) would result in poor retrieval performance. During the design of the CIR system, 3 different implementation models for parallel indexing have been considered. In this paper, qualitative evaluation of the 3 models is presented. Based on the result of the evaluation, the model that offers the best run-time performance was adopted. >

...read moreread less

Proceedings Article•

A text retrieval package for the unix operating system

[...]

Liam R. E. Quin

06 Jun 1994

TL;DR: The lq-text package is available in source form, has been successfully integrated into a number of other systems and products, and is in use at over 100 sites.

...read moreread less

Abstract: This paper describes lq-text, an inverted index text retrieval package written by the author. Inverted index text retrieval provides a fast and effective way of searching large amounts of text. This is implemented by making an index to all of the natural-language words that occur in the text. The actual text remains unaltered in place, or, if desired, can be compressed or archived; the index allows rapid searching even if the data files have been altogether removed. The design and implementation of lq-text are discussed, and performance measurements are given for comparison with other text searching programs such as grep and agrep. The functionality provided is compared briefly with other packages such as glimpse and zbrowser. The lq-text package is available in source form, has been successfully integrated into a number of other systems and products, and is in use at over 100 sites.

...read moreread less