scispace - formally typeset
Search or ask a question

Showing papers on "Inverted index published in 2002"


Proceedings ArticleDOI
03 Jun 2002
TL;DR: This paper surveys both algorithms and applications for generalizing keyword search to keytree and keygraph searching, because trees and graphs have many applications in next-generation database systems.
Abstract: Modern search engines answer keyword-based queries extremely efficiently. The impressive speed is due to clever inverted index structures, caching, a domain-independent knowledge of strings, and thousands of machines. Several research efforts have attempted to generalize keyword search to keytree and keygraph searching, because trees and graphs have many applications in next-generation database systems. This paper surveys both algorithms and applications, giving some emphasis to our own work.

453 citations


01 Dec 2002
TL;DR: This paper proposes a new framework for enforcing privacy in mining frequent itemsets, and combines techniques for efficiently hiding restrictive patterns: a transaction retrieval engine relying on an inverted file and Boolean queries; and a set of algorithms to "sanitize" a database.
Abstract: One crucial aspect of privacy preserving frequent itemset mining is the fact that the mining process deals with a trade-off: privacy and accuracy, which are typically contradictory, and improving one usually incurs a cost in the other. One alternative to address this particular problem is to look for a balance between hiding restrictive patterns and disclosing nonrestrictive ones. In this paper, we propose a new framework for enforcing privacy in mining frequent itemsets. We combine, in a single framework, techniques for efficiently hiding restrictive patterns: a transaction retrieval engine relying on an inverted file and Boolean queries; and a set of algorithms to "sanitize" a database. In addition, we introduce performance measures for mining frequent itemsets that quantify the fraction of raining patterns which are preserved after sanitizing a database. We also report the results of a performance evaluation of our research prototype and an analysis of the results.

321 citations


Proceedings ArticleDOI
11 Aug 2002
TL;DR: This paper proposes several simple optimisations to well-known integer compression schemes, and shows experimentally that these lead to significant reductions in time, and concludes that fast byte-aligned codes should be used to store integers in inverted lists.
Abstract: Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.

280 citations


Patent
04 Nov 2002
TL;DR: In this article, a knowledge management and archival system on a network is described, where a document to be archived is prepared as a machine readable and loaded to a database, and the document can be text, image, video or audio, all of which are indexed during and/or after uploading, and stored to the database.
Abstract: The method and system of the present invention provides for a knowledge management and archival system on a network. A document to be archived is prepared as a machine readable and loaded to a database. The document can be text, image, video or audio, all of which are indexed during and/or after uploading, and stored to the database. This knowledge management system for comprises a server for uploading a record, a user terminal for creating a plurality of indices for each uploaded record; and, a search engine for locating records in response to an index sensitive inquiry. The knowledge management system can also include indexes, such as a field index, a native index and a content index. The knowledge management system also comprises an application program comprising a system administration utility, a data loading component and database management utilities.

183 citations


Book ChapterDOI
TL;DR: This work adapts a state-of-the-art text-based document ranking algorithm, the vector-space model instantiated with the TFxIDF ranking rule, to the P2P environment, and develops a heuristic for adaptively determining the set of peers that should be contacted for a query.
Abstract: We consider the problem of content search and retrieval in peer-to-peer (P2P) communities. P2P computing is a potentially powerful model for information sharing between ad hoc groups of users because of its low cost of entry and natural model for resource scaling. As P2P communities grow, however, locating information distributed across the large number of peers becomes problematic. We address this problem by adapting a state-of-the-art text-based document ranking algorithm, the vector-space model instantiated with the TFxIDF ranking rule, to the P2P environment. We make three contributions: (a) we show how to approximate TFxIDF using compact summaries of individual peers' inverted indexes rather than the inverted index of the entire communal store; (b) we develop a heuristic for adaptively determining the set of peers that should be contacted for a query; and (c) we show that our algorithm tracks TFxIDF's performance very closely, giving P2P communities a search and retrieval algorithm as good as that possible assuming a centralized server.

130 citations


Proceedings ArticleDOI
02 Apr 2002
TL;DR: An algorithm to permute the document numbers so as to create locality in an inverted index by clustering the documents by improving the performance of the best difference coding algorithm.
Abstract: An important concern in the design of search engines is the construction of an inverted index. An inverted index, also called a concordance, contains a list of documents (or posting list) for every possible search term. These posting lists are usually compressed with difference coding. Difference coding yields the best compression when the lists to be coded have high locality. Coding methods have been designed to specifically take advantage of locality in inverted indices. Here, we describe an algorithm to permute the document numbers so as to create locality in an inverted index. This is done by clustering the documents. Our algorithm, when applied to the TREC ad hoc database (disks 4 and 5), improves the performance of the best difference coding algorithm we found by fourteen percent. The improvement increases as the size of the index increases, so we expect that greater improvements would be possible on larger datasets.

126 citations


Proceedings ArticleDOI
01 Dec 2002
TL;DR: A region-based image retrieval framework that integrates efficient region- based representation in terms of storage and retrieval and effective on-line learning capability and supports relevance feedback based on the vector model with a weighting scheme is presented.
Abstract: We present a region-based image retrieval framework that integrates efficient region-based representation in terms of storage and retrieval and effective on-line learning capability. The framework consists of methods for image segmentation and grouping, indexing using modified inverted file, relevance feedback, and continuous learning. By exploiting a vector quantization method, a compact region-based image representation is achieved. Based on this representation, an indexing scheme similar to the inverted file technology is proposed. In addition, it supports relevance feedback based on the vector model with a weighting scheme. A continuous learning strategy is also proposed to enable the system to self improve. Experimental results on a database of 10,000 general-purposed images demonstrate the efficiency and effectiveness of the proposed framework.

86 citations


Patent
22 Nov 2002
TL;DR: In this paper, documents containing information about product offerings in various natural languages are passed through transitional translation layers which convert the data to a single computer language using a universal character set encompassing the character sets used in all supported natural languages.
Abstract: Documents containing information about product offerings in various natural languages are passed through transitional translation layers which convert the data to a single computer language using a universal character set encompassing the character sets used in all supported natural languages. The documents are stored in their original natural languages and in English with documents segmented into components which components are identified by search terms arranged in a taxonomy tree based on product types. The names of the products in the national languages are added to the English language documents enabling quick keyword searches when the product name or number is known. A bi-directional inverted index is provided for access by the keyword search terms so that keywords with the same meaning in different languages are accessible together when the keyword in one of the languages is queried.

82 citations


Proceedings ArticleDOI
11 Mar 2002
TL;DR: This paper examines an XML collection from the viewpoint of Information Retrieval (IR) and views the XML documents as a collection of text documents with additional tags and attempts to adapt existing IR techniques to achieve more sophisticated search on XML documents.
Abstract: Query languages that take advantage of the XML document structure already exist. However, the systems that have been developed to query XML data explore the XML sources from a database perspective. This paper examines an XML collection from the viewpoint of Information Retrieval (IR). As such, we view the XML documents as a collection of text documents with additional tags and we attempt to adapt existing IR techniques to achieve more sophisticated search on XML documents. We employ a class of queries that support path expressions and suggest an efficient index, which extends the inverted file structure to search XML documents. This is accomplished by integrating the XML structure in the inverted file by combining the inverted file with a path index. The proposed structure is a lexicographical index, which may be used for the evaluation of queries that involve path expressions. Moreover, this paper discusses a ranking scheme based on both the term distribution and document structure. Some performance remarks are also presented.

82 citations


Proceedings ArticleDOI
11 Aug 2002
TL;DR: Combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file.
Abstract: Search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. A significant proportion of the queries posed to search engines involve phrases. In this paper we consider how phrase queries can be efficiently supported with low disk overheads. Previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. We propose a combination of nextword indexes with inverted files as a solution to this problem. Our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. Further time savings are available with only slight increases in disk requirements.

80 citations


Patent
01 Feb 2002
TL;DR: In this paper, a system and method of searching a database in which documents of different languages are included is described, which includes a synonym or keyword dictionary which is bi-directional and allows for translation of keywords between a first language and other languages.
Abstract: A system and method of searching a database in which documents of different languages are included. The system includes a synonym or keyword dictionary which is bi-directional and allows for translation of keywords between a first language and other languages. The translated words keywords for the document are stored in an inverted index which is then used for searching, either in a selected language, a second language or in all languages, as determined by the user. This use of multiple searching and a translated synonym dictionary avoids the need for translation of the entire document and avoids inaccuracies which may result from translations.

Proceedings ArticleDOI
04 Nov 2002
TL;DR: Two algorithms for Pagerank are derived based on techniques proposed for out-of-core graph algorithms, and compared to two existing algorithms proposed by Haveliwala, and the implementation of a recently proposed topic-sensitive version of Pageranks is considered.
Abstract: Over the last few years, most major search engines have integrated link-based ranking techniques in order to provide more accurate search results. One widely known approach is the Pagerank technique, which forms the basis of the Google ranking scheme, and which assigns a global importance measure to each page based on the importance of other pages pointing to it. The main advantage of the Pagerank measure is that it is independent of the query posed by a user; this means that it can be precomputed and then used to optimize the layout of the inverted index structure accordingly. However, computing the Pagerank measure requires implementing an iterative process on a massive graph corresponding to billions of web pages and hyperlinks.In this paper, we study I/O-efficient techniques to perform this iterative computation. We derive two algorithms for Pagerank based on techniques proposed for out-of-core graph algorithms, and compare them to two existing algorithms proposed by Haveliwala. We also consider the implementation of a recently proposed topic-sensitive version of Pagerank. Our experimental results show that for very large data sets, significant improvements over previous results can be achieved on machines with moderate amounts of memory. On the other hand, at most minor improvements are possible on data sets that are only moderately larger than memory, which is the case in many practical scenarios.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: The hypothesis is that this allows for faster calculation of predictions and also that early termination heuristics may be used to further speed up the filtering process and perhaps even improve the quality of the predictions.
Abstract: This paper explores the possibility of using a disk based inverted file structure for collaborative filtering. Our hypothesis is that this allows for faster calculation of predictions and also that early termination heuristics may be used to further speed up the filtering process and perhaps even improve the quality of the predictions. In an experiment on the EachMovie dataset this was tested. Our results indicate that searching the inverted file structure is many times faster than general in-memory vector search, even for very large profiles. The Continue termination heuristics produces the best ranked predictions in our experiments, and Quit is the top performer in terms of speed.

Patent
Tapas Kumar Nayak1
01 May 2002
TL;DR: In this article, a full-text search index system and method is generated by creating instances of a database index from an in-memory inverted list of keywords associated with a text identifier and the occurrences of the keyword in the text.
Abstract: A full-text search index system and method is generated by creating instances of a database index from an in-memory inverted list of keywords associated with a text identifier and the occurrences of the keyword in the text. Instances of the index are placed in a priority queue. A merge scheduling process determines when a merge should be initiated, selects instances of the index to be merged and selects a type of merge to perform.

Patent
31 May 2002
TL;DR: In this article, a system and method for generating an inverted index and processing search queries using the inverted index is presented, where a numeric range query is translated into a query on multiple tokens and combining two or more range queries on different attributes becomes a simple merge of document identification lists.
Abstract: A system and method is provided for generating an inverted index and processing search queries using the inverted index. To increase efficiency for queries having multiple numeric range conditions, numeric attributes are tokenized into a plurality of tokens based on their binary value. The tokens become keys in the inverted index. A numeric range query is translated into a query on multiple tokens and combining two or more range queries on different attributes becomes a simple merge of document identification lists.

Book ChapterDOI
16 Sep 2002
TL;DR: The high performance "hybrid" partition inverted index is validated through experiments with a 100 Gbyte collection from TREC-9 and -10 and shows that this organization outperforms the document and the term partitioning schemes.
Abstract: The rapid increase in content available in digital forms gives rise to large digital libraries, targeted to support millions of users and terabytes of data. Efficiently retrieving information then is a challenging task due to the size of the collection and its index. In this paper, our high performance "hybrid" partition inverted index is validated through experiments with a 100 Gbyte collection from TREC-9 and -10. The hybrid scheme combines the term and the document approaches to partitioning inverted indices across nodes of a parallel system. Experiments on a parallel system show that this organization outperforms the document and the term partitioning schemes. Our hybrid approach should support highly efficient searching for information in a large-scale digital library, implemented atop a network of computers.

Book ChapterDOI
01 Jan 2002
TL;DR: This chapter presents the main data structures and algorithms for searching large text collections, and shows how mechanisms based upon inverted files can be used to index and search the Web.
Abstract: In this chapter we present the main data structures and algorithms for searching large text collections. We emphasize inverted files, the most used index, but also review suffix arrays, which are useful in a number of specialized applications. We also cover parallel and distributed implementations of these two structures. As an example, we show how mechanisms based upon inverted files can be used to index and search the Web.

Book ChapterDOI
21 Nov 2002
TL;DR: The data structure is extended so that the scores can be computed for any pattern efficiently while keeping the size of the data structures moderate, which is an improvement from existing methods using O(n log n) bit space for a text collection of length n.
Abstract: We propose space-efficient data structures for text retrieval systems that have merits of both theoretical data structures like suffix trees and practical ones like inverted files. Traditional text retrieval systems use the inverted files and support ranking queries based on the tf*idf (term frequency times inverse document frequency) scores of documents that contain given keywords, which cannot be solved by using only the suffix trees. A drawback of the systems is that the scores can be computed for only predetermined keywords. We extend the data structure so that the scores can be computed for any pattern efficiently while keeping the size of the data structures moderate. The size is comparable with the text size, which is an improvement from existing methods using O(n log n) bit space for a text collection of length n.


Book ChapterDOI
01 Jan 2002
TL;DR: Search engine users have been observed to browse through very few pages of results for queries that they submit, suggesting that prefetching many results upon processing an initial query is not efficient, since the user who initiated the search will not request most of the prefetched results.
Abstract: Publisher Summary The sheer size of the WWW and the efforts of search engines to index significant portions of it have caused many search engines to partition their inverted index of the Web into several disjoint segments (partial indices). The partitioning of the index impacts the manner in which the engines process queries. Most engines also use some form of query result caching, where results of queries that were served are cached for some time. In particular, query results may be prefetched in anticipation of user requests. Such scenario occurs when the engine retrieves (for a certain query) more results than will initially be returned to the user. Search engine users have been observed to browse through very few pages of results for queries that they submit. This behavior of users suggests that prefetching many results upon processing an initial query is not efficient, since the user who initiated the search will not request most of the prefetched results. However, a policy that abandons result prefetching in favor of retrieving just the first page of search results might not make optimal use of system resources as well.

01 Jan 2002
TL;DR: Panache is presented, a distributed inverted index that scales well with the number of nodes in the network and can be shown to use significantly less bandwidth than Gnutella using realworld estimates of network parameters, while retaining high quality search results.
Abstract: The primary challenge in developing a peer-to-peer file sharing system is implementing an efficient keyword search mechanism. This paper presents Panache, a distributed inverted index that scales well with the number of nodes in the network. Panache addresses three critical needs for searching peer-topeer file sharing systems—efficient use of bandwidth, relevant search results and accommodation for graceful node transience. To achieve these needs, Panache aggregates popularity information and builds upon other peer-to-peer systems that distribute index information by keyword. Relying on a combination of Bloom filtering, query ordering, and truncated results based on popularity data, Panache can be shown to use significantly less bandwidth than Gnutella using realworld estimates of network parameters, while retaining high quality search results. Simulation experiments demonstrate that Panache may be viable for Internet deployment, although more comprehensive testing is needed. Panache provides an exciting starting point for future development and optimization.

01 Jan 2002
TL;DR: The composite inverted lists algorithm proposed in this paper can operate either as a local or global inverted list, or both at the same time, depending on the workload generated by the users.
Abstract: The inverted lists strategy is frequently used as an index data structure for very large textual databases. Its implementation and comparative performance has been studied in sequential and parallel applications. In the latter, with relatively few studies, there has been a sort of “which-is-better” discussion about two alternative parallel realizations of the basic data structure and algorithms. We suggest that a mix between the two is actually a better alternative. Depending on the workload generated by the users, the composite inverted lists algorithm we propose in this paper can operate either as a local or global inverted list, or both at the same time.

01 Jan 2002
TL;DR: In this article, the authors explored the performance of a range of data structures for the task of managing sets of millions of strings, and developed new variants of each that are more efficient for this task than previous alternatives.
Abstract: Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures---different forms of trees, tries, and hash tables---for the task of managing sets of millions of strings, and have developed new variants of each that are more efficient for this task than previous alternatives. In this paper we test the performance of the same data structures on small sets of strings, in the context of document processing for index construction. Our results show that the new structures, in particular our burst trie, are the most efficient choice for this task, thus demonstrating that they are suitable for managing sets of hundreds to millions of distinct strings, and for input of hundreds to billions of occurrences.

Proceedings ArticleDOI
04 Nov 2002
TL;DR: This paper investigates two different approaches for reducing index space of inverted files for XML documents and develops the new XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory.
Abstract: Query languages for retrieval of XML documents allow for conditions referring both to the content and the structure of documents. In this paper, we investigate two different approaches for reducing index space of inverted files for XML documents. First, we consider methods for compressing index entries. Second, we develop the new XS tree data structure which contains the structural description of a document in a rather compact form, such that these descriptions can be kept in main memory. Experimental results on two large XML document collections show that very high compression rates for indexes can be achieved, but any compression increases retrieval time. On the other hand, highly compressed indexes may be feasible for applications where storage is limited, such as in PDAs or E-book devices.

Proceedings ArticleDOI
24 Jul 2002
TL;DR: This paper presents the design, implementation and evaluation of Mingle, a secure distributed search system that allows users to conveniently establish their identity to Mingle servers, such that subsequent authentication occurs automatically, with minimal manual involvement.
Abstract: This paper presents the design, implementation and evaluation of Mingle, a secure distributed search system. Each participating host runs a Mingle server, which maintains an inverted index of the local file system. Users initiate peer-to-peer keyword searches by typing keywords to lightweight Mingle clients. Central to Mingle are its access control mechanisms and its insistence on user convenience. For access control, we introduce the idea of access-right mapping, which provides a convenient way for file owners to specify access permissions. Access control is supported through a single sign-on mechanism that allows users to conveniently establish their identity to Mingle servers, such that subsequent authentication occurs automatically, with minimal manual involvement. Preliminary performance evaluation suggests that Mingle is both feasible and scalable.

Journal ArticleDOI
TL;DR: This work shows that, by using parallel processing technique, it is feasible to build a scalable information retrieval system and the performance bottleneck in previous work, which parallelize only disk access, can be removed.

Journal Article
TL;DR: This model has the advantages of many popular full-text index model, such as inverted-list model and Pat array model, and improves the efficiency of the space and time, as proved by theory and experiment.
Abstract: In this paper, a new full-text index model, subsequence array model, is put forward. It has the advantages of many popular full-text index model, such as inverted-list model and Pat array model, and improves the efficiency of the space and time, which is proved by theory and experiment.

Book ChapterDOI
11 Aug 2002
TL;DR: This paper proposes a new approach about storing and querying XML data in the RDBMS basing on the idea of the numbering schema and the inverted list, which is flexible enough to support XML documents both with schema and without schema, and applications both retrieval and update.
Abstract: The common feature of the XML query languages is the use of path expressions to query XML data. To speed up the processing of path expression queries, it is important to be able to quickly determine ancestor-descendant relationship between any pair of nodes in the hierarchy of XML data. At the same time, keyword search is also important to query XML data with a regular structure, if the user does not know the structure or only knows the structure partially. In this paper, we propose a new approach about storing and querying XML data in the RDBMS basing on the idea of the numbering schema and the inverted list. Our approach allows us to quickly determine ancestor-descendant relationship between any pair of nodes in the hierarchy of XML data, which is particularly, highly effective for the searching paths that are very long or unknown. Examples have demonstrated that our approach can effectively supports both query powers in XQuery and keyword searches. Our approach is flexible enough to support XML documents both with schema and without schema, and applications both retrieval and update.

01 Jan 2002
TL;DR: A multi-level index structure to efficiently handle the salient object-based quires with different levels of constraints is proposed and the characteristics of video data are captured in the second level designed to reduce the storage requirement.
Abstract: Several salient object-based data models have been proposed to model the video data, however, none of them proposed an index structure to handle the salient object-based queries efficiently. There are several indexing schemes that have been proposed for spatio-temporal relationships among objects and they are used to optimize timestamp and interval queries, which are rarely used in video database. Moreover, these index structures are designed without consideration of the granularity levels of constraints in salient objects and the characteristics of the video data. In this paper, we propose a multi-level index structure to efficiently handle the salient object-based quires with different levels of constraints. The characteristics of video data are also captured in the second level of the index structure designed to reduce the storage requirement.

01 Jan 2002
TL;DR: The experimental results are promising and show a significant speedup over the conventional and common indexing and query processing method.
Abstract: We demonstrate a parallel implementation of a sparse matrix information retrieval engine. We use a shared nothing PC cluster. We perform our experiments with TREC disk 4 and 5 data, a NIST 2 Gigabytes standard benchmark text collection on 2, 4, 6, 8, 10, 12 and 14 processing nodes with different queries. We compare the results with the results of sequential inverted index, a conventional and common indexing and query processing method. The experimental results are promising and show a significant speedup.