scispace - formally typeset
Search or ask a question

Showing papers on "Inverted index published in 1997"


Patent
05 Feb 1997
TL;DR: In this paper, the indexer traverses the hypertext database and finds hypertext information including the address of the document the hyperlinks point to and the anchor text of each hyperlink.
Abstract: A search engine for retrieving documents pertinent to a query indexes documents in accordance with hyperlinks pointing to those documents. The indexer traverses the hypertext database and finds hypertext information including the address of the document the hyperlinks point to and the anchor text of each hyperlink. The information is stored in an inverted index file, which may also be used to calculate document link vectors for each hyperlink pointing to a particular document. When a query is entered, the search engine finds all document vectors for documents having the query terms in their anchor text. A query vector is also calculated, and the dot product of the query vector and each document link vector is calculated. The dot products relating to a particular document are summed to determine the relevance ranking for each document.

373 citations


Patent
13 Jun 1997
TL;DR: In this paper, a system, method, and various software products provide for improved information retrieval in very large document databases through the use of a predetermined static cache including for terms that appear in a large number of documents, a plurality of documents ordered by a contribution that the term makes to the document score of the document.
Abstract: A system, method, and various software products provide for improved information retrieval in very large document databases through the use of a predetermined static cache. The static cache includes for terms that appear in a large number of documents, a plurality of documents ordered by a contribution that the term makes to the document score of the document. The contribution is a scalar measure of the influence of the term in the computed document score. The contribution reflects both the within document frequency and the between document frequency of the term. In addition, the static cache includes for each term a lookup table that references selected entries for the term in an inverted index. Queries to the database are then processed by first traversing the static cache and obtaining the contribution information thereform and computing the document score from this information. Additional term frequency information for other terms in the query is obtained by looking up the document in the lookup tables of the other query terms, and obtaining the term frequency information for such terms from the inverted index, or by searching the contribution caches of the query terms.

299 citations


Patent
01 Dec 1997
TL;DR: A translation memory for computer assisted translation based upon an aligned file having a number of source language text strings paired with target language text string is described in this paper, where each posting vector file includes a posting vector associated with each source language string in the aligned file.
Abstract: A translation memory for computer assisted translation based upon an aligned file having a number of source language text strings paired with target language text strings A posting vector file includes a posting vector associated with each source language text string in the aligned file Each posting vector includes a document identification number corresponding to a selected one of the source language text strings in the aligned file and a number of entropy weight values, each of the number of weight values corresponding to a unique letter n-gram that appears in the selected source language text string Preferably, the translation memory further includes an inverted index comprising a listing of source language letter n-grams and a pointer to each of the posting vectors including an entry for the listed letter n-gram

167 citations


Journal ArticleDOI
Udi Manber1
TL;DR: A new text compression scheme is presented in this article to speed up string matching by searching the compressed file directly, and can remain compressed indefinitely, saving space while allowing faster search at the same time.
Abstract: A new text compression scheme is presented in this article. The main purpose of this scheme is to speed up string matching by searching the compressed file directly. The scheme requires no modification of the string-matching algorithm, which is used as a black box; any string-matching procedure can be used. Instead, the pattern is modified; only the outcome of the matching of the modified pattern against the compressed file is decompressed. Since the compressed file is smaller than the original file, the search is faster both in terms of I/O time and precessing time than a search in the original file. For typical text files, we achieve about 30% reduction of space and slightly less of search time. A 30% space saving is not competitive with good text compression schemes, and thus should not be used where space is the predominant concern. The intended applications of this scheme are files that are searched often, such as catalogs, bibliographic files, and address books. Such files are typically not compressed, but with this scheme they can remain compressed indefinitely, saving space while allowing faster search at the same time. A particular application to an information retrieval system that we developed is also discussed.

151 citations


Patent
01 Dec 1997
TL;DR: An index selection tool helps reduce costs in time and memory in selecting an index configuration or set of indexes for use by a database server in accessing a database in accordance with a workload of queries.
Abstract: An index selection tool helps reduce costs in time and memory in selecting an index configuration or set of indexes for use by a database server in accessing a database in accordance with a workload of queries. The index selection tool attempts to reduce the number of indexes to be considered, the number of index configurations to be enumerated, and the number of invocations of a query optimizer in selecting an index configuration for the workload.

88 citations


Patent
18 Aug 1997
TL;DR: In this article, a system for filtering documents is described, which includes a document parser, a profile parser, and a comparator, which is used to determine if an incoming document matches a user query.
Abstract: A system for filtering documents and includes a document parser, a profile parser, and a comparator. The document parser accepts incoming documents as input and provides inverted lists of terms contained in the document's output. The profile parser accepts as input user queries and provides as output query nets representing the user queries. The comparator compares the inverted lists representing the documents against the query that is representing the user queries to determine if an incoming document matches a user query. A related method for filtering incoming documents includes the steps of receiving an incoming document and parsing it to produce an inverted list of terms contained in the incoming document. The inverted list is then used to retrieve user queries. Any user queries matching less than a pre-determined number of terms are immediately discarded. The remaining user queries are scored and user queries having a score less than a predetermined threshold are discarded. The remaining user queries are the queries which the incoming document matches.

84 citations


Patent
15 Aug 1997
TL;DR: An index selection tool helps reduce costs in time and memory in selecting an index configuration or set of indexes for use by a database server in accessing a database in accordance with a workload of queries.
Abstract: An index selection tool helps reduce costs in time and memory in selecting an index configuration or set of indexes for use by a database server in accessing a database in accordance with a workload of queries. The index selection tool attempts to reduce the number of indexes to be considered, the number of index configurations to be enumerated, and the number of invocations of a query optimizer in selecting an index configuration for the workload.

50 citations


Patent
01 Dec 1997
TL;DR: An index selection tool helps reduce costs in time and memory in selecting an index configuration or set of indexes for use by a database server in accessing a database in accordance with a workload of queries as discussed by the authors.
Abstract: An index selection tool helps reduce costs in time and memory in selecting an index configuration or set of indexes for use by a database server in accessing a database in accordance with a workload of queries. The index selection tool attempts to reduce the number of indexes to be considered, the number of index configurations to be enumerated, and the number of invocations of a query optimizer in selecting an index configuration for the workload.

35 citations


Journal ArticleDOI
TL;DR: This article suggests a new ranking scheme especially adapted for hypertext environments in order to produce more effective retrieval results and yet maintain the effectiveness of the investment made to date in the Boolean model.
Abstract: In most commercial online systems, the retrieval system is based on the Boolean model and its inverted file organization. Since the investment in these systems is so great and changing them could be economically unfeasible, this article suggests a new ranking scheme especially adapted for hypertext environments in order to produce more effective retrieval results and yet maintain the effectiveness of the investment made to date in the Boolean model. To select the retrieved documents, the suggested ranking strategy uses multiple sources of document content evidence. The proposed scheme integrates both the information provided by the index and query terms, and the inherent relationships between documents such as bibliographic references or hypertext links. We will demonstrate that our scheme represents an integration of both subject and citation indexing, and results in a significant improvement over classical ranking schemes used in hybrid Boolean systems, while preserving its efficiency. Moreover, through knowing the nearest neighbor and the hypertext links which constitute additional sources of evidence, our strategy will take them into account in order to further improve retrieval effectiveness and to provide good starting points for browsing in a hypertext or hypermedia environment.

34 citations


25 Jun 1997
TL;DR: This work presents two novel algorithms that efficiently compute the true total ranking with a fixed space requirement independent of the size of the collection.
Abstract: Efficient ranking algorithms for similarity search use an inverted index to avoid scoring documents that have no overlap with the query. Nonetheless, partial scores must be maintained for a significant proportion of the collection. Previous work has focussed on heuristic partial ranking strategies to reduce the memory and time requirements at the cost of no longer computing the true ranks. We present two novel algorithms that efficiently compute the true total ranking with a fixed space requirement independent of the size of the collection.

22 citations


01 Jan 1997
TL;DR: This paper presents a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a fullfledged derivational morphological system, combined with a shallow parser to solve the multi- word indexing coverage problem in information retrieval.
Abstract: This paper presents a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a fullfledged derivational morphological system, combined with a shallow parser The unique contribution of the research is in using these linguistically based tools with filters in order to avoid the problems of semantic degradation typically associated with derivational analysis The expansion and subsequent conflation of terms increases indexing coverage up to 30%, with precision of nearly 90% for correct identification of related terms The system core is language independent and provides a uniform platform on which to build multilingual applications Language specific modules have been developed for English and French The fully implemented system is described with particular attention to the role of derivational morphology and phrasal relations Results and evaluation will be presented in terms of precision and recall, with an analysis of errors This paper illustrates how the use of natural language processing tools for tasks to which they are especially suited such as indexing, has the potential to improve performance in IR System Function and Architecture Three NLP modules are key to the system: morphology, part of speech tagging, and surface syntactic analysis (see Figure 1) The emphasis in our research is on the computational linguistic features of the system with particular attention to the role of the morphological component, and on the utilization of a toolset to solve the multi-word indexing coverage problem in information retrieval The system consists of the following procedures: 1 Start with a multi-word term list and a large corpus; 2 Disambiguate and part-of-speech tag the words in the multi-word term list and corpus; 3 Generate term variant patterns from the application of morphosyntactic transformations to multi-word terms; 4 Generate all morphologically derived forms of words dynamically; 5 Run a shallow parser identifying morphosyntactic variants within the target corpus; 6 Link term and term variant occurrences to the initial multi-word list for expanded indexing; 7 Build an inverted index file with pointers for term expansions The system permits the expansion and subsequent conflation of morphosyntactically related terms, such as abattage d'arbre (tree cutting) and les arbres on ete abattus (trees have been cut down) or valeur d'estimation (value of estimation) and estimer la valeur (estimate the value), and syntactically related terms such as plantes aromatiques (aromatic plants) and plantes et extraits aromatiques (aromatic plants and extracts) or sechage par le vide (vacuum drying) and sechage sous vide (drying under vacuum)

Patent
23 Oct 1997
TL;DR: A text data registering and retrieving method capable of improving the transaction processing performance is provided in this article, where the document number of a document for which deletion or replacement has been newly requested is registered in an updated document number list.
Abstract: A text data registering and retrieving method capable of improving the transaction processing performance is provided. The document number of a document for which deletion or replacement has been newly requested is registered in an updated document number list. The text data of the document for which insertion or replacement has been newly requested is registered in an update text buffer. The text data stored temporarily in the update text buffer is registered in a plural-character occurrence file defining a text index in a character component file merge step. The data registered in the plural-character occurrence file is retrieved for query terms. The text data stored in the update text buffer is retrieved for the query terms. The document number of a document updated or deleted is deleted from the result of retrieval in the plural-character occurrence file. Also, the result or the document number obtained in the, update text buffer is added to the result of retrieval to provide a final retrieval result.

Patent
09 Jan 1997
TL;DR: In this paper, the problem of easily retrieving information matched with specified purpose and use by performing the comparison processing of category information defined as a retrieval index with the contents information of a database is addressed.
Abstract: PROBLEM TO BE SOLVED: To easily retrieve information matched with specified purpose and use by performing the comparison processing of category information defined as a retrieval index with the contents information of a database. SOLUTION: An index preparation means 5 performs the comparison processing between index definition information stored in an index definition information storage means 13 and database contents information and prepares the retrieval index related to a category defined by a user. The generated plural retrieval indexes of different kinds are stored in a retrieval index storage means 6. A data retrieval means 7 retrieves the database storing data to be retrieved or the data by using the retrieval index desired to be utilized from the plural retrieval indexes of the different kinds. A substance data retrieval means 8 accesses the retrieved database or data and retrieves the substance of the data.

Proceedings ArticleDOI
25 Aug 1997
TL;DR: This work proposes to make use of an extended version of the almost forgotten inverted file techniques, so that the complexity of the algorithms is polynomial, to implement a retrieval engine based on conceptual graphs on top of the O/sub 2/ object-oriented DBMS.
Abstract: In information retrieval, the way in which the correspondence procedure works is highly important for the performance of the underlying system as a whole An inverted file ensures quick access to the information items because the index alone is examined, rather than the actual file of items, in order to determine the items which satisfy a search request This technique was a prominent feature of the old commercial information retrieval systems (IRSs) However it has been used only for keyword-based IRSs Since that time, the inverted file design has not been radically modified With the recent use of more expressive and richly structured languages in information retrieval, this method has not been used very much lately because it has been overrun by the expressiveness that new indexing languages, such as knowledge representation languages, have brought about We propose to make use again of an extended version of the almost forgotten inverted file techniques, so that the complexity of our algorithms is polynomial This allows us to implement, with a few modifications, a retrieval engine based on conceptual graphs on top of the O/sub 2/ object-oriented DBMS

Proceedings Article
08 Dec 1997
TL;DR: The implementation of the Shrink and Search Engine (SASE) framework is discussed which unites text compression and indexing to maximize keyword search performance while reducing storage cost.
Abstract: Keyword based search engines are the basic building block of text retrieval systems. Higher level systems like content sensitive search engines and knowledge-based systems still rely on keyword search as the underlying text retrieval mechanism. With the explosive growth in content, Internet and Intranet information repositories require efficient mechanisms to store as well as index data. In this paper we discuss the implementation of the Shrink and Search Engine (SASE) framework which unites text compression and indexing to maximize keyword search performance while reducing storage cost. SASE features the novel capability of being able to directly search through compressed text without explicit decompression. The implementation includes a search server architecture, which can be accessed from a Java front-end to perform keyword search on the Internet. The performance results show that the compression efficiency of SASE is within 7-17% of GZIP one of the best lossless compression schemes. The sum of the compressed file size and the inverted indices is only between 55-76% of the original database while the search performance is comparable to a fully inverted index. The framework allows a flexible trade-off between search performance and storage requirements for the search indices.

Journal ArticleDOI
TL;DR: In MFSF, a signature file is divided into variable sized vertical frames with different on-bit densities to optimize the response time using a partial query evaluation methodology to guarantee response time guaranteed for queries with several terms.
Abstract: A new signature file method, Multi-Frame Signature File (MFSF), is introduced by extending the bit-sliced signature file method. In MFSF, a signature file is divided into variable sized vertical frames with different on-bit densities to optimize the response time using a partial query evaluation methodology. In query evaluation the on-bits of the lower on-bit density frames are used first. As the number of query terms increases, the number of query signature on-bits in the lower on-bit density frames increases and the query stopping condition is reached in fewer evaluation steps. Therefore, in MFSF, the query evaluation time decreases for increasing numbers of query terms. Under the sequentiality assumption of disk blocks, in a PC environment with 30 ms average disk seek time, MFSF provides a projected worst-case response time of 3.54 seconds for a database size of one million records in a uniform distribution multi-term query environment with 1–5 terms per query. Due to partial evaluation, this desired response time is guaranteed for queries with several terms. The comparison of MFSF with the inverted file approach shows that MFSF provides promising research opportunities.

Proceedings ArticleDOI
06 Oct 1997
TL;DR: A novel technique, called inverted image indexing and compression, is proposed in this paper and the concept of composite bitplane signature is also introduced.
Abstract: Inverted file indexing and its compression have proved to be highly successful for free-text retrieval. Although the 'inverted' nature of the data structure provides an efficient mechanism for searching key words or terms in large documents, for image retrieval, the application of inverted files to the title, caption, or description of the images are not sufficient. One must be able to index and retrieve images based on the visual contents. Many content-based image retrieval techniques are used for the images as a whole picture. Analogous to free-text retrieval, a novel technique, called inverted image indexing and compression, is proposed in this paper. Similar to works in a document, each image can have multiple areas which are perceived to be meaningful visual contents. These areas are selected by users and then undergo two processes: automatic signature generation based on wavelet signatures, and users specification of high-level contents using ternary fact model. The contents in compressed form are inserted into an inverted image file. The concept of composite bitplane signature is also introduced.

Proceedings ArticleDOI
02 Sep 1997
TL;DR: A new hybrid structure (S-Index) is proposed which has a tunable performance and may be tailored to the query profiles of user classes: for frequently queried textbase sections S-Index performs like an inverted index, whereas the bulk of the textbase is indexed in the form of a signature file.
Abstract: Two textbase indexing methods enjoying wide applicability are the inverted index and the Superimposed Coding based Signature File (SC-SF). The former is most efficient in query processing, whereas the latter excels in storage utilization. Building on previous results, we propose a new hybrid structure (S-Index) which has a tunable performance. At the one extreme end, S-Index turns into a signature file with zero information loss, so that queries are processed faster than in ordinary SC-SF. At the other extreme end, S-Index turns into an inverted index. The advantage of the proposed access method is that the textbase index may now be tailored to the query profiles of user classes: for frequently queried textbase sections S-Index performs like an inverted index, whereas the bulk of the textbase is indexed in the form of a signature file. The S-Index structure is presented in detail, together with performance analysis results.

Journal ArticleDOI
TL;DR: Partial combining of the postings into the dictionary is analyzed as a method to give both faster retrieval and improved update properties for the trie hashing inverted file.
Abstract: A prefix trie index (originally called trie hashing) is applied to the problem of providing fast search times, fast load times and fast update properties in a bibliographic or full text retrieval system. For all but the largest dictionaries a single key search in the dictionary under trie hashing takes exactly one disk read. Front compression of search keys is used to enhance performance. Partial combining of the postings into the dictionary is analyzed as a method to give both faster retrieval and improved update properties for the trie hashing inverted file. Statistics are given for a test database consisting of an online catalog at the Graduate School of Library and Information Science Library of the University of Western Ontario. The effect of changing various parameters of prefix tries are tested in this application.

Journal ArticleDOI
TL;DR: A new method, based on dividing a dictionary into many small subdictionaries by constituent character hashing, is proposed, which has the same error correction ability as the inverted file method.
Abstract: Recently an inverted file method for garbled-spelling correction has been reported. This method requires very large memory, but corrects garbled spelling very fast. In this paper a new method, called the constituent character hashing method, is proposed. This method is based on dividing a dictionary into many small subdictionaries by constituent character hashing. Theoretically speaking, the method has the same error correction ability as the inverted file method. A computer experiment using about 230,000 words shows the following features: 1. The method is faster than the inverted file method for the cases of one or two errors in spelling. 2. The memory requirement of the method is about one-quarter of that of the inverted file method.

Proceedings ArticleDOI
24 Aug 1997
TL;DR: A hyperlinked embedded access structure (HEAS) is proposed which combines the features of inverted file structure and doubly-chained tree structure with unique data compression schemes to generate a lexicon for handwritten address interpretation.
Abstract: Providing a lexicon for all possible postal addresses can support decision making for handwritten address interpretation. The possibility of uncertainty can be reduced by choosing the highest confident record in the provided lexicon as the target postal address. For a large database with multi-attribute records, traditional access methods are not efficient enough to generate the lexicon. This paper proposes a hyperlinked embedded access structure (HEAS) which combines the features of inverted file structure and doubly-chained tree structure with unique data compression schemes. The raw United State Postal Service (USPS) database is organized according to the proposed access structure, and the organized database serves as a knowledge base for interpreting handwritten addresses. The organization cost, storage requirement, and query cost are analyzed and compared to conventional inverted file and doubly-chained tree structures.

Patent
27 Mar 1997
TL;DR: In this paper, a method and apparatus for a fast access CDROM file structure for indexed databases is presented, taking advantage of a sequential, multi-step look-up process for inverted indexes by physically interleaving an initial index with its corresponding inverted index on a CDROM disk.
Abstract: A method and apparatus for a fast access CDROM file structure for indexed databases. The invention takes advantage of a sequential, multi-step look-up process for inverted indexes by physically interleaving an initial index with its corresponding inverted index on a CDROM disk. A speed improvement in accessing the indexes is realized by ensuring that the second of two required seeks in the process is almost always a longitudinal seek to a data block following a data block read after the first seek. Thus, lateral seeks are substantially reduced, thereby reducing average seek time.