Showing papers on "Inverted index published in 2005"

PDF

Open Access

Proceedings Article•DOI•

Hybrid index structures for location-based web search

[...]

Yinghua Zhou¹, Xing Xie², Chuang Wang³, Yuchang Gong¹, Wei-Ying Ma² - Show less +1 more•Institutions (3)

University of Science and Technology of China¹, Microsoft², Huazhong University of Science and Technology³

31 Oct 2005

TL;DR: This paper proposes to use a hybrid index structure, which integrates inverted files and R*-trees, to handle both textual and location aware queries, and designs and implements a complete location-based web search engine.

...read moreread less

Abstract: There is more and more commercial and research interest in location-based web search, i.e. finding web content whose topic is related to a particular place or region. In this type of search, location information should be indexed as well as text information. However, the index of conventional text search engine is set-oriented, while location information is two-dimensional and in Euclidean space. This brings new research problems on how to efficiently represent the location attributes of web pages and how to combine two types of indexes. In this paper, we propose to use a hybrid index structure, which integrates inverted files and R*-trees, to handle both textual and location aware queries. Three different combining schemes are studied: (1) inverted file and R*-tree double index, (2) first inverted file then R*-tree, (3) first R*-tree then inverted file. To validate the performance of proposed index structures, we design and implement a complete location-based web search engine which mainly consists of four parts: (1) an extractor which detects geographical scopes of web pages and represents geographical scopes as multiple MBRs based on geographical coordinates; (2) an indexer which builds hybrid index structures to integrate text and location information; (3) a ranker which ranks results by geographical relevance as well as non-geographical relevance; (4) an interface which is friendly for users to input location-based search queries and to obtain geographical and textual relevant results. Experiments on large real-world web dataset show that both the second and the third structures are superior in query time and the second is slightly better than the third. Additionally, indexes based on R*-trees are proven to be more efficient than indexes based on grid structures.

...read moreread less

297 citations

Journal Article•DOI•

Inverted Index Compression Using Word-Aligned Binary Codes

[...]

Vo Anh¹, Alistair Moffat¹•Institutions (1)

University of Melbourne¹

01 Jan 2005-Information Retrieval

TL;DR: This work examines index representation techniques for document-based inverted files, and presents a mechanism for compressing them using word-aligned binary codes, allowing extremely fast decoding of inverted lists during query processing, while providing compression rates better than other high-throughput representations.

...read moreread less

Abstract: We examine index representation techniques for document-based inverted files, and present a mechanism for compressing them using word-aligned binary codes. The new approach allows extremely fast decoding of inverted lists during query processing, while providing compression rates better than other high-throughput representations. Results are given for several large text collections in support of these claims, both for compression effectiveness and query efficiency.

...read moreread less

269 citations

Patent•

Method and apparatus for identifying, extracting, capturing, and leveraging expertise and knowledge

[...]

Scott Brave, Robert Bradshaw, Jack Jia, Christopher Minson

28 Dec 2005

TL;DR: UseRank as mentioned in this paper is an expertise or knowledge index that tracks the behavior of website visitors, focusing on the four key discoveries of enterprise attributes: Subject Authority, Work Patterns, Content Freshness, and Group Know-how.

...read moreread less

Abstract: The invention comprises a set of complementary techniques that dramatically improve enterprise search and navigation results. The core of the invention is an expertise or knowledge index, called UseRank that tracks the behavior of website visitors. The expertise-index is designed to focus on the four key discoveries of enterprise attributes: Subject Authority, Work Patterns, Content Freshness, and Group Know-how. The invention produces useful, timely, cross-application, expertise-based search and navigation results. In contrast, traditional Information Retrieval technologies such as inverted index, NLP, or taxonomy tackle the same problem with an opposite set of attributes than what the enterprise needs: Content Population, Word Patterns, Content Existence, and Statistical Trends. Overall, the invention emcompasses Baynote Search - a enhancement over existing IR searches, Baynote Guide - a set of community-driven navigations, and Baynote Insights - aggregated views of visitor interests and trends and content gaps.

...read moreread less

228 citations

Proceedings Article•DOI•

Three-level caching for efficient query processing in large Web search engines

[...]

Xiaohui Long¹, Torsten Suel¹•Institutions (1)

New York University¹

10 May 2005

TL;DR: This work proposes and evaluates a three-level caching scheme that adds an intermediate level of caching for additional performance gains, and proposes and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure.

...read moreread less

Abstract: Large web search engines have to answer thousands of queries per second with interactive response times. Due to the sizes of the data sets involved, often in the range of multiple terabytes, a single query may require the processing of hundreds of megabytes or more of index data. To keep up with this immense workload, large search engines employ clusters of hundreds or thousands of machines, and a number of techniques such as caching, index compression, and index and query pruning are used to improve scalability. In particular, two-level caching techniques cache results of repeated identical queries at the frontend, while index data for frequently used query terms are cached in each node at a lower level.We propose and evaluate a three-level caching scheme that adds an intermediate level of caching for additional performance gains. This intermediate level attempts to exploit frequently occurring pairs of terms by caching intersections or projections of the corresponding inverted lists. We propose and study several offline and online algorithms for the resulting weighted caching problem, which turns out to be surprisingly rich in structure. Our experimental evaluation based on a large web crawl and real search engine query log shows significant performance gains for the best schemes, both in isolation and in combination with the other caching levels. We also observe that a careful selection of cache admission and eviction policies is crucial for best overall performance.

...read moreread less

166 citations

Patent•

Performant and scalable merge strategy for text indexing

[...]

Tapas Kumar Nayak¹•Institutions (1)

Microsoft¹

12 Sep 2005

TL;DR: In this article, a full-text search index system and method is generated by creating instances of a database index from an in-memory inverted list of keywords associated with a text identifier and the occurrences of the keyword in the text.

...read moreread less

Abstract: A full-text search index system and method is generated by creating instances of a database index from an in-memory inverted list of keywords associated with a text identifier and the occurrences of the keyword in the text. Instances of the index are placed in a priority queue. A merge scheduling process determines when a merge should be initiated, selects instances of the index to be merged and selects a type of merge to perform. Instances of an index are assigned a temporal indicator (timestamp). A set of instances is selected to be merged. The set of instances is validated and merged.

...read moreread less

93 citations

Proceedings Article•DOI•

Position Specific Posterior Lattices for Indexing Speech

[...]

Ciprian Chelba¹, Alex Acero¹•Institutions (1)

Microsoft¹

25 Jun 2005

TL;DR: The paper presents the Position Specific Posterior Lattice, a novel representation of automatic speech recognition lattices that naturally lends itself to efficient indexing of position information and subsequent relevance ranking of spoken documents using proximity.

...read moreread less

Abstract: The paper presents the Position Specific Posterior Lattice, a novel representation of automatic speech recognition lattices that naturally lends itself to efficient indexing of position information and subsequent relevance ranking of spoken documents using proximity.In experiments performed on a collection of lecture recordings --- MIT iCampus data --- the spoken document ranking accuracy was improved by 20% relative over the commonly used baseline of indexing the 1-best output from an automatic speech recognizer. The Mean Average Precision (MAP) increased from 0.53 when using 1-best output to 0.62 when using the new lattice representation. The reference used for evaluation is the output of a standard retrieval engine working on the manual transcription of the speech collection.Albeit lossy, the PSPL lattice is also much more compact than the ASR 3-gram lattice from which it is computed --- which translates in reduced inverted index size as well --- at virtually no degradation in word-error-rate performance. Since new paths are introduced in the lattice, the ORACLE accuracy increases over the original ASR lattice.

...read moreread less

91 citations

Proceedings Article•

n-gram/2L: a space and time efficient two-level n-gram inverted index structure

[...]

Min-Soo Kim¹, Kyu-Young Whang¹, Jae-Gil Lee¹, Min-Jae Lee¹•Institutions (1)

KAIST¹

30 Aug 2005

TL;DR: The n-gram/2L index as mentioned in this paper is a two-level inverted index that eliminates the redundancy of the position information that exists in the N-gram inverted index, which has two major advantages: language-neutral and error-tolerant.

...read moreread less

Abstract: The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9 ~ 2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index.

...read moreread less

82 citations

Proceedings Article•DOI•

Fast on-line index construction by geometric partitioning

[...]

Nicholas Lester¹, Alistair Moffat², Justin Zobel¹•Institutions (2)

RMIT University¹, University of Melbourne²

31 Oct 2005

TL;DR: This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods.

...read moreread less

Abstract: Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line merge-based methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction -- that is, how to build an inverted index when the underlying data must be continuously queryable, and the documents must be indexed and available for search as soon they are inserted. When straightforward approaches are used, document insertions become increasingly expensive as the size of the database grows. This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods. Using experiments on 100GB of web data we demonstrate the efficiency of our methods in practice, showing that they dramatically reduce the cost of on-line index construction.

...read moreread less

69 citations

Proceedings Article•DOI•

Fast Vocabulary-Independent Audio Search Using Path-Based Graph Indexing

[...]

Olivier Siohan, Michiel Bacchiani

04 Sep 2005

TL;DR: A fast vocabulary independent audio search approach that operates on phonetic lattices and is suitable for any query, inspired by a general graph indexing method that defines an automatic procedure to select a small number of paths as indexing features, keeping the index size small while allowing fast retrieval of the lattices matching a given query.

...read moreread less

Abstract: Classical audio retrieval techniques consist in transcribing audio documents using a large vocabulary speech recognition system and indexing the resulting transcripts. However, queries that are not part of the recognizer’s vocabulary or have a large probability of getting misrecognized can significantly impair the performance of the retrieval system. Instead, we propose a fast vocabulary independent audio search approach that operates on phonetic lattices and is suitable for any query. However, indexing phonetic lattices so that any arbitrary phone sequence query can be processed efficiently is a challenge, as the choice of the indexing unit is unclear. We propose an inverted index structure on lattices that uses paths as indexing features. The approach is inspired by a general graph indexing method that defines an automatic procedure to select a small number of paths as indexing features, keeping the index size small while allowing fast retrieval of the lattices matching a given query. The effectiveness of the proposed approach is illustrated on broadcast news and Switchboard databases.

...read moreread less

69 citations

Proceedings Article•DOI•

Vision-based Global Localization Using a Visual Vocabulary

[...]

Junqiu Wang¹, Roberto Cipolla², Hongbin Zha¹•Institutions (2)

Peking University¹, University of Cambridge²

18 Apr 2005

TL;DR: This paper presents a novel coarse-to-fine global localization approach that is inspired by object recognition and text retrieval techniques, and shows that the approach is efficient and reliable.

...read moreread less

Abstract: This paper presents a novel coarse-to-fine global localization approach that is inspired by object recognition and text retrieval techniques. Harris-Laplace interest points characterized by SIFT descriptors are used as natural landmarks. These descriptors are indexed into two databases: an inverted index and a location database. The inverted index is built based on a visual vocabulary learned from the feature descriptors. In the location database, each location is directly represented by a set of scale invariant descriptors. The localization process consists of two stages: coarse localization and fine localization. Coarse localization from the inverted index is fast but not accurate enough; whereas localization from the location database using voting algorithm is relatively slow but more accurate. The combination of coarse and fine stages makes fast and reliable localization possible. In addition, if necessary, the localization result can be verified by epipolar geometry between the representative view in database and the view to be localized. Experimental results show that our approach is efficient and reliable.

...read moreread less

55 citations

Patent•

Method and system for indexing and searching databases

[...]

Ji-Rong Wen¹, Ma Wei-Ying¹•Institutions (1)

Microsoft¹

13 May 2005

TL;DR: In this paper, a search system generates an index for databases by generatively sampling the databases and uses that index to identify and formulate queries for searching the databases, referred to as a domain-attribute index.

...read moreread less

Abstract: A search system generates an index for databases by generatively sampling the databases and uses that index to identify and formulate queries for searching the databases. The generated index is referred to as a domain-attribute index and contains a domain-level index and site-level indexes. A site-level index for a database maps site attributes to distinct attribute values within the database. The domain-level index for a domain maps attribute values to database and site attribute pairs that contain those attribute values. To generate a site-level index for a database within a certain domain, the search system starts out with an initial set of the sample data for that domain. The search system generates sampling queries based on the sample data and submits the sampling queries to a database. The search system updates the site-level index based on the sampling results and uses the results to generate more sampling queries.

...read moreread less

Book Chapter•DOI•

Compressed perfect embedded skip lists for quick inverted-index lookups

[...]

Paolo Boldi¹, Sebastiano Vigna¹•Institutions (1)

University of Milan¹

02 Nov 2005

TL;DR: This paper describes how to embed efficiently a compressed perfect skip list in an inverted list, which is possible to skip quickly over unnecessary documents in a web-scale search engine.

...read moreread less

Abstract: Large inverted indices are by now common in the construction of web-scale search engines. For faster access, inverted indices are indexed internally so that it is possible to skip quickly over unnecessary documents. To this purpose, we describe how to embed efficiently a compressed perfect skip list in an inverted list.

...read moreread less

Journal Article•

GPX - Gardens Point XML Information Retrieval at INEX 2004

[...]

Shlomo Geva

01 Jan 2005-Lecture Notes in Computer Science

TL;DR: This paper describes the implementation of a search engine for XML document collections that is keyword based, built upon an XML inverted file system and met the requirements of Content Only and Vague Content and Structure queries in INEX 2004.

...read moreread less

Abstract: Traditional information retrieval (IR) systems respond to user queries with ranked lists of relevant documents. The separation of content and structure in XML documents allows individual XML elements to be selected in isolation. Thus, users expect XML-IR systems to return highly relevant results that are more precise than entire documents. In this paper we describe the implementation of a search engine for XML document collections. The system is keyword based and is built upon an XML inverted file system. We describe the approach that was adopted to meet the requirements of Content Only (CO) and Vague Content and Structure (VCAS) queries in INEX 2004.

...read moreread less

Patent•

Machine learning system

[...]

Jonathan Baxter

29 Sep 2005

TL;DR: In this article, a method for training a classifier to classify elements of a data set (640) according to a characteristic is described, which includes the steps of forming a first labeled subset of elements from the data set with the elements of the first labeled subsets each labeled according to whether the element includes the characteristic.

...read moreread less

Abstract: A method for training a classifier to classify elements of a data set (640) according to a characteristic is described. The data set (640) includes N elements with the elements each characterised by at least one feature. The method includes the steps of forming a first labeled subset of elements from the data set with the elements of the first labeled subset each labeled (610) according to whether the element includes the characteristic, training an algorithmic classifier (620) to classify for the characteristic according to the first labeled subset thereby determining which at least one feature is relevant to classifying for the characteristic; and then querying with the classifier an inverted index, with this inverted index formed (630) over the at least one feature and generated from the data set (640), thereby generating a ranked set of elements from the data set.

...read moreread less

Journal Article•DOI•

A statistics-based approach to incrementally update inverted files

[...]

Wann-Yun Shieh¹, Chung-Ping Chung¹•Institutions (1)

National Chiao Tung University¹

01 Mar 2005-Information Processing and Management

TL;DR: In this paper, a run-time statistics-based approach is proposed to allocate the spare space in an inverted file to avoid reorganization in the inverted file, and unused free space can be well controlled such that file access speed is not affected.

...read moreread less

Abstract: Many information retrieval systems use the inverted file as indexing structure. The inverted file, however, requires inefficient reorganization when new documents are to be added to an existing collection. Most studies suggest dealing with this problem by sparing free space in an inverted file for incremental updates. In this paper, we propose a run-time statistics-based approach to allocate the spare space. This approach estimates the space requirements in an inverted file using only a little most recent statistical data on space usage and document update request rate. For best indexing speed and space efficiency, the amount of the spare space to be allocated is determined by adaptively balancing the trade-offs between reorganization reduction and space utilization. Experiment results show that the proposed space-sparing approach significantly avoids reorganization in updating an inverted file, and in the meantime, unused free space can be well controlled such that the file access speed is not affected.

...read moreread less

Book Chapter•DOI•

Document identifier reassignment through dimensionality reduction

[...]

Roi Blanco, Álvaro Barreiro

21 Mar 2005

TL;DR: This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation, and presents experimental results about the tradeoff between dimensionality reduction and compression, and time performance.

...read moreread less

Abstract: Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the Greedy-NN TSP algorithm and one more efficient variant based on dividing the original problem in sub-problems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.

...read moreread less

Proceedings Article•DOI•

Okapi-Chamfer matching for articulate object recognition

[...]

Hanning Zhou, Thomas S. Huang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

17 Oct 2005

TL;DR: This paper proposes to improve the efficiency of articulated object recognition by an Okapi-Chamfer matching algorithm based on the inverted index technique, and test the system with both synthesized and real world images.

...read moreread less

Abstract: Recent years have witnessed the rise of many effective text information retrieval systems. By treating local visual features as terms, training images as documents and input images as queries, we formulate the problem of object recognition into that of text retrieval. Our formulation opens up the opportunity to integrate some powerful text retrieval tools with computer vision techniques. In this paper, we propose to improve the efficiency of articulated object recognition by an Okapi-Chamfer matching algorithm. The algorithm is based on the inverted index technique. The inverted index is a widely used way to effectively organize a collection of text documents. With the inverted index, only documents that contain query terms are accessed and used for matching. To enable inverted indexing in an image database, we build a lexicon of local visual features by clustering the features extracted from the training images. Given a query image, we extract visual features and quantize them based on the lexicon, and then look up the inverted index to identify the subset of training images with non-zero matching score. To evaluate the matching scores in the subset, we combined the modified Okapi weighting formula with the Chamfer distance. The performance of the Okapi-Chamfer matching algorithm is evaluated on a hand posture recognition system. We test the system with both synthesized and real world images. Quantitative results demonstrate the accuracy and efficiency of our system

...read moreread less

Book Chapter•DOI•

TopX and XXL at INEX 2005

[...]

Martin Theobald¹, Ralf Schenkel¹, Gerhard Weikum¹•Institutions (1)

Max Planck Society¹

28 Nov 2005

TL;DR: This paper focuses on the design principles, scoring, query evaluation and results of TopX.

...read moreread less

Abstract: We participated with two different and independent search engines in this year’s INEX round: The XXL Search Engine and the TopX engine. As this is the first participation for TopX, this paper focuses on the design principles, scoring, query evaluation and results of TopX. We shortly discuss the results with XXL afterwards.

...read moreread less

Book Chapter•DOI•

Using prefix-trees for efficiently computing set joins

[...]

Ravindranath Jampani¹, Vikram Pudi¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

17 Apr 2005

TL;DR: Experiments on real life datasets show that the total execution time of the PRETTI algorithms is significantly less than that of previous approaches, even when the indices required by the algorithms are not precomputed.

...read moreread less

Abstract: Joins on set-valued attributes (set joins) have numerous database applications. In this paper we propose PRETTI (PREfix Tree based seT joIn) – a suite of set join algorithms for containment, overlap and equality join predicates. Our algorithms use prefix trees and inverted indices. These structures are constructed on-the-fly if they are not already precomputed. This feature makes our algorithms usable for relations without indices and when joining intermediate results during join queries with more than two relations. Another feature of our algorithms is that results are output continuously during their execution and not just at the end. Experiments on real life datasets show that the total execution time of our algorithms is significantly less than that of previous approaches, even when the indices required by our algorithms are not precomputed.

...read moreread less

Patent•

Sharing of full text index entries across application boundaries

[...]

David A. Brooks¹, Niklas Heidloff¹, Hong Dai¹, Craig R. Wolpert¹, Igor L. Belakovskiy¹ - Show less +1 more•Institutions (1)

IBM¹

29 Apr 2005

TL;DR: In this paper, a method and system for sharing full text index entries across application boundaries in which documents are obtained by a shared, platform level indexing service, and a determination is made as to whether the received documents are duplicates with regard to previously indexed documents.

...read moreread less

Abstract: A method and system for sharing full text index entries across application boundaries in which documents are obtained by a shared, platform level indexing service, and a determination is made as to whether the received documents are duplicates with regard to previously indexed documents. If a document is determined to be a duplicate, the index representation of the previously indexed copy of the document is modified to indicate that the document is also associated with another application or context. If a document is not a duplicate of a previously indexed document, the document is indexed to support future searches and/or other processing. The index representation of a document includes application category identifiers associating one or more applications or contexts with the document. When a document is indexed, one or more category identifiers are generated and stored in association with that document. The category identifiers for an indexed document may, for example, represent an application that received, stored, or otherwise processed that document. The application category identifiers enable category specific searching by applications sharing a common search index. A software category filter may be provided to process search results from the shared search index, so that only documents associated with certain categories are returned. Accordingly, one or more search categories may be determined for a given search query, based on an application generating the search query, or some other context information, and then used to filter the search results provided from the shared search index.

...read moreread less

Book Chapter•DOI•

Approximate subtree identification in heterogeneous XML documents collections

[...]

Ismael Sanz¹, Marco Mesiti², Giovanna Guerrini³, Rafael Berlanga Llavori¹•Institutions (3)

James I University¹, University of Milan², University of Pisa³

28 Aug 2005

TL;DR: Different similarity measures between a pattern and subtrees of documents in the collection are discussed and an efficient algorithm for the identification of document subtrees, approximately conforming to the pattern, by indexing structures is introduced.

...read moreread less

Abstract: Due to the heterogeneous nature of XML data for internet applications exact matching of queries is often inadequate. The need arises to quickly identify subtrees of XML documents in a collection that are similar to a given pattern. In this paper we discuss different similarity measures between a pattern and subtrees of documents in the collection. An efficient algorithm for the identification of document subtrees, approximately conforming to the pattern, by indexing structures is then introduced.

...read moreread less

Journal Article•DOI•

Comparing inverted files and signature files for searching a large lexicon

[...]

Ben Carterette¹, Fazli Can²•Institutions (2)

University of Massachusetts Amherst¹, Miami University²

01 May 2005-Information Processing and Management

TL;DR: A direct comparision of the two signature file methods for searching for partially-specified queries in a large lexicon stored in main memory finds the signature file method is about as fast as the inverted file method, and significantly smaller.

...read moreread less

Abstract: Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparision of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.

...read moreread less

Patent•

Document and file indexing system

[...]

Mark Radulovich

12 Dec 2005

TL;DR: A computer system where portions of the indexing application are inserted between the user application and the disk write processing software so that the index information for the particular document being stored is obtained as the document is being stored as discussed by the authors.

...read moreread less

Abstract: A computer system where portions of the indexing application are inserted between the user application and the disk write processing software so that the indexing information for the particular document being stored is obtained as the document is being stored. In a separate parallel operation this document indexing information is provided to the main search index for incorporation. In various embodiments the document and the index can be compressed and encrypted if desired for transmission to a remote computer. The document and the index can be stored locally or remotely, or in any combination. The document or file and the index can be cached locally, if they are stored remotely and the local and remote computers are not in communication. The indexing operations occur on copying operations as well as the writing of modified or new files.

...read moreread less

Journal Article•DOI•

An assessment of a metric space database index to support sequence homology

[...]

Rui Mao¹, Weijia Xu¹, Neha Singh¹, Daniel P. Miranker¹•Institutions (1)

University of Texas at Austin¹

01 Oct 2005-International Journal on Artificial Intelligence Tools

TL;DR: A new hierarchical bulk-load algorithm that alternates between top-down and bottom-up clustering to initialize the index is introduced, and the bi-directional bulk load produces a more effective index than the existing M-tree initialization algorithms.

...read moreread less

Abstract: Hierarchical metric-space clustering methods have been commonly used to organize proteomes into taxonomies. Consequently, it is often anticipated that hierarchical clustering can be leveraged as a basis for scalable database index structures capable of managing the hyper-exponential growth of sequence data. M-tree is one such data structure specialized for the management of large data sets on disk. We explore the application of M-trees to the storage and retrieval of peptide sequence data. Exploiting a technique first suggested by Myers, we organize the database as records of fixed length substrings. Empirical results are promising. However, metric-space indexes are subject to "the curse of dimensionality" and the ultimate performance of an index is sensitive to the quality of the initial construction of the index. We introduce new hierarchical bulk-load algorithm that alternates between top-down and bottom-up clustering to initialize the index. Using the Yeast Proteomes, the bi-directional bulk load produces a more effective index than the existing M-tree initialization algorithms.

...read moreread less

Patent•

Method and framework to support indexing and searching taxonomies in large scale full text indexes

[...]

Nadav Eiron¹, Daniel N. Meredith¹, Joerg Meyer¹, Jan Pieper¹, Andrew Tomkins¹ - Show less +1 more•Institutions (1)

IBM¹

30 Sep 2005

TL;DR: In this article, a posting list for an entity with respect to the locations of the set of terms defining the entity and data associated with the respective terms is constructed, and an inverted list index is used to associate data with each occurrence of an index term.

...read moreread less

Abstract: A system and method of indexing a plurality of entities located in a taxonomy, the entities comprising sets of terms, comprises receiving terms in an index structure; building a posting list for an entity with respect to the locations of the set of terms defining the entity and data associated with the respective terms; and indexing a name of a group comprising the entities within this group at the location of the entities with the data of the group comprising the name of the respective entity at each location. The building of the posting list comprises storing the location of the term and data associated with the term in an entry in the posting list for the term. The method comprises indexing aliases of the name of the group comprising the term, and using an inverted list index to associate data with each occurrence of an index term.

...read moreread less

Book Chapter•DOI•

Efficient evaluation of partial match queries for XML documents using information retrieval techniques

[...]

Young-Ho Park¹, Kyu-Young Whang¹, Byung Suk Lee², Wook-Shin Han³•Institutions (3)

KAIST¹, University of Vermont², Kyungpook National University³

17 Apr 2005

TL;DR: XIR, a novel method for processing partial match queries on heterogeneous XML documents using information retrieval (IR) techniques, is proposed and compared with XRel and XParent using XML documents crawled from the Internet.

...read moreread less

Abstract: We propose XIR, a novel method for processing partial match queries on heterogeneous XML documents using information retrieval (IR) techniques. A partial match query is defined as the one having the descendent-or-self axis “//” in its path expression. In its general form, a partial match query has branch predicates forming branching paths. The objective of XIR is to efficiently support this type of queries for large-scale documents of heterogeneous schemas. XIR has its basis on the conventional schema-level methods using relational tables and significantly improves their efficiency using two techniques: an inverted index technique and a novel prefix match join. The former indexes the labels in label paths as keywords in texts, and allows for finding the label paths matching the queries more efficiently than string match used in the conventional methods. The latter supports branching path expressions, and allows for finding the result nodes more efficiently than containment joins used in the conventional methods. We compare the efficiency of XIR with those of XRel and XParent using XML documents crawled from the Internet. The results show that XIR is more efficient than both XRel and XParent by several orders of magnitude for linear path expressions, and by several factors for branching path expressions.

...read moreread less

Proceedings Article•DOI•

Constructing a searchable encrypted log using encrypted inverted indexes

[...]

Yasuhiro Ohtaki¹•Institutions (1)

Ibaraki University¹

23 Nov 2005

TL;DR: An encrypted version of inverted index is introduced which can be used for a searchable encrypted log and enables to determine all records that contain a keyword while contents of other records are kept secret.

...read moreread less

Abstract: Searchable encrypted log provides a facility to disclose only records which match some specified conditions. However, searching on an encrypted log takes a long time because checking for the condition usually requires a vast number of public key operations. In the field of plain text search, inverted index is a typical data structure to improve the performance of search speed which turns a bunch of comparison into a simple table lookup. This paper introduces an encrypted version of inverted index which can be used for a searchable encrypted log. Encrypted inverted index enables to determine all records that contain a keyword while contents of other records are kept secret. Statistic information such as frequency of each word is also not revealed. A prototype system is implemented to show the efficiency

...read moreread less

Proceedings Article•DOI•

Efficient inverted lists and query algorithms for structured value ranking in update-intensive relational databases

[...]

G. Guo¹, Jayavel Shanmugasundaram¹, Kevin Scott Beyer², Eugene J. Shekita²•Institutions (2)

Cornell University¹, IBM²

05 Apr 2005

TL;DR: This work proposes a new family of inverted list indices and associated query algorithms that can support SVR efficiently in update-intensive databases, where the structured data values (and hence the scores of documents) change frequently.

...read moreread less

Abstract: We propose a new ranking paradigm for relational databases called Structured Value Ranking (SVR). SVR uses structured data values to score (rank) the results of keyword search queries over text columns. Our main contribution is a new family of inverted list indices and associated query algorithms that can support SVR efficiently in update-intensive databases, where the structured data values (and hence the scores of documents) change frequently. Our experimental results on real and synthetic data sets using BerkeleyDB show that we can support SVR efficiently in relational databases.

...read moreread less

Book Chapter•DOI•

SIRIUS: a lightweight XML indexing and approximate search system at INEX 2005

[...]

Eugen Popovici¹, Gildas Ménier¹, Pierre-François Marteau¹•Institutions (1)

Sewanee: The University of the South¹

28 Nov 2005

TL;DR: SIRIUS as mentioned in this paper is a lightweight indexing and search engine for XML documents, which is document oriented and involves an approximate matching scheme of the structure and textual content of XML documents.

...read moreread less

Abstract: This paper reports on SIRIUS, a lightweight indexing and search engine for XML documents The retrieval approach implemented is document oriented It involves an approximate matching scheme of the structure and textual content Instead of managing the matching of whole DOM trees, SIRIUS splits the documents object model in a set of paths In this view, the request is a path-like expression with conditions on the attribute values In this paper, we present the main functionalities and characteristics of this XML IR system and second we relate on our experience on adapting and using it for the INEX 2005 ad-hoc retrieval task Finally, we present and analyze the SIRIUS retrieval performance obtained during the INEX 2005 evaluation campaign and show that despite the lightweight characteristics of SIRIUS we were able to retrieve highly relevant non overlapping XML elements and obtained quite good precision at low recall values

...read moreread less

Book Chapter•DOI•

Database and information retrieval techniques for XML

[...]

Mariano P. Consens¹, Ricardo Baeza-Yates²•Institutions (2)

University of Toronto¹, Pompeu Fabra University²

07 Dec 2005-Lecture Notes in Computer Science

TL;DR: The two distinct cultures of databases and information retrieval now have a natural meeting place in theWeb with its semi-structured XML model, and the need for integrating these two viewpoints becomes even more important.

...read moreread less

Abstract: The world of data has been developed from two main points of view: the structured relational data model and the unstructured text model. The two distinct cultures of databases and information retrieval now have a natural meeting place in theWeb with its semi-structured XML model. As web-style searching becomes an ubiquitous tool, the need for integrating these two viewpoints becomes even more important.

...read moreread less