scispace - formally typeset
Search or ask a question

Showing papers on "Inverted index published in 2010"


Journal ArticleDOI
TL;DR: A more precise representation based on Hamming embedding (HE) and weak geometric consistency constraints (WGC) is derived and this approach is shown to outperform the state-of-the-art on the three datasets.
Abstract: This article improves recent methods for large scale image search. We first analyze the bag-of-features approach in the framework of approximate nearest neighbor search. This leads us to derive a more precise representation based on Hamming embedding (HE) and weak geometric consistency constraints (WGC). HE provides binary signatures that refine the matching based on visual words. WGC filters matching descriptors that are not consistent in terms of angle and scale. HE and WGC are integrated within an inverted file and are efficiently exploited for all images in the dataset. We then introduce a graph-structured quantizer which significantly speeds up the assignment of the descriptors to visual words. A comparison with the state of the art shows the interest of our approach when high accuracy is needed. Experiments performed on three reference datasets and a dataset of one million of images show a significant improvement due to the binary signature and the weak geometric consistency constraints, as well as their efficiency. Estimation of the full geometric transformation, i.e., a re-ranking step on a short-list of images, is shown to be complementary to our weak geometric consistency constraints. Our approach is shown to outperform the state-of-the-art on the three datasets.

795 citations


Proceedings ArticleDOI
06 Jun 2010
TL;DR: The Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time, and identifies the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries.
Abstract: Strings are ubiquitous in computer systems and hence string processing has attracted extensive research effort from computer scientists in diverse areas. One of the most important problems in string processing is to efficiently evaluate the similarity between two strings based on a specified similarity measure. String similarity search is a fundamental problem in information retrieval, database cleaning, biological sequence analysis, and more. While a large number of dissimilarity measures on strings have been proposed, edit distance is the most popular choice in a wide spectrum of applications. Existing indexing techniques for similarity search queries based on edit distance, e.g., approximate selection and join queries, rely mostly on n-gram signatures coupled with inverted list structures. These techniques are tailored for specific query types only, and their performance remains unsatisfactory especially in scenarios with strict memory constraints or frequent data updates. In this paper we propose the Bed-tree, a B+-tree based index structure for evaluating all types of similarity queries on edit distance and normalized edit distance. We identify the necessary properties of a mapping from the string space to the integer space for supporting searching and pruning for these queries. Three transformations are proposed that capture different aspects of information inherent in strings, enabling efficient pruning during the search process on the tree. Compared to state-of-the-art methods on string similarity search, the Bed-tree is a complete solution that meets the requirements of all applications, providing high scalability and fast response time.

141 citations


Proceedings ArticleDOI
25 Oct 2010
TL;DR: A novel method to address the efficiency and scalability issues for near-duplicate video retrieval by introducing a compact spatiotemporal feature to represent videos and constructing an efficient data structure to index the feature to achieve real-time retrieving performance.
Abstract: Near-duplicate video retrieval is becoming more and more important with the exponential growth of the Web. Though various approaches have been proposed to address this problem, they are mainly focusing on the retrieval accuracy while infeasible to query on Web scale video database in real time. This paper proposes a novel method to address the efficiency and scalability issues for near-duplicate We video retrieval. We introduce a compact spatiotemporal feature to represent videos and construct an efficient data structure to index the feature to achieve real-time retrieving performance. This novel feature leverages relative gray-level intensity distribution within a frame and temporal structure of videos along frame sequence. The new index structure is proposed based on inverted file to allow for fast histogram intersection computation between videos. To demonstrate the effectiveness and efficiency of the proposed methods we evaluate its performance on an open Web video data set containing about 10K videos and compare it with four existing methods in terms of precision and time complexity. We also test our method on a data set containing about 50K videos and 11M key-frames. It takes on average 17ms to execute a query against the whole 50K Web video data set.

127 citations


Journal ArticleDOI
TL;DR: FragBag performs on a par with the computationally expensive, yet highly trusted structural aligners STRUCTAL and CE, and finds more accurate candidate sets than the three other filter methods: The SGM, PRIDE and a method by Zotenko et al.
Abstract: Fast identification of protein structures that are similar to a specified query structure in the entire Protein Data Bank (PDB) is fundamental in structure and function prediction. We present FragBag: An ultrafast and accurate method for comparing protein structures. We describe a protein structure by the collection of its overlapping short contiguous backbone segments, and discretize this set using a library of fragments. Then, we succinctly represent the protein as a “bags-of-fragments”—a vector that counts the number of occurrences of each fragment—and measure the similarity between two structures by the similarity between their vectors. Our representation has two additional benefits: (i) it can be used to construct an inverted index, for implementing a fast structural search engine of the entire PDB, and (ii) one can specify a structure as a collection of substructures, without combining them into a single structure; this is valuable for structure prediction, when there are reliable predictions only of parts of the protein. We use receiver operating characteristic curve analysis to quantify the success of FragBag in identifying neighbor candidate sets in a dataset of over 2,900 structures. The gold standard is the set of neighbors found by six state of the art structural aligners. Our best FragBag library finds more accurate candidate sets than the three other filter methods: The SGM, PRIDE, and a method by Zotenko et al. More interestingly, FragBag performs on a par with the computationally expensive, yet highly trusted structural aligners STRUCTAL and CE.

111 citations


Book ChapterDOI
30 Jun 2010
TL;DR: This paper combines an R-tree with an inverted index by the inclusion of spatial references in posting lists to create a disk-resident, dual-index data structure that is used to proactively prune the search space.
Abstract: In this paper, we present a novel method to efficiently process top-k spatial queries with conjunctive Boolean constraints on textual content. Our method combines an R-tree with an inverted index by the inclusion of spatial references in posting lists. The result is a disk-resident, dual-index data structure that is used to proactively prune the search space. R-tree nodes are visited in best-first order. A node entry is placed in the priority queue if there exists at least one object that satisfies the Boolean condition in the subtree pointed by the entry; otherwise, the subtree is not further explored. We show via extensive experimentation with real spatial databases that our method has increased performance over alternate techniques while scaling to large number of objects.

100 citations


Patent
Qifa Ke1, Ming Liu1, Yi Li1
02 Nov 2010
TL;DR: In this paper, image descriptor identifiers are used for content-based search, where a plurality of descriptors are determined for an image and candidate images that include at least a predetermined number of descriptor identifiers that match those of the image are identified.
Abstract: Image descriptor identifiers are used for content-based search. A plurality of descriptors is determined for an image. The descriptors represent the content of the image at respective interest points identified in the image. The descriptors are mapped to respective descriptor identifiers. The image can thus be represented as a set of descriptor identifiers. A search is performed on an index using the descriptor identifiers as search elements. A method for efficiently searching the inverted index is also provided. Candidate images that include at least a predetermined number of descriptor identifiers that match those of the image are identified. The candidate images are ranked and at least a portion thereof are presented as content-based search results.

91 citations


Journal IssueDOI
TL;DR: The results show that the Simple-8b approach, a 64-bit word-bounded code, is an excellent self-skipping code, and has a clear advantage over its competitors in supporting fast query evaluation when the data being compressed represents the inverted index for a large text collection.
Abstract: Modern computers typically make use of 64-bit words as the fundamental unit of data access. However the decade-long migration from 32-bit architectures has not been reflected in compression technology, because of a widespread assumption that effective compression techniques operate in terms of bits or bytes, rather than words. Here we demonstrate that the use of 64-bit access units, especially in connection with word-bounded codes, does indeed provide the opportunity for improving the compression performance. In particular, we extend several 32-bit word-bounded coding schemes to 64-bit operation and explore their uses in information retrieval applications. Our results show that the Simple-8b approach, a 64-bit word-bounded code, is an excellent self-skipping code, and has a clear advantage over its competitors in supporting fast query evaluation when the data being compressed represents the inverted index for a large text collection. The advantages of the new code also accrue on 32-bit architectures, and for all of Boolean, ranked, and phrase queries; which means that it can be used in any situation. Copyright © 2010 John Wiley & Sons, Ltd.

89 citations


Patent
11 May 2010
TL;DR: In this paper, the authors proposed a method for enhancing the performance of a medical search engine, including the procedures of generating an inverted index of medical related documents, receiving a medical query from a user, expanding and augmenting the received medical search query, retrieving all the medical-related documents in the inverted index which are relevant to the enhanced medical query, ranking the retrieved medical related items according to a master expression, presenting the ranked retrieved medical-relevant documents to the user, receiving at least one user feedback response from the user to a respective one of the retrieved retrieved medical items, for
Abstract: Method for enhancing the performance of a medical search engine, including the procedures of generating an inverted index of medical related documents, receiving a medical search query from a user, expanding and augmenting the received medical search query thereby generating an enhanced medical search query, retrieving all the medical related documents in the inverted index which are relevant to the enhanced medical search query, ranking the retrieved medical related documents according to a master expression, presenting the ranked retrieved medical related documents to the user, receiving at least one user feedback response from the user to a respective one of the ranked retrieved medical related documents, for each received user feedback response evaluating and storing at least one feature of the respective one of the ranked retrieved medical related documents and modifying the master expression based on the received user feedback response using at least one machine learning algorithm.

84 citations


Proceedings ArticleDOI
25 Oct 2010
TL;DR: This work uses inverted index compression and fast geometric re-ranking on their database to provide a low delay image recognition response for large scale databases.
Abstract: We present a mobile product recognition system for the camera-phone. By snapping a picture of a product with a camera-phone, the user can retrieve online information of the product. The product is recognized by an image-based retrieval system located on a remote server. Our database currently comprises more than one million entries, primarily products packaged in rigid boxes with printed labels, such as CDs, DVDs, and books. We extract low bit-rate descriptors from the query image and compress the location of the descriptors using location histogram coding on the camera-phone. We transmit the compressed query features, instead of a query image, to reduce the transmission delay. We use inverted index compression and fast geometric re-ranking on our database to provide a low delay image recognition response for large scale databases. Experimental timing results on different parts of the mobile product recognition system is reported in this work.

82 citations


Book ChapterDOI
06 Sep 2010
TL;DR: This paper presents two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text, significantly faster than existing methods in RAM and even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.
Abstract: Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.

81 citations


Journal Article
TL;DR: In this paper, the authors present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text, which they build on existing theoretical techniques, which have implemented and compared empirically with new approaches introduced in this paper.
Abstract: Text search engines return a set of k documents ranked by similarity to a query. Typically, documents and queries are drawn from natural language text, which can readily be partitioned into words, allowing optimizations of data structures and algorithms for ranking. However, in many new search domains (DNA, multimedia, OCR texts, Far East languages) there is often no obvious definition of words and traditional indexing approaches are not so easily adapted, or break down entirely. We present two new algorithms for ranking documents against a query without making any assumptions on the structure of the underlying text. We build on existing theoretical techniques, which we have implemented and compared empirically with new approaches introduced in this paper. Our best approach is significantly faster than existing methods in RAM, and is even three times faster than a state-of-the-art inverted file implementation for English text when word queries are issued.

Proceedings ArticleDOI
13 Jun 2010
TL;DR: A new scalable face representation is developed using both local and global features and it is shown that the inverted index based on local features provides candidate images with good recall, while the multi-reference re-ranking with global hamming signature leads to good precision.
Abstract: State-of-the-art image retrieval systems achieve scalability by using bag-of-words representation and textual retrieval methods, but their performance degrades quickly in the face image domain, mainly because they 1) produce visual words with low discriminative power for face images, and 2) ignore the special properties of the faces. The leading features for face recognition can achieve good retrieval performance, but these features are not suitable for inverted indexing as they are high-dimensional and global, thus not scalable in either computational or storage cost. In this paper we aim to build a scalable face image retrieval system. For this purpose, we develop a new scalable face representation using both local and global features. In the indexing stage, we exploit special properties of faces to design new component-based local features, which are subsequently quantized into visual words using a novel identity-based quantization scheme. We also use a very small hamming signature (40 bytes) to encode the discriminative global feature for each face. In the retrieval stage, candidate images are firstly retrieved from the inverted index of visual words. We then use a new multi-reference distance to re-rank the candidate images using the hamming signature. On a one-millon face database, we show that our local features and global hamming signatures are complementary — the inverted index based on local features provides candidate images with good recall, while the multi-reference re-ranking with global hamming signature leads to good precision. As a result, our system is not only scalable but also outperforms the linear scan retrieval system using the state-of-the-art face recognition feature in term of the quality.

Book ChapterDOI
30 Aug 2010
TL;DR: A new index structure called Spatial-Keyword Inverted File is proposed to handle location-based web searches in an integrated/ efficient manner and to seamlessly find and rank relevant documents, a new distance measure called spatial tf-idf is developed.
Abstract: There is a significant commercial and research interest in location-based web search engines. Given a number of search keywords and one or more locations that a user is interested in, a location-based web search retrieves and ranks the most textually and spatially relevant web pages. In this type of search, both the spatial and textual information should be indexed. Currently, no efficient index structure exists that can handle both the spatial and textual aspects of data simultaneously and accurately. Existing approaches either index space and text separately or use inefficient hybrid index structures with poor performance. Moreover, most of these approaches cannot accurately rank web-pages based on a combination of space and text and are not easy to integrate into existing search engines. In this paper, we propose a new index structure called Spatial-Keyword Inverted File to handle location-based web searches in an integrated/ efficient manner. To seamlessly find and rank relevant documents, we develop a new distance measure called spatial tf-idf. We propose four variants of spatial-keyword relevance scores and two algorithms to perform top-k searches. As verified by experiments, our proposed techniques outperform existing index structures in terms of search performance and accuracy.

Proceedings ArticleDOI
24 Mar 2010
TL;DR: This work design, develop, and compare techniques for inverted index compression for image-based retrieval, and shows that these techniques significantly reduce memory usage, by as much as 5x, without loss in recognition accuracy.
Abstract: To perform fast image matching against large databases, a Vocabulary Tree (VT) uses an inverted index that maps from each tree node to database images which have visited that node. The inverted index can require gigabytes of memory, which significantly slows down the database server. In this paper, we design, develop, and compare techniques for inverted index compression for image-based retrieval. We show that these techniques significantly reduce memory usage, by as much as 5x, without loss in recognition accuracy. Our work includes fast decoding methods, an offline database reordering scheme that exploits the similarity between images for additional memory savings, and a generalized coding scheme for soft-binned feature descriptor histograms. We also show that reduced index memory permits memory-intensive image matching techniques that boost recognition accuracy.

Journal ArticleDOI
TL;DR: It is shown that the inverse classification problem is a powerful and general model which encompasses a number of different criteria, which can be used for a variety of decision support applications which have pre-determined task criteria.
Abstract: In this paper, we examine an emerging variation of the classification problem, which is known as the inverse classification problem. In this problem, we determine the features to be used to create a record which will result in a desired class label. Such an approach is useful in applications in which it is an objective to determine a set of actions to be taken in order to guide the data mining application towards a desired solution. This system can be used for a variety of decision support applications which have pre-determined task criteria. We will show that the inverse classification problem is a powerful and general model which encompasses a number of different criteria. We propose a number of algorithms for the inverse classification problem, which use an inverted list representation for intermediate data structure representation and classification. We validate our approach over a number of real datasets.

Book ChapterDOI
21 Jun 2010
TL;DR: Two challenges remain, and this presentation focuses on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
Abstract: The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

Book ChapterDOI
06 Sep 2010
TL;DR: In this article, low level image features (such as colors and textures) are converted into a textual form and are indexed into the inverted index by means of the Lucene search engine library.
Abstract: Content-based image retrieval is becoming a popular way for searching digital libraries as the amount of available multimedia data increases. However, the cost of developing from scratch a robust and reliable system with content-based image retrieval facilities for large databases is quite prohibitive. In this paper, we propose to exploit an approach to perform approximate similarity search in metric spaces developed by [3, 6]. The idea at the basis of these techniques is that when two objects are very close one to each other they 'see' the world around them in the same way. Accordingly, we can use a measure of dissimilarity between the views of the world at different objects, in place of the distance function of the underlying metric space. To employ this idea the low level image features (such as colors and textures) are converted into a textual form and are indexed into the inverted index by means of the Lucene search engine library. The conversion of the features in textual form allows us to employ the Lucene's off-the-shelf indexing and searching abilities with a little implementation effort. In this way, we are able to set up a robust information retrieval system that combines full-text search with content-based image retrieval capabilities.

Proceedings ArticleDOI
26 Apr 2010
TL;DR: A new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing is proposed, which achieves improved compression while scaling to tens of millions of documents.
Abstract: Web search engines depend on the full-text inverted index data structure Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index sizeIn this paper, we propose improved techniques for document identifier assignment Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning These techniques achieve good compression but do not scale to larger document collections We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing This technique achieves improved compression while scaling to tens of millions of documents Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size

Journal ArticleDOI
TL;DR: The experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.
Abstract: Inverted index data structures are the key to fast text search engines. We first investigate one of the predominant operation on inverted indexes, which asks for intersecting two sorted lists of document IDs of different lengths. We explore compression and performance of different inverted list data structures. In particular, we present Lookup, a new data structure that allows intersection in expected time linear in the smaller list.Based on this result, we present the algorithmic core of a full text data base that allows fast Boolean queries, phrase queries, and document reporting using less space than the input text. The system uses a carefully choreographed combination of classical data compression techniques and inverted-index-based search data structures. Our experiments show that inverted indexes are preferable over purely suffix-array-based techniques for in-memory (English) text search engines.A similar system is now running in practice in each core of the distributed data base engine TREX of SAP.

Journal ArticleDOI
TL;DR: This paper presents an indexing scheme for the energy- and latency-efficient processing of full-text searches over the wireless broadcast data stream, and proposes a replication strategy of the index list and index tree to further improve the latency performance.
Abstract: In wireless mobile computing environments, broadcasting is an effective and scalable technique to disseminate information to a massive number of clients, wherein the energy usage and latency are considered major concerns. This paper presents an indexing scheme for the energy- and latency-efficient processing of full-text searches over the wireless broadcast data stream. Although a lot of access methods and index structures have been proposed in the past for full-text searches, all of them are targeted for data in disk storage, not wireless broadcast channels. For full-text searches on a wireless broadcast stream, we firstly introduce a naive, inverted list-style indexing method, where inverted lists are placed in front of the data on the wireless channel. In order to reduce the latency overhead, we propose a two-level indexing method which adds another level of index structure to the basic inverted list-style index. In addition, we propose a replication strategy of the index list and index tree to further improve the latency performance. We analyze the performance of the proposed indexing scheme with respect to the latency and energy usage measures, and show the optimality of index replication. The correctness of the analysis is demonstrated through simulation experiments, and the effectiveness of the proposed scheme is shown by implementing a real wireless information delivery system.

Book ChapterDOI
11 Oct 2010
TL;DR: This paper gives the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in O(Σ|Pi|) + O (t1/mn1-1/m) time, where t is the number of output occurrences.
Abstract: Given a collection D of string documents {d1, d2, ..., d|D|} of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P1, P2, ..., Pm}. To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores). When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took O(n3/2) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in O(Σ|Pi|) + O (t1/mn1-1/m) time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of O(|P1|+|P2|+√nt log2 n). We also show space-time trade-offs for our data structures. Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest.

Book ChapterDOI
11 Oct 2010
TL;DR: This paper introduces a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index.
Abstract: Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including space-free treatment of stemming and hierarchical documents.

Patent
12 Feb 2010
TL;DR: In this paper, a system for performing an updating process to an in-memory index is described, which is used to provide results to user queries in tandem with the inverted index.
Abstract: Systems and methods for performing an updating process to an in-memory index are provided. Upon receiving notice of document modifications covered by an inverted index associated with a search engine, in the form of an update file, a representation of the modification is published onto various index serving machines. Each index serving machine receiving the update file determines if the modifications are applicable to the index serving machine. If an index serving machine determines that it contains mapping information corresponding to the modified documents, the index serving machine utilizes the update file and associated mapping information to update an in-memory index. In embodiments, the in-memory index is used to provide results to user queries in tandem with the inverted index. In some embodiments, an extra in-memory index is maintained that is revised with constantly incoming metadata updates and the existing in-memory index is periodically swapped with the revised in-memory index.

Patent
Matthew Thomas1
20 Apr 2010
TL;DR: In this article, the authors provide tracking or logging requests to resolve non-existent domains (NXDomains) and organize the NXDomains to support searching of the domain names including ranking the domains based on popularity, e.g., number of hits or potential traffic based on the number of requests made for the NXDomain.
Abstract: Methods and systems provide tracking or logging requests to resolve non-existent domain (NXDomains) and organizing the NXDomains to support searching of the domain names including ranking the NXDomains based on popularity, e.g, number of hits or potential traffic based on the number of requests made for the NXDomain. NXDomain logs may be organized so that it supports searching by creating an inverted index including n-grams of the NXDomains. Searching includes identifying a target substring in one or more of the indexes, selecting those matching NXDomains satisfying some threshold criteria, and displaying the NXDomains in a selected order such as by demand or popularity associated with, for example, a selected geographical location from which resolution requests targeting respective NXDomains originate.

Proceedings ArticleDOI
26 Oct 2010
TL;DR: This paper proposes new index compression techniques for versioned document collections that achieve reductions in index size over previous methods, and first proposes several bitwise compression techniques that achieve a compact index structure but that are too slow for most applications.
Abstract: Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing compressed size and query processing speed have been proposed. In this paper, we focus on versioned document collections, that is, collections where each document is modified over time, resulting in multiple versions of the document. Consecutive versions of the same document are often similar, and several researchers have explored ideas for exploiting this similarity to decrease index size. We propose new index compression techniques for versioned document collections that achieve reductions in index size over previous methods. In particular, we first propose several bitwise compression techniques that achieve a compact index structure but that are too slow for most applications. Based on the lessons learned, we then propose additional techniques that come close to the sizes of the bitwise technique while also improving on the speed of the best previous methods.

Patent
Zhenmin Li1, Chengdu Huang1, Spiros Xanthos1, Qingbo Zhu1, Yuanyuan Zhou1 
23 Sep 2010
TL;DR: In this paper, the inverted index is generated using the extracted structure of the semi-structured data and inverted index includes a location identifier and a data type identifier for one or more entries of an inverted index.
Abstract: Generating an inverted index is disclosed Semi-structured data from a plurality of sources is parsed to extract structure from at least a portion of the semi-structured data The inverted index is generated using the extracted structure The inverted index includes a location identifier and a data type identifier for one or more entries of the inverted index

Book ChapterDOI
11 Oct 2010
TL;DR: It is shown that an inverted index can be replaced by a more space-efficient in-memory encoding, outperforming the query performance of inverted indices when the ratio nM/δ is ω(log |Σ|).
Abstract: We prove that a document collection, represented as a unique sequence T of n terms over a vocabulary Σ, can be represented in nH0(T) + o(n)(H0(T) + 1) bits of space, such that a conjunctive query t1 ∧ ... ∧ tk can be answered in O(kδ log log |Σ|) adaptive time, where δ is the instance difficulty of the query, as defined by Barbay and Kenyon in their SODA'02 paper, and H0(T) is the empirical entropy of order 0 of T. As a comparison, using an inverted index plus the adaptive intersection algorithm by Barbay and Kenyon takes O(kδ log nM/δ), where nM is the length of the shortest and longest occurrence lists, respectively, among those of the query terms. Thus, we can replace an inverted index by a more space-efficient in-memory encoding, outperforming the query performance of inverted indices when the ratio nM/δ is ω(log |Σ|).

Journal ArticleDOI
TL;DR: A novel R-Tree-based inverted index structure, named UI-Tree, is introduced to efficiently support various queries including range queries, similarity joins, and their size estimation, as well as top-k range query, over multidimensional uncertain objects against continuous or discrete cases.
Abstract: With the rapid development of various optical, infrared, and radar sensors and GPS techniques, there are a huge amount of multidimensional uncertain data collected and accumulated everyday. Recently, considerable research efforts have been made in the field of indexing, analyzing, and mining uncertain data. As shown in a recent book on uncertain data, in order to efficiently manage and mine uncertain data, effective indexing techniques are highly desirable. Based on the observation that the existing index structures for multidimensional data are sensitive to the size or shape of uncertain regions of uncertain objects and the queries, in this paper, we introduce a novel R-Tree-based inverted index structure, named UI-Tree, to efficiently support various queries including range queries, similarity joins, and their size estimation, as well as top-k range query, over multidimensional uncertain objects against continuous or discrete cases. Comprehensive experiments are conducted on both real data and synthetic data to demonstrate the efficiency of our techniques.

Proceedings ArticleDOI
25 Oct 2010
TL;DR: A new approach to image indexing and retrieval is presented, which integrates appearance with global image geometry in the indexing process, while enjoying robustness against viewpoint change, photometric variations, occlusion, and background clutter.
Abstract: We present a new approach to image indexing and retrieval, which integrates appearance with global image geometry in the indexing process, while enjoying robustness against viewpoint change, photometric variations, occlusion, and background clutter. We exploit shape parameters of local features to estimate image alignment via a single correspondence. Then, for each feature, we construct a sparse spatial map of all remaining features, encoding their normalized position and appearance, typically vector quantized to visual word. An image is represented by a collection of such feature maps and RANSAC-like matching is reduced to a number of set intersections. Because the induced dissimilarity is still not a metric, we extend min-wise independent permutations to collections of sets and derive a similarity measure for feature map collections. We then exploit sparseness to build an inverted file whereby the retrieval process is sub-linear in the total number of images, ideally linear in the number of relevant ones. We achieve excellent performance on 10^4 images, with a query time in the order of milliseconds.

Journal ArticleDOI
TL;DR: This work shows that indexing costs can be significantly reduced further by letting peers form groups in a self-organized fashion, and reduces index maintenance cost by an order of magnitude, while still keeping a complete and correct term index for query processing.