scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Information Retrieval in 2009"


Journal ArticleDOI
TL;DR: In this paper, the authors compare results obtained from the Web of Science and Scopus, and show that the correlations between the measures obtained with both databases for the number of papers and the citations received by countries, as well as for their ranks, are extremely high (R2 >.99).
Abstract: For more than 40 years, the Institute for Scientific Information (ISI, now part of Thomson Reuters) produced the only available bibliographic databases from which bibliometricians could compile large-scale bibliometric indicators. ISI's citation indexes, now regrouped under the Web of Science (WoS), were the major sources of bibliometric data until 2004, when Scopus was launched by the publisher Reed Elsevier. For those who perform bibliometric analyses and comparisons of countries or institutions, the existence of these two major databases raises the important question of the comparability and stability of statistics obtained from different data sources. This paper uses macro-level bibliometric indicators to compare results obtained from the WoS and Scopus. It shows that the correlations between the measures obtained with both databases for the number of papers and the number of citations received by countries, as well as for their ranks, are extremely high (R2 > .99). There is also a very high correlation when countries' papers are broken down by field. The paper thus provides evidence that indicators of scientific production and citations at the country level are stable and largely independent of the database.

341 citations


Posted Content
TL;DR: The Memento solution is a framework in which archived resources can seamlessly be reached via the URI of their original: protocol-based time travel for the Web.
Abstract: The Web is ephemeral. Many resources have representations that change over time, and many of those representations are lost forever. A lucky few manage to reappear as archived resources that carry their own URIs. For example, some content management systems maintain version pages that reect a frozen prior state of their changing resources. Archives recurrently crawl the web to obtain the actual representation of resources, and subsequently make those available via special-purpose archived resources. In both cases, the archival copies have URIs that are protocolwise disconnected from the URI of the resource of which they represent a prior state. Indeed, the lack of temporal capabilities in the most common Web protocol, HTTP, prevents getting to an archived resource on the basis of the URI of its original. This turns accessing archived resources into a signicant discovery challenge for both human and software agents, which typically involves following a multitude of links from the original to the archival resource, or of searching archives for the original URI. This paper proposes the protocol-based Memento solution to address this problem, and describes a proof-of-concept experiment that includes major servers of archival content, including Wikipedia and the Internet Archive. The Memento solution is based on existing HTTP capabilities applied in a novel way to add the temporal dimension. The result is a framework in which archived resources can seamlessly be reached via the URI of their original: protocol-based time travel for the Web.

149 citations


Posted Content
TL;DR: This work has integrated models from previous work in sensemaking and information seeking behavior to present a canonical social model of user activities before, during, and after search, suggesting where in the search process both explicitly and implicitly shared information may be valuable to individual searchers.
Abstract: Search engine researchers typically depict search as the solitary activity of an individual searcher. In contrast, results from our critical-incident survey of 150 users on Amazon's Mechanical Turk service suggest that social interactions play an important role throughout the search process. Our main contribution is that we have integrated models from previous work in sensemaking and information seeking behavior to present a canonical social model of user activities before, during, and after search, suggesting where in the search process even implicitly shared information may be valuable to individual searchers.

121 citations


Posted Content
TL;DR: In this article, the authors present a method for ranking weblogs based on the link graph and on several similarity characteristics between weblogs, and assign a ranking to each weblog using their algorithm, which is a modified version of PageRank.
Abstract: A large part of the hidden web resides in weblog servers. New content is produced in a daily basis and the work of traditional search engines turns to be insufficient due to the nature of weblogs. This work summarizes the structure of the blogosphere and highlights the special features of weblogs. In this paper we present a method for ranking weblogs based on the link graph and on several similarity characteristics between weblogs. First we create an enhanced graph of connected weblogs and add new types of edges and weights utilising many weblog features. Then, we assign a ranking to each weblog using our algorithm, BlogRank, which is a modified version of PageRank. For the validation of our method we run experiments on a weblog dataset, which we process and adapt to our search engine. (this http URL). The results suggest that the use of the enhanced graph and the BlogRank algorithm is preferred by the users.

98 citations


Posted Content
TL;DR: A model of possible kinds of collaboration is proposed, which contains four dimensions: intent, depth, concurrency and location, and can be used to classify existing systems and to suggest possible opportunities for design in this space.
Abstract: People can help other people find information in networked information seeking environments. Recently, many such systems and algorithms have proliferated in industry and in academia. Unfortunately, it is difficult to compare the systems in meaningful ways because they often define collaboration in different ways. In this paper, we propose a model of possible kinds of collaboration, and illustrate it with examples from literature. The model contains four dimensions: intent, depth, concurrency and location. This model can be used to classify existing systems and to suggest possible opportunities for design in this space.

91 citations


Posted Content
TL;DR: This document describes the BM25 and BM25F implementation using the Lucene Java Framework, both of which have stood out at TREC by their performance and are considered as state-of-the-art in the IR community.
Abstract: This document describes the BM25 and BM25F implementation using the Lucene Java Framework. Both models have stood out at TREC by their performance and are considered as state-of-the-art in the IR community. BM25 is applied to retrieval on plain text documents, that is for documents that do not contain fields, while BM25F is applied to documents with structure.

88 citations


Posted Content
TL;DR: This paper proposed text summarization based on fuzzy logic to improve the quality of the summary created by the general statistic method, and compared the results with the baseline summarizer and Microsoft Word 2007 summarizers.
Abstract: Text summarization can be classified into two approaches: extraction and abstraction This paper focuses on extraction approach The goal of text summarization based on extraction approach is sentence selection One of the methods to obtain the suitable sentences is to assign some numerical measure of a sentence for the summary called sentence weighting and then select the best ones The first step in summarization by extraction is the identification of important features In our experiment, we used 125 test documents in DUC2002 data set Each document is prepared by preprocessing process: sentence segmentation, tokenization, removing stop word, and word stemming Then, we used 8 important features and calculate their score for each sentence We proposed text summarization based on fuzzy logic to improve the quality of the summary created by the general statistic method We compared our results with the baseline summarizer and Microsoft Word 2007 summarizers The results show that the best average precision, recall, and fmeasure for the summaries were obtained by fuzzy method

84 citations


Journal ArticleDOI
TL;DR: An adaptive model which combines similarities in users' rating patterns with epidemic-like spreading of news on an evolving network is proposed and provides a general social mechanism for recommender systems and may find its applications also in other types of recommendation.
Abstract: Most news recommender systems try to identify users' interests and news' attributes and use them to obtain recommendations. Here we propose an adaptive model which combines similarities in users' rating patterns with epidemic-like spreading of news on an evolving network. We study the model by computer agent-based simulations, measure its performance and discuss its robustness against bias and malicious behavior. Subject to the approval fraction of news recommended, the proposed model outperforms the widely adopted recommendation of news according to their absolute or relative popularity. This model provides a general social mechanism for recommender systems and may find its applications also in other types of recommendation.

67 citations


Posted Content
TL;DR: The study and discussion presented here are driven by two dissatisfactions: (1) the majority of IR systems today do not facilitate collaboration directly, and (2) the concept of collaboration itself is not well-understood.
Abstract: It is natural for humans to collaborate while dealing with complex problems. In this article I consider this process of collaboration in the context of information seeking. The study and discussion presented here are driven by two dissatisfactions: (1) the majority of IR systems today do not facilitate collaboration directly, and (2) the concept of collaboration itself is not well-understood. I begin by probing the notion of collaboration and propose a model that helps us understand the requirements for a successful collaboration. A model of a Collaborative Information Seeking (CIS) environment is then rendered based on an extended model of information seeking.

61 citations


Posted Content
TL;DR: Experimental study shows that the weak ties play a significant role in the link prediction problem, and to emphasize the contribution of weak ties can remarkably enhance the predicting accuracy.
Abstract: Plenty of algorithms for link prediction have been proposed and were applied to various real networks. Among these works, the weights of links are rarely taken into account. In this paper, we use local similarity indices to estimate the likelihood of the existence of links in weighted networks, including Common Neighbor, Adamic-Adar Index, Resource Allocation Index, and their weighted versions. In both the unweighted and weighted cases, the resource allocation index performs the best. To our surprise, the weighted indices perform worse, which reminds us of the well-known Weak Tie Theory. Further extensive experimental study shows that the weak ties play a significant role in the link prediction problem, and to emphasize the contribution of weak ties can remarkably enhance the predicting accuracy.

58 citations


Posted Content
Abstract: For users, recommendations can sometimes seem odd or counterintuitive. Visualizing recommendations can remove some of this mystery, showing how a recommendation is grouped with other choices. A drawing can also lead a user's eye to other options. Traditional 2D-embeddings of points can be used to create a basic layout, but these methods, by themselves, do not illustrate clusters and neighborhoods very well. In this paper, we propose the use of geographic maps to enhance the definition of clusters and neighborhoods, and consider the effectiveness of this approach in visualizing similarities and recommendations arising from TV shows and music selections. All the maps referenced in this paper can be found in this http URL

Posted Content
TL;DR: This paper uses a method for traversing the irrelevant pages that met during crawling to improve the coverage of a specific topic and a similarity function is used to check the similarity of web pages w.r.t. topic keywords.
Abstract: A focused crawler traverses the web selecting out relevant pages to a predefined topic and neglecting those out of concern. While surfing the internet it is difficult to deal with irrelevant pages and to predict which links lead to quality pages. In this paper a technique of effective focused crawling is implemented to improve the quality of web navigation. To check the similarity of web pages w.r.t. topic keywords a similarity function is used and the priorities of extracted out links are also calculated based on meta data and resultant pages generated from focused crawler. The proposed work also uses a method for traversing the irrelevant pages that met during crawling to improve the coverage of a specific topic.

Posted Content
TL;DR: The authors discusses the difference between a symmetrical co-citation matrix and an asymmetrical citation matrix and the appropriate statistical techniques that can be applied to each of these matrices, respectively.
Abstract: Co-occurrence matrices, such as co-citation, co-word, and co-link matrices, have been used widely in the information sciences. However, confusion and controversy have hindered the proper statistical analysis of this data. The underlying problem, in our opinion, involved understanding the nature of various types of matrices. This paper discusses the difference between a symmetrical co-citation matrix and an asymmetrical citation matrix as well as the appropriate statistical techniques that can be applied to each of these matrices, respectively. Similarity measures (like the Pearson correlation coefficient or the cosine) should not be applied to the symmetrical co-citation matrix, but can be applied to the asymmetrical citation matrix to derive the proximity matrix. The argument is illustrated with examples. The study then extends the application of co-occurrence matrices to the Web environment where the nature of the available data and thus data collection methods are different from those of traditional databases such as the Science Citation Index. A set of data collected with the Google Scholar search engine is analyzed using both the traditional methods of multivariate analysis and the new visualization software Pajek that is based on social network analysis and graph theory.

Posted Content
TL;DR: The history of spectral ranking can be traced back to the work of as discussed by the authors, who introduced spectral ranking, a general umbrella name for techniques that apply the theory of linear maps (in particular, eigenvalues and eigenvectors) to matrices that do not represent geometric transformations, but rather some kind of relationship between entities.
Abstract: We sketch the history of spectral ranking, a general umbrella name for techniques that apply the theory of linear maps (in particular, eigenvalues and eigenvectors) to matrices that do not represent geometric transformations, but rather some kind of relationship between entities Albeit recently made famous by the ample press coverage of Google's PageRank algorithm, spectral ranking was devised more than a century ago, and has been studied in tournament ranking, psychology, social sciences, bibliometrics, economy and choice theory We describe the contribution given by previous scholars in precise and modern mathematical terms: along the way, we show how to express in a general way damped rankings, such as Katz's index, as dominant eigenvectors of perturbed matrices, and then use results on the Drazin inverse to go back to the dominant eigenvectors by a limit process The result suggests a regularized definition of spectral ranking that yields for a general matrix a unique vector depending on a boundary condition

Posted Content
TL;DR: In this paper, the authors examine the interplay between the different updating frequencies by using AltaVista and Google for searches at different moments of time, and show that both the retrieval of the results and the structure of the retrieved information erodes over time.
Abstract: Internet search engines function in a present which changes continuously. The search engines update their indices regularly, overwriting Web pages with newer ones, adding new pages to the index, and losing older ones. Some search engines can be used to search for information at the internet for specific periods of time. However, these 'date stamps' are not determined by the first occurrence of the pages in the Web, but by the last date at which a page was updated or a new page was added, and the search engine's crawler updated this change in the database. This has major implications for the use of search engines in scholarly research as well as theoretical implications for the conceptions of time and temporality. We examine the interplay between the different updating frequencies by using AltaVista and Google for searches at different moments of time. Both the retrieval of the results and the structure of the retrieved information erodes over time.

Posted Content
TL;DR: The results on French corpus demonstrate that the coupling of Automatic Summarization system with a Question-Answering system is promising, and the personalized QAAS system obtains the best performances.
Abstract: To select the most relevant sentences of a document, it uses an optimal decision algorithm that combines several metrics. The metrics processes, weighting and extract pertinence sentences by statistical and informational algorithms. This technique might improve a Question-Answering system, whose function is to provide an exact answer to a question in natural language. In this paper, we present the results obtained by coupling the Cortex summarizer with a Question-Answering system (QAAS). Two configurations have been evaluated. In the first one, a low compression level is selected and the summarization system is only used as a noise filter. In the second configuration, the system actually functions as a summarizer, with a very high level of compression. Our results on French corpus demonstrate that the coupling of Automatic Summarization system with a Question-Answering system is promising. Then the system has been adapted to generate a customized summary depending on the specific question. Tests on a french multi-document corpus have been realized, and the personalized QAAS system obtains the best performances.

Posted Content
TL;DR: This paper proposes a general framework for interactive IR that is able to capture the full interaction process in a principled way and relies upon a generalisation of the probability framework of quantum physics.
Abstract: Even the best information retrieval model cannot always identify the most useful answers to a user query. This is in particular the case with web search systems, where it is known that users tend to minimise their effort to access relevant information. It is, however, believed that the interaction between users and a retrieval system, such as a web search engine, can be exploited to provide better answers to users. Interactive Information Retrieval (IR) systems, in which users access information through a series of interactions with the search system, are concerned with building models for IR, where interaction plays a central role. There are many possible interactions between a user and a search system, ranging from query (re)formulation to relevance feedback. However, capturing them within a single framework is difficult and previously proposed approaches have mostly focused on relevance feedback. In this paper, we propose a general framework for interactive IR that is able to capture the full interaction process in a principled way. Our approach relies upon a generalisation of the probability framework of quantum physics, whose strong geometric component can be a key towards a successful interactive IR model.

Posted Content
TL;DR: The notion of topological centrality (TC) reflecting the topological positions of nodes and edges in general networks is proposed, and an approach to calculating the topology centrality is proposed.
Abstract: Recent development of network structure analysis shows that it plays an important role in characterizing complex system of many branches of sciences. Different from previous network centrality measures, this paper proposes the notion of topological centrality (TC) reflecting the topological positions of nodes and edges in general networks, and proposes an approach to calculating the topological centrality. The proposed topological centrality is then used to discover communities and build the backbone network. Experiments and applications on research network show the significance of the proposed approach.

Posted Content
TL;DR: This paper proposes a new algorithm for generating multidimensional association rules by utilizing fuzzy sets, a database consisting of fuzzy transactions, the Apriory property is employed to prune the useless candidates, itemsets.
Abstract: Multidimensional association rule mining searches for interesting relationship among the values from different dimensions or attributes in a relational database. In this method the correlation is among set of dimensions i.e., the items forming a rule come from different dimensions. Therefore each dimension should be partitioned at the fuzzy set level. This paper proposes a new algorithm for generating multidimensional association rules by utilizing fuzzy sets. A database consisting of fuzzy transactions, the Apriory property is employed to prune the useless candidates, itemsets.

Posted Content
TL;DR: The experiment shows that the proposed method is capable in principle of calculating a semantic distance between pair of words in any language presented in Russian Wiktionary, and compared to WordNet based algorithms.
Abstract: A set of ontology matching algorithms (for finding correspondences between concepts) is based on a thesaurus that provides the source data for the semantic distance calculations. In this wiki era, new resources may spring up and improve this kind of semantic search. In the paper a solution of this task based on Russian Wiktionary is compared to WordNet based algorithms. Metrics are estimated using the test collection, containing 353 English word pairs with a relatedness score assigned by human evaluators. The experiment shows that the proposed method is capable in principle of calculating a semantic distance between pair of words in any language presented in Russian Wiktionary. The calculation of Wiktionary based metric had required the development of the open-source Wiktionary parser software.

Posted Content
TL;DR: This paper describes the powerful algorithm that mines the web logs efficiently and proves that the algorithm is efficient from the other GSP (Generalized Sequential Pattern) algorithms.
Abstract: World Wide Web is a huge data repository and is growing with the explosive rate of about 1 million pages a day. As the information available on World Wide Web is growing the usage of the web sites is also growing. Web log records each access of the web page and number of entries in the web logs is increasing rapidly. These web logs, when mined properly can provide useful information for decision-making. The designer of the web site, analyst and management executives are interested in extracting this hidden information from web logs for decision making. Web access pattern, which is the frequently used sequence of accesses, is one of the important information that can be mined from the web logs. This information can be used to gather business intelligence to improve sales and advertisement, personalization for a user, to analyze system performance and to improve the web site organization. There exist many techniques to mine access patterns from the web logs. This paper describes the powerful algorithm that mines the web logs efficiently. Proposed algorithm firstly converts the web access data available in a special doubly linked tree. Each access is called an event. This tree keeps the critical mining related information in very compressed form based on the frequent event count. Proposed recursive algorithm uses this tree to efficiently find all access patterns that satisfy user specified criteria. To prove that our algorithm is efficient from the other GSP (Generalized Sequential Pattern) algorithms we have done experimental studies on sample data.

Posted Content
TL;DR: A methodology to classify remote sensing images using HSV color features and Haar wavelet texture features and then grouping them on the basis of particular threshold value is developed and the experimental results indicate that the use of color and texture feature extraction is very useful for image retrieval.
Abstract: Grouping images into semantically meaningful categories using low-level visual feature is a challenging and important problem in content-based image retrieval. The groupings can be used to build effective indices for an image database. Digital image analysis techniques are being used widely in remote sensing assuming that each terrain surface category is characterized with spectral signature observed by remote sensors. Even with the remote sensing images of IRS data, integration of spatial information is expected to assist and to improve the image analysis of remote sensing data. In this paper we present a satellite image retrieval based on a mixture of old fashioned ideas and state of the art learning tools. We have developed a methodology to classify remote sensing images using HSV color features and Haar wavelet texture features and then grouping them on the basis of particular threshold value. The experimental results indicate that the use of color and texture feature extraction is very useful for image retrieval.

Posted Content
TL;DR: Quality based ranking of retrieved trusted information is provided using WIQA (Web Information Quality Assessment) Framework, which provides enhanced trustworthiness in both specific and broad queries in web searching.
Abstract: The WWW is the most important source of information. But, there is no guarantee for information correctness and lots of conflicting information is retrieved by the search engines and the quality of provided information also varies from low quality to high quality. We provide enhanced trustworthiness in both specific (entity) and broad (content) queries in web searching. The filtering of trustworthiness is based on 5 factors – Provenance, Authority, Age, Popularity, and Related Links. The trustworthiness is calculated based on these 5 factors and it is stored thereby increasing the performance in retrieving trustworthy websites. The calculated trustworthiness is stored only for static websites. Quality is provided based on policies selected by the user. Quality based ranking of retrieved trusted information is provided using WIQA (Web Information Quality Assessment) Framework.

Posted Content
TL;DR: In this article, a linear programming-based assignment optimization formulation is used to maximize the overall affinities of papers assigned to reviewers, and the authors demonstrate their results on reviewer preference data from the IEEE ICDM 2007 conference.
Abstract: Conference paper assignment, i.e., the task of assigning paper submissions to reviewers, presents multi-faceted issues for recommender systems research. Besides the traditional goal of predicting `who likes what?', a conference management system must take into account aspects such as: reviewer capacity constraints, adequate numbers of reviews for papers, expertise modeling, conflicts of interest, and an overall distribution of assignments that balances reviewer preferences with conference objectives. Among these, issues of modeling preferences and tastes in reviewing have traditionally been studied separately from the optimization of paper-reviewer assignment. In this paper, we present an integrated study of both these aspects. First, due to the paucity of data per reviewer or per paper (relative to other recommender systems applications) we show how we can integrate multiple sources of information to learn paper-reviewer preference models. Second, our models are evaluated not just in terms of prediction accuracy but in terms of the end-assignment quality. Using a linear programming-based assignment optimization formulation, we show how our approach better explores the space of unsupplied assignments to maximize the overall affinities of papers assigned to reviewers. We demonstrate our results on real reviewer preference data from the IEEE ICDM 2007 conference.

Posted Content
TL;DR: Property of groups that may be relevant to designers of collaborative search systems are discussed, and ways in which understanding such properties could influence the design of interfaces and algorithms for collaborative Web search are proposed.
Abstract: Understanding the similar properties of people involved in group search sessions has the potential to significantly improve collaborative search systems; such systems could be enhanced by information retrieval algorithms and user interface modifications that take advantage of important properties, for example by re-ordering search results using information from group members' combined user profiles. Understanding what makes group members similar can also assist with the identification of groups, which can be valuable for connecting users with others with whom they might undertake a collaborative search. In this workshop paper, we describe our current research efforts towards studying the properties of a variety of group types. We discuss properties of groups that may be relevant to designers of collaborative search systems, and propose ways in which understanding such properties could influence the design of interfaces and algorithms for collaborative Web search.

Posted Content
TL;DR: In this article, the authors proposed an innovative method for an indexing support system that takes as input an ontology and a plain text document and provides as output contextualized keywords of the document.
Abstract: Document indexation is an essential task achieved by archivists or automatic indexing tools. To retrieve relevant documents to a query, keywords describing this document have to be carefully chosen. Archivists have to find out the right topic of a document before starting to extract the keywords. For an archivist indexing specialized documents, experience plays an important role. But indexing documents on different topics is much harder. This article proposes an innovative method for an indexing support system. This system takes as input an ontology and a plain text document and provides as output contextualized keywords of the document. The method has been evaluated by exploiting Wikipedia's category links as a termino-ontological resources.

Posted Content
TL;DR: It will be shown that the technique is not only feasible, but also an elegant solution to the stated problem; what's more, it achieves promising results, both increasing the performance of a major search engine for informational queries, and substantially reducing the time users require to answer complex information needs.
Abstract: Search engines are nowadays one of the most important entry points for Internet users and a central tool to solve most of their information needs. Still, there exist a substantial amount of users' searches which obtain unsatisfactory results. Needless to say, several lines of research aim to increase the relevancy of the results users retrieve. In this paper the authors frame this problem within the much broader (and older) one of information overload. They argue that users' dissatisfaction with search engines is a currently common manifestation of such a problem, and propose a different angle from which to tackle with it. As it will be discussed, their approach shares goals with a current hot research topic (namely, learning to rank for information retrieval) but, unlike the techniques commonly applied in that field, their technique cannot be exactly considered machine learning and, additionally, it can be used to change the search engine's response in real-time, driven by the users behavior. Their proposal adapts concepts from Swarm Intelligence (in particular, Ant Algorithms) from an Information Foraging point of view. It will be shown that the technique is not only feasible, but also an elegant solution to the stated problem; what's more, it achieves promising results, both increasing the performance of a major search engine for informational queries, and substantially reducing the time users require to answer complex information needs.

Posted Content
TL;DR: This paper introduces a searching system built by us for searching courses on the Vietnam OpenCourseWare Program (VOCW), which can be considered as the first tool to be able to perform the user’s Vietnamese questions.
Abstract: The necessary of buiding the searching system being able to support users expressing their searching by natural language queries is very important and opens the researching direction with many potential. It combines the traditional methods of information retrieval and the researching of Question Answering (QA). In this paper, we introduce a searching system built by us for searching courses on the Vietnam OpenCourseWare Program (VOCW). It can be considered as the first tool to be able to perform the user’s Vietnamese questions. The experiment results are rather good when we evaluate this system on the precision and the run-time of answering the Vietnamese questions.

Posted Content
TL;DR: In this article, an ''intellectual indexing that takes into account the point of view of the user'' is proposed. But it does not address the problems related to the misunderstanding of the natural language and the non correspondence between the real needs of a user and the results of his query.
Abstract: Information retrieval (IR) is a user approach to obtain relevant information which meets needs with the help of a IR system (IRS). However, the IRS shows certain differences between user relevance and system relevance. These gaps are essentially related to the imperfection of the indexing process (as approach related to the IR), to problems related to the misunderstanding of the natural language and the non correspondence between the real needs of the user and the results of his query. As idea is to think about an ?intellectual? indexing that takes into account the point of view of the user. By consulting the document, user can build information as added-value on the existing content: new information which grows contents and allows the semantic visibility or facilitates the reading by the annotations, by links to other content, by new descriptors, specific new abstracts of users: it is the reindexing of the contents by the contribution or the vote of the uses

Posted Content
TL;DR: This paper introduces variants of Re-Pair that offer fast decompression at arbitrary positions in main and secondary memory, and introduces variants that in addition speed up the operations required for inverted list intersection.
Abstract: Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art.