scispace - formally typeset
Search or ask a question

Showing papers in "Information Retrieval in 2003"


Journal ArticleDOI
TL;DR: In this paper, an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes, is performed using a publicly available corpus.
Abstract: This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.

195 citations



Journal ArticleDOI
TL;DR: Five compression techniques are examined, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding, which may increase the file size, but decrease load/decompress time, thereby increasing throughput.
Abstract: Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search. The premise “smaller is better” may not be true. To truly build faster indexes it is often necessary to forfeit compression. For inverted lists consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in 128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512 bytes and the word size is 4 bytes, then both the compressed and raw postings would require one disk seek and one disk sector read. A less efficient compression technique may increase the file size, but decrease load/decompress time, thereby increasing throughput. Examined here are five compression techniques, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding. The effect on file size, file seek time, and file read time are all measured as is decompression time. A quantitative measure of throughput is developed and the performance of each method is determined.

95 citations


Journal ArticleDOI
TL;DR: It is found that at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance, which suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text.
Abstract: We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonics that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications.

82 citations


Journal ArticleDOI
TL;DR: It is empirically confirmed that P@n should decline when moving to a sample collection and that average precision and R-precision should remain constant, and SD theory suggests the use of recall-fallout plots as operating characteristic (OC) curves.
Abstract: The relationship between collection size and retrieval effectiveness is particularly important in the context of Web search. We investigate it first analytically and then experimentally, using samples and subsets of test collections. Different retrieval systems vary in how the score assigned to an individual document in a sample collection relates to the score it receives in the full collections we identify four cases. We apply signal detection (SD) theory to retrieval from samples, taking into account the four cases and using a variety of shapes for relevant and irrelevant distributions. We note that the SD model subsumes several earlier hypotheses about the causes of the decreased precision in samples. We also discuss other models which contribute to an understanding of the phenomenon, particularly relating to the effects of discreteness. Different models provide complementary insights. Extensive use is made of test data, some from official submissions to the TREC-6 VLC track and some new, to illustrate the effects and test hypotheses. We empirically confirm predictions, based on SD theory, that [email protected]n should decline when moving to a sample collection and that average precision and R-precision should remain constant. SD theory suggests the use of recall-fallout plots as operating characteristic (OC) curves. We plot OC curves of this type for a real retrieval system and query set and show that curves for sample collections are similar but not identical to the curve for the full collection.

76 citations


Journal ArticleDOI
TL;DR: Two novel approaches to query expansion with long-span collocates—words, significantly co-occurring in topic-size windows with query terms and local collocation analysis from a subset of retrieved documents are presented.
Abstract: The paper presents two novel approaches to query expansion with long-span collocates—words, significantly co-occurring in topic-size windows with query terms. In the first approach—global collocation analysis—collocates of query terms are extracted from the entire collection, in the second—local collocation analysis—from a subset of retrieved documents. The significance of association between collocates was estimated using modified Mutual Information and Z score. The techniques were tested using the Okapi IR system. The effect of different parameters on performance was evaluated: window size, number of expansion terms, measures of collocation significance and types of expansion terms. We present performance results of these techniques and provide comparison with related approaches.

69 citations



Journal Article
TL;DR: In this paper, the economic factors, resource discovery, description and use, developing and designing systems for sharing digital resources, portals and personalization mechanisms for end-user access, preservation, and digital librarians: new roles for the information age.
Abstract: digital futures in current contexts * why digitize? * developing collections in the digital world * the economic factors * resource discovery, description and use * developing and designing systems for sharing digital resources * portals and personalization: mechanisms for end-user access * preservation * digital librarians: new roles for the information age * digital futures.

60 citations


Journal ArticleDOI
TL;DR: While the text is probably better suited as a teaching aid, rather than as a must-have reference for researchers, those in related fields, or those with merely a general interest in discovering more about the subject material will find this a highly accessible volume with a sufficiently extensive range of references to serve as a good starting point for further reading.
Abstract: by content, whereby the k objects in the database are sought that are most similar to a specific query or object. Text based internet searches are the most obvious applications here, although searches over images and time series (movies) databases are also discussed, along with the clear problems of extracting content and context (“A picture may be worth a thousand words, but which thousand words to use is a non-trivial problem”). The later chapters contain sufficient one line commentary reminders of previously discussed results and ideas to allow them to be persued in a modular fashion, and this is clearly what the Authors intended. Regardless, there are plenty of references to earlier chapters should the reader have skipped ahead overly enthusiastically. If more detail is required than is provided in this book, the close of each chapter thoughtfully provides a “Further Reading” section containing, in total, references to well over four hundred books and articles, including other texts on data mining that have more e.g., computer science or economics orientations. This text is aimed at senior undergraduates and junior graduates who wish to learn about the principles of data mining. An undergraduate background in a quantitative discipline (i.e., computer science, mathematics, engineering, economics etc.) is assumed, and this is perhaps a fair assessment of the scientific level of the book. While the necessary basics in the first four chapters should serve as brief refresher, and perhaps more importantly they are examined from a data mining perspective, the remainder of the text manages to prevent unnecessary technical details from distracting the reader from understanding the main concepts and concerns. This is not to say that the text is overly general—to the Authors credit, detail is provided where it is needed, and there is a good number of specific examples and applications throughout the text in to illustrate or clarify a point—rather that a necessary distance is maintained in order to give a broad overview of the subject material. As a consequence, the discussion has a flowing nature that is both informative and easy to digest. While the text is probably better suited as a teaching aid, rather than as a must-have reference for researchers, those in related fields, or those with merely a general interest in discovering more about the subject material, will find this a highly accessible volume with a sufficiently extensive range of references to serve as a good starting point for further reading.

58 citations


Journal ArticleDOI
TL;DR: These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one and the approximation quality of the different mapping functions is compared.
Abstract: Information Retrieval systems typically sort the result with respect to document retrieval status values (RSV). According to the Probability Ranking Principle, this ranking ensures optimum retrieval quality if the RSVs are monotonously increasing with the probabilities of relevance (as e.g. for probabilistic IR models). However, advanced applications like filtering or distributed retrieval require estimates of the actual probability of relevance. The relationship between the RSV of a document and its probability of relevance can be described by a “normalisation” function which maps the retrieval status value onto the probability of relevance (“mapping functions”). In this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distributed retrieval (only merging, without resource selection). These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one. Retrieval quality for distributed retrieval is only slightly improved by using the logistic function.

32 citations


Journal ArticleDOI
TL;DR: Several studies conducted using SIMSIFTER were reported on that examined the impact of key dimensions such as type of interests, rate of change of interests and level of user-involvement on modeling accuracy and ultimately on filtering effectiveness.
Abstract: Modeling users in information filtering systems is a difficult challenge due to dimensions such as nature, scope, and variability of interests. Numerous machine-learning approaches have been proposed for user modeling in filtering systems. The focus has been primarily on techniques for user model capture and representation, with relatively simple assumptions made about the type of users' interests. Although many studies claim to deal with “adaptive” techniques and thus they pay heed to the fact that different types of interests must be modeled or even changes in interests have to be captured, few studies have actually focused on the dynamic nature and the variability of user-interests and their impact on the modeling process. A simulation based information filtering environment called SIMSFITER was developed to overcome some of the barriers associated with conducting studies on user-oriented factors that can impact interests. SIMSIFTER implemented a user modeling approach known as reinforcement learning that has proven to be effective in previous filtering studies involving humans. This paper reports on several studies conducted using SIMSIFTER that examined the impact of key dimensions such as type of interests, rate of change of interests and level of user-involvement on modeling accuracy and ultimately on filtering effectiveness.

Journal ArticleDOI
TL;DR: A probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach, achieves the averaged recall ratio of 100, precision ratio of 83.3%, and accuracy ratio of 92.5%.
Abstract: The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a “record.” This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.

Journal ArticleDOI
TL;DR: This paper employs a decision theoretic approach to find conditions under which an Information Filtering (IF) system is unconditionally superior to another for all users regardless of their cost and benefit profiles and discovers an unexpected dominance relation.
Abstract: In the IR field it is clear that the value of a system depends on the cost and benefit profiles of its users. It would seem obvious that different users would prefer different systems. In the TREC-9 filtering track, systems are evaluated by a utility measure specifying a given cost and benefit. However, in the study of decision systems it is known that, in some cases, one system may be unconditionally better than another. In this paper we employ a decision theoretic approach to find conditions under which an Information Filtering (IF) system is unconditionally superior to another for all users regardless of their cost and benefit profiles. It is well known that if two IF systems have equal precision the system with better recall will be preferred by all users. Similarly, with equal recall, better precision is universally preferred. We confirm these known results and discover an unexpected dominance relation in which a system with lower recall will be universally preferred provided its precision is sufficiently higher.

Journal ArticleDOI
TL;DR: Results based on precision-recall data indicate that a MeSH-enhanced search index is capable of delivering noticeable incremental performance gain (as much as 35%) over the original LSI for modest constraints on precision.
Abstract: Latent Semantic Indexing (LSI) is a popular information retrieval model for concept-based searching. As with many vector space IR models, LSI requires an existing term-document association structure such as a term-by-document matrix. The term-by-document matrix, constructed during document parsing, can only capture weighted vocabulary occurrence patterns in the documents. However, for many knowledge domains there are pre-existing semantic structures that could be used to organize and categorize information. The goals of this study are (i) to demonstrate how such semantic structures can be automatically incorporated into the LSI vector space model, and (ii) to measure the effect of these structures on query matching performance. The new approach, referred to as Knowledge-Enhanced LSI, is applied to documents in the OHSUMED medical abstracts collection using the semantic structures provided by the UMLS Semantic Network and MeSH. Results based on precision-recall data (11-point average precision values) indicate that a MeSH-enhanced search index is capable of delivering noticeable incremental performance gain (as much as 35%) over the original LSI for modest constraints on precision. This performance gain is achieved by replacing the original query with the MeSH heading extracted from the query text via regular expression matches.

Journal ArticleDOI
TL;DR: Dipe-D is meant to improve on existing procedures that identify the concepts less systematically, create a query manually, and then sometimes expand that query, and therefore performs as good as human beings do.
Abstract: The paper reports the development of Dipe-D, a knowledge-based procedure for the formulation of Boolean queries in information retrieval. Dipe-D creates a query in two steps: (1) the user's information need is developed interactively, while identifying the concepts of the information need, and subsequently (2) the collection of concepts identified is automatically transformed into a Boolean query. In the first step, the subject area—as represented in a knowledge base—is explored by the user. He does this by means of specifying the (concepts that meet his) information need in an artificial language and looking through the solution as provided by the computer. The specification language allows one to specify concepts by their features, both in precise terms as well as vaguely. By repeating the process of specifying the information need and exploring the resulting concepts, the user may precisely single out the concepts that describe his information need. In the second step, the program provides the designations (and variants) for the concepts identified, and connects them by appropriate operators. Dipe-D is meant to improve on existing procedures that identify the concepts less systematically, create a query manually, and then sometimes expand that query. Experiments are reported on each of the two stepss they indicate that the first step identifies only but not all the relevant concepts, and the second step performs (at least) as good as human beings do.

Journal ArticleDOI
TL;DR: This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system and presents a novel inference network replica selection function that directs most of the appropriate queries to replicas in a replica hierarchy.
Abstract: The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanism. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the scenarios where partial replication performs better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to examine query locality using query similarity versus exact match. We show that searching replicas can improve locality (from 3 to 19%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4% in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.

Journal ArticleDOI
TL;DR: Daille Beatrice (1996) Study and implementation of combined techniques for automatic extraction of terminology using selective NLP and first-order thesauri, and a (finite-state) parsing grammar.
Abstract: Daille Beatrice (1996) Study and implementation of combined techniques for automatic extraction of terminology. In: Judith L Klavans and Philip Resnik, Eds., The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press, Cambridge, MA, pp. 49–66. Evans David A, Ginther-Webster Kimberly, Hart Mary, Lefferts Robert G and Monarch Ira A (1991) Automatic indexing using selective NLP and first-order thesauri. In: Proceedings of the RIAO’91. Barcelona, Spain, pp. 624–643. Justeson John S and Katz Slava M (1995) Technical terminology: Some linguistic properties and an algorithm for identificatgion in text. Natural Language Engineering, 1(1):9–27. Voutillainen Atro (1993) Designing a (finite-state) parsing grammar. In: Emmanuel Roche and Yves Schabes, Eds., Finite State Natural Language Processing. MIT Press, Cambridge, MA, pp. 283–310.