scispace - formally typeset
Search or ask a question

Showing papers on "Ranking (information retrieval) published in 1994"


Proceedings Article
01 Jan 1994
TL;DR: This work continues the work in TREC 3, performing runs in the routing, ad-hoc, and foreign language environments, with a major focus on massive query expansion, adding from 300 to 530 terms to each query.
Abstract: The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 3, performing runs in the routing, ad-hoc, and foreign language environments. Our major focus is massive query expansion : adding from 300 to 530 terms to each query. These terms come from known relevant documents in the case of routing, and from just the top retrieved documents in the case of ad-hoc and Spanish. This approach improves effectiveness from 7% to 25% in the various experiments. Other ad-hoc work extends our investigations into combining global similarities, giving an overall indication of how a document matches a query, with local similarities identifying a smaller part of the document which matches the query. Using an overlapping text window definition of local, we achieve a 16% improvement.

579 citations


Patent
Hinrich Schuetze1
16 Jun 1994
TL;DR: In this paper, a thesaurus of word vectors is formed for the words in the corpus of documents, which represent global lexical co-occurrence patterns and relationships between word neighbors.
Abstract: A method and apparatus accesses relevant documents based on a query. A thesaurus of word vectors is formed for the words in the corpus of documents. The word vectors represent global lexical co-occurrence patterns and relationships between word neighbors. Document vectors, which are formed from the combination of word vectors, are in the same multi-dimensional space as the word vectors. A singular value decomposition is used to reduce the dimensionality of the document vectors. A query vector is formed from the combination of word vectors associated with the words in the query. The query vector and document vectors are compared to determine the relevant documents. The query vector can be divided into several factor clusters to form factor vectors. The factor vectors are then compared to the document vectors to determine the ranking of the documents within the factor cluster.

519 citations


Proceedings ArticleDOI
Yiming Yang1
01 Aug 1994
TL;DR: The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.
Abstract: Expert Network (ExpNet) is our new approach to automatic categorization and retrieval of natural language texts. We use a training set of texts with expert-assigned categories to construct a network which approximately reflects the conditional probabilities of categories given a text. The input nodes of the network are words in the training texts, the nodes on the intermediate level are the training texts, and the output nodes are categories. The links between nodes are computed based on statistics of the word distribution and the category distribution over the training set. ExpNet is used for relevance ranking of candidate categories of an arbitrary text in the case of text categorization, and for relevance ranking of documents via categories in the case of text retrieval. We have evaluated ExpNet in categorization and retrieval on a document collection of the MEDLINE database, and observed a performance in recall and precision comparable to the Linear Least Squares Fit (LLSF) mapping method, and significantly better than other methods tested. Computationally, ExpNet has an O(N 1og N) time complexity which is much more efficient than the cubic complexity of the LLSF method. The simplicity of the model, the high recall-precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real-world applications.

457 citations


Proceedings ArticleDOI
24 May 1994
TL;DR: This work presents a bottom-up algorithm that generates an efficient query evaluation plan based on cost estimates, and identifies a number of directions in which future research can be directed.
Abstract: Many applications require the ability to manipulate sequences of data. We motivate the importance of sequence query processing, and present a framework for the optimization of sequence queries based on several novel techniques. These include query transformations, optimizations that utilize meta-data, and caching of intermediate results. We present a bottom-up algorithm that generates an efficient query evaluation plan based on cost estimates. This work also identifies a number of directions in which future research can be directed.

161 citations


Proceedings ArticleDOI
01 Aug 1994
TL;DR: The experiments show that on average a current generation natural language system provides better retrieval performance than expert searchers using a Boolean retrieval system when searching full-text legal materials.
Abstract: The results of experiments comparing the relative performance of natural language and Boolean query formulations are presented. The experiments show that on average a current generation natural language system provides better retrieval performance than expert searchers using a Boolean retrieval system when searching full-text legal materials. Methodological issues are reviewed and the effect of database size on query formulation strategy is discussed.

140 citations


Journal ArticleDOI
01 Jan 1994
TL;DR: This work approaches the subsumption problem in the setting of object-oriented databases, and finds that reasoning techniques from Artificial Intelligence can be applied and yield efficient algorithms.
Abstract: Subsumption between queries is a valuable information, eg, for semantic query optimization We approach the subsumption problem in the setting of object-oriented databases, and find that reasoning techniques from Artificial Intelligence can be applied and yield efficient algorithms

130 citations


Proceedings ArticleDOI
Michael Persin1
01 Aug 1994
TL;DR: The experiments show that the proposed evaluation technique reduces both main memory usage and query evaluation time, based on early recognition of which documents are likely to be highly ranked, without degradation in retrieval effectiveness.
Abstract: Ranking techniques are effective for finding answers in document collections but the cost of evaluation of ranked queries can be unacceptably high. We propose an evaluation technique that reduces both main memory usage and query evaluation time. based on early recognition of which documents are likely to be highly ranked. Our experiments show that, for our test data, the proposed technique evaluates queries in 20% of the time and 2% of the memory taken by the standard inverted file implementation, without degradation in retrieval effectiveness.

113 citations


Proceedings ArticleDOI
11 Dec 1994
TL;DR: A state-of-the-art review of ranking, selection and multiple-comparison procedures that are used to compare system designs via computer simulation is presented.
Abstract: We present a state-of-the-art review of ranking, selection and multiple-comparison procedures that are used to compare system designs via computer simulation. We describe methods for four classes of problems: screening a large number of system designs, selecting the best system, comparing all systems to a standard and comparing alternatives to a default. Rather than give a comprehensive review, we present the methods we would be likely to use in practice and emphasize recent results. Where possible, we unify the ranking-and-selection and multiple-comparison perspectives.

86 citations


Journal ArticleDOI
TL;DR: The current version of the Metathesaurus, as utilized by SAPHIRE, was unable to represent the conceptual content of one-fourth of physician-generated MEDLINE queries.

85 citations


Journal ArticleDOI
TL;DR: A linguistic extension has been generated, starting from an existing Boolean weighted retrieval model and formalized within fuzzy set theory, in which numeric query weights are replaced by linguistic descriptors that specify the degree of importance of the terms.

85 citations


Proceedings ArticleDOI
15 Oct 1994
TL;DR: Three extensions are applied to the basic colour-pair technique: the development of a similarity-based ranking formula for colour-pairs matching; the use of segmented objects for object-level retrieval; and the inclusion of perceptually similar colours for fuzzy retrieval.
Abstract: Most general content-based image retrieval techniques use colour and texture as main retrieval indices. A recent technique uses colour pairs to model distinct object boundaries for retrieval. These techniques have been applied to overall image contents without taking into account the characteristics of individual objects. While the techniques work well for the retrieval of images with similar overall contents (including backgrounds), their accuracies are limited because they are unable to take advantage of individual object's visual characteristics, and to perform object-level retrieval. This paper looks specifically at the use of colour-pair technique for fuzzy object-level image retrieval. Three extensions are applied to the basic colour-pair technique: (a) the development of a similarity-based ranking formula for colour-pairs matching; (b) the use of segmented objects for object-level retrieval; and (c) the inclusion of perceptually similar colours for fuzzy retrieval. A computer-aided segmentation technique is developed to segment the images' contents. Experimental results indicate that the extensions have led to substantial improvements in the retrieval performance. These extensions are sufficiently general and can be applied to other content-based image retrieval techniques.

Proceedings ArticleDOI
27 Jun 1994
TL;DR: Genetic programming is applied to an information retrieval system in order to improve Boolean query formulation via relevance feedback, which brings together the concepts of information retrieval and genetic programming.
Abstract: Genetic programming is applied to an information retrieval system in order to improve Boolean query formulation via relevance feedback. This approach brings together the concepts of information retrieval and genetic programming. Documents are viewed as vectors in index term space. A Boolean query, viewed as a parse tree, is an organism in the genetic programming sense. Through the mechanisms of genetic programming, the query is modified in order to improve precision and recall. Relevance feedback is incorporated, in part, via user defined measures over a trial set of documents. The fitness of a candidate query can be expressed directly as a function of the relevance of the retrieved set. Preliminary results based on a testbed are given. The form of the fitness function has a significant effect upon performance and the proper fitness functions take into account relevance based on topicality (and perhaps other factors). >


Book
01 Apr 1994
TL;DR: A graphical interface is described, called Cougar, that displays retrieved documents in terms of interactions among their automatically-assigned main topics, thus allowing users to familiarize themselves with the topics and terminology of a text collection.
Abstract: This dissertation investigates the role of contextual information in the automated retrieval and display of full-text documents, using robust natural language processing algorithms to automatically detect structure in and assign topic labels to texts. Many long texts are comprised of complex topic and subtopic structure, a fact ignored by existing information access methods. I present two algorithms which detect such structure, and two visual display paradigms which use the results of these algorithms to show the interactions of multiple main topics, multiple subtopics, and the relations between main topics and subtopics. The first algorithm, called {\it TextTiling}, recognizes the subtopic structure of texts as dictated by their content. It uses domain-independent lexical frequency and distribution information to partition texts into multi-paragraph passages. The results are found to correspond well to reader judgments of major subtopic boundaries. The second algorithm assigns multiple main topic labels to each text, where the labels are chosen from pre-defined, intuitive category sets; the algorithm is trained on unlabeled text. A new iconic representation, called {\it TileBars} uses TextTiles to simultaneously and compactly display query term frequency, query term distribution and relative document length. This representation provides an informative alternative to ranking long texts according to their overall similarity to a query. For example, a user can choose to view those documents that have an extended discussion of one set of terms and a brief but overlapping discussion of a second set of terms. This representation also allows for relevance feedback on patterns of term distribution. TileBars display documents only in terms of words supplied in the user query. For a given retrieved text, if the query words do not correspond to its main topics, the user cannot discern in what context the query terms were used. For example, a query on {\sl contaminants} may retrieve documents whose main topics relate to nuclear power, food, or oil spills. To address this issue, I describe a graphical interface, called {\it Cougar}, that displays retrieved documents in terms of interactions among their automatically-assigned main topics, thus allowing users to familiarize themselves with the topics and terminology of a text collection.

Proceedings ArticleDOI
01 Aug 1994
TL;DR: The results show that the most effective sources were the users written question statement, user terms derived during the interaction and terms selected from particular database fields.
Abstract: To improve information retrieval effectiveness, research in both the algorithmic and human approach to query expansion is required. This paper uses the human approach to examine the selection and effectiveness of search terms sources for query expansion. The results show that the most effective sources were the users written question statement, user terms derived during the interaction and terms selected from particular database fields. These findings indicate the need for the design and testing of automatic relevance feedback techniques that place greater emphasis on these sources.

Proceedings Article
01 Jan 1994
TL;DR: The system description when benchmarking the authors' retrieval strategy on category B of TREC-3, i.e. on c.550 Mbytes of the Wall Street Journal newspaper texts, is presented and the results of the retrieval system in terms of precision and recall are disappointing.
Abstract: This paper describes an approach to information retrieval based on a syntactic analysis of the document texts and user queries and from that analysis, the construction of tree structures (TSAs) to encode and capture language ambiguities. TSAs are constructed at the clause many TSAs and each query may be represented by several TSAs. The TSAs from documents and from queries are then matched and their degrees of overlap between individual TSAs are computed and then aggregated to yield a score for each document, which is then used in ranking the collection. This paper presents the system description when benchmarking our retrieval strategy on category B of TREC-3, i.e. on c.550 Mbytes of the Wall Street Journal newspaper texts. The implementation is based on a two-stage retrieval where a statistically-based pre-fetch retrieval retrieves the set of WSJ articles for the more computationnaly expensive language based processing. The results of our retrieval system in terms of precision and recall are disappointing and an analysis of why is also included. Part of this analysis includes a direct comparison between our system and some mainstream IR approaches. In addition to performaing ad hoc retrieval on texts in English, we have also performed ad hoc retrieval retrieval on texts in Spanish using a weighted trigram approach, and this is outlined and performance results given in an appendix.

Journal ArticleDOI
01 Dec 1994
TL;DR: An information retrieval system that simultaneously allows to search for text and speech documents and it is shown that the retrieval effectiveness based on such a small indexing vocabulary is similar to the retrieved effectiveness of a Boolean retrieval system.
Abstract: We present an information retrieval system that simultaneously allows to search for text and speech documents. The retrieval system accepts vague queries and performs a best-match search to find those documents that are relevant to the query. The output of the retrieval system is a list of ranked documents where the documents on the top of the list satisfy best the user's information need. The relevance of the documents is estimated by means of metadata (document description vectors). The metadata is automatically generated and it is organized such that queries can be processed efficiently. We introduce a controlled indexing vocabulary for both speech and text documents. The size of the new indexing vocabulary is small (1000 features) compared with the sizes of indexing vocabularies of conventional text retrieval (10000 - 100000 features). We show that the retrieval effectiveness based on such a small indexing vocabulary is similar to the retrieval effectiveness of a Boolean retrieval system.

01 Aug 1994
TL;DR: The idea, based on an adaptive technique of "genetic algorithms," for modifying a user's query to improve the retrieval results is presented, achieving substantially higher precision levels at each fixed recall level than those reported by other research groups.
Abstract: The ability of information retrieval system parameters on the basis of user's relevance feedback for improving the retrieval effectiveness has become an important and also an active research area One such adjustment is to modify the query which reflects the user's information request via the judgment of the previously retrieved results In this paper we present our idea, based on an adaptive technique of "genetic algorithms," for modifying a user's query to improve the retrieval results The effectiveness of query modification has been tested on the Cranfield Collection, a classical set of documents, queries and known relevance responses to each query Results of this study show that the method is highly effective, achieving substantially higher precision levels at each fixed recall level than those have been reported by other research groups

Proceedings ArticleDOI
17 Oct 1994
TL;DR: The paper makes suggestions for the generation of a user model as a basis for an adaptive visualization system to extract information about the user by involving the user in interactive computer tests and games.
Abstract: Meaningful scientific visualizations benefit the interpretation of scientific data, concepts and processes. To ensure meaningful visualizations, the visualization system needs to adapt to desires, disabilities and abilities of the user, interpretation aim, resources (hardware, software) available, and the form and content of the data to be visualized. We suggest describing these characteristics with four models: user model, problem domain/task model, resource model and data model. The paper makes suggestions for the generation of a user model as a basis for an adaptive visualization system. We propose to extract information about the user by involving the user in interactive computer tests and games. Relevant abilities tested are color perception, color memory, color ranking, mental rotation, and fine motor coordination. >

Journal ArticleDOI
TL;DR: It is shown through performance comparison that the proposed E-Relevance algorithm achieves higher retrieval effectiveness than the others proposed earlier, and avoids the various problems of previous thesaurus-based ranking algorithms.
Abstract: In this paper we investigate document ranking methods in thesaurus-based boolean retrieval systems, and propose a new thesaurus-based ranking algorithm called the Extended Relevance (E-Relevance) algorithm. The E-Relevance algorithm integrates the extended boolean model and the thesaurus-based relevance algorithm. Since the E-Relevance algorithm has all the desirable properties of the extended boolean model, it avoids the various problems of previous thesaurus-based ranking algorithms. The E-Relevance algorithm also ranks documents effectively by using term dependence information from the thesaurus. We have shown through performance comparison that the proposed algorithm achieves higher retrieval effectiveness than the others proposed earlier.

Journal ArticleDOI
TL;DR: A search strategy for hypertext systems based on an extended Boolean model (the p -norm scheme) and supplemented it with links to improve the ranking of the retrieved items in a sequence most likely to fulfill the intent of the user is designed and implemented.
Abstract: In proposing a searching strategy well suited to the hypertext environment, we have considered four criteria: (1) the retrieval scheme should be integrated into a large hypertext environment; (2) the retrieval process should be operable with an unrestricted text collection; (3) the processing time should be reasonable; and (4) the system should be capable of learning in order to improve its retrieval effectiveness. To satisfy these four criteria, we have designed and implemented a search strategy for hypertext systems based on an extended Boolean model (the p -norm scheme) and supplemented it with links to improve the ranking of the retrieved items in a sequence most likely to fulfill the intent of the user. These links, representing additional information about document content, are established according to the requests and relevance judgments. Using a fully automatic procedure, our retrieval scheme can be applied to most existing systems. Based on the CACM test collection, which includes 3,204 documents and the CISI corpus (1,460 documents), we have built a hypertext and evaluated our proposed strategy. The retrieval effectiveness of our solution presents encouraging results.

Proceedings ArticleDOI
Peter Anick1
01 Aug 1994
TL;DR: The challenges of tuning an IR system to the domain of computer troubleshooting, where user queries tend to be very short and natural language query terms are intermixed with terminology from a variety of technical sublanguages are considered.
Abstract: There has been much research in full-text information retrieval on automated and semi-automated methods of query expansion to improve the effectiveness of user queries In this paper we consider the challenges of tuning an IR system to the domain of computer troubleshooting, where user queries tend to be very short and natural language query terms are intermixed with terminology from a variety of technical sublanguages A number of heuristic techniques for domain knowledge acquisition are described in which the complementary contributions of query log data and corpus analysis are exploited We discuss the implications of sublanguage domain tuning for run-time query expansion tools and document indexing, arguing that the conventional devices for more purely “natural language” domains may be inadequate

Journal ArticleDOI
TL;DR: This paper depart from deductive paradigm with object-oriented extensions, then relax the too strict modus ponens in the classic propositional logic by appropriate inference rules that would capture the relevance of information in the document to the information needed by the user.
Abstract: Relevance of the retrieved documents to a query is in the sense of information retrieval a judgement of the user rather than the material implication in the sense of logic. In this paper we depart from deductive paradigm with object-oriented extensions, then relax the too strict modus ponens in the classic propositional logic by appropriate inference rules that would capture the relevance of information in the document to the information needed by the user. In such a framework, a document is relevant to a query if the latter can be deduced from the set of axioms associated with the document using inference rules. As various kinds of inference rules will be used in the deduction, we distinguish between logical, strict, and plausible rules. Answering a query in such a framework can be done either by a special query processor that supports different kinds of inference mechanisms, or by relaxing the original query so that it can be evaluated by an ordinary query processor. Instead of suggesting a new logic model, we make use of the query answering machinery of deductive and object-oriented database in this approach.

Proceedings ArticleDOI
14 Feb 1994
TL;DR: The methods described in the paper have been used to build a retrieval system with which it is possible to process ranked queries of 40 terms in about 5% of the space required by previous implementations; in as little as 25%" of the time; and without measurable degradation in retrieval effectiveness.
Abstract: Ranking techniques have long been suggested as alternatives to conventional Boolean methods for searching document collections. The cost of computing a ranking is, however, greater than the cost of performing a Boolean search, in terms of both memory space and processing time. The authors consider the resources required by the cosine method of ranking, and show that, with a careful application of indexing and selection techniques, both the space and the time required by ranking can be substantially reduced. The methods described in the paper have been used to build a retrieval system with which it is possible to process ranked queries of 40 terms in about 5% of the space required by previous implementations; in as little as 25% of the time; and without measurable degradation in retrieval effectiveness. >

Proceedings Article
16 Jun 1994
TL;DR: Among others it is shown that χ r (G) can be computed in polynomial time when restricted to graphs with treewidth at most k for any fixed k.

Proceedings ArticleDOI
IJsbrand Jan Aalbersberg1
01 Aug 1994
TL;DR: A new full-text document retrieval model that is based on comparing occurrence frequency rank numbers of terms in queries and documents is introduced.
Abstract: This paper introduces a new full-text document retrieval model that is based on comparing occurrence frequency rank numbers of terms in queries and documents.

Proceedings Article
11 Oct 1994
TL;DR: An architecture for an interactive retrieval system based on abduction is proposed comprising a schema-level representation of the documents' contents and structure, an abductive retrieval engine, and a user interface which allows to control the inference process.
Abstract: The problem of automatic query expansion is studied in the context of a logic-based information retrieval system that employs - in contrast to approaches based on deductive reasoning - an abductive inference engine. Given a query, the abduction process yields a set of possible expansions to the query. An architecture for an interactive retrieval system based on abduction is proposed comprising a schema-level representation of the documents' contents and structure, an abductive retrieval engine, and a user interface which allows to control the inference process. The retrieval engine was tested on a collection of SGML-structured texts. We report on experimental results in the last section of the paper.

01 Jan 1994
TL;DR: This paper first counts the exact number of trees that can be used to evaluate a given query on n relations, then serves to generate random, uniformly distributed operator trees in O(n2) time per tree.
Abstract: In this paper we study the space of operator trees that can be used to answer a join query, with the goal of generating elements form this space at random. We solve the problem for queries with acyclic query graphs. We first count, in O(n3) time, the exact number of trees that can be used to evaluate a given query on n relations. The intermediate results of the counting procedure then serve to generate random, uniformly distributed operator trees in O(n2) time per tree. We also establish a mapping between the N operator trees for a query and the integers 1 through N —i. e. a ranking-and describe ranking and unranking procedures with complexity O(n2) and O(n2 log n), respectively.

Proceedings ArticleDOI
Chen1
01 Jan 1994
TL;DR: The ID5R algorithm is introduced, previously developed by Utgoff (1989), for "intelligent" and system-supported query processing in information retrieval and database management systems.
Abstract: This paper presents an incremental, inductive learning approach to query-by-examples for information retrieval (IR) and database management systems (DBMS). After briefly reviewing conventional information retrieval techniques and the prevailing database query paradigms, we introduce the ID5R algorithm, previously developed by Utgoff (1989), for "intelligent" and system-supported query processing. >

Journal ArticleDOI
10 Nov 1994-Nature
TL;DR: Some surprising facts about the most productive institutions emerge when Japanese life-science research is subject to a novel type of assessment.
Abstract: Some surprising facts about the most productive institutions emerge when Japanese life-science research is subject to a novel type of assessment.