scispace - formally typeset
Search or ask a question

Showing papers on "Ranking (information retrieval) published in 2001"


Proceedings ArticleDOI
01 Apr 2001
TL;DR: A set of techniques for the rank aggregation problem is developed and compared to that of well-known methods, to design rank aggregation techniques that can be used to combat spam in Web searches.
Abstract: We consider the problem of combining ranking results from various sources. In the context of the Web, the main applications include building meta-search engines, combining ranking functions, selecting documents based on multiple criteria, and improving search precision through word associations. We develop a set of techniques for the rank aggregation problem and compare their performance to that of well-known methods. A primary goal of our work is to design rank aggregation techniques that can e ectively combat \spam," a serious problem in Web searches. Experiments show that our methods are simple, e cient, and e ective.

1,982 citations


Proceedings ArticleDOI
01 Oct 2001
TL;DR: This work proposes the use of a support vector machine active learning algorithm for conducting effective relevance feedback for image retrieval and achieves significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.
Abstract: Relevance feedback is often a critical component when designing image databases. With these databases it is difficult to specify queries directly and explicitly. Relevance feedback interactively determinines a user's desired output or query concept by asking the user whether certain proposed images are relevant or not. For a relevance feedback algorithm to be effective, it must grasp a user's query concept accurately and quickly, while also only asking the user to label a small number of images. We propose the use of a support vector machine active learning algorithm for conducting effective relevance feedback for image retrieval. The algorithm selects the most informative images to query a user and quickly learns a boundary that separates the images that satisfy the user's query concept from the rest of the dataset. Experimental results show that our algorithm achieves significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback.

1,512 citations


Proceedings ArticleDOI
05 Oct 2001
TL;DR: This paper proposes and evaluates two different approaches to updating a query language model based on feedback documents, one based on a generative probabilistic model of feedback documents and onebased on minimization of the KL-divergence over feedback documents.
Abstract: The language modeling approach to retrieval has been shown to perform well empirically. One advantage of this new approach is its statistical foundations. However, feedback, as one important component in a retrieval system, has only been dealt with heuristically in this new retrieval approach: the original query is usually literally expanded by adding additional terms to it. Such expansion-based feedback creates an inconsistent interpretation of the original and the expanded query. In this paper, we present a more principled approach to feedback in the language modeling approach. Specifically, we treat feedback as updating the query language model based on the extra evidence carried by the feedback documents. Such a model-based feedback strategy easily fits into an extension of the language modeling approach. We propose and evaluate two different approaches to updating a query language model based on feedback documents, one based on a generative probabilistic model of feedback documents and one based on minimization of the KL-divergence over feedback documents. Experiment results show that both approaches are effective and outperform the Rocchio feedback approach.

852 citations


Journal ArticleDOI
01 Sep 2001
TL;DR: A framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory is presented and an operational retrieval model that extends recent developments in the language modeling approach to information retrieval is suggested.
Abstract: We present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling approach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk minimization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC collections and compared to the basic language modeling approach and vector space models together with query expansion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data.

823 citations


Proceedings Article
03 Jan 2001
TL;DR: A simple and efficient online algorithm is described, its performance in the mistake bound model is analyzed, its correctness is proved, and it outperforms online algorithms for regression and classification applied to ranking.
Abstract: We discuss the problem of ranking instances. In our framework each instance is associated with a rank or a rating, which is an integer from 1 to k. Our goal is to find a rank-predict ion rule that assigns each instance a rank which is as close as possible to the instance's true rank. We describe a simple and efficient online algorithm, analyze its performance in the mistake bound model, and prove its correctness. We describe two sets of experiments, with synthetic data and with the EachMovie dataset for collaborative filtering. In the experiments we performed, our algorithm outperforms online algorithms for regression and classification applied to ranking.

657 citations


Journal ArticleDOI
TL;DR: This work presents a computationally simple and theoretically justified method for assigning scores to candidate expansion terms within Rocchio's framework for query reweigthing, and discusses the effect on retrieval effectiveness of the main parameters involved in automatic query expansion.
Abstract: Techniques for automatic query expansion from top retrieved documents have shown promise for improving retrieval effectiveness on large collections; however, they often rely on an empirical ground, and there is a shortage of cross-system comparisons. Using ideas from Information Theory, we present a computationally simple and theoretically justified method for assigning scores to candidate expansion terms. Such scores are used to select and weight expansion terms within Rocchio's framework for query reweigthing. We compare ranking with information-theoretic query expansion versus ranking with other query expansion techniques, showing that the former achieves better retrieval effectiveness on several performance measures. We also discuss the effect on retrieval effectiveness of the main parameters involved in automatic query expansion, such as data sparseness, query difficulty, number of selected documents, and number of selected terms, pointing out interesting relationships.

404 citations


Proceedings ArticleDOI
01 Sep 2001
TL;DR: XIRQL as discussed by the authors is a query language based on the document-centric view of XML, which integrates logic-based probabilistic IR models, in combination with concepts from the database area.
Abstract: Based on the document-centric view of XML, we present the query language XIRQL. Current proposals for XML query languages lack most IR-related features, which are weighting and ranking, relevance-oriented search, datatypes with vague predicates, and semantic relativism. XIRQL integrates these features by using ideas from logic-based probabilistic IR models, in combination with concepts from the database area. For processing XIRQL queries, a path algebra is presented, that also serves as a starting point for query optimization.

332 citations


Patent
Krishna Bharat1
30 Jan 2001
TL;DR: A re-ranking component in the search engine then refined the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents were preferred over documents that were less frequently cited within the original set.
Abstract: A search engine for searching a corpus improves the relevancy of the results by refining a standard relevancy score based on the interconnectivity of the initially returned set of documents. The search engine obtains an initial set of relevant documents by matching a user's search terms to an index of a corpus. A re-ranking component in the search engine then refines the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents are preferred over documents that are less frequently cited within the initial set.

330 citations


Patent
18 Jan 2001
TL;DR: In this paper, the relevance of a document to a user's query is determined by calculating a similarity coefficient, based on the structures of each pair of query predicates and document predicates.
Abstract: A relevancy ranking and clustering method and system that determines the relevance of a document relative to a user's query using a similarity comparison process. Input queries are parsed into one or more query predicate structures using an ontological parser. The ontological parser parses a set of known documents to generate one or more document predicate structures. A comparison of each query predicate structure with each document predicate structure is performed to determine a matching degree, represented by a real number. A multilevel modifier strategy is implemented to assign different relevance values to the different parts of each predicate structure match to calculate the predicate structure's matching degree. The relevance of a document to a user's query is determined by calculating a similarity coefficient, based on the structures of each pair of query predicates and document predicates. Documents are autonomously clustered using a self-organizing neural network that provides a coordinate system that makes judgments in a non-subjective fashion.

321 citations


Proceedings ArticleDOI
01 Sep 2001
TL;DR: In a different type of experiment, ranking based on link anchor text is twice as effective asranking based on document content, even though both methods used the same BM25 formula.
Abstract: Link-based ranking methods have been described in the literature and applied in commercial Web search engines. However, according to recent TREC experiments, they are no better than traditional content-based methods. We conduct a different type of experiment, in which the task is to find the main entry point of a specific Web site. In our experiments, ranking based on link anchor text is twice as effective as ranking based on document content, even though both methods used the same BM25 formula. We obtained these results using two sets of 100 queries on a 18.5 million document set and another set of 100 on a 0.4 million document set. This site finding effectiveness begins to explain why many search engines have adopted link methods. It also opens a rich new area for effectiveness improvement, where traditional methods fail.

320 citations


Patent
19 Dec 2001
TL;DR: In this article, a system and method for performing domain-specific knowledge based metasearches is presented for accessing a searching text-based documents using generic search engines while simultaneously being able to access publication based databases and sequence databases as well as in-house proprietary databases and any database capable of being interfaced with a web interface so as to produce search results in text format.
Abstract: A system and method for performing domain-specific knowledge based metasearches. A metasearch engine is provided for accessing a searching text-based documents using generic search engines while simultaneously being able to access publication based databases and sequence databases as well as in-house proprietary databases and any database capable of being interfaced with a web interface so as to produce search results in text format. A data mining module is also provided for organizing raw data obtained by unsupervised clustering, simple relevance ranking, and categorization, all of which are done independently of one another. The system is capable of storing previous search data for use in query refinement or subsequent searches based upon the stored data. A search results collection browser may be provided for analyzing current browsing patterns of the user for developing weighting factors to be used in ordering the results of future searches.

Proceedings ArticleDOI
01 Sep 2001
TL;DR: The initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics are proposed, which are referred to aspseudo-relevance judgments.
Abstract: The most prevalent experimental methodology for comparing the effectiveness of information retrieval systems requires a test collection, composed of a set of documents, a set of query topics, and a set of relevance judgments indicating which documents are relevant to which topics. It is well known that relevance judgments are not infallible, but recent retrospective investigation into results from the Text REtrieval Conference (TREC) has shown that differences in human judgments of relevance do not affect the relative measured performance of retrieval systems. Based on this result, we propose and describe the initial results of a new evaluation methodology which replaces human relevance judgments with a randomly selected mapping of documents to topics which we refer to aspseudo-relevance judgments.Rankings of systems with our methodology correlate positively with official TREC rankings, although the performance of the top systems is not predicted well. The correlations are stable over a variety of pool depths and sampling techniques. With improvements, such a methodology could be useful in evaluating systems such as World-Wide Web search engines, where the set of documents changes too often to make traditional collection construction techniques practical.

Book ChapterDOI
02 Apr 2001
TL;DR: An algorithm is presented to synthesize linear ranking functions that can establish termination of program cycles and the representation of systems of linear inequalities and sets of linear expressions as polyhedral cones allows this search to be reduced to the computation of polars, intersections and projections ofpolyhedral cones.
Abstract: Deductive verification of progress properties relies on finding ranking functions to prove termination of program cycles. We present an algorithm to synthesize linear ranking functions that can establish such termination. Fundamental to our approach is the representation of systems of linear inequalities and sets of linear expressions as polyhedral cones. This representation allows us to reduce the search for linear ranking functions to the computation of polars, intersections and projections of polyhedral cones, problems which have well-known solutions.

Proceedings ArticleDOI
01 Sep 2001
TL;DR: It is shown empirically that the score distributions of a number of text search engines on a per query basis may be fitted using an exponential distribution for the set of non-relevant documents and a normal distribution forThe set of relevant documents.
Abstract: In this paper the score distributions of a number of text search engines are modeled. It is shown empirically that the score distributions on a per query basis may be fitted using an exponential distribution for the set of non-relevant documents and a normal distribution for the set of relevant documents. Experiments show that this model fits TREC-3 and TREC-4 data for not only probabilistic search engines like INQUERY but also vector space search engines like SMART for English. We have also used this model to fit the output of other search engines like LSI search engines and search engines indexing other languages like Chinese.It is then shown that given a query for which relevance information is not available, a mixture model consisting of an exponential and a normal distribution can be fitted to the score distribution. These distributions can be used to map the scores of a search engine to probabilities. We also discuss how the shape of the score distributions arise given certain assumptions about word distributions in documents. We hypothesize that all 'good' text search engines operating on any language have similar characteristics.This model has many possible applications. For example, the outputs of different search engines can be combined by averaging the probabilities (optimal if the search engines are independent) or by using the probabilities to select the best engine for each query. Results show that the technique performs as well as the best current combination techniques.

Proceedings ArticleDOI
01 Sep 2001
TL;DR: A new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed, and so provide a better cost/performance compromise than previous inverted file organisations.
Abstract: Considerable research effort has been invested in improving the effectiveness of information retrieval systems. Techniques such as relevance feedback, thesaural expansion, and pivoting all provide better quality responses to queries when tested in standard evaluation frameworks. But such enhancements can add to the cost of evaluating queries. In this paper we consider the pragmatic issue of how to improve the cost-effectiveness of searching. We describe a new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed. That is, we are able to reach similar effectiveness levels with less computational cost, and so provide a better cost/performance compromise than previous inverted file organisations.

Patent
23 Aug 2001
TL;DR: In this article, a method and system for obtaining consumer preferences over a communication network from consumers is presented, where the system searches the product database for products or services based on consumer's search criteria.
Abstract: A method and system for obtaining consumer preferences over a communication network from consumers. The system searches the product database for products or services based on consumer's search criteria. The system displays the products or services and/or advertisements related to the consumer's search criteria in accordance with the ranking parameter(s) specified by the user. The consumer's preferences, i.e., the search criteria and the ranking parameters(s), are stored in the database for future references, e.g., determine consumer trends, etc.

Book ChapterDOI
19 Sep 2001
TL;DR: An ontology of place is presented that combines limited coordinate data with qualitative spatial relationships between places and has been implemented with a semantic modelling system linking non-spatial conceptual hierarchies with the place ontology.
Abstract: Geographical context is required of many information retrieval tasks in which the target of the search may be documents, images or records which are referenced to geographical space only by means of place names. Often there may be an imprecise match between the query name and the names associated with candidate sources of information. There is a need therefore for geographical information retrieval facilities that can rank the relevance of candidate information with respect to geographical closeness as well as semantic closeness with respect to the topic of interest. Here we present an ontology of place that combines limited coordinate data with qualitative spatial relationships between places. This parsimonious model of place is intended to suppon information retrieval tasks that may be global in scope. The ontology has been implemented with a semantic modelling system linking non-spatial conceptual hierarchies with the place ontology. An hierarchical distance measure is combined with Euclidean distance between place centroids to create a hybrid spatial distance measure. This can be combined with thematic distance, based on classification semantics, to create an integrated semantic closeness measure that can be used for a relevance ranking of retrieved objects.

Proceedings ArticleDOI
01 Sep 2001
TL;DR: Experimental results show that query-expansion using document summaries can be considerably more effective than using full-document expansion and a novel approach to term-selection that separates the choice of relevant documents from the selection of a pool of potential expansion terms is presented.
Abstract: Query-expansion is an effective Relevance Feedback technique for improving performance in Information Retrieval. In general query-expansion methods select terms from the complete contents of relevant documents. One problem with this approach is that expansion terms unrelated to document relevance can be introduced into the modified query due to their presence in the relevant documents and distribution in the document collection. Motivated by the hypothesis that query-expansion terms should only be sought from the most relevant areas of a document, this investigation explores the use of document summaries in query-expansion. The investigation explores the use of both context-independent standard summaries and query-biased summaries. Experimental results using the Okapi BM25 probabilistic retrieval model with the TREC-8 ad hoc retrieval task show that query-expansion using document summaries can be considerably more effective than using full-document expansion. The paper also presents a novel approach to term-selection that separates the choice of relevant documents from the selection of a pool of potential expansion terms. Again, this technique is shown to be more effective that standard methods.

Patent
29 Jun 2001
TL;DR: In this article, a method and system for constructing a text summarization is presented, where a user profile indicative of a user's interests is defined in terms of the ontology concepts and a document's relevance to the user is determined based upon the user profile.
Abstract: A method and system for constructing a text summarization. At least one domain ontology that includes a set of concepts is selected. A user profile indicative of a user's interests is defined in terms of the ontology concepts. A document's relevance to the user is determined based upon the user profile. If the document is relevant, at least a portion of the ontology is used to extract concepts from the document. The degree of match between the extracted concepts and the user profile concepts is determined and the document text summary is generated if the degree of match exceeds a predetermined threshold. Generating the summary may include selecting sentences based on the concepts in the user profile, ranking the selected sentences by relevance to the user profile, selecting sentences for inclusion in the document text summary based upon the ranking, and merging the selected sentences into the document text summary.

Patent
13 Apr 2001
TL;DR: A weighted preference data search engine as discussed by the authors uses the weighted preference information to search a data source and to provide an ordered result list based upon the weighting information, including a plurality of search criteria and a corresponding plurality of weights indicating the relative importance of the search criteria.
Abstract: A search engine for databases, data streams, and other data sources allows user preferences as to the relative importance of search criteria to be used to rank the output of the search engine. A weighted preference generator generates weighted preference information including at least a plurality of weights corresponding to a plurality of search criteria. A weighted preference data search engines uses the weighted preference information to search a data source and to provide an ordered result list based upon the weighted preference information. A method for weighted preference data searching includes determining weighted preference information including a plurality of search criteria and a corresponding plurality of weights signifying the relative importance of the search criteria, and querying a data source and ranking the results based upon the weighted preference information. In addition to allowing client input of the relative importance of various search criteria, the system and method also preferably include the ability to provide a subjective ordering for at least some of the search criteria.

Journal ArticleDOI
TL;DR: A new type of passage is introduced, overlapping fragments of either fixed or variable length, and it is shown that ranking with these arbitrary passages gives substantial improvements in retrieval effectiveness over traditional document ranking schemes, particularly for queries on collections of long documents.
Abstract: Text retrieval systems store a great variety of documents, from abstracts, newspaper articles, and Web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in current approaches to ranking. Use of short fragments of documents, called passages, instead of whole documents can overcome these shortcomings: passage ranking provides convenient units of text to return to the user, can avoid the difficulties of comparing documents of different length, and enables identification of short blocks of relevant material among otherwise irrelevant text. In this article, we compare several kinds of passage in an extensive series of experiments. We introduce a new type of passage, overlapping fragments of either fixed or variable length. We show that ranking with these arbitrary passages gives substantial improvements in retrieval effectiveness over traditional document ranking schemes, particularly for queries on collections of long documents. Ranking with arbitrary passages shows consistent improvements compared to ranking with whole documents, and to ranking with previous passage types that depend on document structure or topic shifts in documents.

Proceedings ArticleDOI
01 Sep 2001
TL;DR: Experimental results show that the two-level cache is superior, and that it allows increasing the maximum number of queries processed per second by a factor of three, while preserving the response time.
Abstract: We present an e ective caching scheme that reduces the computing and I/O requirements of a Web search engine without altering its ranking characteristics. The novelty is a two-level caching scheme that simultaneously combines cached query results and cached inverted lists on a real case search engine. A set of log queries are used to measure and compare the performance and the scalability of the search engine with no cache, with the cache for query results, with the cache for inverted lists, and with the two-level cache. Experimental results show that the two-level cache is superior, and that it allows increasing the maximum number of queries processed per second by a factor of three, while preserving the response time. These results are new, have not been reported before, and demonstrate the importance of advanced caching schemes for real case search engines.

Book ChapterDOI
12 Jul 2001
TL;DR: A computational model is developed to determine the directional similarity between extended spatial objects, which forms a foundation for meaningful spatial similarity operators and confirms the cognitive plausibility of the similarity model.
Abstract: Like people who casually assess similarity between spatial scenes in their routine activities, users of pictorial databases are often interested in retrieving scenes that are similar to a given scene, and ranking them according to degrees of their match. For example, a town architect would like to query a database for the towns that have a landscape similar to the landscape of the site of a planned town. In this paper, we develop a computational model to determine the directional similarity between extended spatial objects, which forms a foundation for meaningful spatial similarity operators. The model is based on the direction-relation matrix. We derive how the similarity assessment of two direction-relation matrices corresponds to determining the least cost for transforming one direction-relation matrix into another. Using the transportation algorithm, the cost can be determined efficiently for pairs of arbitrary direction-relation matrices. The similarity values are evaluated empirically with several types of movements that create increasingly less similar direction relations. The tests confirm the cognitive plausibility of the similarity model.

Proceedings ArticleDOI
01 Sep 2001
TL;DR: A probabilistic cross-lingual retrieval system that uses a generative model to estimate the probability that a document in one language is relevant, given a query in another language, which achieves better retrieval results but requires more computation than the structural query translation technique.
Abstract: This work proposes and evaluates a probabilistic cross-lingual retrieval system. The system uses a generative model to estimate the probability that a document in one language is relevant, given a query in another language. An important component of the model is translation probabilities from terms in documents to terms in a query. Our approach is evaluated when 1) the only resource is a manually generated bilingual word list, 2) the only resource is a parallel corpus, and 3) both resources are combined in a mixture model. The combined resources produce about 90% of monolingual performance in retrieving Chinese documents. For Spanish the system achieves 85% of monolingual performance using only a pseudo-parallel Spanish-English corpus. Retrieval results are comparable with those of the structural query translation technique (Pirkola, 1998) when bilingual lexicons are used for query translation. When parallel texts in addition to conventional lexicons are used, it achieves better retrieval results but requires more computation than the structural query translation technique. It also produces slightly better results than using a machine translation system for CLIR, but the improvement over the MT system is not significant.

Journal ArticleDOI
TL;DR: The general architecture and function of an intelligent recommendation system aimed at supporting a leisure traveller in the task of selecting a tourist destination, bundling a set of products and composing a plan for the travel is described.
Abstract: This paper describes the general architecture and function of an intelligent recommendation system aimed at supporting a leisure traveller in the task of selecting a tourist destination, bundling a set of products and composing a plan for the travel. The system enables the user to identify his own destination and to personalize the travel by aggregating elementary items (additional locations to visit, services and activities). Case-Based Reasoning techniques enable the user to browse a repository of past travels and make possible the ranking of the elementary items included in a recommendation when these are selected from a catalogue. The system integrates data and information originating from external, already existent, tourist portals exploiting an XML-based mediator architecture, data mapping techniques, similarity-based retrieval and online analytical processing.

Patent
16 Mar 2001
TL;DR: A query information retrieval content enhancing system and method using the system disclosed that takes a user query and generates not only results corresponding to the exact query, but also results that relate to the same query as discussed by the authors.
Abstract: A query information retrieval content enhancing system and method using the system disclosed that takes a user query and generates not only results corresponding to the exact query, but also generates results that relate to the exact query. The related results are generated by identifying query keywords and connectors and determining related keywords and/or connectors. The original keywords and connectors and the relates keywords and connectors are then submitted to data mining routines that generate the related results. The normal results and related results are then made available to the user through an interface so that the user can review, analyze and manipulate the results.

Patent
04 Oct 2001
TL;DR: In this article, a document organizer processor may analyze the content of documents such as web pages and text documents, downloaded from a computer network, such as the Internet or an intranet, in response to a user's search query.
Abstract: Systems and methods interactive document search, retrieval, categorization, and summarization are provided. A document organizer processor may analyze the content of documents, such as web pages and text documents, downloaded from a computer network, such as the Internet or an intranet, in response to a user's search query. After receiving a search query from a user, the processor may locate documents related to the query, parse words in the documents into a word set, filter out unnecessary words, group the documents into categories, provide labels for the categories, construct summaries of the documents in each category, determine if any additional words or phases are to be recommended, present the labels and summaries to the user, and enable the user to iteratively refine the search.

Patent
08 May 2001
TL;DR: In this article, a similarity score is calculated for the query utilizing a feature vector that characterizes attributes and query words associated with the document, and a rank value is assigned to the document based upon the relevance score and the similarity score.
Abstract: A method of ranking search results includes producing a relevance score for a document in view of a query. A similarity score is calculated for the query utilizing a feature vector that characterizes attributes and query words associated with the document. A rank value is assigned to the document based upon the relevance score and the similarity score.

Patent
22 Aug 2001
TL;DR: In this paper, a method, system, and computer program product for performing searching that generates improved queries, retrieves meaningful and relevant information, and presents the retrieved information to the user in a useful and comprehensive manner is described.
Abstract: A method, system, and computer program product for performing searching that generates improved queries, retrieves meaningful and relevant information, and presents the retrieved information to the user in a useful and comprehensive manner is described. The method of searching comprises the steps of: receiving from a user a search query requesting information, retrieving at least one recommendation relating to the search query, generating an expanded query based on the received query, performing a search using the expanded query to retrieve documents, and generating themes relating to the retrieved documents. The at least one recommendation relating to the search query is retrieved from a recommendation database. The recommendation database is generated by performing the steps of: performing data mining using users search query logs, user search patterns, and user profile information to generate a plurality of recommendations relating to search query strings, generating a data structure including the recommendations relating to search query strings, and generating a text index based on information in the data structure.

Patent
05 Dec 2001
TL;DR: In this paper, the authors propose a method to find a result for a query based on a large body of information such as a collection of documents, and rank the matches in order to provide the most relevant information.
Abstract: The invention offers new approaches to fulfilling an information need, in particular to finding a result for a query based on a large body of information such as a collection of documents. The invention accepts a query containing an unspecified portion that expresses the information need. The invention locates matches for the query within a body of information and returns the matches or portions thereof in addition to or instead of identifiers for documents in which the matches are found. The invention allows placement of term ordering restrictions, and allows intervening words between the search terms as they appear in the searched documents or contexts. The invention ranks the matches in order to provide the most relevant information. One preferred method of ranking considers the number of instances of a match among a plurality of documents. The invention further defines a new type of index that includes contexts in which terms occur and provides methods of searching such indices to fulfill an information need.