scispace - formally typeset
Search or ask a question

Showing papers on "Ranking (information retrieval) published in 1999"


Journal ArticleDOI
01 Sep 1999
TL;DR: It is shown that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query, suggesting that traditional information retrieval techniques may not work well for answering web search requests.
Abstract: In this paper we present an analysis of an AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents almost 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. We also present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques may not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such.

1,255 citations


Journal ArticleDOI
01 Aug 1999
TL;DR: A simple, well motivated model of the document-to-query translation process is proposed, and an algorithm for learning the parameters of this model in an unsupervised manner from a collection of documents is described.
Abstract: We propose a new probabilistic approach to information retrieval based upon the ideas and methods of statistical machine translation. The central ingredient in this approach is a statistical model of how a user might distill or "translate" a given document into a query. To assess the relevance of a document to a user's query, we estimate the probability that the query would have been generated as a translation of the document, and factor in the user's general preferences in the form of a prior distribution over documents. We propose a simple, well motivated model of the document-to-query translation process, and describe an algorithm for learning the parameters of this model in an unsupervised manner from a collection of documents. As we show, one can view this approach as a generalization and justification of the "language modeling" strategy recently proposed by Ponte and Croft. In a series of experiments on TREC data, a simple translation-based retrieval system performs well in comparison to conventional retrieval techniques. This prototype system only begins to tap the full potential of translation-based retrieval.

651 citations


Patent
05 May 1999
TL;DR: A system for ranking search results obtained from an information retrieval system includes a search pre-processor, a search engine and a search post-processor as mentioned in this paper, which is used to determine the context of a search query by comparing the terms in the search query with a predetermined user context profile.
Abstract: A system for ranking search results obtained from an information retrieval system includes a search pre-processor, a search engine and a search post-processor. The search preprocessor determines the context of the search query by comparing the terms in the search query with a predetermined user context profile. Preferably, the context profile is a user profile or a community profile, which includes a set of terms which have been rated by the user, community, or a recommender system. The search engine generates a search result comprising at least one item obtained from the information retrieval system. The search post-processor ranks each item returned in the search result in accordance with the context of the search query.

512 citations


Patent
Wen-Syan Li1, Quoc Vu1
22 Mar 1999
TL;DR: In this article, the authors present a hypermedia database for managing bookmarks, which allows a user to organize hypertext documents for querying, navigating, sharing and viewing, and also provides access control to the information in the database.
Abstract: The present invention provides a hypermedia database for managing bookmarks, which allows a user to organize hypertext documents for querying, navigating, sharing and viewing. In addition, the hypermedia database also provides access control to the information in the database. The hypermedia database of the present invention parses meta-data from bookmarked documents and indexes and classifies the documents. The present invention supports advanced query and navigation of a collection of bookmarks, especially providing various personalized bookmark services. In one embodiment, the present invention utilizes a proxy server to observe a user's access patterns to provide useful personalized services, such as automated URL bookmarking, document refresh, and bookmark expiration. In addition, a user may also specify various preference in bookmark management, e.g., ranking schemes (i.e. by referral, access frequency, or popularity) and navigation tree fan-out. A subscription service which retrieves new or updated documents of user-specified interests is also provided.

428 citations


Proceedings Article
07 Sep 1999
TL;DR: This paper describes the query processor of Lore, a DBMS for XML-based data supporting an expressive query language and focuses primarily on Lore's cost-based query optimizer, including heuristics for reducing the large search space.
Abstract: XML is an emerging standard for data representation and exchange on the World-Wide Web. Due to the nature of information on the Web and the inherent flexibility of XML, we expect that much of the data encoded in XML will be semistructured: the data may be irregular or incomplete, and its structure may change rapidly or unpredictably. This paper describes the query processor of Lore, a DBMS for XML-based data supporting an expressive query language. We focus primarily on Lore's cost-based query optimizer. While all of the usual problems associated with cost-based query optimization apply to XML-based query languages, a number of additional problems arise, such as new kinds of indexing, more complicated notions of database statistics, and vastly different query execution strategies for different databases. We define appropriate logical and physical query plans, database statistics, and a cost model, and we describe plan enumeration including heuristics for reducing the large search space. Our optimizer is fully implemented in Lore and preliminary performance results are reported. This is a short version of the paper Query Optimization for Semistructured Data which is available at: http://www-db.stanford.edu/~mchughj/publications/qo.ps

419 citations


Proceedings Article
07 Sep 1999
TL;DR: This paper studies how to determine a range query to evaluate a top-k query by exploiting the statistics available to a relational DBMS, and the impact of the quality of these statistics on the retrieval eciency of the resulting scheme.
Abstract: In many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. Instead, the result to such queries is typically a rank of the \top k" tuples that best match the given attribute values. In this paper, we study the advantages and limitations of processing a top-k query by translating it into a single range query that traditional relational DBMSs can process eciently. In particular, we study how to determine a range query to evaluate a top-k query by exploiting the statistics available to a relational DBMS, and the impact of the quality of these statistics on the retrieval eciency of the resulting scheme.

328 citations


Proceedings ArticleDOI
07 Sep 1999
TL;DR: A framework for multidatabase query processing that fully includes the quality of information in many facets, such as completeness, timeliness, accuracy, etc, is described.
Abstract: Integrated access to information that is spread over multiple, distributed, and heterogeneous sources is an important problem in many scienti c and commercial domains. While much work has been done on query processing and choosing plans under cost criteria, very little is known about the important problem of incorporating the information quality aspect into query planning. In this paper we describe a framework for multidatabase query processing that fully includes the quality of information in many facets, such as completeness, timeliness, accuracy, etc. We seamlessly include information quality into a multidatabase query processor based on a view-rewriting mechanism. We model information quality at di erent levels to ultimately nd a set of high-quality queryanswering plans.

243 citations


Journal ArticleDOI
TL;DR: A general decision-theoretic model is developed and a formula which estimates the number of relevant documents in a database based on dictionary information is derived, which is similar to the single-broker case.
Abstract: In networked IR, a client submits a query to a broker, which is in contact with a large number of databases. In order to yield a maximum number of documents at minimum cost, the broker has to make estimates about the retrieval cost of each database, and then decide for each database whether or not to use it for the current query, and if, how many documents to retrieve from it. For this purpose, we develop a general decision-theoretic model and discuss different cost structures. Besides cost for retrieving relevant versus nonrelevant documents, we consider the following parameters for each database: expected retrieval quality, expected number of relevant documents in the database and cost factors for query processing and document delivery. For computing the overall optimum, a divide-and-conquer algorithm is given. If there are several brokers knowing different databases, a preselection of brokers can only be performed heuristically, but the computation of the optimum can be done similarily to the single-broker case. In addition, we derive a formula which estimates the number of relevant documents in a database based on dictionary information.

212 citations


Journal ArticleDOI
01 Jun 1999
TL;DR: A theoretical and experimental analysis of the resulting search space and a novel query optimization algorithm that is designed to perform well under the different conditions that may arise are described.
Abstract: We consider the problem of query optimization in the presence of limitations on access patterns to the data (i.e., when one must provide values for one of the attributes of a relation in order to obtain tuples). We show that in the presence of limited access patterns we must search a space of annotated query plans, where the annotations describe the inputs that must be given to the plan. We describe a theoretical and experimental analysis of the resulting search space and a novel query optimization algorithm that is designed to perform well under the different conditions that may arise. The algorithm searches the set of annotated query plans, pruning invalid and non-viable plans as early as possible in the search space, and it also uses a best-first search strategy in order to produce a first complete plan early in the search. We describe experiments to illustrate the performance of our algorithm.

184 citations


Journal ArticleDOI
TL;DR: Analysis shows that Alta Vista, Excite and Infoseek are the top three services, with their relative rank changing depending on how one operationally defines the concept of relevance.
Abstract: Five search engines, Alta Vista, Excite, HotBot, Infoseek, and Lycos, are compared for precision on the first 20 results returned for 15 queries, adding weight for ranking effectiveness. All searching was done from January 31 to March 12, 1997. In the study, steps are taken to ensure that bias has not unduly influenced the evaluation. Friedmann’s randomized block design is used to perform multiple comparisons for significance. Analysis shows that Alta Vista, Excite and Infoseek are the top three services, with their relative rank changing depending on how one operationally defines the concept of relevance. Correspondence analysis shows that Lycos performed better on short, unstructured queries, whereas HotBot performed better on structured queries.

171 citations


Patent
02 Jul 1999
TL;DR: In this paper, meta-descriptors are generated for multimedia information in a repository by extracting the descriptors from the multimedia information and clustering the metadata information based on the descriptor.
Abstract: Multimedia information retrieval is performed using meta-descriptors in addition to descriptors. A “descriptor” is a representation of a feature, a “feature” being a distinctive characteristic of multimedia information, while a “meta-descriptor” is information about the descriptor. Meta-descriptors are generated for multimedia information in a repository ( 10, 12, 14, 16, 18, 20, 22, 24 ) by extracting the descriptors from the multimedia information ( 111 ), clustering the multimedia information based on the descriptors ( 112 ), assigning meta-descriptors to each cluster ( 113 ), and attaching the meta-descriptors to the multimedia information in the repository ( 114 ). The multimedia repository is queried by formulating a query using query-by-example ( 131 ), acquiring the descriptor/s and meta-descriptor/s for a repository multimedia item ( 132 ), generating a query descriptor/s if none of the same type has been previously generated ( 133, 134 ), comparing the descriptors of the repository multimedia item and the query multimedia item ( 135 ), and ranking and displaying the results ( 136, 137 ).

Journal ArticleDOI
TL;DR: This paper used corpus analysis techniques to automatically discover similar words directly from the contents of the databases which are not tagged with part-of-speech labels, resulting in conceptual retrieval rather than requiring exact word matches between queries and documents.
Abstract: Searching online text collections can be both rewarding and frustrating. While valuable information can be found, typically many irrelevant documents are also retrieved, while many relevant ones are missed. Terminology mismatches between the user's query and document contents are a main cause of retrieval failures. Expanding a user's query with related words can improve search performances, but finding and using related words is an open problem. This research uses corpus analysis techniques to automatically discover similar words directly from the contents of the databases which are not tagged with part-of-speech labels. Using these similarities, user queries are automatically expanded, resulting in conceptual retrieval rather than requiring exact word matches between queries and documents. We are able to achieve a 7.6% improvement for TREC 5 queries and up to a 28.5% improvement on the narrow-domain Cystic Fibrosis collection. This work has been extended to multidatabase collections where each subdatabase has a collection-specific similarity matrix associated with it. If the best matrix is selected, substantial search improvements are possible. Various techniques to select the appropriate matrix for a particular query are analyzed, and a 4.8% improvement in the results is validated.

Patent
10 Jun 1999
TL;DR: In this article, an integrated retrieval scheme retrieves data involved in a plurality of semi-structured documents scattering over open networks and collects the required information item by item from the semi-Structured documents through a unified interface without regard to differences in the document structures, presentation styles, and elements of the documents.
Abstract: An integrated retrieval scheme retrieves data involved in a plurality of semi-structured documents scattering over open networks and collects the required information item by item from the semi-structured documents through a unified interface without regard to differences in the document structures, presentation styles, and elements of the semi-structured documents. The search scheme receives a query consisting of search items and search conditions from a user (S200). The search scheme finds, according to location data that specifies the location of each of the semi-structured documents, the location of each semi-structured document that contains all search items (S210) and converts, if necessary, item presentation styles of the entered query into that of the location found semi-structured documents according to style conversion data (S220,S225,S230), and forms queries for the location found semi-structured documents, and transmits the queries to the found locations and obtains the location found semi-structured documents (S240), and extracts item data from the obtained semi-structured documents according to structure data being used to delimit document into items and attribute data being used for conditional retrieval, and prepares a search result (S240), and converts, if necessary, item presentation styles of the search result into the item presentation styles of each user according to the style conversion data (S250).

Journal ArticleDOI
TL;DR: A new scheme which implements a recursive HSV-space segmentation technique to identify perceptually prominent color areas, providing robust retrieval results for a wide range of gamma nonlinearity values, which proves to be of great importance since, in general, the image acquisition source is unknown.

Patent
Jay Ponte1
31 Mar 1999
TL;DR: In this paper, the authors present a system for performing online data queries, which is a distributed computer system with a plurality of server nodes each fully redundant and capable of processing a user query request.
Abstract: Disclosed is a system for performing online data queries. The system for performing online data queries is a distributed computer system with a plurality of server nodes each fully redundant and capable of processing a user query request. Each server node includes a data query cache and other caches that may be used in performing data queries. The data query, as well as request allocation, is performed in accordance with an adaptive partitioning technique with a bias towards an initial partitioning scheme. Generic objects are created and used to represent business listings upon which the user may perform queries. Various data processing and integration techniques are included which enhance data queries. An update technique is used for synchronizing data updates as needed in updating the plurality of server nodes. A multi-media data transfer technique is used to transfer non-text or multi-media data between various components of the online query tool. Optimizations for searching, such as the common term optimization, are included for those commonly performed data queries. Also disclosed is a system for targeting advertisements that are displayed to a user of the system.

Proceedings ArticleDOI
Yun-Wu Huang1, Philip S. Yu1
01 Aug 1999
TL;DR: IBM T.J. Watson Research Center 30 Saw Mill River Road Hawthorne, NY 10532 is located on the outskirts of Hawthorne and is home to the largest collection of Watson research mice in the world, as well as a small number of other research mice from around the world.
Abstract: IBM T.J. Watson Research Center 30 Saw Mill River Road Hawthorne, NY 10532

Journal ArticleDOI
Wen-Syan Li1, Quoc Vu1, Divakant Agrawal1, Yoshinori Hara1, Hajime Takano1 
17 May 1999
TL;DR: The notion of bookmark management is extended by introducing the functionalities of hypermedia databases, and PowerBookmarks supports advanced query, classification, and navigation functionalities on collections of bookmarks.
Abstract: We extend the notion of bookmark management by introducing the functionalities of hypermedia databases. PowerBookmarks is a Web information organization, sharing, and management tool, which parses metadata from bookmarked URLs and uses it to index and classify the URLs. PowerBookmarks supports advanced query, classification, and navigation functionalities on collections of bookmarks. PowerBookmarks monitors and utilizes users' access patterns to provide many useful personalized services, such as automated URL bookmarking, document refreshing, and bookmark expiration. It also allows users to specify their preference in bookmark management, such as ranking schemes and classification tree structures. Subscription services for new or updated documents of users' interests are also supported.

Patent
Jay Ponte1
30 Jul 1999
TL;DR: In this paper, a method and device for improving the quality of documents selected in response to a user query for documents such as Web pages or sites is presented, which involves the successive review by the user of a limited number of documents as being relevant or not relevant, the analysis of the characteristics of the documents so graded by means of information retrieval techniques, and the modification of the search query based upon that analysis, until the user is satisfied with the quality presented to him.
Abstract: Disclosed is a method and device for improving the quality of documents selected in response to a user query for documents such as Web pages or sites. The method is one of iteration, and involves the successive review by the user of a limited number of documents as being relevant or not relevant, the analysis of the characteristics of the documents so graded by means of information retrieval techniques, and the modification of the search query based upon that analysis, until the user is satisfied with the quality of the documents presented to him.

Journal ArticleDOI
TL;DR: A new relevance feedback mechanism is described which evaluates the feature distributions of the images judged relevant, or not relevant, by the user and dynamically updates both the similarity measure and the query in order to accurately represent the user's particular information needs.
Abstract: Content-based image retrieval systems require the development of relevance feedback mechanisms that allow the user to progressively refine the system's response to a query. In this paper a new relevance feedback mechanism is described which evaluates the feature distributions of the images judged relevant, or not relevant, by the user and dynamically updates both the similarity measure and the query in order to accurately represent the user's particular information needs. Experimental results demonstrate the effectiveness of this mechanism.

Journal ArticleDOI
TL;DR: Compared to passage ranking with adaptations of current document ranking algorithms, the new “DO-TOS” passage-ranking algorithm requires only a fraction of the resources, at the cost of a small loss of effectiveness.
Abstract: Queries to text collections are resolved by ranking the documents in the collection and returning the highest-scoring documents to the user. An alternative retrieval method is to rank passages, that is, short fragments of documents, a strategy that can improve effectiveness and identify relevant material in documents that are too large for users to consider as a whole. However, ranking of passages can considerably increase retrieval costs. In this article we explore alternative query evaluation techniques, and develop new tecnhiques for evaluating queries on passages. We show experimentally that, appropriately implemented, effective passage retrieval is practical in limited memory on a desktop machine. Compared to passage ranking with adaptations of current document ranking algorithms, our new “DO-TOS” passage-ranking algorithm requires only a fraction of the resources, at the cost of a small loss of effectiveness.

Journal ArticleDOI
TL;DR: The idea is to have users give two to three times more feedback in the same amount of time that would be required to give feedback for conventional feedback mechanisms to clarify the ambiguity of the short queries given by users.
Abstract: The World Wide Web is a world of great richness, but finding information on the Web is also a great challenge. Keyword-based querying has been an immediate and efficient way to specify and retrieve related information that the user inquires. However, conventional document ranking based on an automatic assessment of document relevance to the query may not be the best approach when little information is given, as in most cases. In order to clarify the ambiguity of the short queries given by users, we propose the idea of concept-based relevance feedback for Web information retrieval. The idea is to have users give two to three times more feedback in the same amount of time that would be required to give feedback for conventional feedback mechanisms. Under this design principle, we apply clustering techniques to the initial search results to provide concept-based browsing. We show the performance of various feedback interface designs and compare their pros and cons. We measure precision and relative recall to show how clustering improves performance over conventional similarity ranking and, most importantly, we show how the assistance of concept-based presentation reduces browsing labor.

Patent
Soumen Chakrabarti1, Byron Dom1
12 Mar 1999
TL;DR: In this article, a system and method for ranking wide area computer network (e.g., Web) pages by popularity in response to a query was proposed, using a query and the response from a search engine.
Abstract: A system and method for ranking wide area computer network (e.g., Web) pages by popularity in response to a query. Further, using a query and the response thereto from a search engine, the system and method finds additional key words that might be good extended search terms, essentially generating a local thesaurus on the fly at query time.

01 Jan 1999
TL;DR: A keyword-based model for textmining is described in Feldman and Dagan (1995) and the work suggests to use a wide range of KDD (KnowledgeDiscovery in Databases) operations on collections of textual documents, including association discovery among keywords within the documents.
Abstract: As the amount of electronic documents (corpora, dictionaries, newspapers, newswires, etc.) becomes more andmore important and diversified, there is a need to extract inf ormation automatically from these texts.In order to extract terms and relations between terms, two methods can be used. The first method is theunsupervised approach, which requires a term extraction module and few predefined t ypes, especially termtypes, in order to find relationships between terms and to ass ign appropriate types to the relationships.Works on automatic term recognition usually involve predefi nition of a set of term patterns, extractionprocedure and a scoring mechanism to filter out non-relevant candidates. Smadja (1993) describes a set oftechniques based on statistical methods for retrieving collocations from large text collections. Daille (1996)presents a combination of linguistic filters and statistica l methods to extract two-word terms. This work imple-ments finite automata for each term pattern, then various sta tistical scores for ranking the extracted terms arecompared.Unsupervised identification of term relationships is a more complicated task, reported in works from variousfields including Computational Linguistics and Knowledge D iscovery in Texts. A keyword-based model for textmining is described in Feldman and Dagan (1995). The work suggests to use a wide range of KDD (KnowledgeDiscovery in Databases) operations on collections of textual documents, including association discovery amongkeywords within the documents. Cooper and Byrd (1997) reports the T

Patent
James Conklin1
16 Mar 1999
TL;DR: In this paper, a search and retrieval system pre-processes an input query to map a contextual semantic interpretation, expressed by the user of the input query, to a boolean logic interpretation for processing in the system.
Abstract: A search and retrieval system pre-processes an input query to map a contextual semantic interpretation, expressed by the user of the input query, to a boolean logic interpretation for processing in the search and retrieval system. A knowledge base comprises a plurality of categories, such that subsets of the categories are designated to one of a plurality of groups. A lexicon stores a plurality of terms including definitional characteristics for the terms. To pre-process the query, the search and retrieval system receives an input query comprising a plurality of terms, and processes the terms by referencing the lexicon to identify value terms that comprise a content carrying capacity. The knowledge base is referenced to identify a group for each value term. A processed input query is generated by inserting an AND logical connector between two value terms if the two respective value terms are in different groups and by inserting an OR logical connector between two value terms if the two respective value terms are in the same group. The lexicon is also used to identify phrases as well as connective terms for conversion to a boolean operator.

Patent
Mahesh Viswanathan1
18 Jun 1999
TL;DR: In this article, an audio retrieval system and method are provided for augmenting the transcription of an audio file with one or more alternate word or phrase choices, such as next-best guesses for each word and phrase, in addition to the best word sequence identified by the transcription process.
Abstract: An audio retrieval system and method are provided for augmenting the transcription of an audio file with one or more alternate word or phrase choices, such as next-best guesses for each word or phrase, in addition to the best word sequence identified by the transcription process. The audio retrieval system can utilize a primary index file containing the best identified words and/or phrases for each portion of the input audio stream and a supplemental index file containing alternative choices for each word or phrase in the transcript. The present invention allows words that are incorrectly transcribed during speech recognition to be identified in response to a textual query by searching the supplemental index files. During an indexing process, the list of alternative word or phrase choices provided by the speech recognition system are collected to produce a set of supplemental index files. During a retrieval process, the user-specified textual query is matched against the primary and supplemental indexes derived from the transcribed audio to identify relevant documents. An objective ranking function scales matches found in the supplemental index file(s) using a predefined scaling factor, or a value reflecting the confidence value of the corresponding alternative choice as identified by the speech recognition system.

Proceedings ArticleDOI
23 Mar 1999
TL;DR: A statistical method to estimate the usefulness of a search engine for any given query, which can be used by a metasearch engine to choose local search engines to invoke.
Abstract: In this paper, we present a statistical method to estimate the usefulness of a search engine for any given query. The estimates can be used by a metasearch engine to choose local search engines to invoke. For a given query, the usefulness of a search engine in this paper is defined to be a combination of the number of documents in the search engine that are sufficiently similar to the query and the average similarity of these documents. Experimental results indicate that the proposed estimation method is quite accurate.

01 Jan 1999
TL;DR: The test revealed that when the queries were unexpanded, there were no great differences between different structure types irrespective of the complexity level, and when queries were expanded, the performance of the retrieval system was satisfactory.
Abstract: Solveig Stormbom and Arja Tuuliniemi kindly helped me to test facet analysis. I am grateful to my assiduous supervisor, Prof. Kalervo Järvelin, for encouragement and confidence in my work. I would also like to thank Heikki Keskustalo for his patience in Unix matters, the whole FIRE group and Ruud van der Pol for discussions and thoughtful comments, and Virginia Mattila for the language checking. II Abstract In this study the effects of query complexity, expansion and structure on retrieval performance – measured as precision and recall – in probabilistic text retrieval were tested. Complexity refers to the number of search facets or intersecting concepts in a query. Facets were divided into major and minor facets on the basis of their importance with respect to a corresponding request. Two complexity levels were tested: high complexity refers to queries using all search facets identified from requests, low complexity was achieved by formulating queries with major facets only. Query expansion was based on a thesaurus, from which the expansion keys were elicited for queries. There were five expansion types: (1) the first query version was an unexpanded, original query with one search key for each search concept (original search concepts) elicited from the test thesaurus; (2) the synonyms of the original search keys were added to the original query; (3) search keys representing the narrower concepts of the original search concepts were added to the original query; (4) search keys representing the associative concepts of the original search concepts were added to the original query; (5) all previous expansion keys were cumulatively added to the original query. Query structure refers to the syntactic structure of a query expression, marked with query operators and parentheses. The structure of queries was either weak (queries with no differentiated relations between search keys, except weights) or strong (different relationships between search keys). More precisely, strong query structures were based on facets or intersecting concepts. Altogether five weak and eight strong structure types were tested. The test involved 30 test requests which all were formulated into 110 queries representing different structure, expansion and complexity combinations. The test database was a text database of 53,893 newspaper articles. The test was run in InQuery, a probabilistic text retrieval system. The test revealed that when the queries were unexpanded, there were no great differences between different structure types irrespective of the complexity level. When queries were expanded, the performance of the …


Patent
21 Dec 1999
TL;DR: In this article, the authors propose a method for retrieving relevant stories from a collection of stories. But their method comprises the steps of identifying at least one query term, applying a cooccurrence matrix to the query term to provide a list of query terms, determining if a story in the collection contains any terms on the list of queries, and then increasing a relevance measure if the story does contain words on the query words.
Abstract: A method for retrieving relevant stories from a collection of stories. The method comprises the steps of identifying at least one query term, applying a cooccurrence matrix to the query term to provide a list of query terms, determining if a story in the collection contains any terms on the list of query terms, and then increasing a relevance measure if the story does contain words on the list of query words. If the relevance measure is higher than a threshold, the story is added to a list of relevant stories.

Book ChapterDOI
22 Sep 1999
TL;DR: A query expansion technique which is based on a statistical similarity measure among terms to improve the effectiveness of the dictionary-based cross-language information retrieval (CLIR) method and a term similarity-based sense disambiguation technique proposed in earlier work is employed to enhance the accuracy of the Dictionary-based query translation method.
Abstract: We propose a query expansion technique which is based on a statistical similarity measure among terms to improve the effectiveness of the dictionary-based cross-language information retrieval (CLIR) method. We employ a term similarity-based sense disambiguation technique proposed in our earlier work to enhance the accuracy of the dictionary-based query translation method. The query expansion technique is then applied to the translation of queries to further improve their retrieval performance. We demonstrate the effectiveness of the two techniques combined using queries in three languages, namely, German, Spanish, and Indonesian, to retrieve English documents from a standard TREC (Text Retrieval Conference) collection. The results of our experiments indicate that the term similarity-based techniques work better when there are more phrases in the queries. In addition, our results also reemphasize other researchers' finding that phrase recognition and translation are critical to CLIR's effectiveness.