Showing papers on "Ranking (information retrieval) published in 2003"

PDF

Open Access

Journal Article•DOI•

Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search

[...]

01 Jul 2003-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is shown that using linear combinations of these (precomputed) biased PageRank vectors to generate context-specific importance scores for pages at query time, can generate more accurate rankings than with a single, generic PageRank vector.

...read moreread less

Abstract: The original PageRank algorithm for improving the ranking of search-query results computes a single vector, using the link structure of the Web, to capture the relative "importance" of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. For ordinary keyword search queries, we compute the topic-sensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topic-sensitive PageRank scores using the topic of the context in which the query appeared. By using linear combinations of these (precomputed) biased PageRank vectors to generate context-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. We describe techniques for efficiently implementing a large-scale search system based on the topic-sensitive PageRank scheme.

...read moreread less

1,161 citations

Proceedings Article•

Ranking on Data Manifolds

[...]

Dengyong Zhou¹, Jason Weston¹, Arthur Gretton¹, Olivier Bousquet¹, Bernhard Schölkopf¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

09 Dec 2003

TL;DR: A simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data, to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data.

...read moreread less

Abstract: The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method.

...read moreread less

767 citations

Journal Article•DOI•

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval

[...]

ChengXiang Zhai¹, William W. Cohen², John Lafferty²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Carnegie Mellon University²

28 Jul 2003

TL;DR: A framework for evaluating subtopic retrieval is proposed which generalizes the traditional precision and recall metrics by accounting for intrinsic topic difficulty as well as redundancy in documents and a maximal marginal relevance (MMR) ranking strategy is proposed.

...read moreread less

Abstract: We present a non-traditional retrieval problem we call subtopic retrieval. The subtopic retrieval problem is concerned with finding documents that cover many different subtopics of a query topic. In such a problem, the utility of a document in a ranking is dependent on other documents in the ranking, violating the assumption of independent relevance which is assumed in most traditional retrieval methods. Subtopic retrieval poses challenges for evaluating performance, as well as for developing effective algorithms. We propose a framework for evaluating subtopic retrieval which generalizes the traditional precision and recall metrics by accounting for intrinsic topic difficulty as well as redundancy in documents. We propose and systematically evaluate several methods for performing subtopic retrieval using statistical language models and a maximal marginal relevance (MMR) ranking strategy. A mixture model combined with query likelihood relevance ranking is shown to modestly outperform a baseline relevance ranking on a data set used in the TREC interactive track.

...read moreread less

611 citations

Book Chapter•DOI•

Efficient IR-style keyword search over relational databases

[...]

Vagelis Hristidis¹, Luis Gravano², Yannis Papakonstantinou¹•Institutions (2)

University of California, San Diego¹, Columbia University²

09 Sep 2003

TL;DR: This paper adapts IR-style document-relevance ranking strategies to the problem of processing free-form keyword queries over RDBMSs, and develops query-processing strategies that build on a crucial characteristic of IR- style keyword search: only the few most relevant matches are generally of interest.

...read moreread less

Abstract: Applications in which plain text coexists with structured data are pervasive. Commercial relational database management systems (RDBMSs) generally provide querying capabilities for text attributes that incorporate state-of-the-art information retrieval (IR) relevance ranking strategies, but this search functionality requires that queries specify the exact column or columns against which a given list of keywords is to be matched. This requirement can be cumbersome and inflexible from a user perspective: good answers to a keyword query might need to be "assembled" -in perhaps unforeseen ways- by joining tuples from multiple relations. This observation has motivated recent research on free-form keyword search over RDBMSs. In this paper, we adapt IR-style document-relevance ranking strategies to the problem of processing free-form keyword queries over RDBMSs. Our query model can handle queries with both AND and OR semantics, and exploits the sophisticated single-column text-search functionality often available in commercial RDBMSs. We develop query-processing strategies that build on a crucial characteristic of IR-style keyword search: only the few most relevant matches -according to some definition of "relevance"- are generally of interest. Consequently, rather than computing all matches for a keyword query, which leads to inefficient executions, our techniques focus on the top-k matches for the query, for moderate values of k. A thorough experimental evaluation over real data shows the performance advantages of our approach.

...read moreread less

581 citations

Proceedings Article•DOI•

Efficient query evaluation using a two-level retrieval process

[...]

Andrei Z. Broder¹, David Carmel¹, Michael Herscovici¹, Aya Soffer¹, Jason Zien¹ - Show less +1 more•Institutions (1)

IBM¹

03 Nov 2003

TL;DR: An efficient query evaluation method based on a two level approach that significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall.

...read moreread less

Abstract: We present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. The efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. The amount of pruning can be controlled by the user as a function of time allocated for query evaluation. Experimentally, using the TREC Web Track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. At the heart of our approach there is an efficient implementation of a new Boolean construct called WAND or Weak AND that might be of independent interest.

...read moreread less

435 citations

Proceedings Article•DOI•

Expertise identification using email communications

[...]

Christopher S. Campbell¹, Paul P. Maglio¹, Alex Cozzi¹, Byron Dom¹•Institutions (1)

IBM¹

03 Nov 2003

TL;DR: Two algorithms for determining expertise from email were compared: a content-based approach that takes account only of email text, and a graph-based ranking algorithm (HITS) that take account both of text and communication patterns.

...read moreread less

Abstract: A common method for finding information in an organization is to use social networks---ask people, following referrals until someone with the right information is found. Another way is to automatically mine documents to determine who knows what. Email documents seem particularly well suited to this task of "expertise location", as people routinely communicate what they know. Moreover, because people explicitly direct email to one another, social networks are likely to be contained in the patterns of communication. Can these patterns be used to discover experts on particular topics? Is this approach better than mining message content alone? To find answers to these questions, two algorithms for determining expertise from email were compared: a content-based approach that takes account only of email text, and a graph-based ranking algorithm (HITS) that takes account both of text and communication patterns. An evaluation was done using email and explicit expertise ratings from two different organizations. The rankings given by each algorithm were compared to the explicit rankings with the precision and recall measures commonly used in information retrieval, as well as the d' measure commonly used in signal-detection theory. Results show that the graph-based algorithm performs better than the content-based algorithm at identifying experts in both cases, demonstrating that the graph-based algorithm effectively extracts more information than is found in content alone.

...read moreread less

395 citations

Journal Article•DOI•

Methods for identifying versioned and plagiarized documents

[...]

Timothy C. Hoad¹, Justin Zobel¹•Institutions (1)

RMIT University¹

01 Feb 2003-Journal of the Association for Information Science and Technology

TL;DR: The identity measure and the best fingerprinting technique are both able to accurately identify coderivative documents, and it is demonstrated that the identity measure is clearly superior for fingerprinting parameters.

...read moreread less

Abstract: The widespread use of on-line publishing of text promotes storage of multiple versions of documents and mirroring of documents in multiple locations, and greatly simplifies the task of plagiarizing the work of others. We evaluate two families of methods for searching a collection to find documents that are coderivative, that is, are versions or plagiarisms of each other. The first, the ranking family, uses information retrieval techniques; extending this family, we propose the identity measure, which is specifically designed for identification of co-derivative documents. The second, the fingerprinting family, uses hashing to generate a compact document description, which can then be compared to the fingerprints of the documents in the collection. We introduce a new method for evaluating the effectiveness of these techniques, and demonstrate it in practice. Using experiments on two collections, we demonstrate that the identity measure and the best fingerprinting technique are both able to accurately identify coderivative documents. However, for fingerprinting parameters must be carefully chosen, and even so the identity measure is clearly superior.

...read moreread less

378 citations

Journal Article•DOI•

Query expansion by mining user logs

[...]

Hang Cui¹, Ji-Rong Wen², Jian-Yun Nie³, Wei-Ying Ma•Institutions (3)

National University of Singapore¹, Microsoft², Université de Montréal³

01 Jul 2003-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This study proposes a new method for query expansion based on user interactions recorded in user logs that extracts correlations between query terms and document terms by analyzing user logs and can produce much better results than both the classical search method and the other query expansion methods.

...read moreread less

Abstract: Queries to search engines on the Web are usually short. They do not provide sufficient information for an effective selection of relevant documents. Previous research has proposed the utilization of query expansion to deal with this problem. However, expansion terms are usually determined on term co-occurrences within documents. In this study, we propose a new method for query expansion based on user interactions recorded in user logs. The central idea is to extract correlations between query terms and document terms by analyzing user logs. These correlations are then used to select high-quality expansion terms for new queries. Compared to previous query expansion methods, ours takes advantage of the user judgments implied in user logs. The experimental results show that the log-based query expansion method can produce much better results than both the classical search method and the other query expansion methods.

...read moreread less

342 citations

Journal Article•DOI•

Enemy at the gate: threats to information security

[...]

Michael E. Whitman¹•Institutions (1)

Kennesaw State University¹

01 Aug 2003-Communications of The ACM

TL;DR: A firm can build more effective security strategies by identifying and ranking the severity of potential threats to its IS efforts.

...read moreread less

Abstract: A firm can build more effective security strategies by identifying and ranking the severity of potential threats to its IS efforts.

...read moreread less

335 citations

Proceedings Article•DOI•

Time-based language models

[...]

Xiaoyan Li¹, W. Bruce Croft¹•Institutions (1)

University of Massachusetts Amherst¹

03 Nov 2003

TL;DR: It is shown how time can be incorporated into both query-likelihood models and relevance models, and shows that time-based models perform as well as or better than the best of the heuristic techniques.

...read moreread less

Abstract: We explore the relationship between time and relevance using TREC ad-hoc queries. A type of query is identified that favors very recent documents. We propose a time-based language model approach to retrieval for these queries. We show how time can be incorporated into both query-likelihood models and relevance models. These models were used for experiments comparing time-based language models to heuristic techniques for incorporating document recency in the ranking. Our results show that time-based models perform as well as or better than the best of the heuristic techniques.

...read moreread less

308 citations

Proceedings Article•DOI•

PlanetP: using gossiping to build content addressable peer-to-peer information sharing communities

[...]

Francisco Matias Cuenca-Acuna¹, Christopher Peery¹, Richard Martin¹, Thu D. Nguyen¹•Institutions (1)

Rutgers University¹

22 Jun 2003

TL;DR: PlanetP as mentioned in this paper is a content addressable publish/subscribe service for unstructured peer-to-peer (P2P) communities that supports content addressing by providing a gossiping layer used to globally replicate a membership directory and an extremely compact content index.

...read moreread less

Abstract: We introduce PlanetP, content addressable publish/subscribe service for unstructured peer-to-peer (P2P) communities. PlanetP supports content addressing by providing: (1) a gossiping layer used to globally replicate a membership directory and an extremely compact content index; and (2) a completely distributed content search and ranking algorithm that help users find the most relevant information. PlanetP is a simple, yet powerful system for sharing information. PlanetP is simple because each peer must only perform a periodic, randomized, point-to-point message exchange with other peers. PlanetP is powerful because it maintains a globally content-ranked view of the shared data. Using simulation and a prototype implementation, we show that PlanetP achieves ranking accuracy that is comparable to a centralized solution and scales easily to several thousand peers while remaining resilient to rapid membership changes.

...read moreread less

Proceedings Article•DOI•

Query type classification for web document retrieval

[...]

Inho Kang, Gil-Chang Kim

28 Jul 2003

TL;DR: A user query classification scheme that uses the difference of distribution, mutual information, the usage rate as anchor texts, and the POS information for the classification and could get the best performance when the OKAPI scoring algorithm was used.

...read moreread less

Abstract: The heterogeneous Web exacerbates IR problems and short user queries make them worse. The contents of web documents are not enough to find good answer documents. Link information and URL information compensates for the insufficiencies of content information. However, static combination of multiple evidences may lower the retrieval performance. We need different strategies to find target documents according to a query type. We can classify user queries as three categories, the topic relevance task, the homepage finding task, and the service finding task. In this paper, a user query classification scheme is proposed. This scheme uses the difference of distribution, mutual information, the usage rate as anchor texts, and the POS information for the classification. After we classified a user query, we apply different algorithms and information for the better results. For the topic relevance task, we emphasize the content information, on the other hand, for the homepage finding task, we emphasize the Link information and the URL information. We could get the best performance when our proposed classification method with the OKAPI scoring algorithm was used.

...read moreread less

Proceedings Article•DOI•

Combining document representations for known-item search

[...]

Paul Ogilvie¹, Jamie Callan¹•Institutions (1)

Carnegie Mellon University¹

28 Jul 2003

TL;DR: This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search, and presents a mixture-based language model to investigate several hypotheses.

...read moreread less

Abstract: This paper investigates the pre-conditions for successful combination of document representations formed from structural markup for the task of known-item search. As this task is very similar to work in meta-search and data fusion, we adapt several hypotheses from those research areas and investigate them in this context. To investigate these hypotheses, we present a mixture-based language model and also examine many of the current meta-search algorithms. We find that compatible output from systems is important for successful combination of document representations. We also demonstrate that combining low performing document representations can improve performance, but not consistently. We find that the techniques best suited for this task are robust to the inclusion of poorly performing document representations. We also explore the role of variance of results across systems and its impact on the performance of fusion, with the surprising result that the correct documents have higher variance across document representations than highly ranking incorrect documents.

...read moreread less

Proceedings Article•

Automated Ranking of Database Query Results

[...]

Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis

01 Jan 2003

TL;DR: The challenges and several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval, are discussed and results of preliminary experiments are presented.

...read moreread less

Abstract: Ranking and returning the most relevant results of a query is a popular paradigm in Information Retrieval. We discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. We present results of preliminary experiments.

...read moreread less

Journal Article•DOI•

Relevant term suggestion in interactive web search based on contextual information in query session logs

[...]

Chien-Kang Huang¹, Lee-Feng Chien², Yen-Jen Oyang¹•Institutions (2)

National Taiwan University¹, Academia Sinica²

01 May 2003-Journal of the Association for Information Science and Technology

TL;DR: Experimental results show that the proposed approach to relevant term extraction and term suggestion can provide organized and highly relevant terms, and can exploit the contextual information in a user's query session to make more effective suggestions.

...read moreread less

Abstract: This paper proposes an effective term suggestion approach to interactive Web search. Conventional approaches to making term suggestions involve extracting co-occurring keyterms from highly ranked retrieved documents. Such approaches must deal with term extraction difficulties and interference from irrelevant documents, and, more importantly, have difficulty extracting terms that are conceptually related but do not frequently co-occur in documents. In this paper, we present a new, effective log-based approach to relevant term extraction and term suggestion. Using this approach, the relevant terms suggested for a user query are those that co-occur in similar query sessions from search engine logs, rather than in the retrieved documents. In addition, the suggested terms in each interactive search step can be organized according to its relevance to the entire query session, rather than to the most recent single query as in conventional approaches. The proposed approach was tested using a proxy server log containing about two million query transactions submitted to search engines in Taiwan. The obtained experimental results show that the proposed approach can provide organized and highly relevant terms, and can exploit the contextual information in a user's query session to make more effective suggestions.

...read moreread less

Proceedings Article•DOI•

Relevant document distribution estimation method for resource selection

[...]

Luo Si¹, Jamie Callan¹•Institutions (1)

Carnegie Mellon University¹

28 Jul 2003

TL;DR: It is shown that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases, and a new resource selection algorithm is proposed that uses information about database sizes as well as database contents.

...read moreread less

Abstract: Prior research under a variety of conditions has shown the CORI algorithm to be one of the most effective resource selection algorithms, but the range of database sizes studied was not large. This paper shows that the CORI algorithm does not do well in environments with a mix of "small" and "very large" databases. A new resource selection algorithm is proposed that uses information about database sizes as well as database contents. We also show how to acquire database size estimates in uncooperative environments as an extension of the query-based sampling used to acquire resource descriptions. Experiments demonstrate that the database size estimates are more accurate for large databases than estimates produced by a competing method; the new resource ranking algorithm is always at least as effective as the CORI algorithm; and the new algorithm results in better document rankings than the CORI algorithm.

...read moreread less

Patent•

Data storage, retrieval, manipulation and display tools enabling multiple hierarchical points of view

[...]

Jena J. Jordahl

12 Jun 2003

TL;DR: In this article, a user interface can present results in the form of browsing multiple hierarchical representations, wherein matching categories are differentiated from non-matching categories, providing an indication of the fitness of the search terms for returning satisfactory results.

...read moreread less

Abstract: Systems and methods for data storage, retrieval, manipulation and display provide search engines and computer-based research tools for enabling multiple hierarchical points of view. Category definitions in the hierarchical data structures can include lists of set members, like word arrays of set members, generative descriptions for determining set members, and fitness functions for determining fitness of a presented item for being a member of a set. Significance and interest values can be assigned to search categories to set threshold confidence levels for returning search results and for weighting the results, respectively. A user interface can present results in the form of browsing multiple hierarchical representations, wherein matching categories are differentiated from non-matching categories. Peer ratings can represent the ranking of search term results with relation to results using other search terms, providing an indication of the fitness of the search terms for returning satisfactory results.

...read moreread less

Book Chapter•DOI•

Pairwise preference learning and ranking

[...]

Johannes Fürnkranz¹, Eyke Hüllermeier²•Institutions (2)

Austrian Research Institute for Artificial Intelligence¹, University of Marburg²

22 Sep 2003

TL;DR: The main objective of this work is to investigate the trade-off between the quality of the induced ranking function and the computational complexity of the algorithm, both depending on the amount of preference information given for each example.

...read moreread less

Abstract: We consider supervised learning of a ranking function, which is a mapping from instances to total orders over a set of labels (options). The training information consists of examples with partial (and possibly inconsistent) information about their associated rankings. From these, we induce a ranking function by reducing the original problem to a number of binary classification problems, one for each pair of labels. The main objective of this work is to investigate the trade-off between the quality of the induced ranking function and the computational complexity of the algorithm, both depending on the amount of preference information given for each example. To this end, we present theoretical results on the complexity of pairwise preference learning, and experimentally investigate the predictive performance of our method for different types of preference information, such as top-ranked labels and complete rankings. The domain of this study is the prediction of a rational agent's ranking of actions in an uncertain environment.

...read moreread less

Proceedings Article•DOI•

Web metasearch: rank vs. score based rank aggregation methods

[...]

M. Elena Renda, Umberto Straccia

09 Mar 2003

TL;DR: It is shown that the rank based method, named Borda Count, is competitive with score based methods, but this is not true for metasearch, and it will be shown that Markov chain based methods compete with Score based methods.

...read moreread less

Abstract: Given a set of rankings, the task of ranking fusion is the problem of combining these lists in such a way to optimize the performance of the combination. The ranking fusion problem is encountered in many situations and, e.g., metasearch is a prominent one. It deals with the problem of combining the result lists returned by multiple search engines in response to a given query, where each item in a result list is ordered with respect to a search engine and a relevance score. Several ranking fusion methods have been proposed in the literature. They can be classified based on whether: (i) they rely on the rank; (ii) they rely on the score; and (iii) they require training data or not. Our paper will make the following contributions: (i) we will report experimental results for the Markov chain rank based methods, for which no large experimental tests have yet been made; (ii) while it is believed that the rank based method, named Borda Count, is competitive with score based methods, we will show that this is not true for metasearch; and (iii) we will show that Markov chain based methods compete with score based methods. This is especially important in the context of metasearch as scores are usually not available from the search engines.

...read moreread less

An Analytical Comparison of Approaches to Personalizing PageRank

[...]

Taher H. Haveliwala, Sepandar D. Kamvar, Glen Jeh

20 Jun 2003

TL;DR: This work analytically compares three recent approaches to personalizing PageRank and discusses the tradeoffs of each one.

...read moreread less

Abstract: PageRank, the popular link-analysis algorithm for ranking web pages, assigns a query and user independent estimate of "importance" to web pages Query and user sensitive extensions of PageRank, which use a basis set of biased PageRank vectors, have been proposed in order to personalize the ranking function in a tractable way We analytically compare three recent approaches to personalizing PageRank and discuss the tradeoffs of each one

...read moreread less

Journal Article•DOI•

pSearch: information retrieval in structured overlays

[...]

Chunqiang Tang¹, Zhichen Xu², Mallik Mahalingam²•Institutions (2)

University of Rochester¹, Hewlett-Packard²

01 Jan 2003

TL;DR: An efficient peer-to-peer information retrieval system, pSearch; that supports state-of-the-art content- and semantic-based full-text searches, avoiding the scalability problem of existing systems that employ centralized indexing, or index/query flooding.

...read moreread less

Abstract: We describe an efficient peer-to-peer information retrieval system, pSearch; that supports state-of-the-art content- and semantic-based full-text searches. pSearch avoids the scalability problem of existing systems that employ centralized indexing, or index/query flooding. It also avoids the nondeterminism that is exhibited by heuristic-based approaches. In pSearch; documents in the network are organized around their vector representations (based on modern document ranking algorithms) such that the search space for a given query is organized around related documents, achieving both efficiency and accuracy.

...read moreread less

Journal Article•

A family of additive online algorithms for category ranking

[...]

Koby Crammer¹, Yoram Singer¹•Institutions (1)

Hebrew University of Jerusalem¹

01 Mar 2003-Journal of Machine Learning Research

TL;DR: A new family of topic- ranking algorithms for multi-labeled documents that achieve state-of-the-art results and outperforms topic-ranking adaptations of Rocchio's algorithm and of the Perceptron algorithm are described.

...read moreread less

Abstract: We describe a new family of topic-ranking algorithms for multi-labeled documents. The motivation for the algorithms stem from recent advances in online learning algorithms. The algorithms are simple to implement and are also time and memory efficient. We provide a unified analysis of the family of algorithms in the mistake bound model. We then discuss experiments with the proposed family of topic-ranking algorithms on the Reuters-21578 corpus and the new corpus released by Reuters in 2000. On both corpora, the algorithms we present achieve state-of-the-art results and outperforms topic-ranking adaptations of Rocchio's algorithm and of the Perceptron algorithm.

...read moreread less

Proceedings Article•

Overview of the TREC 2003 Web Track.

[...]

Nick Craswell, David Hawking, Ross Wilkinson, Mingfang Wu

01 Jan 2003

TL;DR: Studies conducted by the two participating groups compared a search engine using automatic topic distillation features with the same engine with those features disabled in order to determine whether the automatic topic Distillation features assisted the users in the performance of their tasks and whether humans could achieve better results than the automatic system.

...read moreread less

Abstract: The TREC 2003 web track consisted of both a non-interactive stream and an interactive stream. Both streams worked with the .GOV test collection. The non-interactive stream continued an investigation into the importance of homepages in Web ranking, via both a Topic Distillation task and a Navigational task. In the topic distillation task, systems were expected to return a list of the homepages of sites relevant to each of a series of broad queries. This differs from previous homepage experiments in that queries may have multiple correct answers. The navigational task required systems to return a particular desired web page as early as possible in the ranking in response to queries. In half of the queries, the target answer was the homepage of a site and the query was derived from the name of the site (Homepage finding) while in the other half, the target answers were not homepages and the queries were derived from the name of the page (Named page finding). The two types of query were arbitrarily mixed and not identified. The interactive stream focused on human participation in a topic distillation task over the .GOV collection. Studies conducted by the two participating groups compared a search engine using automatic topic distillation features with the same engine with those features disabled in order to determine whether the automatic topic distillation features assisted the users in the performance of their tasks and whether humans could achieve better results than the automatic system.

...read moreread less

Journal Article•DOI•

Using Ranking and Selection to Clean Up after Simulation Optimization

[...]

Justin Boesel¹, Barry L. Nelson², Seong-Hee Kim³•Institutions (3)

Mitre Corporation¹, Northwestern University², Georgia Institute of Technology³

01 Sep 2003-Operations Research

TL;DR: In this paper, the problem of finding the simulated system with the best (maximum or minimum) expected performance when the number of systems is large and initial samples from each system have already been taken is addressed.

...read moreread less

Abstract: In this paper we address the problem of finding the simulated system with the best (maximum or minimum) expected performance when the number of systems is large and initial samples from each system have already been taken This problem may be encountered when a heuristic search procedure--perhaps one originally designed for use in a deterministic environment--has been applied in a simulation-optimization context Because of stochastic variation, the system with the best sample mean at the end of the search procedure may not coincide with the true best system encountered during the search This paper develops statistical procedures that return the best system encountered by the search (or one near the best) with a prespecified probability We approach this problem using combinations of statistical subset selection and indifference-zone ranking procedures The subset-selection procedures, which use only the data already collected, screen out the obviously inferior systems, while the indifference-zone procedures, which require additional simulation effort, distinguish the best from the less obviously inferior systems

...read moreread less

Proceedings Article•DOI•

A system for principled matchmaking in an electronic marketplace

[...]

Tommaso Di Noia¹, Eugenio Di Sciascio¹, Francesco M. Donini¹, Marina Mongiello¹•Institutions (1)

Instituto Politécnico Nacional¹

20 May 2003

TL;DR: This work formalizes general properties a matchmaker should have, then it presents a matchmaking facilitator, compliant with desired properties, that embeds a NeoClassic reasoner, whose structural subsumption algorithm has been modified to allow match categorization into potential and partial, and ranking of matches within categories.

...read moreread less

Abstract: More and more resources are becoming available on the Web, and there is a growing need for infrastructures that, based on advertised descriptions, are able to semantically match demands with suppliesWe formalize general properties a matchmaker should have, then we present a matchmaking facilitator, compliant with desired propertiesThe system embeds a NeoClassic reasoner, whose structural subsumption algorithm has been modified to allow match categorization into potential and partial, and ranking of matches within categories Experiments carried out show the good correspondence between users and system rankings

...read moreread less

Proceedings Article•DOI•

Query expansion using associated queries

[...]

Bodo Billerbeck¹, Falk Scholer¹, Hugh E. Williams¹, Justin Zobel¹•Institutions (1)

RMIT University¹

03 Nov 2003

TL;DR: This work proposes a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection, that is effective for query expansion for web retrieval.

...read moreread less

Abstract: Hundreds of millions of users each day use web search engines to meet their information needs Advances in web search effectiveness are therefore perhaps the most significant public outcomes of IR research Query expansion is one such method for improving the effectiveness of ranked retrieval by adding additional terms to a query In previous approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run We propose a new method of obtaining expansion terms, based on selecting terms from past user queries that are associated with documents in the collection Our scheme is effective for query expansion for web retrieval: our results show relative improvements over unexpanded full text retrieval of 26%--29%, and 18%--20% over an optimised, conventional expansion approach

...read moreread less

Ranking Significance of Software Components Based on Use Relations

[...]

Katsuro Inoue

01 Aug 2003

Proceedings Article•DOI•

Query length in interactive information retrieval

[...]

Nicholas J. Belkin¹, Diane Kelly¹, G. Kim¹, Ja-Young Kim¹, Hyuk-Jin Lee¹, Gheorghe Muresan¹, Muh-Chyun Tang¹, Xiaojun Yuan¹, Colleen Cool² - Show less +5 more•Institutions (2)

Rutgers University¹, Queens College²

28 Jul 2003

TL;DR: Results show that the specific technique tested results in longer queries than a standard query elicitation technique, that this technique is indeed usable, that the technique results in increased user satisfaction with the search, and that query length is positively correlated with user satisfactionwith the search.

...read moreread less

Abstract: Query length in best-match information retrieval (IR) systems is well known to be positively related to effectiveness in the IR task, when measured in experimental, non-interactive environments. However, in operational, interactive IR systems, query length is quite typically very short, on the order of two to three words. We report on a study which tested the effectiveness of a particular query elicitation technique in increasing initial searcher query length, and which tested the effectiveness of queries elicited using this technique, and the relationship in general between query length and search effectiveness in interactive IR. Results show that the specific technique results in longer queries than a standard query elicitation technique, that this technique is indeed usable, that the technique results in increased user satisfaction with the search, and that query length is positively correlated with user satisfaction with the search.

...read moreread less

Journal Article•DOI•

Objective quality ranking of computing journals

[...]

Pairin Katerattanakul¹, Bernard T. Han¹, Soon-Goo Hong²•Institutions (2)

Western Michigan University¹, Dong-a University²

01 Oct 2003-Communications of The ACM

TL;DR: Citation analysis, which provides a clear picture of actual use of journals and their articles, is an effective way to determine a journal's influence.

...read moreread less

Abstract: Citation analysis, which provides a clear picture of actual use of journals and their articles, is an effective way to determine a journal's influence.

...read moreread less

Book Chapter•DOI•

Optimized query execution in large search engines with global page ordering

[...]

Xiaohui Long¹, Torsten Suel¹•Institutions (1)

New York University¹

09 Sep 2003

TL;DR: This work study pruning techniques for query execution in large engines in the case where there is a global ranking of pages, as provided by Pagerank or any other method, in addition to the standard term-based approach, and shows that there is significant potential benefit in such techniques.

...read moreread less

Abstract: Large web search engines have to answer thousands of queries per second with interactive response times. A major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. To address this issue, IR and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. Over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. We focus on the question of how such techniques can be efficiently integrated into query processing. In particular, we study pruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by Pagerank or any other method, in addition to the standard term-based approach. We describe pruning schemes for this case and evaluate their efficiency on an experimental cluster-based search engine with million web pages. Our results show that there is significant potential benefit in such techniques.

...read moreread less

Collapse