scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Optimizing web search using web click-through data

13 Nov 2004-pp 118-126
TL;DR: A novel iterative reinforced algorithm to utilize the user click-through data to improve search performance and effectively finds "virtual queries" for web pages and overcomes the challenges discussed above.
Abstract: The performance of web search engines may often deteriorate due to the diversity and noisy information contained within web pages. User click-through data can be used to introduce more accurate description (metadata) for web pages, and to improve the search performance. However, noise and incompleteness, sparseness, and the volatility of web pages and queries are three major challenges for research work on user click-through log mining. In this paper, we propose a novel iterative reinforced algorithm to utilize the user click-through data to improve search performance. The algorithm fully explores the interrelations between queries and web pages, and effectively finds "virtual queries" for web pages and overcomes the challenges discussed above. Experiment results on a large set of MSN click-through log data show a significant improvement on search performance over the naive query log mining algorithm as well as the baseline search engine.
Citations
More filters
Proceedings ArticleDOI
06 Aug 2006
TL;DR: In this paper, the authors show that incorporating implicit feedback can augment other features, improving the accuracy of a competitive web search ranking algorithm by as much as 31% relative to the original performance.
Abstract: We show that incorporating user behavior data can significantly improve ordering of top results in real web search setting. We examine alternatives for incorporating feedback into the ranking process and explore the contributions of user feedback compared to other common web search features. We report results of a large scale evaluation over 3,000 queries and 12 million user interactions with a popular web search engine. We show that incorporating implicit feedback can augment other features, improving the accuracy of a competitive web search ranking algorithms by as much as 31% relative to the original performance.

1,119 citations

Journal ArticleDOI
TL;DR: This survey presents a unified view of a large number of recent approaches to AQE that leverage various data sources and employ very different principles and techniques.
Abstract: The relative ineffectiveness of information retrieval systems is largely caused by the inaccuracy with which a query formed by a few keywords models the actual user information need. One well known method to overcome this limitation is automatic query expansion (AQE), whereby the user’s original query is augmented by new features with a similar meaning. AQE has a long history in the information retrieval community but it is only in the last years that it has reached a level of scientific and experimental maturity, especially in laboratory settings such as TREC. This survey presents a unified view of a large number of recent approaches to AQE that leverage various data sources and employ very different principles and techniques. The following questions are addressed. Why is query expansion so important to improve search effectiveness? What are the main steps involved in the design and implementation of an AQE component? What approaches to AQE are available and how do they compare? Which issues must still be resolved before AQE becomes a standard component of large operational information retrieval systems (e.g., search engines)?

1,058 citations


Cites background from "Optimizing web search using web cli..."

  • ...Other problems with their use for AQE are caused by noise, incompleteness, sparseness, and the volatility of Web pages and query [Xue et al. 2004]....

    [...]

  • ...Other problems for their use for AQE may be constituted by noise, incompleteness, sparseness, and the volatility of web pages and query [Xue et al. 2004]....

    [...]

Proceedings ArticleDOI
Shenghua Bao1, Gui-Rong Xue1, Xiaoyuan Wu1, Yong Yu1, Ben Fei2, Zhong Su2 
08 May 2007
TL;DR: Preliminary experimental results show that SSR can find the latent semantic association between queries and annotations, while SPR successfully measures the quality of a webpage from the web users' perspective.
Abstract: This paper explores the use of social annotations to improve websearch. Nowadays, many services, e.g. del.icio.us, have been developed for web users to organize and share their favorite webpages on line by using social annotations. We observe that the social annotations can benefit web search in two aspects: 1) the annotations are usually good summaries of corresponding webpages; 2) the count of annotations indicates the popularity of webpages. Two novel algorithms are proposed to incorporate the above information into page ranking: 1) SocialSimRank (SSR)calculates the similarity between social annotations and webqueries; 2) SocialPageRank (SPR) captures the popularity of webpages. Preliminary experimental results show that SSR can find the latent semantic association between queries and annotations, while SPR successfully measures the quality (popularity) of a webpage from the web users' perspective. We further evaluate the proposed methods empirically with 50 manually constructed queries and 3000 auto-generated queries on a dataset crawledfrom delicious. Experiments show that both SSR and SPRbenefit web search significantly.

604 citations


Cites methods from "Optimizing web search using web cli..."

  • ...PageRank [17]). Meanwhile, the interaction log of search engine users also benefits web search by providing the click-through data, which can be used in both similarity ranking (e.g. IA [ 10 ]) and static ranking (e.g....

    [...]

  • ...In modern search engines, several methods have been proposed to find new information as additional metadata to enhance the performance of similarity ranking, e.g., document title [37], anchor text [21, 28, 34], and users’ query logs [ 10 ]....

    [...]

  • ...State-of-the-art techniques include anchor text generation [21, 28, 34], metadata extraction [37], link analysis [34], and search log mining [ 10 ]; 2) ordering the web pages according to their qualities....

    [...]

Proceedings ArticleDOI
Nick Craswell1, Martin Szummer1
23 Jul 2007
TL;DR: A Markov random walk model is applied to a large click log, producing a probabilistic ranking of documents for a given query, demonstrating its ability to retrieve relevant documents that have not yet been clicked for that query and rank those effectively.
Abstract: Search engines can record which documents were clicked for which query, and use these query-document pairs as "soft" relevance judgments. However, compared to the true judgments, click logs give noisy and sparse relevance information. We apply a Markov random walk model to a large click log, producing a probabilistic ranking of documents for a given query. A key advantage of the model is its ability to retrieve relevant documents that have not yet been clicked for that query and rank those effectively. We conduct experiments on click logs from image search, comparing our ("backward") random walk model to a different ("forward") random walk, varying parameters such as walk length and self-transition probability. The most effective combination is a long backward walk with high self-transition probability.

519 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: A Product-based Neural Networks (PNN) with an embedding layer to learn a distributed representation of the categorical data, a product layer to capture interactive patterns between interfield categories, and further fully connected layers to explore high-order feature interactions.
Abstract: Predicting user responses, such as clicks and conversions, is of great importance and has found its usage inmany Web applications including recommender systems, websearch and online advertising. The data in those applicationsis mostly categorical and contains multiple fields, a typicalrepresentation is to transform it into a high-dimensional sparsebinary feature representation via one-hot encoding. Facing withthe extreme sparsity, traditional models may limit their capacityof mining shallow patterns from the data, i.e. low-order featurecombinations. Deep models like deep neural networks, on theother hand, cannot be directly applied for the high-dimensionalinput because of the huge feature space. In this paper, we proposea Product-based Neural Networks (PNN) with an embeddinglayer to learn a distributed representation of the categorical data, a product layer to capture interactive patterns between interfieldcategories, and further fully connected layers to explorehigh-order feature interactions. Our experimental results on twolarge-scale real-world ad click datasets demonstrate that PNNsconsistently outperform the state-of-the-art models on various metrics.

459 citations


Cites background from "Optimizing web search using web cli..."

  • ...This predicted probability indicates the user’s interest on the specific item such as a news article, a commercial item or an advertising post, which influences the subsequent decision making such as document ranking [2] and ad bidding [3]....

    [...]

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations


"Optimizing web search using web cli..." refers background or methods in this paper

  • ...For example, Google’s search engine [20] takes the anchor text as it’s metadata to improve the performance of search....

    [...]

  • ...For example, anchor text [14][20][21][22], title or surrounding text of the anchor were used to enhance the web search....

    [...]

  • ...Previous research [14][20][22] show that this method yields better search result than searching on Web page content alone....

    [...]

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

13,327 citations

Book
15 May 1999
TL;DR: In this article, the authors present a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective, which provides an up-to-date student oriented treatment of the subject.
Abstract: From the Publisher: This is a rigorous and complete textbook for a first course on information retrieval from the computer science (as opposed to a user-centred) perspective. The advent of the Internet and the enormous increase in volume of electronically stored information generally has led to substantial work on IR from the computer science perspective - this book provides an up-to-date student oriented treatment of the subject.

9,923 citations

Journal ArticleDOI
TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.
Abstract: The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

7,572 citations

Proceedings ArticleDOI
23 Jul 2002
TL;DR: The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking.
Abstract: This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them difficult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples.

4,453 citations


"Optimizing web search using web cli..." refers methods in this paper

  • ...Recently, Joachims [ 11 ] propose a method of utilizing click-through data in learning of a retrieval function (e.g., a meta search function)....

    [...]