scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Mining Concept Sequences from Large-Scale Search Logs for Context-Aware Query Suggestion

TL;DR: The experimental results clearly show that the novel context-aware query suggestion approach outperforms three baseline methods in both coverage and quality of suggestions.
Abstract: Query suggestion plays an important role in improving usability of search engines. Although some recently proposed methods provide query suggestions by mining query patterns from search logs, none of them models the immediately preceding queries as context systematically, and uses context information effectively in query suggestions. Context-aware query suggestion is challenging in both modeling context and scaling up query suggestion using context. In this article, we propose a novel context-aware query suggestion approach. To tackle the challenges, our approach consists of two stages. In the first, offline model-learning stage, to address data sparseness, queries are summarized into concepts by clustering a click-through bipartite. A concept sequence suffix tree is then constructed from session data as a context-aware query suggestion model. In the second, online query suggestion stage, a user’s search context is captured by mapping the query sequence submitted by the user to a sequence of concepts. By looking up the context in the concept sequence suffix tree, we suggest to the user context-aware queries. We test our approach on large-scale search logs of a commercial search engine containing 4.0 billion Web queries, 5.9 billion clicks, and 1.87 billion search sessions. The experimental results clearly show that our approach outperforms three baseline methods in both coverage and quality of suggestions.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
Milad Shokouhi1
28 Jul 2013
TL;DR: The results suggest that supervised rankers enhanced by personalization features can significantly outperform the existing popularity-based base-lines, in terms of mean reciprocal rank (MRR) by up to 9%.
Abstract: Query auto-completion (QAC) is one of the most prominent features of modern search engines. The list of query candidates is generated according to the prefix entered by the user in the search box and is updated on each new key stroke. Query prefixes tend to be short and ambiguous, and existing models mostly rely on the past popularity of matching candidates for ranking. However, the popularity of certain queries may vary drastically across different demographics and users. For instance, while instagram and imdb have comparable popularities overall and are both legitimate candidates to show for prefix i, the former is noticeably more popular among young female users, and the latter is more likely to be issued by men. In this paper, we present a supervised framework for personalizing auto-completion ranking. We introduce a novel labelling strategy for generating offline training labels that can be used for learning personalized rankers. We compare the effectiveness of several user-specific and demographic-based features and show that among them, the user's long-term search history and location are the most effective for personalizing auto-completion rankers. We perform our experiments on the publicly available AOL query logs, and also on the larger-scale logs of Bing. The results suggest that supervised rankers enhanced by personalization features can significantly outperform the existing popularity-based base-lines, in terms of mean reciprocal rank (MRR) by up to 9%.

195 citations

Book
Hang Li1, Jun Xu1
20 Jun 2014
TL;DR: This survey gives a systematic and detailed introduction to newly developed machine learning technologies for query document matching (semantic matching) in search, particularly web search, and focuses on the fundamental problems, as well as the state-of-the-art solutions.
Abstract: Relevance is the most important factor to assure users' satisfaction in search and the success of a search engine heavily depends on its performance on relevance. It has been observed that most of the dissatisfaction cases in relevance are due to term mismatch between queries and documents (e.g., query "NY times" does not match well with a document only containing "New York Times"), because term matching, i.e., the bag-of-words approach, still functions as the main mechanism of modern search engines. It is not exaggerated to say, therefore, that mismatch between query and document poses the most critical challenge in search. Ideally, one would like to see query and document match with each other, if they are topically relevant. Recently, researchers have expended significant effort to address the problem. The major approach is to conduct semantic matching, i.e., to perform more query and document understanding to represent the meanings of them, and perform better matching between the enriched query and document representations. With the availability of large amounts of log data and advanced machine learning techniques, this becomes more feasible and significant progress has been made recently. This survey gives a systematic and detailed introduction to newly developed machine learning technologies for query document matching (semantic matching) in search, particularly web search. It focuses on the fundamental problems, as well as the state-of-the-art solutions of query document matching on form aspect, phrase aspect, word sense aspect, topic aspect, and structure aspect. The ideas and solutions explained may motivate industrial practitioners to turn the research results into products. The methods introduced and the discussions made may also stimulate academic researchers to find new research directions and approaches. Matching between query and document is not limited to search and similar problems can be found in question answering, online advertising, cross-language information retrieval, machine translation, recommender systems, link prediction, image annotation, drug design, and other applications, as the general task of matching between objects from two different spaces. The technologies introduced can be generalized into more general machine learning techniques, which is referred to as learning to match in this survey.

179 citations

Book
13 Sep 2016
TL;DR: A Survey of Query Auto Completion in Information Retrieval is an ideal reference on the topic and provides researchers who are working on query auto completion or related problems in the eld of information retrieval with a good overview and analysis of state-of-the-art QAC approaches.
Abstract: In information retrieval, query auto completion (QAC), also known as type-ahead and auto-complete suggestion, refers to the following functionality: given a prex consisting of a number of characters entered into a search box, the user interface proposes alternative ways of extending the prex to a full query. QAC helps users to formulate their query when they have an intent in mind but not a clear way of expressing this in a query. It helps to avoid possible spelling mistakes, especially on devices with small screens. It saves keystrokes and cuts down the search duration of users which implies a lower load on the search engine, and results in savings in machine resources and maintenance. Because of the clear benets of QAC, a considerable number of algorithmic approaches to QAC have been proposed in the past few years. Query logs have proven to be a key asset underlying most of the recent research. This monograph surveys this research. It focuses on summarizing the literature on QAC and provides a general understanding of the wealth of QAC approaches that are currently available. A Survey of Query Auto Completion in Information Retrieval is an ideal reference on the topic. Its contributions can be summarized as follows: It provides researchers who are working on query auto completion or related problems in the eld of information retrieval with a good overview and analysis of state-of-the-art QAC approaches. In particular, for researchers new to the eld, the survey can serve as an introduction to the state-of-the-art. It also offers a comprehensive perspective on QAC approaches by presenting a taxonomy of existing solutions. In addition, it presents solutions for QAC under different conditions such as available high-resolution query logs, in-depth user interactions with QAC using eye-tracking, and elaborate user engagements in a QAC process. It also discusses practical issues related to QAC. Lastly, it presents a detailed discussion of core challenges and promising open directions in QAC.

143 citations


Cites background or methods from "Mining Concept Sequences from Large..."

  • ...But exploiting the user’s personal context has led to increases in QAC effectiveness [Bar-Yossef and Kraus, 2011, Liao et al., 2011, Santos et al., 2013, Shokouhi, 2013]....

    [...]

  • ...This use of user reformulation behavior during query sessions results in significant improvements over the performance of start-of-the-art context-aware query completion or suggestion approaches [He et al., 2009, Bar-Yossef and Kraus, 2011, Liao et al., 2011]....

    [...]

Proceedings ArticleDOI
03 Jul 2014
TL;DR: The feasibility of exploiting the context to learn user reformulation behavior for boosting prediction performance is investigated and a supervised approach to query auto-completion is proposed, where three kinds of reformulation-related features are considered, including term-level, query-level and session-level features.
Abstract: It is crucial for query auto-completion to accurately predict what a user is typing. Given a query prefix and its context (e.g., previous queries), conventional context-aware approaches often produce relevant queries to the context. The purpose of this paper is to investigate the feasibility of exploiting the context to learn user reformulation behavior for boosting prediction performance. We first conduct an in-depth analysis of how the users reformulate their queries. Based on the analysis, we propose a supervised approach to query auto-completion, where three kinds of reformulation-related features are considered, including term-level, query-level and session-level features. These features carefully capture how the users change preceding queries along the query sessions. Extensive experiments have been conducted on the large-scale query log of a commercial search engine. The experimental results demonstrate a significant improvement over 4 competitive baselines.

107 citations


Cites background or methods from "Mining Concept Sequences from Large..."

  • ...The work [25] further clusters relevant queries with common click-through data into concepts....

    [...]

  • ...CACB proposed in [25] is a concept-based context-aware query suggestion method....

    [...]

  • ...The idea is reasonable since disambiguating the user’s search intent by context has been studied extensively in query suggestion [5,9,16,25,26]....

    [...]

  • ...[25] grouped queries into various concepts....

    [...]

  • ...Many relevant queries are generated from conventional context-aware methods such as colon cancer symptoms (from [2]), celiac disease (from [16]), and colon cancer (from [25]) in our experiments....

    [...]

Proceedings ArticleDOI
Bhaskar Mitra1
09 Aug 2015
TL;DR: This paper studies the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and shows that they can be used to represent query reformulations as vectors, and demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank for the QAC ranking task by more than 10% over a supervised ranker baseline.
Abstract: Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "San Francisco" -- "San Francisco 49ers" is semantically similar to "Detroit" -- "Detroit Lions". Likewise, "London" -- "things to do in London" and "New York" -- "New York tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" -- "new movies" and "york" -- "New York", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations. Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.

85 citations


Cites methods from "Mining Concept Sequences from Large..."

  • ...[27] have explored session context using latent concept clusters from click-through bipartite graphs, while Guo et al....

    [...]

References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Proceedings Article
02 Aug 1996
TL;DR: In this paper, a density-based notion of clusters is proposed to discover clusters of arbitrary shape, which can be used for class identification in large spatial databases and is shown to be more efficient than the well-known algorithm CLAR-ANS.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLAR-ANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

17,056 citations

Proceedings Article
01 Jan 1996
TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

14,297 citations


"Mining Concept Sequences from Large..." refers methods in this paper

  • ...[2001] combined query content information and click-through information and applied a density-based method, DBSCAN [Ester et al. 1996], to cluster queries....

    [...]

  • ...Wen et al. [2001] combined query content infor­mation and click-through information and applied a density-based method, DBSCAN [Ester et al. 1996], to cluster queries....

    [...]

Journal ArticleDOI
TL;DR: A new graphical display is proposed for partitioning techniques, where each cluster is represented by a so-called silhouette, which is based on the comparison of its tightness and separation, and provides an evaluation of clustering validity.

14,144 citations


"Mining Concept Sequences from Large..." refers methods in this paper

  • ...To evaluate the effectiveness of the split-merge process, we compare the silhouette coefficient [Rousseeuw 1987] of the clusters before and after the split-merge process....

    [...]

  • ...To evaluate the effectiveness of the split-merge process, we compare the silhouette coef.cient [Rousseeuw 1987] of the clusters before and after the split-merge process....

    [...]