scispace - formally typeset
Search or ask a question

Showing papers on "Web query classification published in 2009"


Journal ArticleDOI
TL;DR: As work in Web page classification is reviewed, the importance of these Web-specific features and algorithms are noted, state-of-the-art practices are described, and the underlying assumptions behind the use of information from neighboring pages are tracked.
Abstract: Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.

502 citations


Book ChapterDOI
06 Nov 2009
TL;DR: An approach to execute SPARQL queries over the Web of Linked Data using an iterator-based pipeline to discover data that might be relevant for answering a query during the query execution itself and an extension of the iterator paradigm is proposed.
Abstract: The Web of Linked Data forms a single, globally distributed dataspace. Due to the openness of this dataspace, it is not possible to know in advance all data sources that might be relevant for query answering. This openness poses a new challenge that is not addressed by traditional research on federated query processing. In this paper we present an approach to execute SPARQL queries over the Web of Linked Data. The main idea of our approach is to discover data that might be relevant for answering a query during the query execution itself. This discovery is driven by following RDF links between data sources based on URIs in the query and in partial results. The URIs are resolved over the HTTP protocol into RDF data which is continuously added to the queried dataset. This paper describes concepts and algorithms to implement our approach using an iterator-based pipeline. We introduce a formalization of the pipelining approach and show that classical iterators may cause blocking due to the latency of HTTP requests. To avoid blocking, we propose an extension of the iterator paradigm. The evaluation of our approach shows its strengths as well as the still existing challenges.

386 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: This paper proposes a general methodology to the problem of query intent classification that can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small and can be easily applied to various intent domains.
Abstract: Understanding the intent behind a user's query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user's intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results shows that our method significantly outperforms other methods in each intent domain.

317 citations


Proceedings ArticleDOI
29 Mar 2009
TL;DR: This work addresses a novel spatial keyword query called the m-closest keywords (mCK) query, which aims to find the spatially closest tuples which match m user-specified keywords, and introduces a new index called the bR*-tree, which is an extension of the R-tree.
Abstract: This work addresses a novel spatial keyword query called the m-closest keywords (mCK) query. Given a database of spatial objects, each tuple is associated with some descriptive information represented in the form of keywords. The mCK query aims to find the spatially closest tuples which match m user-specified keywords. Given a set of keywords from a document, mCK query can be very useful in geotagging the document by comparing the keywords to other geotagged documents in a database. To answer mCK queries efficiently, we introduce a new index called the bR*-tree, which is an extension of the R*-tree. Based on bR*-tree, we exploit a priori-based search strategies to effectively reduce the search space. We also propose two monotone constraints, namely the distance mutex and keyword mutex, as our a priori properties to facilitate effective pruning. Our performance study demonstrates that our search strategy is indeed efficient in reducing query response time and demonstrates remarkable scalability in terms of the number of query keywords which is essential for our main application of searching by document.

298 citations


Proceedings Article
01 Nov 2009
TL;DR: An overview of the TREC 2009 Web Track is provided, including topic development, evaluation measures, and results, including both a traditional ad hoc retrieval task and a new diversity task.
Abstract: : The TRECWeb Track explores and evaluates Web retrieval technologies. Currently, the Web Track conducts experiments using the new billion-page ClueWeb09 collection. The TREC 2009 Web Track includes both a traditional ad hoc retrieval task and a new diversity task. The goal of this diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. Topics for the track were created from the logs of a commercial search engine, with the aid of tools developed at Microsoft Research. Given a target query, these tools extracted and analyzed groups of related queries, using co-clicks and other information, to identify clusters of queries that highlight different aspects and interpretations of the target query. These clusters were employed by NIST for topic development. Each resulting topic is structured as a representative set of subtopics, each related to a different user need. Documents were judged with respect to the subtopics, as well as with respect to the topic as a whole. For each subtopic, NIST assessors made a binary judgment as to whether or not the document satisfies the information need associated with the subtopic. These topics were used for both the ad hoc task and the diversity task. A total of 26 groups submitted runs to the track, with many groups participating in both tasks. This report provides an overview of the track, including topic development, evaluation measures, and results.

289 citations


Proceedings ArticleDOI
02 Nov 2009
TL;DR: This paper creates a taxonomy of query refinement strategies and builds a high precision rule-based classifier to detect each type of reformulation, finding that some reformulations are better suited to helping users when the current results are already fruitful, while other reformulation are more effective when the results are lacking.
Abstract: Users frequently modify a previous search query in hope of retrieving better results. These modifications are called query reformulations or query refinements. Existing research has studied how web search engines can propose reformulations, but has given less attention to how people perform query reformulations. In this paper, we aim to better understand how web searchers refine queries and form a theoretical foundation for query reformulation. We study users' reformulation strategies in the context of the AOL query logs. We create a taxonomy of query refinement strategies and build a high precision rule-based classifier to detect each type of reformulation. Effectiveness of reformulations is measured using user click behavior. Most reformulation strategies result in some benefit to the user. Certain strategies like add/remove words, word substitution, acronym expansion, and spelling correction are more likely to cause clicks, especially on higher ranked results. In contrast, users often click the same result as their previous query or select no results when forming acronyms and reordering words. Perhaps the most surprising finding is that some reformulations are better suited to helping users when the current results are already fruitful, while other reformulations are more effective when the results are lacking. Our findings inform the design of applications that can assist searchers; examples are described in this paper.

286 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: This work performs an extensive study of compression techniques for document IDs and presents new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances.
Abstract: Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages.

283 citations


Proceedings ArticleDOI
Dekang Lin1, Xiaoyun Wu1
02 Aug 2009
TL;DR: A simple and scalable algorithm for clustering tens of millions of phrases and using the resulting clusters as features in discriminative classifiers to demonstrate the power and generality of this approach.
Abstract: We present a simple and scalable algorithm for clustering tens of millions of phrases and use the resulting clusters as features in discriminative classifiers. To demonstrate the power and generality of this approach, we apply the method in two very different applications: named entity recognition and query classification. Our results show that phrase clusters offer significant improvements over word clusters. Our NER system achieves the best current result on the widely used CoNLL benchmark. Our query classifier is on par with the best system in KDDCUP 2005 without resorting to labor intensive knowledge engineering efforts.

254 citations


Journal ArticleDOI
TL;DR: A privacy-aware query processor embedded inside a location-based database server to deal with snapshot and continuous queries based on the knowledge of the user's cloaked location rather than the exact location, which achieves a trade-off between query processing cost and answer optimality.
Abstract: In this article, we present a new privacy-aware query processing framework, Capsera, in which mobile and stationary users can obtain snapshot and/or continuous location-based services without revealing their private location information. In particular, we propose a privacy-aware query processor embedded inside a location-based database server to deal with snapshot and continuous queries based on the knowledge of the user's cloaked location rather than the exact location. Our proposed privacy-aware query processor is completely independent of how we compute the user's cloaked location. In other words, any existing location anonymization algorithms that blur the user's private location into cloaked rectilinear areas can be employed to protect the user's location privacy. We first propose a privacy-aware query processor that not only supports three new privacy-aware query types, but also achieves a trade-off between query processing cost and answer optimality. Then, to improve system scalability of processing continuous privacy-aware queries, we propose a shared execution paradigm that shares query processing among a large number of continuous queries. The proposed scalable paradigm can be tuned through two parameters to trade off between system scalability and answer optimality. Experimental results show that our query processor achieves high quality snapshot and continuous location-based services while supporting queries and/or data with cloaked locations.

240 citations


Proceedings ArticleDOI
20 Apr 2009
TL;DR: This work proposes MetaLabeler to automatically determine the relevant set of labels for each instance without intensive human involvement or expensive cross-validation, and applies it to a large scale query categorization problem in Yahoo!, yielding a significant improvement in performance.
Abstract: The explosion of online content has made the management of such content non-trivial. Web-related tasks such as web page categorization, news filtering, query categorization, tag recommendation, etc. often involve the construction of multi-label categorization systems on a large scale. Existing multi-label classification methods either do not scale or have unsatisfactory performance. In this work, we propose MetaLabeler to automatically determine the relevant set of labels for each instance without intensive human involvement or expensive cross-validation. Extensive experiments conducted on benchmark data show that the MetaLabeler tends to outperform existing methods. Moreover, MetaLabeler scales to millions of multi-labeled instances and can be deployed easily. This enables us to apply the MetaLabeler to a large scale query categorization problem in Yahoo!, yielding a significant improvement in performance.

239 citations


Journal ArticleDOI
01 Aug 2009
TL;DR: This paper first transforms the vertices into points in a vector space via graph embedding techniques, coverting a pattern match query into a distance-based multi-way join problem over the converted vector space, and proposes several pruning strategies and a join order selection method to process join processing efficiently.
Abstract: The growing popularity of graph databases has generated interesting data management problems, such as subgraph search, shortest-path query, reachability verification, and pattern match. Among these, a pattern match query is more flexible compared to a subgraph search and more informative compared to a shortest-path or reachability query. In this paper, we address pattern match problems over a large data graph G. Specifically, given a pattern graph (i.e., query Q), we want to find all matches (in G) that have the similar connections as those in Q. In order to reduce the search space significantly, we first transform the vertices into points in a vector space via graph embedding techniques, coverting a pattern match query into a distance-based multi-way join problem over the converted vector space. We also propose several pruning strategies and a join order selection method to process join processing efficiently. Extensive experiments on both real and synthetic datasets show that our method outperforms existing ones by orders of magnitude.

Proceedings ArticleDOI
19 Jul 2009
TL;DR: This work addresses the problem of vertical selection, predicting relevant verticals for queries issued to the search engine's main web search page by focusing on 18 different verticals, which differ in terms of semantics, media type, size, and level of query traffic.
Abstract: Web search providers often include search services for domain-specific subcollections, called verticals, such as news, images, videos, job postings, company summaries, and artist profiles. We address the problem of vertical selection, predicting relevant verticals (if any) for queries issued to the search engine's main web search page. In contrast to prior query classification and resource selection tasks, vertical selection is associated with unique resources that can inform the classification decision. We focus on three sources of evidence: (1) the query string, from which features are derived independent of external resources, (2) logs of queries previously issued directly to the vertical, and (3) corpora representative of vertical content. We focus on 18 different verticals, which differ in terms of semantics, media type, size, and level of query traffic. We compare our method to prior work in federated search and retrieval effectiveness prediction. An in-depth error analysis reveals unique challenges across different verticals and provides insight into vertical selection for future work.

Journal ArticleDOI
TL;DR: Experimental results suggest that query expansion using MeSH in PubMed can generally improve retrieval performance, but the improvement may not affect end PubMed users in realistic situations.
Abstract: This paper investigates the effectiveness of using MeSH® in PubMed through its automatic query expansion process: Automatic Term Mapping (ATM). We run Boolean searches based on a collection of 55 topics and about 160,000 MEDLINE® citations used in the 2006 and 2007 TREC Genomics Tracks. For each topic, we first automatically construct a query by selecting keywords from the question. Next, each query is expanded by ATM, which assigns different search tags to terms in the query. Three search tags: [MeSH Terms], [Text Words], and [All Fields] are chosen to be studied after expansion because they all make use of the MeSH field of indexed MEDLINE citations. Furthermore, we characterize the two different mechanisms by which the MeSH field is used. Retrieval results using MeSH after expansion are compared to those solely based on the words in MEDLINE title and abstracts. The aggregate retrieval performance is assessed using both F-measure and mean rank precision. Experimental results suggest that query expansion using MeSH in PubMed can generally improve retrieval performance, but the improvement may not affect end PubMed users in realistic situations.

Proceedings ArticleDOI
06 Jul 2009
TL;DR: The NEXT CEP system is described that was especially designed for query rewriting and distribution, and experiments on the Emulab test-bed show a significant improvement in system scalability due to rewrite and distribution.
Abstract: The nature of data in enterprises and on the Internet is changing. Data used to be stored in a database first and queried later. Today timely processing of new data, represented as events, is increasingly valuable. In many domains, complex event processing (CEP) systems detect patterns of events for decision making. Examples include processing of environmental sensor data, trades in financial markets and RSS web feeds. Unlike conventional database systems, most current CEP systems pay little attention to query optimisation. They do not rewrite queries to more efficient representations or make decisions about operator distribution, limiting their overall scalability.This paper describes the NEXT CEP system that was especially designed for query rewriting and distribution. Event patterns are specified in a high-level query language and, before being translated into event automata, are rewritten in a more efficient form. Automata are then distributed across a cluster of machines for detection scalability. We present algorithms for query rewriting and distributed placement. Our experiments on the Emulab test-bed show a significant improvement in system scalability due to rewriting and distribution.

Proceedings ArticleDOI
19 Jul 2009
TL;DR: This work uses various measures of query quality described in the literature as features to represent sub-queries, and trains a classifier to reduce long queries to more effective shorter ones that lack those extraneous terms.
Abstract: Long queries frequently contain many extraneous terms that hinder retrieval of relevant documents. We present techniques to reduce long queries to more effective shorter ones that lack those extraneous terms. Our work is motivated by the observation that perfectly reducing long TREC description queries can lead to an average improvement of 30% in mean average precision. Our approach involves transforming the reduction problem into a problem of learning to rank all sub-sets of the original query (sub-queries) based on their predicted quality, and selecting the top sub-query. We use various measures of query quality described in the literature as features to represent sub-queries, and train a classifier. Replacing the original long query with the top-ranked sub-query chosen by the ranker results in a statistically significant average improvement of 8% on our test sets. Analysis of the results shows that query reduction is well-suited for moderately-performing long queries, and a small set of query quality predictors are well-suited for the task of ranking sub-queries.

Proceedings ArticleDOI
19 Oct 2009
TL;DR: This paper proposes a new query suggestion scheme named Visual Query Suggestion (VQS), which provides a more effective query interface to formulate an intent-specific query by joint text and image suggestions, and shows that VQS outperforms these engines in terms of both the quality of query suggestion and search performance.
Abstract: Query suggestion is an effective approach to improve the usability of image search. Most existing search engines are able to automatically suggest a list of textual query terms based on users' current query input, which can be called Textual Query Suggestion. This paper proposes a new query suggestion scheme named Visual Query Suggestion (VQS) which is dedicated to image search. It provides a more effective query interface to formulate an intent-specific query by joint text and image suggestions. We show that VQS is able to more precisely and more quickly help users specify and deliver their search intents. When a user submits a text query, VQS first provides a list of suggestions, each containing a keyword and a collection of representative images in a dropdown menu. If the user selects one of the suggestions, the corresponding keyword will be added to complement the initial text query as the new text query, while the image collection will be formulated as the visual query. VQS then performs image search based on the new text query using text search techniques, as well as content-based visual retrieval to refine the search results by using the corresponding images as query examples. We compare VQS with three popular image search engines, and show that VQS outperforms these engines in terms of both the quality of query suggestion and search performance.

Proceedings ArticleDOI
19 Jul 2009
TL;DR: This paper incorporates context information into the problem of query classification by using conditional random field (CRF) models and shows that it can improve the F1 score by 52% as compared to other state-of-the-art baselines.
Abstract: Understanding users'search intent expressed through their search queries is crucial to Web search and online advertisement. Web query classification (QC) has been widely studied for this purpose. Most previous QC algorithms classify individual queries without considering their context information. However, as exemplified by the well-known example on query "jaguar", many Web queries are short and ambiguous, whose real meanings are uncertain without the context information. In this paper, we incorporate context information into the problem of query classification by using conditional random field (CRF) models. In our approach, we use neighboring queries and their corresponding clicked URLs (Web pages) in search sessions as the context information. We perform extensive experiments on real world search logs and validate the effectiveness and effciency of our approach. We show that we can improve the F1 score by 52% as compared to other state-of-the-art baselines.

Journal ArticleDOI
TL;DR: A query monitoring system from the ground up is presented, describing various new techniques for query monitoring, their implementation inside a real database system, and a novel interface that presents the observed and predicted information in an accessible manner.
Abstract: Query monitoring refers to the problem of observing and predicting various parameters related to the execution of a query in a database system. In addition to being a useful tool for database users and administrators, it can also serve as an information collection service for resource allocation and adaptive query processing techniques. In this article, we present a query monitoring system from the ground up, describing various new techniques for query monitoring, their implementation inside a real database system, and a novel interface that presents the observed and predicted information in an accessible manner. To enable this system, we introduce several lightweight online techniques for progressively estimating and refining the cardinality of different relational operators using information collected at query execution time. These include binary and multiway joins as well as typical grouping operations and combinations thereof. We describe the various algorithms used to efficiently implement estimators and present the results of an evaluation of a prototype implementation of our framework in an open-source data management system. Our results demonstrate the feasibility and practical utility of the approach presented herein.

Patent
29 Sep 2009
TL;DR: In this paper, an approach for creating and utilizing information representation of queries is presented, where a query application receives a query and expresses the query as a resource description framework graph, and the query application causes at least in part storage of the query resource description graph.
Abstract: An approach is provided for creating and utilizing information representation of queries. A query application receives a query. The query application expresses the query as a resource description framework graph. The query application causes at least in part storage of the query resource description framework graph.

Proceedings ArticleDOI
29 Jun 2009
TL;DR: An overview of the state-of-the-art techniques for supporting keyword search on structured and semi-structured data, including query result definition, ranking functions, result generation and top-k query processing, snippet generation, result clustering, query cleaning, performance optimization, and search quality evaluation are given.
Abstract: Empowering users to access databases using simple keywords can relieve the users from the steep learning curve of mastering a structured query language and understanding complex and possibly fast evolving data schemas. In this tutorial, we give an overview of the state-of-the-art techniques for supporting keyword search on structured and semi-structured data, including query result definition, ranking functions, result generation and top-k query processing, snippet generation, result clustering, query cleaning, performance optimization, and search quality evaluation. Various data models will be discussed, including relational data, XML data, graph-structured data, data streams, and workflows. We also discuss applications that are built upon keyword search, such as keyword based database selection, query generation, and analytical processing. Finally we identify the challenges and opportunities of future research to advance the field.

Patent
15 Dec 2009
TL;DR: In this paper, a spoken search query is received by a portable device and the portable device then determines its present location, that information is incorporated into a local language model that is used to process the search query.
Abstract: Disclosed herein are systems, methods, and computer-readable storage media for a speech recognition application for directory assistance that is based on a user's spoken search query. The spoken search query is received by a portable device and portable device then determines its present location. Upon determining the location of the portable device, that information is incorporated into a local language model that is used to process the search query. Finally, the portable device outputs the results of the search query based on the local language model.

Patent
13 Nov 2009
TL;DR: In this article, a computer-implemented method includes parsing a user query to determine whether the user query assigns a field to the first term, parsing resulting in a parsed query that conforms to a predefined format, performing a search in a metadata repository using the parsed query, the metadata repository embodied in a computer readable medium and including triplets generated based on multiple modes of metadata for video content, search identifying a set of candidate scenes from the video content.
Abstract: A computer-implemented method includes receiving, in a computer system, a user query comprising at least a first term, parsing the user query to at least determine whether the user query assigns a field to the first term, the parsing resulting in a parsed query that conforms to a predefined format, performing a search in a metadata repository using the parsed query, the metadata repository embodied in a computer readable medium and including triplets generated based on multiple modes of metadata for video content, the search identifying a set of candidate scenes from the video content, ranking the set of candidate scenes according to a scoring metric into a ranked scene list, and generating an output from the computer system that includes at least part of the ranked scene list, the output generated in response to the user query.

Proceedings ArticleDOI
09 Feb 2009
TL;DR: The proposed methods can match in precision, and often improve, recommendations based on query-click graphs, without using users' clicks, and the experiments show that it is important to consider transition-type labels on edges for having good quality recommendations.
Abstract: The query-flow graph [Boldi et al., CIKM 2008] is an aggregated representation of the latent querying behavior contained in a query log. Intuitively, in the query-flow graph a directed edge from query qi to query qj means that the two queries are likely to be part of the same search mission. Any path over the query-flow graph may be seen as a possible search task, whose likelihood is given by the strength of the edges along the path. An edge (qi, qj) is also labelled with some information: e.g., the probability that user moves from qi to qj, or the type of the transition, for instance, the fact that qj is a specialization of qi.In this paper we propose, and experimentally study, query recommendations based on short random walks on the query-flow graph. Our experiments show that these methods can match in precision, and often improve, recommendations based on query-click graphs, without using users' clicks. Our experiments also show that it is important to consider transition-type labels on edges for having good quality recommendations.Finally, one feature that we had in mind while devising our methods was that of providing diverse sets of recommendations: the experimentation that we conducted provides encouraging results in this sense.

Journal ArticleDOI
TL;DR: The overall design of QUICK is described, the core algorithms to enable efficient query construction are presented, and the effectiveness of the system is demonstrated through an experimental study.

Proceedings ArticleDOI
19 Jul 2009
TL;DR: This paper investigates a subset of temporal queries that are implicitly year qualified queries, a query that does not actually contain a year, but yet the user may have implicitly formulated the query with a specific year in mind.
Abstract: Many web search queries have implicit intents associated with them that, if detected and used effectively, can be used to improve search quality. For example, a user who enters the query “toyota camry” may wish to find the official web page for the car, reviews about the car, or the location of the closest Toyota dealership. However, since the user only entered a couple of keywords, it may be difficult to accurately determine which of these implicit intents the user actually meant. Given such an ambiguous query, a search engine must use personalization, click information, query log analysis, and other means for determining implicit intent. Rather than solving the general problem of automatically determining user intent, we focus on queries that have a temporally dependent intent. Temporally dependent queries are queries for which the best search results change with time. Simple examples include “new years” and “presidential elections”, which are events that recur over time. The search results for these queries should reflect the freshest, most current results. A slightly more complex example is the query “turkey”. For this query, it may be useful to return turkey recipes or cooking instructions around the Thanksgiving holiday and travel information during peak vacation times. In all of our examples thus far, the events have occurred with (mostly) predictable periodicity. However, for queries such as “oldest person alive”, the best result changes unpredictably, making it difficult for search engines to consistently return correct results. Therefore, temporally dependent queries come in many different forms and pose many challenges to search engines. In this paper, we investigate a subset of temporal queries that we call implicitly year qualified queries. A year qualified query is a query that contains a year. An implicitly year qualified query is a query that does not actually contain a year, but yet the user may have implicitly formulated the query with a specific year in mind. An example implicitly year qualified query is “miss universe”. It is plausible that the user actually meant “miss universe 2008”, “miss universe 2007”, or maybe even“miss universe 1990”, yet did not actu-

Book ChapterDOI
02 Jun 2009
TL;DR: The idea is to track the querying behavior of each user, identify which parts of the database may be of interest for the corresponding data analysis task, and recommend queries that retrieve relevant data.
Abstract: Relational database systems are becoming increasingly popular in the scientific community to support the interactive exploration of large volumes of data. In this scenario, users employ a query interface (typically, a web-based client) to issue a series of SQL queries that aim to analyze the data and mine it for interesting information. First-time users, however, may not have the necessary knowledge to know where to start their exploration. Other times, users may simply overlook queries that retrieve important information. To assist users in this context, we draw inspiration from Web recommender systems and propose the use of personalized query recommendations. The idea is to track the querying behavior of each user, identify which parts of the database may be of interest for the corresponding data analysis task, and recommend queries that retrieve relevant data. We discuss the main challenges in this novel application of recommendation systems, and outline a possible solution based on collaborative filtering. Preliminary experimental results on real user traces demonstrate that our framework can generate effective query recommendations.

Proceedings ArticleDOI
Fernando Diaz1
09 Feb 2009
TL;DR: This paper addresses the issue of integrating search results from a news vertical into web search results, and defines several click-based metrics which allow a system to be monitored and tuned without annotator effort.
Abstract: Aggregated search refers to the integration of content from specialized corpora or verticals into web search results. Aggregation improves search when the user has vertical intent but may not be aware of or desire vertical search. In this paper, we address the issue of integrating search results from a news vertical into web search results. News is particularly challenging because, given a query, the appropriate decision---to integrate news content or not---changes with time. Our system adapts to news intent in two ways. First, by inspecting the dynamics of the news collection and query volume, we can track development of and interest in topics. Second, by using click feedback, we can quickly recover from system errors. We define several click-based metrics which allow a system to be monitored and tuned without annotator effort.

Proceedings ArticleDOI
29 Jun 2009
TL;DR: This paper presents a novel data-driven approach, called Query By Output (QBO), which can enhance the usability of database systems and designs several optimization techniques to reduce processing overhead and introduce a set of criteria to rank order output queries by various notions of utility.
Abstract: It has recently been asserted that the usability of a database is as important as its capability. Understanding the database schema, the hidden relationships among attributes in the data all play an important role in this context. Subscribing to this viewpoint, in this paper, we present a novel data-driven approach, called Query By Output (QBO), which can enhance the usability of database systems. The central goal of QBO is as follows: given the output of some query Q on a database D, denoted by Q(D), we wish to construct an alternative query Q′ such that Q(D) and Q′ (D) are instance-equivalent. To generate instance-equivalent queries from Q(D), we devise a novel data classification-based technique that can handle the at-least-one semantics that is inherent in the query derivation. In addition to the basic framework, we design several optimization techniques to reduce processing overhead and introduce a set of criteria to rank order output queries by various notions of utility. Our framework is evaluated comprehensively on three real data sets and the results show that the instance-equivalent queries we obtain are interesting and that the approach is scalable and robust to queries of different selectivities.

Journal ArticleDOI
TL;DR: This work presents Falcons Object Search, a keyword-based search engine for linked objects, which constructs a comprehensive virtual document including not only associated literals but also the textual descriptions of associated links and linked objects.
Abstract: Along with the rapid growth of the data Web, searching linked objects for information needs and for reusing become emergent for ordinary Web users and developers, respectively. To meet the challenge, we present Falcons Object Search, a keyword-based search engine for linked objects. To serve various keyword queries, for each object the system constructs a comprehensive virtual document including not only associated literals but also the textual descriptions of associated links and linked objects. The resulting objects are ranked by considering both their relevance to the query and their popularity. For each resulting object, a query-relevant structured snippet is provided to show the associated literals and linked objects matched with the query. Besides, Web-scale class-inclusion reasoning is performed to discover implicit typing information, and users could navigate class hierarchies for incremental class-based results filtering. The results of a task-based experiment show the promising features of the system.

Journal IssueDOI
TL;DR: Results show that Reformulation and Assistance account for approximately 45p of all query reformulations; furthermore, the results demonstrate that the first- and second-order models provide the best predictability, between 28 and 40p overall and higher than 70p for some patterns.
Abstract: Query reformulation is a key user behavior during Web search. Our research goal is to develop predictive models of query reformulation during Web searching. This article reports results from a study in which we automatically classified the query-reformulation patterns for 964,780 Web searching sessions, composed of 1,523,072 queries, to predict the next query reformulation. We employed an n-gram modeling approach to describe the probability of users transitioning from one query-reformulation state to another to predict their next state. We developed first-, second-, third-, and fourth-order models and evaluated each model for accuracy of prediction, coverage of the dataset, and complexity of the possible pattern set. The results show that Reformulation and Assistance account for approximately 45p of all query reformulations; furthermore, the results demonstrate that the first- and second-order models provide the best predictability, between 28 and 40p overall and higher than 70p for some patterns. Implications are that the n-gram approach can be used for improving searching systems and searching assistance. © 2009 Wiley Periodicals, Inc.