scispace - formally typeset
Search or ask a question

Showing papers on "Web query classification published in 2016"


Proceedings ArticleDOI
24 Oct 2016
TL;DR: A suite of query expansion methods that are based on word embeddings that use the CBOW embedding approach to select terms that are semantically related to the query and integrate them with the effective pseudo-feedback-based relevance model.
Abstract: We present a suite of query expansion methods that are based on word embeddings. Using Word2Vec's CBOW embedding approach, applied over the entire corpus on which search is performed, we select terms that are semantically related to the query. Our methods either use the terms to expand the original query or integrate them with the effective pseudo-feedback-based relevance model. In the former case, retrieval performance is significantly better than that of using only the query, and in the latter case the performance is significantly better than that of the relevance model.

194 citations


Proceedings ArticleDOI
12 Sep 2016
TL;DR: This paper proposes to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms, and develops an embedding-based relevance model, an extension of the effective and robust relevance model approach.
Abstract: Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.

135 citations


Posted Content
TL;DR: The proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches, and shows that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF.
Abstract: A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs. We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.

133 citations


Proceedings ArticleDOI
12 Sep 2016
TL;DR: A theoretical framework for estimating query embedding vectors based on the individual embedding vector of vocabulary terms is proposed and a number of different implementations of this framework are provided and it is shown that the AWE method is a special case of the proposed framework.
Abstract: The dense vector representation of vocabulary terms, also known as word embeddings, have been shown to be highly effective in many natural language processing tasks. Word embeddings have recently begun to be studied in a number of information retrieval (IR) tasks. One of the main steps in leveraging word embeddings for IR tasks is to estimate the embedding vectors of queries. This is a challenging task, since queries are not always available during the training phase of word embedding vectors. Previous work has considered the average or sum of embedding vectors of all query terms (AWE) to model the query embedding vectors, but no theoretical justification has been presented for such a model. In this paper, we propose a theoretical framework for estimating query embedding vectors based on the individual embedding vectors of vocabulary terms. We then provide a number of different implementations of this framework and show that the AWE method is a special case of the proposed framework. We also introduce pseudo query vectors, the query embedding vectors estimated using pseudo-relevant documents. We further extrinsically evaluate the proposed methods using two well-known IR tasks: query expansion and query classification. The estimated query embedding vectors are evaluated via query expansion experiments over three newswire and web TREC collections as well as query classification experiments over the KDD Cup 2005 test set. The experiments show that the introduced pseudo query vectors significantly outperform the AWE method.

111 citations


Posted Content
TL;DR: It is found that the word2vec based query expansion methods perform similarly with and without any feedback information, and the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3.
Abstract: In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low dimensional embedding for each vocabulary entry. Using such a framework, we devise a query expansion technique, where related terms to a query are obtained by K-nearest neighbor approach. We explore the performance of the AQE methods, with and without feedback query expansion, and a variant of simple K-nearest neighbor in the proposed framework. Experiments on standard TREC ad-hoc data (Disk 4, 5 with query sets 301-450, 601-700) and web data (WT10G data with query set 451-550) shows significant improvement over standard term-overlapping based retrieval methods. However the proposed method fails to achieve comparable performance with statistical co-occurrence based feedback method such as RM3. We have also found that the word2vec based query expansion methods perform similarly with and without any feedback information.

105 citations


Journal ArticleDOI
TL;DR: Experimental results indicate that feature selection and ensemble learning can enhance the predictive performance of classifiers in web page classification, and Bagging and Random Subspace ensemble methods and correlation-based and consistency-based feature selection methods obtain better results in terms of accuracy rates.
Abstract: Web page classification is an important research direction on web mining. The abundant amount of data available on the web makes it essential to develop efficient and robust models for web mining tasks. Web page classification is the process of assigning a web page to a particular predefined category based on labelled data. It serves for several other web mining tasks, such as focused web crawling, web link analysis and contextual advertising. Machine learning and data mining methods have been successfully applied for several web mining tasks, including web page classification. Multiple classifier systems are a promising research direction in machine learning, which aims to combine several classifiers by differentiating base classifiers and/or dataset distributions so that more robust classification models can be built. This paper presents a comparative analysis of four different feature selections correlation, consistency, information gain and chi-square-based feature selection and four different ensemble learning methods Boosting, Bagging, Dagging and Random Subspace based on four different base learners naive Bayes, K-nearest neighbour algorithm, C4.5 algorithm and FURIA algorithm. The article examines the predictive performance of ensemble methods for web page classification. The experimental results indicate that feature selection and ensemble learning can enhance the predictive performance of classifiers in web page classification. For the DMOZ-50 dataset, the highest average predictive performance 88.1% is obtained with the combination of consistency-based feature selection with AdaBoost and naive Bayes algorithms, which is a promising result for web page classification. Experimental results indicate that Bagging and Random Subspace ensemble methods and correlation-based and consistency-based feature selection methods obtain better results in terms of accuracy rates.

101 citations


Journal ArticleDOI
TL;DR: This paper presents the design of a cloud multidatastore query language (CloudMdsQL), and its query engine, a functional SQL-like language, capable of querying multiple heterogeneous data stores within a single query that may contain embedded invocations to each data store’s native query interface.
Abstract: The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. In this paper, we present the design of a cloud multidatastore query language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store's native query interface. The query engine has a fully distributed architecture, which provides important opportunities for optimization. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized, e.g. by pushing down select predicates, using bind join, performing join ordering, or planning intermediate data shipping. Our experimental validation, with three data stores (graph, document and relational) and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatastore query language.

90 citations


Proceedings ArticleDOI
13 Mar 2016
TL;DR: A lab-based user study investigated potential indicators of learning in web searching, effective query strategies for learning, and the relationship between search behavior and learning outcomes, finding that searchers' perceived learning outcomes closely matched their actual learning outcomes.
Abstract: Users make frequent use of Web search for learning-related tasks, but little is known about how different Web search interaction strategies affect outcomes for learning-oriented tasks, or what implicit or explicit indicators could reliably be used to assess search-related learning on the Web. We describe a lab-based user study in which we investigated potential indicators of learning in web searching, effective query strategies for learning, and the relationship between search behavior and learning outcomes. Using questionnaires, analysis of written responses to knowledge prompts, and search log data, we found that searchers' perceived learning outcomes closely matched their actual learning outcomes; that the amount searchers wrote in post-search questionnaire responses was highly correlated with their cognitive learning scores; and that the time searchers spent per document while searching was also highly and consistently correlated with higher-level cognitive learning scores. We also found that of the three query interaction conditions we applied, an intrinsically diverse presentation of results was associated with the highest percentage of users achieving combined factual and conceptual knowledge gains. Our study provides deeper insight into which aspects of search interaction are most effective for supporting superior learning outcomes, and the difficult problem of how learning may be assessed effectively during Web search.

84 citations


Journal ArticleDOI
TL;DR: An adaptive algorithm, called ADSUD, is proposed, which redefine the approximate global skyline probability and choose local representative tuples due to minimum probabilistic bounding rectangle adaptively and design a progressive pruning method and apply the reuse mechanism to improve its efficiency.
Abstract: Query processing over uncertain data has gained growing attention, because it is necessary to deal with uncertain data in many real-life applications. In this paper, we investigate skyline queries over uncertain data in distributed environments (DSUD query) whose research is only in an early stage. The state-of-the-art algorithm, called e-DSUD algorithm, is designed for processing this query. It has the desirable characteristics of progressiveness and minimum bandwidth consumption. However, it still needs to be perfected in three aspects. (1) Progressiveness. Each time it only returns one query result at most. (2) Efficiency. There are a significant amount of redundant I/O cost and numerous iterations which causes a long total query time. (3) Universality. It is restricted to the case where local skyline tuples are incomparability. To address these concerns, we first present a detailed analysis of the e-DSUD algorithm and then develop an improved framework for the DSUD query, namely IDSUD. Based on the new framework, we propose an adaptive algorithm, called ADSUD, for the DSUD query. In the algorithm, we redefine the approximate global skyline probability and choose local representative tuples due to minimum probabilistic bounding rectangle adaptively. Furthermore, we design a progressive pruning method and apply the reuse mechanism to improve its efficiency. The results of extensive experiments verify the better overall performance of our algorithm than the e-DSUD algorithm.

82 citations


Patent
01 Aug 2016
TL;DR: In this article, the authors describe methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving an utterance and environmental data, obtaining a transcription of the utterance, identifying an entity using the environmental data and submitting a query to a natural language query processing engine.
Abstract: Disclosed are methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for receiving an utterance and environmental data, obtaining a transcription of the utterance, identifying an entity using the environmental data, submitting a query to a natural language query processing engine, wherein the query includes at least a portion of the transcription and data that identifies the entity, and obtaining one or more results of the query. [Reference numerals] (104) Disambiguation engine; (106) Speech recognition engine; (108) Keyword mapping engine; (110) Content recognition engine; (AA,FF) Utterance; (BB) Environmental data; (CC,GG) Content item data; (DD) Environmental data + Content type; (EE) Content type; (HH) Transcription

77 citations


Proceedings ArticleDOI
07 Jul 2016
TL;DR: The UQV100 test collection is described, designed to incorporate variability from users and unlock new opportunities for novel investigations and analysis, including for problems such as task-intent retrieval performance and consistency, query clustering, query difficulty prediction, and relevance feedback, among others.
Abstract: We describe the UQV100 test collection, designed to incorporate variability from users. Information need ?backstories? were written for 100 topics (or sub-topics) from the TREC 2013 and 2014 Web Tracks. Crowd workers were asked to read the backstories, and provide the queries they would use; plus effort estimates of how many useful documents they would have to read to satisfy the need. A total of 10,835 queries were collected from 263 workers. After normalization and spell-correction, 5,764 unique variations remained; these were then used to construct a document pool via Indri-BM25 over the ClueWeb12-B corpus. Qualified crowd workers made relevance judgments relative to the backstories, using a relevance scale similar to the original TREC approach; first to a pool depth of ten per query, then deeper on a set of targeted documents. The backstories, query variations, normalized and spell-corrected queries, effort estimates, run outputs, and relevance judgments are made available collectively as the UQV100 test collection. We also make available the judging guidelines and the gold hits we used for crowd-worker qualification and spam detection. We believe this test collection will unlock new opportunities for novel investigations and analysis, including for problems such as task-intent retrieval performance and consistency (independent of query variation), query clustering, query difficulty prediction, and relevance feedback, among others.

Proceedings ArticleDOI
24 Oct 2016
TL;DR: A learning to rewrite framework that consists of a candidate generating phase and a candidate ranking phase that allows for the flexibility to reuse most of existing query rewriters and to explicitly optimize search relevance is proposed.
Abstract: It is widely known that there exists a semantic gap between web documents and user queries and bridging this gap is crucial to advance information retrieval systems. The task of query rewriting, aiming to alter a given query to a rewrite query that can close the gap and improve information retrieval performance, has attracted increasing attention in recent years. However, the majority of existing query rewriters are not designed to boost search performance and consequently their rewrite queries could be sub-optimal. In this paper, we propose a learning to rewrite framework that consists of a candidate generating phase and a candidate ranking phase. The candidate generating phase provides us the flexibility to reuse most of existing query rewriters; while the candidate ranking phase allows us to explicitly optimize search relevance. Experimental results on a commercial search engine demonstrate the effectiveness of the proposed framework. Further experiments are conducted to understand the important components of the proposed framework.

Journal ArticleDOI
01 Aug 2016
TL;DR: A system design that natively supports heterogeneous data formats and also minimizes query execution times is presented and Proteus outperforms state-of-the-art open-source and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format.
Abstract: Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of heterogeneous datasets to gain insights. The different data models and formats pose a significant challenge on performing analysis over a combination of diverse datasets. Serving all queries using a single, general-purpose query engine is slow. On the other hand, using a specialized engine for each heterogeneous dataset increases complexity: queries touching a combination of datasets require an integration layer over the different engines.This paper presents a system design that natively supports heterogeneous data formats and also minimizes query execution times. For multi-format support, the design uses an expressive query algebra which enables operations over various data models. For minimal execution times, it uses a code generation mechanism to mimic the system and storage most appropriate to answer a query fast. We validate our design by building Proteus, a query engine which natively supports queries over CSV, JSON, and relational binary data, and which specializes itself to each query, dataset, and workload via code generation. Proteus outperforms state-of-the-art open-source and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format, all while exposing users to a single query interface.

Proceedings ArticleDOI
16 May 2016
TL;DR: This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently and builds a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search.
Abstract: It is nowadays quite common for road networks to have textual contents on the vertices, which describe auxiliary information (e.g., business, traffic, etc.) associated with the vertex. In such road networks, which are modelled as weighted undirected graphs, each vertex is associated with one or more keywords, and each edge is assigned with a weight, which can be its physical length or travelling time. In this paper, we study the problem of keyword-aware continuous k nearest neighbour (KCkNN) search on road networks, which computes the k nearest vertices that contain the query keywords issued by a moving object and maintains the results continuously as the object is moving on the road network. Reducing the query processing costs in terms of computation and communication has attracted considerable attention in the database community with interesting techniques proposed. This paper proposes a framework, called a Labelling AppRoach for Continuous kNN query (LARC), on road networks to cope with KCkNN query efficiently. First we build a pivot-based reverse label index and a keyword-based pivot tree index to improve the efficiency of keyword-aware k nearest neighbour (KkNN) search by avoiding massive network traversals and sequential probe of keywords. To reduce the frequency of unnecessary result updates, we develop the concepts of dominance interval and region on road network, which share the similar intuition with safe region for processing continuous queries in Euclidean space but are more complicated and thus require more dedicated design. For high frequency keywords, we resolve the dominance interval when the query results changed. In addition, a path-based dominance updating approach is proposed to compute the dominance region efficiently when the query keywords are of low frequency. We conduct extensive experiments by comparing our algorithms with the state-of-the-art methods on real data sets. The empirical observations have verified the superiority of our proposed solution in all aspects of index size, communication cost and computation time.

Journal ArticleDOI
TL;DR: This paper proposes the first range query processing scheme that achieves index indistinguishability under the indistinguishesability against chosen keyword attack (IND-CKA), and proposes two algorithms, namely PBtree traversal width minimization and PB tree traversal depth minimization, to improve query processing efficiency.
Abstract: Privacy has been the key road block to cloud computing as clouds may not be fully trusted. This paper is concerned with the problem of privacy-preserving range query processing on clouds. Prior schemes are weak in privacy protection as they cannot achieve index indistinguishability, and therefore allow the cloud to statistically estimate the values of data and queries using domain knowledge and history query results. In this paper, we propose the first range query processing scheme that achieves index indistinguishability under the indistinguishability against chosen keyword attack (IND-CKA). Our key idea is to organize indexing elements in a complete binary tree called PBtree, which satisfies structure indistinguishability (i.e., two sets of data items have the same PBtree structure if and only if the two sets have the same number of data items) and node indistinguishability (i.e., the values of PBtree nodes are completely random and have no statistical meaning). We prove that our scheme is secure under the widely adopted IND-CKA security model. We propose two algorithms, namely PBtree traversal width minimization and PBtree traversal depth minimization, to improve query processing efficiency. We prove that the worst-case complexity of our query processing algorithm using PBtree is $O(\vert R\vert\log n)$ , where $n$ is the total number of data items and $R$ is the set of data items in the query result. We implemented and evaluated our scheme on a real-world dataset with 5 million items. For example, for a query whose results contain 10 data items, it takes only 0.17 ms.

Journal ArticleDOI
TL;DR: This paper proposes and displays an ontology-based object-attribute-value (O-A-V) information extraction system as a web model that acts as a user dictionary to refine the search keywords in the query for subsequent attempts to improve the standard information retrieval systems.
Abstract: In the internet era, search engines play a vital role in information retrieval from web pages. Search engines arrange the retrieved results using various ranking algorithms. Additionally, retrieval is based on statistical searching techniques or content-based information extraction methods. It is still difficult for the user to understand the abstract details of every web page unless the user opens it separately to view the web content. This key point provided the motivation to propose and display an ontology-based object-attribute-value (O-A-V) information extraction system as a web model that acts as a user dictionary to refine the search keywords in the query for subsequent attempts. This first model is evaluated using various natural language processing (NLP) queries given as English sentences. Additionally, image search engines, such as Google Images, use content-based image information extraction and retrieval of web pages against the user query. To minimize the semantic gap between the image retrieval results and the expected user results, the domain ontology is built using image descriptions. The second proposed model initially examines natural language user queries using an NLP parser algorithm that will identify the subject-predicate-object (S-P-O) for the query. S-P-O extraction is an extended idea from the ontology-based O-A-V web model. Using this S-P-O extraction and considering the complex nature of writing SPARQL protocol and RDF query language (SPARQL) from the user point of view, the SPARQL auto query generation module is proposed, and it will auto generate the SPARQL query. Then, the query is deployed on the ontology, and images are retrieved based on the auto-generated SPARQL query. With the proposed methodology above, this paper seeks answers to following two questions. First, how to combine the use of domain ontology and semantics to improve information retrieval and user experience? Second, does this new unified framework improve the standard information retrieval systems? To answer these questions, a document retrieval system and an image retrieval system were built to test our proposed framework. The web document retrieval was tested against three key-words/bag-of-words models and a semantic ontology model. Image retrieval was tested on IAPR TC-12 benchmark dataset. The precision, recall and accuracy results were then compared against standard information retrieval systems using TREC_EVAL. The results indicated improvements over the standard systems. A controlled experiment was performed by test subjects querying the retrieval system in the absence and presence of our proposed framework. The queries were measured using two metrics, time and click-count. Comparisons were made on the retrieval performed with and without our proposed framework. The results were encouraging.

Journal ArticleDOI
01 Dec 2016
TL;DR: This work introduces a novel query paradigm that considers a user query as an example of the data in which the user is interested and provides a formal specification of their semantics, which are fundamentally different from notions like queries by example, approximate queries and related queries.
Abstract: Modern search engines employ advanced techniques that go beyond the structures that strictly satisfy the query conditions in an effort to better capture the user intentions. In this work, we introduce a novel query paradigm that considers a user query as an example of the data in which the user is interested. We call these queries exemplar queries. We provide a formal specification of their semantics and show that they are fundamentally different from notions like queries by example, approximate queries and related queries. We provide an implementation of these semantics for knowledge graphs and present an exact solution with a number of optimizations that improve performance without compromising the result quality. We study two different congruence relations, isomorphism and strong simulation, for identifying the answers to an exemplar query. We also provide an approximate solution that prunes the search space and achieves considerably better time performance with minimal or no impact on effectiveness. The effectiveness and efficiency of these solutions with synthetic and real datasets are experimentally evaluated, and the importance of exemplar queries in practice is illustrated.

Proceedings ArticleDOI
24 Oct 2016
TL;DR: This paper proposes RFMF, a PRF framework based on matrix factorization which is a state-of-the-art technique in collaborative recommender systems and implements this framework for two widely used document retrieval frameworks: language modeling and the vector space model.
Abstract: In information retrieval, pseudo-relevance feedback (PRF) refers to a strategy for updating the query model using the top retrieved documents. PRF has been proven to be highly effective in improving the retrieval performance. In this paper, we look at the PRF task as a recommendation problem: the goal is to recommend a number of terms for a given query along with weights, such that the final weights of terms in the updated query model better reflect the terms' contributions in the query. To do so, we propose RFMF, a PRF framework based on matrix factorization which is a state-of-the-art technique in collaborative recommender systems. Our purpose is to predict the weight of terms that have not appeared in the query and matrix factorization techniques are used to predict these weights. In RFMF, we first create a matrix whose elements are computed using a weight function that shows how much a term discriminates the query or the top retrieved documents from the collection. Then, we re-estimate the created matrix using a matrix factorization technique. Finally, the query model is updated using the re-estimated matrix. RFMF is a general framework that can be employed with any retrieval model. In this paper, we implement this framework for two widely used document retrieval frameworks: language modeling and the vector space model. Extensive experiments over several TREC collections demonstrate that the RFMF framework significantly outperforms competitive baselines. These results indicate the potential of using other recommendation techniques in this task.

Proceedings ArticleDOI
01 Jun 2016
TL;DR: Using an architecture that combines a stack of the DLSTM layers with a tradition CNN layer, new state-of-the-art query classification accuracy on benchmark data sets for query classification is observed.
Abstract: Traditional convolutional neural network (CNN) based query classification uses linear feature mapping in its convolution operation. The recurrent neural network (RNN), differs from a CNN in representing word sequence with their ordering information kept explicitly. We propose using a deep long-short-term-memory (DLSTM) based feature mapping to learn feature representation for CNN. The DLSTM, which is a stack of LSTM units, has different order of feature representations at different depth of LSTM unit. The bottom LSTM unit equipped with input and output gates, extracts the first order feature representation from current word. To extract higher order nonlinear feature representation, the LSTM unit at higher position gets input from two parts. First part is the lower LSTM unit’s memory cell from previous word. Second part is the lower LSTM unit’s hidden output from current word. In this way, the DLSTM captures the nonlinear nonconsecutive interaction within n-grams. Using an architecture that combines a stack of the DLSTM layers with a tradition CNN layer, we have observed new state-of-the-art query classification accuracy on benchmark data sets for query classification.

Proceedings ArticleDOI
26 Jun 2016
TL;DR: The combined approach-data-driven query chopping-achieves robust and scalable performance on co-processors and validated with the open-source GPU-accelerated database engine CoGaDB and the popular star schema and TPC-H benchmarks.
Abstract: Technology limitations are making the use of heterogeneous computing devices much more than an academic curiosity. In fact, the use of such devices is widely acknowledged to be the only promising way to achieve application-speedups that users urgently need and expect. However, building a robust and efficient query engine for heterogeneous co-processor environments is still a significant challenge. In this paper, we identify two effects that limit performance in case co-processor resources become scarce. Cache thrashing occurs when the working set of queries does not fit into the co-processor's data cache, resulting in performance degradations up to a factor of 24. Heap contention occurs when multiple operators run in parallel on a co-processor and when their accumulated memory footprint exceeds the main memory capacity of the co-processor, slowing down query execution by up to a factor of six. We propose solutions for both effects. Data-driven operator placement avoids data movements when they might be harmful; query chopping limits co-processor memory usage and thus avoids contention. The combined approach-data-driven query chopping-achieves robust and scalable performance on co-processors. We validate our proposal with our open-source GPU-accelerated database engine CoGaDB and the popular star schema and TPC-H benchmarks.

Book
15 Mar 2016
TL;DR: Query reformulation refers to a process of translating a source query into a target plan that abides by certain interface restrictions as discussed by the authors, where the goal is to convert implicit definitions into explicit definitions using an approach known as interpolation.
Abstract: Query reformulation refers to a process of translating a source query—a request for information in some high-level logic-based language—into a target plan that abides by certain interface restrictions Many practical problems in data management can be seen as instances of the reformulation problem For example, the problem of translating an SQL query written over a set of base tables into another query written over a set of views; the problem of implementing a query via translating to a program calling a set of database APIs; the problem of implementing a query using a collection of web services In this book we approach query reformulation in a very general setting that encompasses all the problems above, by relating it to a line of research within mathematical logic For many decades logicians have looked at the problem of converting "implicit definitions" into "explicit definitions," using an approach known as interpolation We will review the theory of interpolation, and explain its close con

Journal ArticleDOI
Xu Zhou1, Kenli Li1, Guoqing Xiao1, Yantao Zhou1, Keqin Li1 
TL;DR: This paper forms an uncertain dynamic skyline (UDS) query over a probabilistic product set, and proposes effective pruning strategies for the UDS query, and integrates them into effective algorithms.
Abstract: With the development of the economy, products are significantly enriched, and uncertainty has been their inherent quality. The probabilistic dynamic skyline (PDS) query is a powerful tool for customers to use in selecting products according to their preferences. However, this query suffers several limitations: it requires the specification of a probabilistic threshold, which reports undesirable results and disregards important results; it only focuses on the objects that have large dynamic skyline probabilities; and, additionally, the results are not stable. To address this concern, in this paper, we formulate an uncertain dynamic skyline (UDS) query over a probabilistic product set. Furthermore, we propose effective pruning strategies for the UDS query, and integrate them into effective algorithms. In addition, a novel query type, namely the top $k$ favorite probabilistic products (TFPP) query, is presented. The TFPP query is utilized to select $k$ products which can meet the needs of a customer set at the maximum level. To tackle the TFPP query, we propose a TFPP algorithm and its efficient parallelization. Extensive experiments with a variety of experimental settings illustrate the efficiency and effectiveness of our proposed algorithms.

Proceedings ArticleDOI
08 Feb 2016
TL;DR: The TriniT search engine for querying and ranking on extended knowledge graphs that combine relational facts with textual web contents is presented and a model for automatic query relaxation to compensate for mismatches between the data and a user's query is presented.
Abstract: Entity search over text corpora is not geared for relationship queries where answers are tuples of related entities and where a query often requires joining cues from multiple documents. With large knowledge graphs, structured querying on their relational facts is an alternative, but often suffers from poor recall because of mismatches between user queries and the knowledge graph or because of weakly populated relations. This paper presents the TriniT search engine for querying and ranking on extended knowledge graphs that combine relational facts with textual web contents. Our query language is designed on the paradigm of SPO triple patterns, but is more expressive, supporting textual phrases for each of the SPO arguments. We present a model for automatic query relaxation to compensate for mismatches between the data and a user's query. Query answers -- tuples of entities -- are ranked by a statistical language model. We present experiments with different benchmarks, including complex relationship queries, over a combination of the Yago knowledge graph and the entity-annotated ClueWeb'09 corpus.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: The vitrivr architecture is self-contained and addresses all aspects of multimedia search, from offline feature extraction, database management to frontend user interaction, and thus offers a large variety of different query modes which can be seamlessly combined.
Abstract: vitrivr is an open source full-stack content-based multimedia retrieval system with focus on video. Unlike the majority of the existing multimedia search solutions, vitrivr is not limited to searching in metadata, but also provides content-based search and thus offers a large variety of different query modes which can be seamlessly combined: Query by sketch, which allows the user to draw a sketch of a query image and/or sketch motion paths, Query by example, keyword search, and relevance feedback. The vitrivr architecture is self-contained and addresses all aspects of multimedia search, from offline feature extraction, database management to frontend user interaction. The system is composed of three modules: a web-based frontend which allows the user to input the query (e.g., add a sketch) and browse the retrieved results (vitrivr-ui), a database system designed for interactive search in large-scale multimedia collections (ADAM), and a retrieval engine that handles feature extraction and feature-based retrieval (Cineast). The vitrivr source is available on GitHub under the MIT open source (and similar) licenses and is currently undergoing several upgrades as part of the Google Summer of Code 2016.

Journal ArticleDOI
TL;DR: To achieve privacy-preserving spatial range query, the first predicate-only encryption scheme for inner product range (IPRE) is proposed, which can be used to detect whether a position is within a given circular area in a privacy- Preserving way.
Abstract: With the pervasiveness of smart phones, location-based services (LBS) have received considerable attention and become more popular and vital recently. However, the use of LBS also poses a potential threat to user’s location privacy. In this paper, aiming at spatial range query, a popular LBS providing information about points of interest (POIs) within a given distance, we present an efficient and privacy-preserving location-based query solution, called EPLQ. Specifically, to achieve privacy-preserving spatial range query, we propose the first predicate-only encryption scheme for inner product range (IPRE), which can be used to detect whether a position is within a given circular area in a privacy-preserving way. To reduce query latency, we further design a privacy-preserving tree index structure in EPLQ. Detailed security analysis confirms the security properties of EPLQ. In addition, extensive experiments are conducted, and the results demonstrate that EPLQ is very efficient in privacy-preserving spatial range query over outsourced encrypted data. In particular, for a mobile LBS user using an Android phone, around 0.9 s is needed to generate a query, and it also only requires a commodity workstation, which plays the role of the cloud in our experiments, a few seconds to search POIs.

Proceedings ArticleDOI
26 Jun 2016
TL;DR: This demonstration presents a Cloud Multidatastore Query Language (CloudMdsQL), and its query engine, a functional SQL-like language capable of querying multiple heterogeneous data stores within a single query that may contain embedded invocations to each data store's native query interface.
Abstract: The blooming of different cloud data management infrastructures has turned multistore systems to a major topic in the nowadays cloud landscape. In this demonstration, we present a Cloud Multidatastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store's native query interface. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized. Within our demonstration, we focus on two use cases each involving four diverse data stores (graph, document, relational, and key-value) with its corresponding CloudMdsQL queries. The query execution flows are visualized by an embedded real-time monitoring subsystem. The users can also try out different ad-hoc queries, not necessarily in the context of the use cases.

01 Jan 2016
TL;DR: The experimental results in this thesis indicate that the proposed query auto completion approaches can improve the ranking performance of query completions in terms of well-known metrics, like Mean Reciprocal Rank.
Abstract: Query auto completion is an important feature embedded into today's search engines. It can help users formulate queries which other people have searched for when he/she finishes typing the query prefix. Today's most sophisticated query auto completion approaches are based on the collected query logs to provide the best possible queries for each searcher's input. In this thesis, we develop new query auto completion methods for information retrieval. First, we consider the information of both time and user to propose a time-sensitive personalized query auto completion approach. In previous work, these two sources of information have been developed separately. We bring them together and pay special attention to long-tail prefixes. Second, based on a learning-to-rank framework, we propose to extract features originating from so-called homologous queries and from the semantic similarity of terms, which allow the contributions from similar queries and from semantic relatedness to be used for query auto completion. In addition, we study the problem of query auto completion diversification, where we aim to diversify aspect-level query intents of query completions. This task has not been studied before. Given that only a limited number of query completions can be returned to users of a search engine, it is important to remove redundant queries and improve user satisfaction by finding an acceptable query. Finally, we conduct an investigation on when to personalize query auto completion by proposing a selectively personalizing query auto completion approach, where the weight of personalization in a query auto completion model is selectively assigned based on the search context in session. The experimental results in this thesis indicate that our proposed query auto completion approaches can improve the ranking performance of query completions in terms of well-known metrics, like Mean Reciprocal Rank. The unique insights and interesting findings in this thesis may be used to help search engine designers to improve the satisfaction of search engine user by providing high quality query completions.

Journal ArticleDOI
TL;DR: The effectiveness of the proposed L2R-QAC model with newly added features is analyzed, and it significantly outperforms state-of-the-art QAC models, either based on learning to rank or on popularity.
Abstract: We propose a learning to rank based query auto completion model (L2R-QAC) that exploits contributions from so-called homologous queries for a QAC candidate, in which two kinds of homologous queries are taken into account.We propose semantic features for QAC, using the semantic relatedness of terms inside a query candidate and of pairs of terms from a candidate and from queries previously submitted in the same session.We analyze the effectiveness of our L2R-QAC model with newly added features, and find that it significantly outperforms state-of-the-art QAC models, either based on learning to rank or on popularity. Query auto completion (QAC) models recommend possible queries to web search users when they start typing a query prefix. Most of today's QAC models rank candidate queries by popularity (i.e., frequency), and in doing so they tend to follow a strict query matching policy when counting the queries. That is, they ignore the contributions from so-called homologous queries, queries with the same terms but ordered differently or queries that expand the original query. Importantly, homologous queries often express a remarkably similar search intent. Moreover, today's QAC approaches often ignore semantically related terms. We argue that users are prone to combine semantically related terms when generating queries.We propose a learning to rank-based QAC approach, where, for the first time, features derived from homologous queries and semantically related terms are introduced. In particular, we consider: (i) the observed and predicted popularity of homologous queries for a query candidate; and (ii) the semantic relatedness of pairs of terms inside a query and pairs of queries inside a session. We quantify the improvement of the proposed new features using two large-scale real-world query logs and show that the mean reciprocal rank and the success rate can be improved by up to 9% over state-of-the-art QAC models.

Proceedings ArticleDOI
14 Jun 2016
TL;DR: iOLAP is an incremental OLAP query engine that provides a smooth trade-off between query accuracy and latency, and fulfills a full spectrum of user requirements from approximate but timely query execution to a more traditional accurate query execution.
Abstract: The size of data and the complexity of analytics continue to grow along with the need for timely and cost-effective analysis. However, the growth of computation power cannot keep up with the growth of data. This calls for a paradigm shift from traditional batch OLAP processing model to an incremental OLAP processing model. In this paper, we propose iOLAP, an incremental OLAP query engine that provides a smooth trade-off between query accuracy and latency, and fulfills a full spectrum of user requirements from approximate but timely query execution to a more traditional accurate query execution. iOLAP enables interactive incremental query processing using a novel mini-batch execution model---given an OLAP query, iOLAP first randomly partitions the input dataset into smaller sets (mini-batches) and then incrementally processes through these mini-batches by executing a delta update query on each mini-batch, where each subsequent delta update query computes an update based on the output of the previous one. The key idea behind iOLAP is a novel delta update algorithm that models delta processing as an uncertainty propagation problem, and minimizes the recomputation during each subsequent delta update by minimizing the uncertainties in the partial (including intermediate) query results. We implement iOLAP on top of Apache Spark and have successfully demonstrated it at scale on over 100 machines. Extensive experiments on a multitude of queries and datasets demonstrate that iOLAP can deliver approximate query answers for complex OLAP queries orders of magnitude faster than traditional OLAP engines, while continuously delivering updates every few seconds.

Proceedings ArticleDOI
26 Jun 2016
TL;DR: In this article, the authors propose a sampling-based iterative procedure that requires almost no changes to the original query optimizer or query evaluation mechanism of the system, and show that this indeed imposes low overhead and catches cases where three widely used optimizers (PostgreSQL and two commercial systems) make large errors.
Abstract: Despite of decades of work, query optimizers still make mistakes on "difficult" queries because of bad cardinality estimates, often due to the interaction of multiple predicates and correlations in the data. In this paper, we propose a low-cost post-processing step that can take a plan produced by the optimizer, detect when it is likely to have made such a mistake, and take steps to fix it. Specifically, our solution is a sampling-based iterative procedure that requires almost no changes to the original query optimizer or query evaluation mechanism of the system. We show that this indeed imposes low overhead and catches cases where three widely used optimizers (PostgreSQL and two commercial systems) make large errors.