scispace - formally typeset
Search or ask a question

Showing papers on "Web query classification published in 2014"


Proceedings ArticleDOI
19 May 2014
TL;DR: Wang et al. as discussed by the authors proposed a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns, and empirically analyzed the efficiency of their protocols through various experiments.
Abstract: For the past decade, query processing on relational data has been studied extensively, and many theoretical and practical solutions to query processing have been proposed under various scenarios. With the recent popularity of cloud computing, users now have the opportunity to outsource their data as well as the data management tasks to the cloud. However, due to the rise of various privacy issues, sensitive data (e.g., medical records) need to be encrypted before outsourcing to the cloud. In addition, query processing tasks should be handled by the cloud; otherwise, there would be no point to outsource the data at the first place. To process queries over encrypted data without the cloud ever decrypting the data is a very challenging task. In this paper, we focus on solving the k-nearest neighbor (kNN) query problem over encrypted database outsourced to a cloud: a user issues an encrypted query record to the cloud, and the cloud returns the k closest records to the user. We first present a basic scheme and demonstrate that such a naive solution is not secure. To provide better security, we propose a secure kNN protocol that protects the confidentiality of the data, user's input query, and data access patterns. Also, we empirically analyze the efficiency of our protocols through various experiments. These results indicate that our secure protocol is very efficient on the user end, and this lightweight scheme allows a user to use any mobile device to perform the kNN query.

285 citations


Journal ArticleDOI
TL;DR: This paper extends the most relevant prior theoretical model of explanations for intelligent systems to account for some missing elements, and defines a new sort of explanation as a minimal set of words, such that removing all words within this set from the document changes the predicted class from the class of interest.
Abstract: Many document classification applications require human understanding of the reasons for data-driven classification decisions by managers, client-facing employees, and the technical team. Predictive models treat documents as data to be classified, and document data are characterized by very high dimensionality, often with tens of thousands to millions of variables (words). Unfortunately, due to the high dimensionality, understanding the decisions made by document classifiers is very difficult. This paper begins by extending the most relevant prior theoretical model of explanations for intelligent systems to account for some missing elements. The main theoretical contribution is the definition of a new sort of explanation as a minimal set of words (terms, generally), such that removing all words within this set from the document changes the predicted class from the class of interest. We present an algorithm to find such explanations, as well as a framework to assess such an algorithm's performance. We demonstrate the value of the new approach with a case study from a real-world document classification task: classifying web pages as containing objectionable content, with the goal of allowing advertisers to choose not to have their ads appear on those pages. A second empirical demonstration on news-story topic classification shows the explanations to be concise and document-specific, and to be capable of providing understanding of the exact reasons for the classification decisions, of the workings of the classification models, and of the business application itself. We also illustrate how explaining the classifications of documents can help to improve data quality and model performance.

269 citations


Proceedings ArticleDOI
18 May 2014
TL;DR: This work addresses a major open problem in private DB: efficient sub linear search for arbitrary Boolean queries, and allows leakage of some search pattern information, but protects the query and data, and provides a high level of privacy for individual terms in the executed search formula.
Abstract: Query privacy in secure DBMS is an important feature, although rarely formally considered outside the theoretical community. Because of the high overheads of guaranteeing privacy in complex queries, almost all previous works addressing practical applications consider limited queries (e.g., just keyword search), or provide a weak guarantee of privacy. In this work, we address a major open problem in private DB: efficient sub linear search for arbitrary Boolean queries. We consider scalable DBMS with provable security for all parties, including protection of the data from both server (who stores encrypted data) and client (who searches it), as well as protection of the query, and access control for the query. We design, build, and evaluate the performance of a rich DBMS system, suitable for real-world deployment on today medium-to large-scale DBs. On a modern server, we are able to query a formula over 10TB, 100M-record DB, with 70 searchable index terms per DB row, in time comparable to (insecure) MySQL (many practical queries can be privately executed with work 1.2-3 times slower than MySQL, although some queries are costlier). We support a rich query set, including searching on arbitrary boolean formulas on keywords and ranges, support for stemming, and free keyword searches over text fields. We identify and permit a reasonable and controlled amount of leakage, proving that no further leakage is possible. In particular, we allow leakage of some search pattern information, but protect the query and data, provide a high level of privacy for individual terms in the executed search formula, and hide the difference between a query that returned no results and a query that returned a very small result set. We also support private and complex access policies, integrated in the search process so that a query with empty result set and a query that fails the policy are hard to tell apart.

268 citations


Proceedings ArticleDOI
18 Jun 2014
TL;DR: In this article, a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds is presented. But it is not validated whether these techniques actually generate accurate error bars for real query workloads, and error bar estimation often fails on real world production workloads.
Abstract: Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate "error bars" on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.

183 citations


Book
Hang Li1, Jun Xu1
20 Jun 2014
TL;DR: This survey gives a systematic and detailed introduction to newly developed machine learning technologies for query document matching (semantic matching) in search, particularly web search, and focuses on the fundamental problems, as well as the state-of-the-art solutions.
Abstract: Relevance is the most important factor to assure users' satisfaction in search and the success of a search engine heavily depends on its performance on relevance. It has been observed that most of the dissatisfaction cases in relevance are due to term mismatch between queries and documents (e.g., query "NY times" does not match well with a document only containing "New York Times"), because term matching, i.e., the bag-of-words approach, still functions as the main mechanism of modern search engines. It is not exaggerated to say, therefore, that mismatch between query and document poses the most critical challenge in search. Ideally, one would like to see query and document match with each other, if they are topically relevant. Recently, researchers have expended significant effort to address the problem. The major approach is to conduct semantic matching, i.e., to perform more query and document understanding to represent the meanings of them, and perform better matching between the enriched query and document representations. With the availability of large amounts of log data and advanced machine learning techniques, this becomes more feasible and significant progress has been made recently. This survey gives a systematic and detailed introduction to newly developed machine learning technologies for query document matching (semantic matching) in search, particularly web search. It focuses on the fundamental problems, as well as the state-of-the-art solutions of query document matching on form aspect, phrase aspect, word sense aspect, topic aspect, and structure aspect. The ideas and solutions explained may motivate industrial practitioners to turn the research results into products. The methods introduced and the discussions made may also stimulate academic researchers to find new research directions and approaches. Matching between query and document is not limited to search and similar problems can be found in question answering, online advertising, cross-language information retrieval, machine translation, recommender systems, link prediction, image annotation, drug design, and other applications, as the general task of matching between objects from two different spaces. The technologies introduced can be generalized into more general machine learning techniques, which is referred to as learning to match in this survey.

179 citations


Journal ArticleDOI
01 Jun 2014
TL;DR: A novel two-phase search algorithm is proposed that carefully selects a set of expansion centers from the query trajectory and exploits upper and lower bounds to prune the search space in the spatial and temporal domains.
Abstract: With the increasing availability of moving-object tracking data, trajectory search and matching is increasingly important. We propose and investigate a novel problem called personalized trajectory matching (PTM). In contrast to conventional trajectory similarity search by spatial distance only, PTM takes into account the significance of each sample point in a query trajectory. A PTM query takes a trajectory with user-specified weights for each sample point in the trajectory as its argument. It returns the trajectory in an argument data set with the highest similarity to the query trajectory. We believe that this type of query may bring significant benefits to users in many popular applications such as route planning, carpooling, friend recommendation, traffic analysis, urban computing, and location-based services in general. PTM query processing faces two challenges: how to prune the search space during the query processing and how to schedule multiple so-called expansion centers effectively. To address these challenges, a novel two-phase search algorithm is proposed that carefully selects a set of expansion centers from the query trajectory and exploits upper and lower bounds to prune the search space in the spatial and temporal domains. An efficiency study reveals that the algorithm explores the minimum search space in both domains. Second, a heuristic search strategy based on priority ranking is developed to schedule the multiple expansion centers, which can further prune the search space and enhance the query efficiency. The performance of the PTM query is studied in extensive experiments based on real and synthetic trajectory data sets.

155 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel method for object detection based on structural feature description and query expansion that is evaluated on high-resolution satellite images and demonstrates its clear advantages over several other object detection methods.
Abstract: Object detection is an important task in very high-resolution remote sensing image analysis. Traditional detection approaches are often not sufficiently robust in dealing with the variations of targets and sometimes suffer from limited training samples. In this paper, we tackle these two problems by proposing a novel method for object detection based on structural feature description and query expansion. The feature description combines both local and global information of objects. After initial feature extraction from a query image and representative samples, these descriptors are updated through an augmentation process to better describe the object of interest. The object detection step is implemented using a ranking support vector machine (SVM), which converts the detection task to a ranking query task. The ranking SVM is first trained on a small subset of training data with samples automatically ranked based on similarities to the query image. Then, a novel query expansion method is introduced to update the initial object model by active learning with human inputs on ranking of image pairs. Once the query expansion process is completed, which is determined by measuring entropy changes, the model is then applied to the whole target data set in which objects in different classes shall be detected. We evaluate the proposed method on high-resolution satellite images and demonstrate its clear advantages over several other object detection methods.

118 citations


Proceedings ArticleDOI
03 Jul 2014
TL;DR: The feasibility of exploiting the context to learn user reformulation behavior for boosting prediction performance is investigated and a supervised approach to query auto-completion is proposed, where three kinds of reformulation-related features are considered, including term-level, query-level and session-level features.
Abstract: It is crucial for query auto-completion to accurately predict what a user is typing. Given a query prefix and its context (e.g., previous queries), conventional context-aware approaches often produce relevant queries to the context. The purpose of this paper is to investigate the feasibility of exploiting the context to learn user reformulation behavior for boosting prediction performance. We first conduct an in-depth analysis of how the users reformulate their queries. Based on the analysis, we propose a supervised approach to query auto-completion, where three kinds of reformulation-related features are considered, including term-level, query-level and session-level features. These features carefully capture how the users change preceding queries along the query sessions. Extensive experiments have been conducted on the large-scale query log of a commercial search engine. The experimental results demonstrate a significant improvement over 4 competitive baselines.

107 citations


Journal ArticleDOI
01 Jun 2014
TL;DR: LegoBase is presented, a query engine written in the high-level programming language Scala that significantly outperforms a commercial in-memory database system as well as an existing query compiler and the compilation overhead is low compared to the overall execution time, making the approach usable in practice for efficiently compiling query engines.
Abstract: In this paper we advocate that it is time for a radical rethinking of database systems design. Developers should be able to leverage high-level programming languages without having to pay a price in efficiency. To realize our vision of abstraction without regret, we present LegoBase, a query engine written in the high-level programming language Scala. The key technique to regain efficiency is to apply generative programming: the Scala code that constitutes the query engine, despite its high-level appearance, is actually a program generator that emits specialized, low-level C code. We show how the combination of high-level and generative programming allows to easily implement a wide spectrum of optimizations that are difficult to achieve with existing low-level query compilers, and how it can continuously optimize the query engine.We evaluate our approach with the TPC-H benchmark and show that: (a) with all optimizations enabled, our architecture significantly outperforms a commercial in-memory database system as well as an existing query compiler, (b) these performance improvements require programming just a few hundred lines of high-level code instead of complicated low-level code that is required by existing query compilers and, finally, that (c) the compilation overhead is low compared to the overall execution time, thus making our approach usable in practice for efficiently compiling query engines.

94 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: An auto encoder-based method for the unsupervised identification of subword units and the encoded representation of speech produced by standard auto encoders is more effective than Gaussian posteriorgrams in a spoken query classification task.
Abstract: In this paper we propose an auto encoder-based method for the unsupervised identification of subword units. We experiment with different types and architectures of auto encoders to assess what auto encoder properties are most important for this task. We first show that the encoded representation of speech produced by standard auto encoders is more effective than Gaussian posteriorgrams in a spoken query classification task. Finally we evaluate the subword inventories produced by the proposed method both in terms of classification accuracy in a word classification task (with lexicon size up to 263 words) and in terms of consistency between subword transcription of different word examples of a same word type. The evaluation is carried out on Italian and American English datasets.

90 citations


Patent
30 Sep 2014
TL;DR: In this paper, the authors describe a system that receives a first search query from a user and determines whether a media asset from the media assets is related to the second search query.
Abstract: Systems and methods for searching for a media asset are described. In some aspects, the system includes control circuitry that receives a first search query from a user. The control circuitry identifies media assets related to the first search query from a content database. The control circuitry receives a second search query following the first search query. The control circuitry determines whether a media asset from the media assets is related to the second search query. In response to determining that less than a threshold number of media assets from the media assets are related to the second search query, the control circuitry transmits an instruction requesting the user to repeat the second search query. The control circuitry receives a third search query related to the first search query. The control circuitry determines a media asset from the media assets that is related to the third search query.

Journal ArticleDOI
01 Aug 2014
TL;DR: RAW as mentioned in this paper is a prototype query engine which enables querying heterogeneous data sources transparently and employs Just-In-Time access paths, which efficiently couple heterogeneous raw files to the query engine and reduce the overheads of traditional general-purpose scan operators.
Abstract: Database systems deliver impressive performance for large classes of workloads as the result of decades of research into optimizing database engines. High performance, however, is achieved at the cost of versatility. In particular, database systems only operate efficiently over loaded data, i.e., data converted from its original raw format into the system's internal data format. At the same time, data volume continues to increase exponentially and data varies increasingly, with an escalating number of new formats. The consequence is a growing impedance mismatch between the original structures holding the data in the raw files and the structures used by query engines for efficient processing. In an ideal scenario, the query engine would seamlessly adapt itself to the data and ensure efficient query processing regardless of the input data formats, optimizing itself to each instance of a file and of a query by leveraging information available at query time. Today's systems, however, force data to adapt to the query engine during data loading.This paper proposes adapting the query engine to the formats of raw data. It presents RAW, a prototype query engine which enables querying heterogeneous data sources transparently. RAW employs Just-In-Time access paths, which efficiently couple heterogeneous raw files to the query engine and reduce the overheads of traditional general-purpose scan operators. There are, however, inherent overheads with accessing raw data directly that cannot be eliminated, such as converting the raw values. Therefore, RAW also uses column shreds, ensuring that we pay these costs only for the subsets of raw data strictly needed by a query. We use RAW in a real-world scenario and achieve a two-order of magnitude speedup against the existing hand-written solution.

Patent
27 Jan 2014
TL;DR: In this article, a set of attributes derived from an element of a first digital document is obtained from eye-tracking data of a user viewing the digital document, and a search query of a database comprising at least one query term is received.
Abstract: In one exemplary embodiment, a set of attributes derived from an element of a first digital document is obtained. The element is identified from eye-tracking data of a user viewing the digital document. A search query of a database comprising at least one query term is received. A set of documents in the database is identified according to the search query. An attribute score is determined for each document. The set of documents are sorted according to the attribute score. Optionally, a commonality between the query term and at least one member of the set of attributes ma be determined. The search query may be generated by the user. The database may be a hypermedia database.

Journal ArticleDOI
TL;DR: In this paper, the authors present a query rewriting algorithm for rather general types of ontological constraints, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog± family of ontology languages, can be compiled into a union of conjunctive queries (UCQ).
Abstract: Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this article, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent first-order query against the underlying extensional database. We present a novel query rewriting algorithm for rather general types of ontological constraints that is well suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog± family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process soas to produce possibly small and cost-effective UCQ rewritings for an input query.

Journal ArticleDOI
TL;DR: The RASP data perturbation method combines order preserving encryption, dimensionality expansion, random noise injection, and random projection, to provide strong resilience to attacks on the perturbed data and queries.
Abstract: With the wide deployment of public cloud computing infrastructures, using clouds to host data query services has become an appealing solution for the advantages on scalability and cost-saving. However, some data might be sensitive that the data owner does not want to move to the cloud unless the data confidentiality and query privacy are guaranteed. On the other hand, a secured query service should still provide efficient query processing and significantly reduce the in-house workload to fully realize the benefits of cloud computing. We propose the random space perturbation (RASP) data perturbation method to provide secure and efficient range query and kNN query services for protected data in the cloud. The RASP data perturbation method combines order preserving encryption, dimensionality expansion, random noise injection, and random projection, to provide strong resilience to attacks on the perturbed data and queries. It also preserves multidimensional ranges, which allows existing indexing techniques to be applied to speedup range query processing. The kNN-R algorithm is designed to work with the RASP range query algorithm to process the kNN queries. We have carefully analyzed the attacks on data and queries under a precisely defined threat model and realistic security assumptions. Extensive experiments have been conducted to show the advantages of this approach on efficiency and security.

Proceedings ArticleDOI
07 Apr 2014
TL;DR: Several practical completion suggestion ranking approaches are proposed, including a sliding window of query popularity evidence from the past 2-28 days, and the query popularity distribution in the last N queries observed with a given prefix, and short-range query popularity prediction based on recently observed trends.
Abstract: Query auto-completion (QAC) is a common interactive feature that assists users in formulating queries by providing completion suggestions as they type. In order for QAC to minimise the user's cognitive and physical effort, it must: (i) suggest the user's intended query after minimal input keystrokes, and (ii) rank the user's intended query highly in completion suggestions. Typically, QAC approaches rank completion suggestions by their past popularity. Accordingly, QAC is usually very effective for previously seen and consistently popular queries. Users are increasingly turning to search engines to find out about unpredictable emerging and ongoing events and phenomena, often using previously unseen or unpopular queries. Consequently, QAC must be both robust and time-sensitive -- that is, able to sufficiently rank both consistently and recently popular queries in completion suggestions. To address this trade-off, we propose several practical completion suggestion ranking approaches, including: (i) a sliding window of query popularity evidence from the past 2-28 days, (ii) the query popularity distribution in the last N queries observed with a given prefix, and (iii) short-range query popularity prediction based on recently observed trends. Using real-time simulation experiments, we extensively investigated the parameters necessary to maximise QAC effectiveness for three openly available query log datasets with prefixes of 2-5 characters: MSN and AOL (both English), and Sogou 2008 (Chinese). Optimal parameters vary for each query log, capturing the differing temporal dynamics and querying distributions. Results demonstrate consistent and language-independent improvements of up to 9.2% over a non-temporal QAC baseline for all query logs with prefix lengths of 2-3 characters. This work is an important step towards more effective QAC approaches.

Proceedings Article
01 Jan 2014
TL;DR: This paper introduces a new join ordering algorithm that performs a SParQL-tailored query simplification and presents a novel RDF statistical synopsis that accurately estimates cardinalities in large SPARQL queries.
Abstract: The join ordering problem is a fundamental challenge that has to be solved by any query optimizer. Since the high-performance RDF systems are often implemented as triple stores (i.e., they represent RDF data as a single table with three attributes, at least conceptually), the query optimization strategies employed by such systems are often adopted from relational query optimization. In this paper we show that the techniques borrowed from traditional SQL query optimization (such as Dynamic Programming algorithm or greedy heuristics) are not immediately capable of handling large SPARQL queries. We introduce a new join ordering algorithm that performs a SPARQL-tailored query simplification. Furthermore, we present a novel RDF statistical synopsis that accurately estimates cardinalities in large SPARQL queries. Our experiments show that this algorithm is highly superior to the state-of-the-art SPARQL optimization approaches, including the RDF-3X’s original Dynamic Programming strategy.

Proceedings ArticleDOI
03 Jul 2014
TL;DR: An advanced search engine that supports users in querying documents by means of keywords, entities, and categories, which is automatically mapped onto appropriate suggestions for entities and categories based on named-entity disambiguation.
Abstract: This paper describes an advanced search engine that supports users in querying documents by means of keywords, entities, and categories. Users simply type words, which are automatically mapped onto appropriate suggestions for entities and categories. Based on named-entity disambiguation, the search engine returns documents containing the query's entities and prominent entities from the query's categories.

Proceedings ArticleDOI
03 Jul 2014
TL;DR: This paper presents the first large-scale study of user interactions with auto-completion based on query logs of Bing, a commercial search engine, and confirms that lower-ranked auto- completion suggestions receive substantially lower engagement than those ranked higher.
Abstract: Query Auto-Completion (QAC) is a popular feature of web search engines that aims to assist users to formulate queries faster and avoid spelling mistakes by presenting them with possible completions as soon as they start typing. However, despite the wide adoption of auto-completion in search systems, there is little published on how users interact with such services. In this paper, we present the first large-scale study of user interactions with auto-completion based on query logs of Bing, a commercial search engine. Our results confirm that lower-ranked auto-completion suggestions receive substantially lower engagement than those ranked higher. We also observe that users are most likely to engage with auto-completion after typing about half of the query, and in particular at word boundaries. Interestingly, we also noticed that the likelihood of using auto-completion varies with the distance of query characters on the keyboard. Overall, we believe that the results reported in our study provide valuable insights for understanding user engagement with auto-completion, and are likely to inform the design of more effective QAC systems.

Patent
Alyssa Glass1, Anlei Dong1, Ted Eiche1
30 Apr 2014
TL;DR: In this paper, a plurality of query suggestions are provided in a ranking to a user, and a quality measure of the plurality of queries is calculated based on the user activity and the position of the one of the query suggestions in the ranking.
Abstract: Methods, systems and programming for evaluating query suggestions quality. In one example, a plurality of query suggestions are provided in a ranking to a user. A user activity with respect to one of the plurality of query suggestions is detected. A position of the one of the plurality of query suggestions in the ranking is determined. A quality measure of the plurality of query suggestions is calculated based, at least in part, on the user activity and the position of the one of the plurality of query suggestions.

Proceedings ArticleDOI
24 Aug 2014
TL;DR: A probabilistic method for identifying and labeling search tasks based on the following intuitive observations: queries that are issued temporally close by users in many sequences of queries are likely to belong to the same search task, meanwhile, different users having the same information needs tend to submit topically coherent search queries.
Abstract: We consider a search task as a set of queries that serve the same user information need. Analyzing search tasks from user query streams plays an important role in building a set of modern tools to improve search engine performance. In this paper, we propose a probabilistic method for identifying and labeling search tasks based on the following intuitive observations: queries that are issued temporally close by users in many sequences of queries are likely to belong to the same search task, meanwhile, different users having the same information needs tend to submit topically coherent search queries. To capture the above intuitions, we directly model query temporal patterns using a special class of point processes called Hawkes processes, and combine topic models with Hawkes processes for simultaneously identifying and labeling search tasks. Essentially, Hawkes processes utilize their self-exciting properties to identify search tasks if influence exists among a sequence of queries for individual users, while the topic model exploits query co-occurrence across different users to discover the latent information needed for labeling search tasks. More importantly, there is mutual reinforcement between Hawkes processes and the topic model in the unified model that enhances the performance of both. We evaluate our method based on both synthetic data and real-world query log data. In addition, we also apply our model to query clustering and search task identification. By comparing with state-of-the-art methods, the results demonstrate that the improvement in our proposed approach is consistent and promising.

Proceedings Article
27 Jul 2014
TL;DR: A novel method for learning, in a supervised way, semantic representations for words and phrases by embedding queries and documents in special matrices, which disposes of an increased representational power with respect to existing approaches adopting a vector representation.
Abstract: In web search, users queries are formulated using only few terms and term-matching retrieval functions could fail at retrieving relevant documents. Given a user query, the technique of query expansion (QE) consists in selecting related terms that could enhance the likelihood of retrieving relevant documents. Selecting such expansion terms is challenging and requires a computational framework capable of encoding complex semantic relationships. In this paper, we propose a novel method for learning, in a supervised way, semantic representations for words and phrases. By embedding queries and documents in special matrices, our model disposes of an increased representational power with respect to existing approaches adopting a vector representation. We show that our model produces high-quality query expansion terms. Our expansion increase IR measures beyond expansion from current word-embeddings models and well-established traditional QE methods.

Patent
04 Apr 2014
TL;DR: In this paper, the first user of an online social network receives a search query input including one or more n-grams, generates a number of query commands based on the query input, and then searches the verticals to identify objects stored by the vertical that match the query commands.
Abstract: In one embodiment, a method includes receiving from a first user of an online social network a search query input including one or more n-grams; generating a number of query commands based on the search query input; and searching one or more verticals to identify one or more objects stored by the vertical that match the query commands. Each vertical stores one or more objects associated with the online social network. The method also includes generating a number of search-result modules. Each search-result module corresponds to a query command of the number of query commands. Each search-result module includes references to one or more of the identified objects matching the query command corresponding to the search-result module. The method also includes scoring the search-result modules; and sending each search-result module having a score greater than a threshold score to the first user for display.

Journal ArticleDOI
TL;DR: The aim of this study is to reduce the number of features to be used to improve runtime and accuracy of the classification of web pages, and it is shown that using the ACO for feature selection improves both accuracy and runtime performance of classification.
Abstract: The increased popularity of the web has caused the inclusion of huge amount of information to the web, and as a result of this explosive information growth, automated web page classification systems are needed to improve search engines' performance Web pages have a large number of features such as HTML/XML tags, URLs, hyperlinks, and text contents that should be considered during an automated classification process The aim of this study is to reduce the number of features to be used to improve runtime and accuracy of the classification of web pages In this study, we used an ant colony optimization (ACO) algorithm to select the best features, and then we applied the well-known C45, naive Bayes, and k nearest neighbor classifiers to assign class labels to web pages We used the WebKB and Conference datasets in our experiments, and we showed that using the ACO for feature selection improves both accuracy and runtime performance of classification We also showed that the proposed ACO based algorithm can select better features with respect to the well-known information gain and chi square feature selection methods

Patent
14 Mar 2014
TL;DR: In this article, a system for obtaining and presenting search results in a language that differs from the language in which a query is received is presented, where the search results being based on the search query and associated with the at least one second language (or dialect).
Abstract: Systems, methods, and computer-readable storage media are provided for obtaining and presenting search results in a language that differs from the language in which a query is received. Upon receipt of a search query in a first language, at least one second language (or dialect) to which the search query is directed is determined and one or more search results are retrieved, the search results being based on the search query and associated with the at least one second language (or dialect). Further, embodiments of the present invention relate to generating advertisements including embedded links to landing pages that have been translated into one or more languages (or dialects) associated with a target market. In this way, advertisers are able to more successfully advertise to individuals whose primary language or dialect differs from that of the website and/or the advertiser.

Patent
30 Apr 2014
TL;DR: In this paper, a structured query consisting of references to one or more selected objects accessible by the computing device is generated by generating search results corresponding to the structured query, wherein each search result corresponds to a particular object accessible by a computing device.
Abstract: In one embodiment, a method includes receiving, from a client system of a first user, a structured query comprising references to one or more selected objects accessible by the computing device, generating one or more search results corresponding to the structured query, wherein each search result corresponds to a particular object accessible by the computing device, determining one or more search intents based at least on whether one or more of the selected objects referenced in the structured query match objects corresponding to a search intent indexed in a pattern-detection model, and scoring the search results based on one or more of the search intents.

Patent
27 Aug 2014
TL;DR: In this article, a search query from a first user and identifying one or more second nodes that match the search query is defined. But the search intent may be based on topics associated with the identified nodes and node types of identified nodes.
Abstract: In one embodiment, a method includes receiving a search query from a first user and identifying one or more second nodes that match the search query. The method includes determining one or more search intents of the search query. Search intent may be based on one or more topics associated with the identified nodes and one or more node-types of the identified nodes. The method includes generating one or more search results corresponding to the search query, the search-results being generated based on the determined search intents. The method includes sending a search-results page to the client system of the first user for display. The search-results page may include one or more of the generated search results.

Proceedings ArticleDOI
01 Mar 2014
TL;DR: This paper defines a set of candidate points called happy points for the k-regret query, a recently proposed query, which integrates the merits of the two representative queries and proposes two efficient algorithms each of which performs more efficiently than the best-known fastest algorithm.
Abstract: Returning tuples that users may be interested in is one of the most important goals for multi-criteria decision making. Top-k queries and skyline queries are two representative queries. A top-k query has its merit of returning a limited number of tuples to users but requires users to give their exact utility functions. A skyline query has its merit that users do not need to give their exact utility functions but has no control over the number of tuples to be returned. In this paper, we study a k-regret query, a recently proposed query, which integrates the merits of the two representative queries. We first identify some interesting geometry properties for the k-regret query. Based on these properties, we define a set of candidate points called happy points for the k-regret query, which has not been studied in the literature. This result is very fundamental and beneficial to not only all existing algorithms but also all new algorithms to be developed for the k-regret query. Since it is found that the number of happy points is very small, the efficiency of all existing algorithms can be improved significantly. Furthermore, based on other geometry properties, we propose two efficient algorithms each of which performs more efficiently than the best-known fastest algorithm. Our experimental results show that our proposed algorithms run faster than the best-known method on both synthetic and real datasets. In particular, in our experiments on real datasets, the best-known method took more than 3 hours to answer a k-regret query but one of our proposed methods took about a few minutes and the other took within a second.

Proceedings ArticleDOI
03 Jul 2014
TL;DR: ReQ-ReC (ReQuery-ReClassify), a double-loop retrieval system that combines iterative expansion of a query set with iterative refinements of a classifier, significantly outperforms previous relevance feedback methods that rely on a single ranking function to balance precision and recall.
Abstract: We consider a scenario where a searcher requires both high precision and high recall from an interactive retrieval process. Such scenarios are very common in real life, exemplified by medical search, legal search, market research, and literature review. When access to the entire data set is available, an active learning loop could be used to ask for additional relevance feedback labels in order to refine a classifier. When data is accessed via search services, however, only limited subsets of the corpus can be considered, subsets defined by queries. In that setting, relevance feedback has been used in a query enhancement loop that updates a query. We describe and demonstrate the effectiveness of ReQ-ReC (ReQuery-ReClassify), a double-loop retrieval system that combines iterative expansion of a query set with iterative refinements of a classifier. This permits a separation of concerns, where the query selector's job is to enhance recall while the classifier's job is to maximize precision on the items that have been retrieved by any of the queries so far. The overall process alternates between the query enhancement loop, to increase recall, and the classifier refinement loop, to increase precision. The separation allows the query enhancement process to explore larger parts of the query space. Our experiments show that this distribution of work significantly outperforms previous relevance feedback methods that rely on a single ranking function to balance precision and recall.

Journal ArticleDOI
01 May 2014
TL;DR: The length-constrained maximum-sum region (LCMSR) query is proposed that returns a spatial-network region that is located within a general region of interest, that does not exceed a given size constraint, and that best matches query keywords.
Abstract: We consider an application scenario where points of interest (PoIs) each have a web presence and where a web user wants to identify a region that contains relevant PoIs that are relevant to a set of keywords, e.g., in preparation for deciding where to go to conveniently explore the PoIs. Motivated by this, we propose the length-constrained maximum-sum region (LCMSR) query that returns a spatial-network region that is located within a general region of interest, that does not exceed a given size constraint, and that best matches query keywords. Such a query maximizes the total weight of the PoIs in it w.r.t. the query keywords. We show that it is NP-hard to answer this query. We develop an approximation algorithm with a (5 + e) approximation ratio utilizing a technique that scales node weights into integers. We also propose a more efficient heuristic algorithm and a greedy algorithm. Empirical studies on real data offer detailed insight into the accuracy of the proposed algorithms and show that the proposed algorithms are capable of computing results efficiently and effectively.