scispace - formally typeset

Journal ArticleDOI

Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

01 Feb 2012-Vol. 5, Iss: 6, pp 538-549

TL;DR: This paper extends crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time, and leverages the usergenerated data found in the Social Web to build perceptual spaces.

AbstractBy incorporating human workers into the query execution process crowd-enabled databases facilitate intelligent, social capabilities like completing missing data at query time or performing cognitive operators. But despite all their flexibility, crowd-enabled databases still maintain rigid schemas. In this paper, we extend crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time. However, the number of crowd-sourced mini-tasks to fill in missing values may often be prohibitively large and the resulting data quality is doubtful. Instead of simple crowd-sourcing to obtain all values individually, we leverage the usergenerated data found in the Social Web: By exploiting user ratings we build perceptual spaces, i.e., highly-compressed representations of opinions, impressions, and perceptions of large numbers of users. Using few training samples obtained by expert crowd sourcing, we then can extract all missing data automatically from the perceptual space with high quality and at low costs. Extensive experiments show that our approach can boost both performance and quality of crowd-enabled databases, while also providing the flexibility to expand schemas in a query-driven fashion.

Topics: Missing data (54%), Data quality (50%)

...read more

Content maybe subject to copyright    Report

Citations
More filters

Proceedings ArticleDOI
13 May 2013
TL;DR: This paper proposes and extensively evaluate a different Crowdsourcing approach based on a push methodology that carefully selects which workers should perform a given task based on worker profiles extracted from social networks and shows that this approach consistently yield better results than usual pull strategies.
Abstract: Crowdsourcing allows to build hybrid online platforms that combine scalable information systems with the power of human intelligence to complete tasks that are difficult to tackle for current algorithms. Examples include hybrid database systems that use the crowd to fill missing values or to sort items according to subjective dimensions such as picture attractiveness. Current approaches to Crowdsourcing adopt a pull methodology where tasks are published on specialized Web platforms where workers can pick their preferred tasks on a first-come-first-served basis. While this approach has many advantages, such as simplicity and short completion times, it does not guarantee that the task is performed by the most suitable worker. In this paper, we propose and extensively evaluate a different Crowdsourcing approach based on a push methodology. Our proposed system carefully selects which workers should perform a given task based on worker profiles extracted from social networks. Workers and tasks are automatically matched using an underlying categorization structure that exploits entities extracted from the task descriptions on one hand, and categories liked by the user on social platforms on the other hand. We experimentally evaluate our approach on tasks of varying complexity and show that our push methodology consistently yield better results than usual pull strategies.

160 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ...In the database community, hybrid human-machine database systems have been proposed [12, 23]....

    [...]


Proceedings ArticleDOI
18 Mar 2013
TL;DR: The problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions is studied, and efficient algorithms that are guaranteed to achieve good results with high probability are given.
Abstract: Group-by and top-k are fundamental constructs in database queries. However, the criteria used for grouping and ordering certain types of data -- such as unlabeled photos clustered by the same person ordered by age -- are difficult to evaluate by machines. In contrast, these tasks are easy for humans to evaluate and are therefore natural candidates for being crowd-sourced.We study the problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions. Given two data elements, the answer to a type question is "yes" if the elements have the same type and therefore belong to the same group or cluster; the answer to a value question orders the two data elements. The assumption here is that there is an underlying ground truth, but that the answers returned by the crowd may sometimes be erroneous. We formalize the problems of top-k and group-by in the crowd-sourced setting, and give efficient algorithms that are guaranteed to achieve good results with high probability. We analyze the crowd-sourced cost of these algorithms in terms of the total number of type and value questions, and show that they are essentially the best possible. We also show that fewer questions are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases.

129 citations


Journal ArticleDOI
01 Oct 2013
TL;DR: The ZenCrowd system uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud.
Abstract: We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

84 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ..., “10 papers with the most novel ideas”) [22,43]....

    [...]


Proceedings ArticleDOI
18 Mar 2013
TL;DR: It is shown that by assessing the individual risk a tuple poses with respect to the overall result quality, crowd-sourcing efforts for eliciting missing values can be narrowly focused on only those tuples that may degenerate the expected quality most strongly, which leads to an algorithm for computing skyline sets on incomplete data with maximum result quality.
Abstract: Skyline queries are a well-established technique for database query personalization and are widely acclaimed for their intuitive query formulation mechanisms. However, when operating on incomplete datasets, skylines queries are severely hampered and often have to resort to highly error-prone heuristics. Unfortunately, incomplete datasets are a frequent phenomenon, especially when datasets are generated automatically using various information extraction or information integration approaches. Here, the recent trend of crowd-enabled databases promises a powerful solution: during query execution, some database operators can be dynamically outsourced to human workers in exchange for monetary compensation, therefore enabling the elicitation of missing values during runtime. Unfortunately, this powerful feature heavily impacts query response times and (monetary) execution costs. In this paper, we present an innovative hybrid approach combining dynamic crowd-sourcing with heuristic techniques in order to overcome current limitations. We will show that by assessing the individual risk a tuple poses with respect to the overall result quality, crowd-sourcing efforts for eliciting missing values can be narrowly focused on only those tuples that may degenerate the expected quality most strongly. This leads to an algorithm for computing skyline sets on incomplete data with maximum result quality, while optimizing crowd-sourcing costs.

59 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: A cost sensitive quantitative analysis method to estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated and the experimental results show that the proposed method outperforms all the state-of-art methods.
Abstract: Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed.In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.

52 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ...[16] expanded database schemas with additional attributes through querying the crowdsourcing systems....

    [...]


References
More filters

Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,005 citations


"Pushing the boundaries of crowd-ena..." refers methods in this paper

  • ...Furthermore, we can show that approaches based on classification using metadata and LSI lead to surprisingly bad results (g-mean between 0.41 and 0.50), and show even worse accuracy than randomly applying labels....

    [...]

  • ...This is implemented by using Latent Semantic Indexing (LSI) [21] to generate a 100-dimensional “metadata space” from movie attributes like title, plot, main actors, directors, year, runtime, and country as recorded in IMDb....

    [...]


Journal ArticleDOI
TL;DR: This tutorial gives an overview of the basic ideas underlying Support Vector (SV) machines for function estimation, and includes a summary of currently used algorithms for training SV machines, covering both the quadratic programming part and advanced methods for dealing with large datasets.
Abstract: In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and advanced methods for dealing with large datasets. Finally, we mention some modifications and extensions that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a SV perspective.

9,105 citations


"Pushing the boundaries of crowd-ena..." refers methods in this paper

  • ...Instead of relying on non-linear regression, we can use an SVM classifier [19]....

    [...]


Journal ArticleDOI
TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.
Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

5,368 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...A popular measure of classification performance in the presence of class imbalance is the g-mean measure [20], which is the geometric mean of sensitivity (accuracy on all movies truly belonging to the genre) and specificity (accuracy on all movies truly not belonging to the genre), As the g-mean punishes significant differences between sensitivity and specificity, the above naïve classifier would achieve 0% g-mean....

    [...]


BookDOI
31 Mar 2010
Abstract: In the field of machine learning, semi-supervised learning (SSL) occupies the middle ground, between supervised learning (in which all training examples are labeled) and unsupervised learning (in which no label data are given). Interest in SSL has increased in recent years, particularly because of application domains in which unlabeled data are plentiful, such as images, text, and bioinformatics. This first comprehensive overview of SSL presents state-of-the-art algorithms, a taxonomy of the field, selected applications, benchmark experiments, and perspectives on ongoing and future research. Semi-Supervised Learning first presents the key assumptions and ideas underlying the field: smoothness, cluster or low-density separation, manifold structure, and transduction. The core of the book is the presentation of SSL methods, organized according to algorithmic strategies. After an examination of generative models, the book describes algorithms that implement the low-density separation assumption, graph-based methods, and algorithms that perform two-step learning. The book then discusses SSL applications and offers guidelines for SSL practitioners by analyzing the results of extensive benchmark experiments. Finally, the book looks at interesting directions for SSL research. The book closes with a discussion of the relationship between semi-supervised learning and transduction. Adaptive Computation and Machine Learning series

3,611 citations


Proceedings Article
03 Dec 1996
TL;DR: This work compares support vector regression (SVR) with a committee regression technique (bagging) based on regression trees and ridge regression done in feature space and expects that SVR will have advantages in high dimensionality space because SVR optimization does not depend on the dimensionality of the input space.
Abstract: A new regression technique based on Vapnik's concept of support vectors is introduced. We compare support vector regression (SVR) with a committee regression technique (bagging) based on regression trees and ridge regression done in feature space. On the basis of these experiments, it is expected that SVR will have advantages in high dimensionality space because SVR optimization does not depend on the dimensionality of the input space.

3,498 citations


"Pushing the boundaries of crowd-ena..." refers methods in this paper

  • ...perceptual space, we suggest to use Support Vector Regression Machines (SVMs) [14], which are a highly flexible technique to perform non-linear regression and classification, and have been proven to be effective when dealing with perceptual data [15]....

    [...]