scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

01 Feb 2012-Vol. 5, Iss: 6, pp 538-549
TL;DR: This paper extends crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time, and leverages the usergenerated data found in the Social Web to build perceptual spaces.
Abstract: By incorporating human workers into the query execution process crowd-enabled databases facilitate intelligent, social capabilities like completing missing data at query time or performing cognitive operators. But despite all their flexibility, crowd-enabled databases still maintain rigid schemas. In this paper, we extend crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time. However, the number of crowd-sourced mini-tasks to fill in missing values may often be prohibitively large and the resulting data quality is doubtful. Instead of simple crowd-sourcing to obtain all values individually, we leverage the usergenerated data found in the Social Web: By exploiting user ratings we build perceptual spaces, i.e., highly-compressed representations of opinions, impressions, and perceptions of large numbers of users. Using few training samples obtained by expert crowd sourcing, we then can extract all missing data automatically from the perceptual space with high quality and at low costs. Extensive experiments show that our approach can boost both performance and quality of crowd-enabled databases, while also providing the flexibility to expand schemas in a query-driven fashion.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
13 May 2013
TL;DR: This paper proposes and extensively evaluate a different Crowdsourcing approach based on a push methodology that carefully selects which workers should perform a given task based on worker profiles extracted from social networks and shows that this approach consistently yield better results than usual pull strategies.
Abstract: Crowdsourcing allows to build hybrid online platforms that combine scalable information systems with the power of human intelligence to complete tasks that are difficult to tackle for current algorithms. Examples include hybrid database systems that use the crowd to fill missing values or to sort items according to subjective dimensions such as picture attractiveness. Current approaches to Crowdsourcing adopt a pull methodology where tasks are published on specialized Web platforms where workers can pick their preferred tasks on a first-come-first-served basis. While this approach has many advantages, such as simplicity and short completion times, it does not guarantee that the task is performed by the most suitable worker. In this paper, we propose and extensively evaluate a different Crowdsourcing approach based on a push methodology. Our proposed system carefully selects which workers should perform a given task based on worker profiles extracted from social networks. Workers and tasks are automatically matched using an underlying categorization structure that exploits entities extracted from the task descriptions on one hand, and categories liked by the user on social platforms on the other hand. We experimentally evaluate our approach on tasks of varying complexity and show that our push methodology consistently yield better results than usual pull strategies.

165 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ...In the database community, hybrid human-machine database systems have been proposed [12, 23]....

    [...]

Proceedings ArticleDOI
18 Mar 2013
TL;DR: The problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions is studied, and efficient algorithms that are guaranteed to achieve good results with high probability are given.
Abstract: Group-by and top-k are fundamental constructs in database queries. However, the criteria used for grouping and ordering certain types of data -- such as unlabeled photos clustered by the same person ordered by age -- are difficult to evaluate by machines. In contrast, these tasks are easy for humans to evaluate and are therefore natural candidates for being crowd-sourced.We study the problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions. Given two data elements, the answer to a type question is "yes" if the elements have the same type and therefore belong to the same group or cluster; the answer to a value question orders the two data elements. The assumption here is that there is an underlying ground truth, but that the answers returned by the crowd may sometimes be erroneous. We formalize the problems of top-k and group-by in the crowd-sourced setting, and give efficient algorithms that are guaranteed to achieve good results with high probability. We analyze the crowd-sourced cost of these algorithms in terms of the total number of type and value questions, and show that they are essentially the best possible. We also show that fewer questions are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases.

135 citations

Journal ArticleDOI
01 Oct 2013
TL;DR: The ZenCrowd system uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud.
Abstract: We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

89 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ..., “10 papers with the most novel ideas”) [22,43]....

    [...]

Proceedings ArticleDOI
18 Mar 2013
TL;DR: It is shown that by assessing the individual risk a tuple poses with respect to the overall result quality, crowd-sourcing efforts for eliciting missing values can be narrowly focused on only those tuples that may degenerate the expected quality most strongly, which leads to an algorithm for computing skyline sets on incomplete data with maximum result quality.
Abstract: Skyline queries are a well-established technique for database query personalization and are widely acclaimed for their intuitive query formulation mechanisms. However, when operating on incomplete datasets, skylines queries are severely hampered and often have to resort to highly error-prone heuristics. Unfortunately, incomplete datasets are a frequent phenomenon, especially when datasets are generated automatically using various information extraction or information integration approaches. Here, the recent trend of crowd-enabled databases promises a powerful solution: during query execution, some database operators can be dynamically outsourced to human workers in exchange for monetary compensation, therefore enabling the elicitation of missing values during runtime. Unfortunately, this powerful feature heavily impacts query response times and (monetary) execution costs. In this paper, we present an innovative hybrid approach combining dynamic crowd-sourcing with heuristic techniques in order to overcome current limitations. We will show that by assessing the individual risk a tuple poses with respect to the overall result quality, crowd-sourcing efforts for eliciting missing values can be narrowly focused on only those tuples that may degenerate the expected quality most strongly. This leads to an algorithm for computing skyline sets on incomplete data with maximum result quality, while optimizing crowd-sourcing costs.

68 citations

Proceedings ArticleDOI
22 Jun 2013
TL;DR: A cost sensitive quantitative analysis method to estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated and the experimental results show that the proposed method outperforms all the state-of-art methods.
Abstract: Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed.In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.

54 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ...[16] expanded database schemas with additional attributes through querying the crowdsourcing systems....

    [...]

References
More filters
Book
01 Jan 2000
TL;DR: The Poststructuralist Mutation as discussed by the authors is an example of the poststructuralism of post-modernism in post-cinema, and it has been studied extensively in the last few decades.
Abstract: Preface. Introduction. 1. The Antecedents of Film Theory. 2. Russian Formalism. 3. The Question of Film Language. 4. The Presence of Brecht. 5. The Poststructuralist Mutation. 6. The Rise of Cultural Studies. 7. The Coming Out of Queer Theory. 8. Third World Cinema Revisited. 9. The Politics of Postmodernism. 10. Post Cinema: Digital Theory and the New Media. Index.

404 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...classification scheme for movies focusing on those high-level properties that are of major relevance to customers [16]....

    [...]

Journal ArticleDOI
Winter Mason1, Duncan J. Watts1
TL;DR: It is found that increased financial incentives increase the quantity, but not the quality, of work performed by participants, where the difference appears to be due to an "anchoring" effect.
Abstract: The relationship between financial incentives and performance, long of interest to social scientists, has gained new relevance with the advent of web-based "crowd-sourcing" models of production. Here we investigate the effect of compensation on performance in the context of two experiments, conducted on Amazon's Mechanical Turk (AMT). We find that increased financial incentives increase the quantity, but not the quality, of work performed by participants, where the difference appears to be due to an "anchoring" effect: workers who were paid more also perceived the value of their work to be greater, and thus were no more motivated than workers paid less. In contrast with compensation levels, we find the details of the compensation scheme do matter--specifically, a "quota" system results in better work for less pay than an equivalent "piece rate" system. Although counterintuitive, these findings are consistent with previous laboratory studies, and may have real-world analogs as well.

396 citations

Proceedings Article
01 Jan 2011
TL;DR: This work proposes Qurk, a novel query system for managing HITs, allowing crowdpowered processing of relational databases, and describes a number of query execution and optimization challenges, and discusses some potential solutions.
Abstract: Amazon’s Mechanical Turk (\MTurk") service allows users to post short tasks (\HITs") that other users can receive a small amount of money for completing. Common tasks on the system include labelling a collection of images, combining two sets of images to identify people which appear in both, or extracting sentiment from a corpus of text snippets. Designing a workow of various kinds of HITs for ltering, aggregating, sorting, and joining data sources together is common, and comes with a set of challenges in optimizing the cost per HIT, the overall time to task completion, and the accuracy of MTurk results. We propose Qurk, a novel query system for managing these workows, allowing crowdpowered processing of relational databases. We describe a number of query execution and optimization challenges, and discuss some potential solutions.

224 citations

Proceedings Article
01 Jan 2005
TL;DR: The Semex System is described, that offers users a flexible platform for personal information management that leverages the personal information to enable lightweight information integration tasks that are discouragingly difficult to perform with today’s tools.
Abstract: The explosion of the amount of information available in digital form has made search a hot research topic for the Information Management Community. While most of the research on search is focussed on the WWW, individual computer users have developed their own vast collections of data on their desktops, and these collections are in critical need for good search tools. We describe the Semex System that offers users a flexible platform for personal information management. Semex has two main goals. The first goal is to enable browsing personal information by semantically meaningful associations. The challenge it to automatically create such associations between data items on one’s desktop, and to create enough of them so Semex becomes an indispensable tool. Our second goal is to leverage the personal information space we created to increase users’ productivity. As our first target, Semex leverages the personal information to enable lightweight information integration tasks that are discouragingly difficult to perform with today’s tools.

179 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...To fully harness the power of the crowd and to provide flexible and adaptive querying, a certain amount of schema malleability [4], [5] is needed....

    [...]

  • ...Originally, malleable schemas [4], [5] have been designed to deal with the vague nature of both data and queries in the real world....

    [...]

Proceedings ArticleDOI
19 Jul 2009
TL;DR: This work presents a novel method for recommending items to users based on expert opinions, which is a variation of traditional collaborative filtering: rather than applying a nearest neighbor algorithm to the user-rating data, predictions are computed using a set of expert neighbors from an independent dataset, whose opinions are weighted according to their similarity to the users.
Abstract: Nearest-neighbor collaborative filtering provides a successful means of generating recommendations for web users. However, this approach suffers from several shortcomings, including data sparsity and noise, the cold-start problem, and scalability. In this work, we present a novel method for recommending items to users based on expert opinions. Our method is a variation of traditional collaborative filtering: rather than applying a nearest neighbor algorithm to the user-rating data, predictions are computed using a set of expert neighbors from an independent dataset, whose opinions are weighted according to their similarity to the user. This method promises to address some of the weaknesses in traditional collaborative filtering, while maintaining comparable accuracy. We validate our approach by predicting a subset of the Netflix data set. We use ratings crawled from a web portal of expert reviews, measuring results both in terms of prediction accuracy and recommendation list precision. Finally, we explore the ability of our method to generate useful recommendations, by reporting the results of a user-study where users prefer the recommendations generated by our approach.

154 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...Still, even when there really are only small numbers of users and ratings, recent work in recommender systems shows that much can be learnt from the available data as long as there are some active core users, who rated a large number of items each [22]....

    [...]