scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

01 Feb 2012-Vol. 5, Iss: 6, pp 538-549
TL;DR: This paper extends crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time, and leverages the usergenerated data found in the Social Web to build perceptual spaces.
Abstract: By incorporating human workers into the query execution process crowd-enabled databases facilitate intelligent, social capabilities like completing missing data at query time or performing cognitive operators. But despite all their flexibility, crowd-enabled databases still maintain rigid schemas. In this paper, we extend crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time. However, the number of crowd-sourced mini-tasks to fill in missing values may often be prohibitively large and the resulting data quality is doubtful. Instead of simple crowd-sourcing to obtain all values individually, we leverage the usergenerated data found in the Social Web: By exploiting user ratings we build perceptual spaces, i.e., highly-compressed representations of opinions, impressions, and perceptions of large numbers of users. Using few training samples obtained by expert crowd sourcing, we then can extract all missing data automatically from the perceptual space with high quality and at low costs. Extensive experiments show that our approach can boost both performance and quality of crowd-enabled databases, while also providing the flexibility to expand schemas in a query-driven fashion.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
13 May 2013
TL;DR: This paper proposes and extensively evaluate a different Crowdsourcing approach based on a push methodology that carefully selects which workers should perform a given task based on worker profiles extracted from social networks and shows that this approach consistently yield better results than usual pull strategies.
Abstract: Crowdsourcing allows to build hybrid online platforms that combine scalable information systems with the power of human intelligence to complete tasks that are difficult to tackle for current algorithms. Examples include hybrid database systems that use the crowd to fill missing values or to sort items according to subjective dimensions such as picture attractiveness. Current approaches to Crowdsourcing adopt a pull methodology where tasks are published on specialized Web platforms where workers can pick their preferred tasks on a first-come-first-served basis. While this approach has many advantages, such as simplicity and short completion times, it does not guarantee that the task is performed by the most suitable worker. In this paper, we propose and extensively evaluate a different Crowdsourcing approach based on a push methodology. Our proposed system carefully selects which workers should perform a given task based on worker profiles extracted from social networks. Workers and tasks are automatically matched using an underlying categorization structure that exploits entities extracted from the task descriptions on one hand, and categories liked by the user on social platforms on the other hand. We experimentally evaluate our approach on tasks of varying complexity and show that our push methodology consistently yield better results than usual pull strategies.

165 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ...In the database community, hybrid human-machine database systems have been proposed [12, 23]....

    [...]

Proceedings ArticleDOI
18 Mar 2013
TL;DR: The problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions is studied, and efficient algorithms that are guaranteed to achieve good results with high probability are given.
Abstract: Group-by and top-k are fundamental constructs in database queries. However, the criteria used for grouping and ordering certain types of data -- such as unlabeled photos clustered by the same person ordered by age -- are difficult to evaluate by machines. In contrast, these tasks are easy for humans to evaluate and are therefore natural candidates for being crowd-sourced.We study the problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions. Given two data elements, the answer to a type question is "yes" if the elements have the same type and therefore belong to the same group or cluster; the answer to a value question orders the two data elements. The assumption here is that there is an underlying ground truth, but that the answers returned by the crowd may sometimes be erroneous. We formalize the problems of top-k and group-by in the crowd-sourced setting, and give efficient algorithms that are guaranteed to achieve good results with high probability. We analyze the crowd-sourced cost of these algorithms in terms of the total number of type and value questions, and show that they are essentially the best possible. We also show that fewer questions are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases.

135 citations

Journal ArticleDOI
01 Oct 2013
TL;DR: The ZenCrowd system uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud.
Abstract: We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

89 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ..., “10 papers with the most novel ideas”) [22,43]....

    [...]

Proceedings ArticleDOI
18 Mar 2013
TL;DR: It is shown that by assessing the individual risk a tuple poses with respect to the overall result quality, crowd-sourcing efforts for eliciting missing values can be narrowly focused on only those tuples that may degenerate the expected quality most strongly, which leads to an algorithm for computing skyline sets on incomplete data with maximum result quality.
Abstract: Skyline queries are a well-established technique for database query personalization and are widely acclaimed for their intuitive query formulation mechanisms. However, when operating on incomplete datasets, skylines queries are severely hampered and often have to resort to highly error-prone heuristics. Unfortunately, incomplete datasets are a frequent phenomenon, especially when datasets are generated automatically using various information extraction or information integration approaches. Here, the recent trend of crowd-enabled databases promises a powerful solution: during query execution, some database operators can be dynamically outsourced to human workers in exchange for monetary compensation, therefore enabling the elicitation of missing values during runtime. Unfortunately, this powerful feature heavily impacts query response times and (monetary) execution costs. In this paper, we present an innovative hybrid approach combining dynamic crowd-sourcing with heuristic techniques in order to overcome current limitations. We will show that by assessing the individual risk a tuple poses with respect to the overall result quality, crowd-sourcing efforts for eliciting missing values can be narrowly focused on only those tuples that may degenerate the expected quality most strongly. This leads to an algorithm for computing skyline sets on incomplete data with maximum result quality, while optimizing crowd-sourcing costs.

68 citations

Proceedings ArticleDOI
22 Jun 2013
TL;DR: A cost sensitive quantitative analysis method to estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated and the experimental results show that the proposed method outperforms all the state-of-art methods.
Abstract: Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed.In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.

54 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ...[16] expanded database schemas with additional attributes through querying the crowdsourcing systems....

    [...]

References
More filters
Proceedings ArticleDOI
11 Jun 2007
TL;DR: A novel query relaxation scheme that enables users to find best matching information by exploiting malleable schemas to effectively query vaguely structured information and ranks results of the relaxed query according to their respective probability of satisfying the original query's intent.
Abstract: In contrast to classical databases and IR systems, real-world information systems have to deal increasingly with very vague and diverse structures for information management and storage that cannot be adequately handled yet. While current object-relational database systems require clear and unified data schemas, IR systems usually ignore the structured information completely. Malleable schemas, as recently introduced, provide a novel way to deal with vagueness, ambiguity and diversity by incorporating imprecise and overlapping definitions of data structures. In this paper, we propose a novel query relaxation scheme that enables users to find best matching information by exploiting malleable schemas to effectively query vaguely structured information. Our scheme utilizes duplicates in differently described data sets to discover the correlations within a malleable schema, and then uses these correlations to appropriately relax the users' queries. In addition, it ranks results of the relaxed query according to their respective probability of satisfying the original query's intent. We have implemented the scheme and conducted extensive experiments with real-world data to confirm its performance and practicality.

63 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...On the query side malleability was understood in terms of query relaxation to match more data at runtime [6]....

    [...]

Proceedings Article
01 Jan 2005
TL;DR: This work considers the question of modeling an application domain whose data may be partially structured and partially unstructured, and proposes the concept of malleable schemas as a modeling tool that enables incorporating both structured and unstructuring data from the very beginning, and evolving one's model as it becomes more structured.
Abstract: Large-scale information integration, and in particular, search on the World Wide Web, is pushing the limits on the combination of structured data and unstructured data. By its very nature, as we combine a large number of information sources, our ability to model the domain in a completely structured way diminishes. We argue that in order to build applications that combine structured and unstructured data, there is a need for a new modeling tool. We consider the question of modeling an application domain whose data may be partially structured and partially unstructured. In particular, we are concerned with applications where the border between the structured and unstructured parts of the data is not well deflned, not well known in advance, or may evolve over time. We propose the concept of malleable schemas as a modeling tool that enables incorporating both structured and unstructured data from the very beginning, and evolving one’s model as it becomes more structured. A malleable schema begins the same way as a traditional schema, but at certain points gradually becomes vague, and we use keywords to describe schema elements such as classes and properties. The important aspect of malleable schemas is that a modeler can capture the important aspects of the domain at modeling time without having to commit to a very strict schema. The vague parts of the schema can later evolve to have more structure, or can remain as such. Users can pose queries in which references to schema elements can be imprecise, and the query processor will consider closely related schema elements as well.

31 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...To fully harness the power of the crowd and to provide flexible and adaptive querying, a certain amount of schema malleability [4], [5] is needed....

    [...]

  • ...Originally, malleable schemas [4], [5] have been designed to deal with the vague nature of both data and queries in the real world....

    [...]

Proceedings ArticleDOI
23 Oct 2011
TL;DR: This work uses the publicly available user generated information contained in Wikipedia to identify similarities between items by mapping them to Wikipedia pages and finding similarities in the text and commonalities in the links and categories of each page to improve ranking predictions.
Abstract: One important challenge in the field of recommender systems is the sparsity of available data. This problem limits the ability of recommender systems to provide accurate predictions of user ratings. We overcome this problem by using the publicly available user generated information contained in Wikipedia. We identify similarities between items by mapping them to Wikipedia pages and finding similarities in the text and commonalities in the links and categories of each page. These similarities can be used in the recommendation process and improve ranking predictions. We find that this method is most effective in cases where ratings are extremely sparse or nonexistent. Preliminary experimental results on the MovieLens dataset are encouraging.

29 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...Finally, additional meta-data on items beyond plain ratings could be incorporated [25]....

    [...]

Proceedings ArticleDOI
19 Jul 2010
TL;DR: The proposed selective labeling scheme collects additional labels only for a subset of training samples, specifically for those that are labeled relevant by a judge, and outperforms several methods of using overlapping labels, such as simple k-overlap, majority vote, the highest labels, etc.
Abstract: This paper studies quality of human labels used to train search engines' rankers. Our specific focus is performance improvements obtained by using overlapping relevance labels, which is by collecting multiple human judgments for each training sample. The paper explores whether, when, and for which samples one should obtain overlapping training labels, as well as how many labels per sample are needed. The proposed selective labeling scheme collects additional labels only for a subset of training samples, specifically for those that are labeled relevant by a judge. Our experiments show that this labeling scheme improves the NDCG of two Web search rankers on several real-world test sets, with a low labeling overhead of around 1.4 labels per sample. This labeling scheme also outperforms several methods of using overlapping labels, such as simple k-overlap, majority vote, the highest labels, etc. Finally, the paper presents a study of how many overlapping labels are needed to get the best improvement in retrieval accuracy.

24 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...In [32] and [33] different strategies are developed to infer a single reliable judgment from conflicting responses to the same HIT, mostly by extending the majority voting scheme....

    [...]

Posted Content
TL;DR: A methodology to systematically check the claim that meaningful item features could be extracted from collaborative rating data, which is becoming available through social networking services, is proposed and initial evidence is presented.
Abstract: Performing effective preference-based data retrieval requires detailed and preferentially meaningful structurized information about the current user as well as the items under consideration. A common problem is that representations of items often only consist of mere technical attributes, which do not resemble human perception. This is particularly true for integral items such as movies or songs. It is often claimed that meaningful item features could be extracted from collaborative rating data, which is becoming available through social networking services. However, there is only anecdotal evidence supporting this claim; but if it is true, the extracted information could very valuable for preference-based data retrieval. In this paper, we propose a methodology to systematically check this common claim. We performed a preliminary investigation on a large collection of movie ratings and present initial evidence.

9 citations


"Pushing the boundaries of crowd-ena..." refers background or result in this paper

  • ...Most work in this direction focuses on comparing spaces created by different methods for means of classification [11], [36]....

    [...]

  • ...However, some initial studies we conducted in [11] indicated the potential benefit of factor models beyond recommendation tasks....

    [...]