Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

doi:10.14778/2168651.2168655

Home
/
Papers
/
Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

Journal Article•DOI•

Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

Joachim Selke¹, Christoph Lofi¹, Wolf-Tilo Balke¹•Institutions (1)

Braunschweig University of Technology¹

01 Feb 2012-Vol. 5, Iss: 6, pp 538-549

TL;DR: This paper extends crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time, and leverages the usergenerated data found in the Social Web to build perceptual spaces.

read less

Abstract: By incorporating human workers into the query execution process crowd-enabled databases facilitate intelligent, social capabilities like completing missing data at query time or performing cognitive operators. But despite all their flexibility, crowd-enabled databases still maintain rigid schemas. In this paper, we extend crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time. However, the number of crowd-sourced mini-tasks to fill in missing values may often be prohibitively large and the resulting data quality is doubtful. Instead of simple crowd-sourcing to obtain all values individually, we leverage the usergenerated data found in the Social Web: By exploiting user ratings we build perceptual spaces, i.e., highly-compressed representations of opinions, impressions, and perceptions of large numbers of users. Using few training samples obtained by expert crowd sourcing, we then can extract all missing data automatically from the perceptual space with high quality and at low costs. Extensive experiments show that our approach can boost both performance and quality of crowd-enabled databases, while also providing the flexibility to expand schemas in a query-driven fashion.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Pick-a-crowd: tell me what you like, and i'll tell you what to do

[...]

Djellel Eddine Difallah¹, Gianluca Demartini¹, Philippe Cudré-Mauroux¹•Institutions (1)

University of Fribourg¹

13 May 2013

TL;DR: This paper proposes and extensively evaluate a different Crowdsourcing approach based on a push methodology that carefully selects which workers should perform a given task based on worker profiles extracted from social networks and shows that this approach consistently yield better results than usual pull strategies.

...read moreread less

Abstract: Crowdsourcing allows to build hybrid online platforms that combine scalable information systems with the power of human intelligence to complete tasks that are difficult to tackle for current algorithms. Examples include hybrid database systems that use the crowd to fill missing values or to sort items according to subjective dimensions such as picture attractiveness. Current approaches to Crowdsourcing adopt a pull methodology where tasks are published on specialized Web platforms where workers can pick their preferred tasks on a first-come-first-served basis. While this approach has many advantages, such as simplicity and short completion times, it does not guarantee that the task is performed by the most suitable worker. In this paper, we propose and extensively evaluate a different Crowdsourcing approach based on a push methodology. Our proposed system carefully selects which workers should perform a given task based on worker profiles extracted from social networks. Workers and tasks are automatically matched using an underlying categorization structure that exploits entities extracted from the task descriptions on one hand, and categories liked by the user on social platforms on the other hand. We experimentally evaluate our approach on tasks of varying complexity and show that our push methodology consistently yield better results than usual pull strategies.

...read moreread less

165 citations

Cites background from "Pushing the boundaries of crowd-ena..."

...In the database community, hybrid human-machine database systems have been proposed [12, 23]....
[...]

Proceedings Article•DOI•

Using the crowd for top-k and group-by queries

[...]

Susan B. Davidson¹, Sanjeev Khanna¹, Tova Milo², Sudeepa Roy³•Institutions (3)

University of Pennsylvania¹, Tel Aviv University², University of Washington³

18 Mar 2013

TL;DR: The problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions is studied, and efficient algorithms that are guaranteed to achieve good results with high probability are given.

...read moreread less

Abstract: Group-by and top-k are fundamental constructs in database queries. However, the criteria used for grouping and ordering certain types of data -- such as unlabeled photos clustered by the same person ordered by age -- are difficult to evaluate by machines. In contrast, these tasks are easy for humans to evaluate and are therefore natural candidates for being crowd-sourced.We study the problem of evaluating top-k and group-by queries using the crowd to answer either type or value questions. Given two data elements, the answer to a type question is "yes" if the elements have the same type and therefore belong to the same group or cluster; the answer to a value question orders the two data elements. The assumption here is that there is an underlying ground truth, but that the answers returned by the crowd may sometimes be erroneous. We formalize the problems of top-k and group-by in the crowd-sourced setting, and give efficient algorithms that are guaranteed to achieve good results with high probability. We analyze the crowd-sourced cost of these algorithms in terms of the total number of type and value questions, and show that they are essentially the best possible. We also show that fewer questions are needed when values and types are correlated, or when the error model is one in which the error decreases as the distance between the two elements in the sorted order increases.

...read moreread less

135 citations

Journal Article•DOI•

Large-scale linked data integration using probabilistic reasoning and crowdsourcing

[...]

Gianluca Demartini¹, Djellel Eddine Difallah¹, Philippe Cudré-Mauroux¹•Institutions (1)

University of Fribourg¹

01 Oct 2013

TL;DR: The ZenCrowd system uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud.

...read moreread less

Abstract: We tackle the problems of semiautomatically matching linked data sets and of linking large collections of Web pages to linked data. Our system, ZenCrowd, (1) uses a three-stage blocking technique in order to obtain the best possible instance matches while minimizing both computational complexity and latency, and (2) identifies entities from natural language text using state-of-the-art techniques and automatically connects them to the linked open data cloud. First, we use structured inverted indices to quickly find potential candidate results from entities that have been indexed in our system. Our system then analyzes the candidate matches and refines them whenever deemed necessary using computationally more expensive queries on a graph database. Finally, we resort to human computation by dynamically generating crowdsourcing tasks in case the algorithmic components fail to come up with convincing results. We integrate all results from the inverted indices, from the graph database and from the crowd using a probabilistic framework in order to make sensible decisions about candidate matches and to identify unreliable human workers. In the following, we give an overview of the architecture of our system and describe in detail our novel three-stage blocking technique and our probabilistic decision framework. We also report on a series of experimental results on a standard data set, showing that our system can achieve a 95 % average accuracy on instance matching (as compared to the initial 88 % average accuracy of the purely automatic baseline) while drastically limiting the amount of work performed by the crowd. The experimental evaluation of our system on the entity linking task shows an average relative improvement of 14 % over our best automatic approach.

...read moreread less

89 citations

Cites background from "Pushing the boundaries of crowd-ena..."

..., “10 papers with the most novel ideas”) [22,43]....
[...]

Proceedings Article•DOI•

Skyline queries in crowd-enabled databases

[...]

Christoph Lofi¹, Kinda El Maarry², Wolf-Tilo Balke²•Institutions (2)

National Institute of Informatics¹, Braunschweig University of Technology²

18 Mar 2013

TL;DR: It is shown that by assessing the individual risk a tuple poses with respect to the overall result quality, crowd-sourcing efforts for eliciting missing values can be narrowly focused on only those tuples that may degenerate the expected quality most strongly, which leads to an algorithm for computing skyline sets on incomplete data with maximum result quality.

...read moreread less

Abstract: Skyline queries are a well-established technique for database query personalization and are widely acclaimed for their intuitive query formulation mechanisms. However, when operating on incomplete datasets, skylines queries are severely hampered and often have to resort to highly error-prone heuristics. Unfortunately, incomplete datasets are a frequent phenomenon, especially when datasets are generated automatically using various information extraction or information integration approaches. Here, the recent trend of crowd-enabled databases promises a powerful solution: during query execution, some database operators can be dynamically outsourced to human workers in exchange for monetary compensation, therefore enabling the elicitation of missing values during runtime. Unfortunately, this powerful feature heavily impacts query response times and (monetary) execution costs. In this paper, we present an innovative hybrid approach combining dynamic crowd-sourcing with heuristic techniques in order to overcome current limitations. We will show that by assessing the individual risk a tuple poses with respect to the overall result quality, crowd-sourcing efforts for eliciting missing values can be narrowly focused on only those tuples that may degenerate the expected quality most strongly. This leads to an algorithm for computing skyline sets on incomplete data with maximum result quality, while optimizing crowd-sourcing costs.

...read moreread less

68 citations

Proceedings Article•DOI•

An online cost sensitive decision-making method in crowdsourcing systems

[...]

Jinyang Gao¹, Xuan Liu¹, Beng Chin Ooi¹, Haixun Wang², Gang Chen³ - Show less +1 more•Institutions (3)

National University of Singapore¹, Microsoft², Zhejiang University³

22 Jun 2013

TL;DR: A cost sensitive quantitative analysis method to estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated and the experimental results show that the proposed method outperforms all the state-of-art methods.

...read moreread less

Abstract: Crowdsourcing has created a variety of opportunities for many challenging problems by leveraging human intelligence. For example, applications such as image tagging, natural language processing, and semantic-based information retrieval can exploit crowd-based human computation to supplement existing computational algorithms. Naturally, human workers in crowdsourcing solve problems based on their knowledge, experience, and perception. It is therefore not clear which problems can be better solved by crowdsourcing than solving solely using traditional machine-based methods. Therefore, a cost sensitive quantitative analysis method is needed.In this paper, we design and implement a cost sensitive method for crowdsourcing. We online estimate the profit of the crowdsourcing job so that those questions with no future profit from crowdsourcing can be terminated. Two models are proposed to estimate the profit of crowdsourcing job, namely the linear value model and the generalized non-linear model. Using these models, the expected profit of obtaining new answers for a specific question is computed based on the answers already received. A question is terminated in real time if the marginal expected profit of obtaining more answers is not positive. We extends the method to publish a batch of questions in a HIT. We evaluate the effectiveness of our proposed method using two real world jobs on AMT. The experimental results show that our proposed method outperforms all the state-of-art methods.

...read moreread less

54 citations

Cites background from "Pushing the boundaries of crowd-ena..."

...[16] expanded database schemas with additional attributes through querying the crowdsourcing systems....
[...]

1
2
3
4
…
5
6
7
8
9
10

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Analyzing the Amazon Mechanical Turk marketplace

[...]

Panagiotis G. Ipeirotis¹•Institutions (1)

New York University¹

01 Dec 2010-ACM Crossroads Student Magazine

TL;DR: In this paper, an associate professor at New York University's Stern School of Business uncovers answers about who are the employers in paid crowdsourcing, what tasks they post, and how much they pay.

...read moreread less

Abstract: An associate professor at New York Universitys Stern School of Business uncovers answers about who are the employers in paid crowdsourcing, what tasks they post, and how much they pay.

...read moreread less

750 citations

Journal Article•DOI•

Collaborative filtering with temporal dynamics

[...]

Yehuda Koren¹•Institutions (1)

Yahoo!¹

01 Apr 2010-Communications of The ACM

TL;DR: Two leading collaborative filtering recommendation approaches are revamped and a more sensitive approach is required, which can make better distinctions between transient effects and long-term patterns.

...read moreread less

Abstract: Customer preferences for products are drifting over time. Product perception and popularity are constantly changing as new selection emerges. Similarly, customer inclinations are evolving, leading them to ever redefine their taste. Thus, modeling temporal dynamics is essential for designing recommender systems or general customer preference models. However, this raises unique challenges. Within the ecosystem intersecting multiple products and customers, many different characteristics are shifting simultaneously, while many of them influence each other and often those shifts are delicate and associated with a few data instances. This distinguishes the problem from concept drift explorations, where mostly a single concept is tracked. Classical time-window or instance decay approaches cannot work, as they lose too many signals when discarding data instances. A more sensitive approach is required, which can make better distinctions between transient effects and long-term patterns. We show how to model the time changing behavior throughout the life span of the data. Such a model allows us to exploit the relevant components of all data instances, while discarding only what is modeled as being irrelevant. Accordingly, we revamp two leading collaborative filtering recommendation approaches. Evaluation is made on a large movie-rating dataset underlying the Netflix Prize contest. Results are encouraging and better than those previously reported on this dataset. In particular, methods described in this paper play a significant role in the solution that won the Netflix contest.

...read moreread less

694 citations

Proceedings Article•DOI•

CrowdDB: answering queries with crowdsourcing

[...]

Michael J. Franklin¹, Donald Kossmann², Tim Kraska¹, Sukriti Ramesh², Reynold Xin¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, ETH Zurich²

12 Jun 2011

TL;DR: The design of CrowdDB is described, a major change is that the traditional closed-world assumption for query processing does not hold for human input, and important avenues for future work in the development of crowdsourced query processing systems are outlined.

...read moreread less

Abstract: Some queries cannot be answered by machines only. Processing such queries requires human input for providing information that is missing from the database, for performing computationally difficult functions, and for matching, ranking, or aggregating results based on fuzzy criteria. CrowdDB uses human input via crowdsourcing to process queries that neither database systems nor search engines can adequately answer. It uses SQL both as a language for posing complex queries and as a way to model data. While CrowdDB leverages many aspects of traditional database systems, there are also important differences. Conceptually, a major change is that the traditional closed-world assumption for query processing does not hold for human input. From an implementation perspective, human-oriented query operators are needed to solicit, integrate and cleanse crowdsourced data. Furthermore, performance and cost depend on a number of new factors including worker affinity, training, fatigue, motivation and location. We describe the design of CrowdDB, report on an initial set of experiments using Amazon Mechanical Turk, and outline important avenues for future work in the development of crowdsourced query processing systems.

...read moreread less

688 citations

"Pushing the boundaries of crowd-ena..." refers background in this paper

...sourcing platform can only utilize a relatively small human worker pool [1]....
[...]
...• Performance: As outlined by [1], a large number of HITs(1) may have to be issued for each crowd-enabled query....
[...]

Proceedings Article•DOI•

Large-scale matrix factorization with distributed stochastic gradient descent

[...]

Rainer Gemulla¹, Erik Nijkamp², Peter J. Haas², Yannis Sismanis²•Institutions (2)

Max Planck Society¹, IBM²

21 Aug 2011

TL;DR: A novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements, called DSGD, that can be fully distributed and run on web-scale datasets using, e.g., MapReduce.

...read moreread less

Abstract: We provide a novel algorithm to approximately factor large matrices with millions of rows, millions of columns, and billions of nonzero elements. Our approach rests on stochastic gradient descent (SGD), an iterative stochastic optimization algorithm. We first develop a novel "stratified" SGD variant (SSGD) that applies to general loss-minimization problems in which the loss function can be expressed as a weighted sum of "stratum losses." We establish sufficient conditions for convergence of SSGD using results from stochastic approximation theory and regenerative process theory. We then specialize SSGD to obtain a new matrix-factorization algorithm, called DSGD, that can be fully distributed and run on web-scale datasets using, e.g., MapReduce. DSGD can handle a wide variety of matrix factorizations. We describe the practical techniques used to optimize performance in our DSGD implementation. Experiments suggest that DSGD converges significantly faster and has better scalability properties than alternative algorithms.

...read moreread less

669 citations

"Pushing the boundaries of crowd-ena..." refers methods in this paper

...This optimization problem can be solved efficiently using stochastic gradient descent or alternating least squares methods, even on large data sets [13]....
[...]

Journal Article•

Large Scale Transductive SVMs

[...]

Ronan Collobert, Fabian H. Sinz¹, Jason Weston¹, Léon Bottou•Institutions (1)

Max Planck Society¹

01 Dec 2006-Journal of Machine Learning Research

TL;DR: It is shown how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem, and provides for the first time a highly scalable algorithm in the nonlinear case.

...read moreread less

Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the first time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction.html .

...read moreread less

497 citations