Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

doi:10.14778/2168651.2168655

Home
/
Papers
/
Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

Journal Article•DOI•

Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

Joachim Selke¹, Christoph Lofi¹, Wolf-Tilo Balke¹•Institutions (1)

Braunschweig University of Technology¹

01 Feb 2012-Vol. 5, Iss: 6, pp 538-549

TL;DR: This paper extends crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time, and leverages the usergenerated data found in the Social Web to build perceptual spaces.

read less

Abstract: By incorporating human workers into the query execution process crowd-enabled databases facilitate intelligent, social capabilities like completing missing data at query time or performing cognitive operators. But despite all their flexibility, crowd-enabled databases still maintain rigid schemas. In this paper, we extend crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time. However, the number of crowd-sourced mini-tasks to fill in missing values may often be prohibitively large and the resulting data quality is doubtful. Instead of simple crowd-sourcing to obtain all values individually, we leverage the usergenerated data found in the Social Web: By exploiting user ratings we build perceptual spaces, i.e., highly-compressed representations of opinions, impressions, and perceptions of large numbers of users. Using few training samples obtained by expert crowd sourcing, we then can extract all missing data automatically from the perceptual space with high quality and at low costs. Extensive experiments show that our approach can boost both performance and quality of crowd-enabled databases, while also providing the flexibility to expand schemas in a query-driven fashion.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Optimally selecting and combining assessment and assessor types for information retrieval evaluation

[...]

Maryam Bashir

01 Aug 2014

TL;DR: A general Bayesian framework for leveraging disparate categories of workers on the cost-accuracy scale, asking the right worker the right kind of question at the right time in order to obtain accurate assessments most costeffectively for test collection construction is proposed.

...read moreread less

Abstract: Current test collection construction methodologies for Information Retrieval evaluation generally rely on large numbers of document relevance assessments, obtained from experts at great cost. Recently, the use of inexpensive crowd workers has been proposed instead. However, while crowd workers are inexpensive, their assessments are also generally highly inaccurate, rendering their collective assessments far less useful than those obtained from experts in the traditional manner. Our thesis is that instead of using either experts or crowd workers, one can obtain the advantages of both—inexpensive and accurate assessments—by optimally combining them. Another related problem in Information Retrieval evaluation is asking right kind of question to the assessors for the collection relevance judgments. Traditional methods of collecting relevance judgments are based on collecting binary or graded nominal judgments, but such judgments are limited by factors such as inter-assessor disagreement and the arbitrariness of grades. Previous research has shown that it is easier for assessors to make pairwise preference judgments. However, unless the preferences collected are largely transitive, it is not clear how to combine them in order to obtain document relevance scores. Another difficulty is that the number of pairs that need to be assessed is quadratic in the number of documents. We show how to combine a linear number of pairwise preference judgments from multiple assessors to compute relevance scores for every document. We propose a general Bayesian framework for leveraging disparate categories of workers on the cost-accuracy scale, asking the right worker the right kind of question at the right time in order to obtain accurate assessments most costeffectively for test collection construction. Experiments with Mechanical Turks and expert assessors show promising results for our framework.

...read moreread less

2 citations

Cites methods from "Pushing the boundaries of crowd-ena..."

...Crowdsourcing has been used in database community for building hybrid humanmachine database systems [18, 19]....
[...]

Book Chapter•DOI•

Towards an Impact-Driven Quality Control Model for Imbalanced Crowdsourcing Tasks

[...]

Kinda El Maarry¹, Wolf-Tilo Balke¹•Institutions (1)

Braunschweig University of Technology¹

07 Nov 2016

TL;DR: This paper argues for an impact-driven quality control model, which fulfills the impact-sourcing vision, thus materializing the social responsibility aspect of crowdsourcing, while ensuring high quality results.

...read moreread less

Abstract: Crowdsourcing have been gaining increasing popularity as a highly distributed digital solution that surpasses both borders and time-zones. Moreover, it extends economic opportunities to developing countries, thus answering the call of impact sourcing in alleviating the welfare of poor labor in need. Nevertheless, it is constantly criticized for the associated quality problems and risks. Attempting to mitigate these risks, a rich body of research has been dedicated to design countermeasures against free riders and spammers, who compromise the overall quality of the results, and whose undetected presence ruins the financial prospects for other honest workers. Such quality risks materialize even more severely with imbalanced crowdsourcing tasks. In fact, while surveying this literature, a common rule of thumb can be indeed derived: the easier it is to cheat the system and go undetected, the more restrictive and across-the-board discriminating countermeasures are taken. Hence, also honest yet low-skilled workers will be placed on par with spammers, and consequently exposed and deprived of much-needed earnings. Therefore in this paper, we argue for an impact-driven quality control model, which fulfills the impact-sourcing vision, thus materializing the social responsibility aspect of crowdsourcing, while ensuring high quality results.

...read moreread less

2 citations

Cites background from "Pushing the boundaries of crowd-ena..."

...Generally, such hiring seeks intelligent information processing skills for numerous tasks, ranging from content annotation [1], information extraction [2], to more complex tasks like sentiment analysis [3] and crowd-enabled database retrieval [4]....
[...]

Exploiting Social Judgements in Big Data Analytics.

[...]

Christoph Lofi, Philipp Wille¹•Institutions (1)

Braunschweig University of Technology¹

01 Jan 2015

TL;DR: This paper will discuss how to use unstructured reviews to build a structured semantic representation of database items, enabling the imple- mentation of semantic queries and further machine-learning analytics.

...read moreread less

Abstract: Social judgements like comments, reviews, discussions, or ratings have become a ubiquitous component of most Web applications, especially in the e-commerce domain. Now, a central challenge is using these judgements to im- prove the user experience by offering new query paradigms or better data analyt- ics. Recommender systems have already demonstrated how ratings can be effec- tively used towards that end, allowing users to semantically explore even large item databases. In this paper, we will discuss how to use unstructured reviews to build a structured semantic representation of database items, enabling the imple- mentation of semantic queries and further machine-learning analytics. Thus, we address one of the central challenge of Big Data: making sense of huge collec- tions of unstructured user feedback.

...read moreread less

2 citations

Cites background or methods from "Pushing the boundaries of crowd-ena..."

...Our Perceptual Spaces introduced in [9] and [6] rely on a factor model using the following assumptions: Perceptual Spaces use the established assumption that item ratings in the Social Web are a result of a user’s preferences with respect to an item’s attributes [10]....
[...]
...In [9], we have shown that certain perceived properties (like the degree of funniness) can be made explicit with only minimal human input using crowdsourcing-based machine regression....
[...]
...As our experiments in [9] showed, quality of perceptual spaces increase with the involvement and activity of users: rating data obtained from a restaurant data set (where...
[...]
...In the following, we evaluate different review-based embeddings in comparison with our rating-based perceptual space [9] as a baseline....
[...]

Journal Article•DOI•

Implementing computational biology pipelines using VisFlow

[...]

Xin Mou¹, Hasan M. Jamil¹, Robert Rinker¹•Institutions (1)

University of Idaho¹

24 May 2017-International Journal of Data Mining and Bioinformatics

TL;DR: A declarative meta-language, called VisFlow, for requirement specification, and a translator for mapping requirements into executable queries in a variant of SQL augmented with integration artefacts are presented.

...read moreread less

Abstract: Data integration continues to baffle researchers even though substantial progress has been made Although the emergence of technologies such as XML, web services, semantic web and cloud computing have helped, a system in which biologists are comfortable articulating new applications and developing them without technical assistance from a computing expert is yet to be realised The distance between a friendly graphical interface that does little, and a 'traditional' system though clunky yet powerful, is deemed too great more often than not The question that remains unanswered is, if a user can state her query involving a set of complex, heterogeneous and distributed life sciences resources in an easy to use language and execute it without further help from a computer savvy programmer In this paper, we present a declarative meta-language, called VisFlow, for requirement specification, and a translator for mapping requirements into executable queries in a variant of SQL augmented with integration artefacts

...read moreread less

2 citations

Journal Article•DOI•

NoXperanto: Crowdsourced Polyglot Persistence

[...]

Antonio Maccioni, Orlando Cassano¹, Yongming Luo², Juan Carlos Castrejón³, Genoveva Vargas-Solar⁴ - Show less +1 more•Institutions (4)

Université libre de Bruxelles¹, Eindhoven University of Technology², Pierre Mendès-France University³, Centre national de la recherche scientifique⁴

31 Jul 2014

TL;DR: Qs results are used to help non-expert users in using the multi-database environment and improve performances of theMulti-databaseenvironment, which not only uses disk and memory resources, but heavily rely on network bandwidth.

...read moreread less

Abstract: —This paper proposes N O X PERANTO , a novelcrowdsourcing approach to address querying over datacollections managed by polyglot persistence settings. The maincontribution of N O X PERANTO is the ability to solve complexqueries involving different data stores by exploiting queriesfrom expert users (i.e. a crowd of database administrators, dataengineers, domain experts, etc.), assuming that these users cansubmit meaningful queries. N O X PERANTO exploits the resultsof “meaningful queries” in order to facilitate the forthcomingquery answering processes. In particular, queries results areused to: (i) help non-expert users in using the multi-databaseenvironment and (ii) improve performances of the multi-databaseenvironment, which not only uses disk and memory resources,but heavily rely on network bandwidth. N O X PERANTO employsa layer to keep track of the information produced by the crowdmodeled as a Property Graph and managed in a Graph DatabaseManagement System (GDBMS).Index Terms—Polyglot persistence, crowdsourcing, multi-databases, big data, property graph, graph databases.

...read moreread less

2 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Indexing by Latent Semantic Analysis

[...]

Scott Deerwester¹, Susan T. Dumais², George W. Furnas², Thomas K. Landauer², Richard A. Harshman³ - Show less +1 more•Institutions (3)

University of Chicago¹, Telcordia Technologies², University of Western Ontario³

01 Sep 1990-Journal of the Association for Information Science and Technology

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

...read moreread less

12,443 citations

"Pushing the boundaries of crowd-ena..." refers methods in this paper

...Furthermore, we can show that approaches based on classification using metadata and LSI lead to surprisingly bad results (g-mean between 0.41 and 0.50), and show even worse accuracy than randomly applying labels....
[...]
...This is implemented by using Latent Semantic Indexing (LSI) [21] to generate a 100-dimensional “metadata space” from movie attributes like title, plot, main actors, directors, year, runtime, and country as recorded in IMDb....
[...]

Journal Article•DOI•

A tutorial on support vector regression

[...]

Alexander J. Smola¹, Bernhard Schölkopf²•Institutions (2)

Australian National University¹, Max Planck Society²

01 Aug 2004-Statistics and Computing

TL;DR: This tutorial gives an overview of the basic ideas underlying Support Vector (SV) machines for function estimation, and includes a summary of currently used algorithms for training SV machines, covering both the quadratic programming part and advanced methods for dealing with large datasets.

...read moreread less

Abstract: In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and advanced methods for dealing with large datasets. Finally, we mention some modifications and extensions that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a SV perspective.

...read moreread less

10,696 citations

"Pushing the boundaries of crowd-ena..." refers methods in this paper

...Instead of relying on non-linear regression, we can use an SVM classifier [19]....
[...]

Journal Article•DOI•

Learning from Imbalanced Data

[...]

Haibo He¹, E.A. Garcia¹•Institutions (1)

Stevens Institute of Technology¹

01 Sep 2009-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.

...read moreread less

Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

...read moreread less

6,320 citations

"Pushing the boundaries of crowd-ena..." refers background in this paper

...A popular measure of classification performance in the presence of class imbalance is the g-mean measure [20], which is the geometric mean of sensitivity (accuracy on all movies truly belonging to the genre) and specificity (accuracy on all movies truly not belonging to the genre), As the g-mean punishes significant differences between sensitivity and specificity, the above naïve classifier would achieve 0% g-mean....
[...]

Proceedings Article•

Support Vector Regression Machines

[...]

Harris Drucker¹, Christopher John Burges, Linda Kaufman², Alexander J. Smola², Vladimir Vapnik³ - Show less +1 more•Institutions (3)

Monmouth University¹, Bell Labs², AT&T Labs³

03 Dec 1996

TL;DR: This work compares support vector regression (SVR) with a committee regression technique (bagging) based on regression trees and ridge regression done in feature space and expects that SVR will have advantages in high dimensionality space because SVR optimization does not depend on the dimensionality of the input space.

...read moreread less

Abstract: A new regression technique based on Vapnik's concept of support vectors is introduced. We compare support vector regression (SVR) with a committee regression technique (bagging) based on regression trees and ridge regression done in feature space. On the basis of these experiments, it is expected that SVR will have advantages in high dimensionality space because SVR optimization does not depend on the dimensionality of the input space.

...read moreread less

4,009 citations

"Pushing the boundaries of crowd-ena..." refers methods in this paper

...perceptual space, we suggest to use Support Vector Regression Machines (SVMs) [14], which are a highly flexible technique to perform non-linear regression and classification, and have been proven to be effective when dealing with perceptual data [15]....
[...]

Book•DOI•

Semi-Supervised Learning

[...]

Olivier Chapelle¹, Bernhard Schlkopf¹, Alexander Zien¹•Institutions (1)

Max Planck Society¹

31 Mar 2010

TL;DR: Semi-supervised learning (SSL) as discussed by the authors is the middle ground between supervised learning (in which all training examples are labeled) and unsupervised training (where no label data are given).

...read moreread less

Abstract: In the field of machine learning, semi-supervised learning (SSL) occupies the middle ground, between supervised learning (in which all training examples are labeled) and unsupervised learning (in which no label data are given). Interest in SSL has increased in recent years, particularly because of application domains in which unlabeled data are plentiful, such as images, text, and bioinformatics. This first comprehensive overview of SSL presents state-of-the-art algorithms, a taxonomy of the field, selected applications, benchmark experiments, and perspectives on ongoing and future research. Semi-Supervised Learning first presents the key assumptions and ideas underlying the field: smoothness, cluster or low-density separation, manifold structure, and transduction. The core of the book is the presentation of SSL methods, organized according to algorithmic strategies. After an examination of generative models, the book describes algorithms that implement the low-density separation assumption, graph-based methods, and algorithms that perform two-step learning. The book then discusses SSL applications and offers guidelines for SSL practitioners by analyzing the results of extensive benchmark experiments. Finally, the book looks at interesting directions for SSL research. The book closes with a discussion of the relationship between semi-supervised learning and transduction. Adaptive Computation and Machine Learning series

...read moreread less

3,773 citations