scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Pushing the boundaries of crowd-enabled databases with query-driven schema expansion

01 Feb 2012-Vol. 5, Iss: 6, pp 538-549
TL;DR: This paper extends crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time, and leverages the usergenerated data found in the Social Web to build perceptual spaces.
Abstract: By incorporating human workers into the query execution process crowd-enabled databases facilitate intelligent, social capabilities like completing missing data at query time or performing cognitive operators. But despite all their flexibility, crowd-enabled databases still maintain rigid schemas. In this paper, we extend crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time. However, the number of crowd-sourced mini-tasks to fill in missing values may often be prohibitively large and the resulting data quality is doubtful. Instead of simple crowd-sourcing to obtain all values individually, we leverage the usergenerated data found in the Social Web: By exploiting user ratings we build perceptual spaces, i.e., highly-compressed representations of opinions, impressions, and perceptions of large numbers of users. Using few training samples obtained by expert crowd sourcing, we then can extract all missing data automatically from the perceptual space with high quality and at low costs. Extensive experiments show that our approach can boost both performance and quality of crowd-enabled databases, while also providing the flexibility to expand schemas in a query-driven fashion.

Content maybe subject to copyright    Report

Citations
More filters
01 Aug 2014
TL;DR: A general Bayesian framework for leveraging disparate categories of workers on the cost-accuracy scale, asking the right worker the right kind of question at the right time in order to obtain accurate assessments most costeffectively for test collection construction is proposed.
Abstract: Current test collection construction methodologies for Information Retrieval evaluation generally rely on large numbers of document relevance assessments, obtained from experts at great cost. Recently, the use of inexpensive crowd workers has been proposed instead. However, while crowd workers are inexpensive, their assessments are also generally highly inaccurate, rendering their collective assessments far less useful than those obtained from experts in the traditional manner. Our thesis is that instead of using either experts or crowd workers, one can obtain the advantages of both—inexpensive and accurate assessments—by optimally combining them. Another related problem in Information Retrieval evaluation is asking right kind of question to the assessors for the collection relevance judgments. Traditional methods of collecting relevance judgments are based on collecting binary or graded nominal judgments, but such judgments are limited by factors such as inter-assessor disagreement and the arbitrariness of grades. Previous research has shown that it is easier for assessors to make pairwise preference judgments. However, unless the preferences collected are largely transitive, it is not clear how to combine them in order to obtain document relevance scores. Another difficulty is that the number of pairs that need to be assessed is quadratic in the number of documents. We show how to combine a linear number of pairwise preference judgments from multiple assessors to compute relevance scores for every document. We propose a general Bayesian framework for leveraging disparate categories of workers on the cost-accuracy scale, asking the right worker the right kind of question at the right time in order to obtain accurate assessments most costeffectively for test collection construction. Experiments with Mechanical Turks and expert assessors show promising results for our framework.

2 citations


Cites methods from "Pushing the boundaries of crowd-ena..."

  • ...Crowdsourcing has been used in database community for building hybrid humanmachine database systems [18, 19]....

    [...]

Book ChapterDOI
07 Nov 2016
TL;DR: This paper argues for an impact-driven quality control model, which fulfills the impact-sourcing vision, thus materializing the social responsibility aspect of crowdsourcing, while ensuring high quality results.
Abstract: Crowdsourcing have been gaining increasing popularity as a highly distributed digital solution that surpasses both borders and time-zones. Moreover, it extends economic opportunities to developing countries, thus answering the call of impact sourcing in alleviating the welfare of poor labor in need. Nevertheless, it is constantly criticized for the associated quality problems and risks. Attempting to mitigate these risks, a rich body of research has been dedicated to design countermeasures against free riders and spammers, who compromise the overall quality of the results, and whose undetected presence ruins the financial prospects for other honest workers. Such quality risks materialize even more severely with imbalanced crowdsourcing tasks. In fact, while surveying this literature, a common rule of thumb can be indeed derived: the easier it is to cheat the system and go undetected, the more restrictive and across-the-board discriminating countermeasures are taken. Hence, also honest yet low-skilled workers will be placed on par with spammers, and consequently exposed and deprived of much-needed earnings. Therefore in this paper, we argue for an impact-driven quality control model, which fulfills the impact-sourcing vision, thus materializing the social responsibility aspect of crowdsourcing, while ensuring high quality results.

2 citations


Cites background from "Pushing the boundaries of crowd-ena..."

  • ...Generally, such hiring seeks intelligent information processing skills for numerous tasks, ranging from content annotation [1], information extraction [2], to more complex tasks like sentiment analysis [3] and crowd-enabled database retrieval [4]....

    [...]

01 Jan 2015
TL;DR: This paper will discuss how to use unstructured reviews to build a structured semantic representation of database items, enabling the imple- mentation of semantic queries and further machine-learning analytics.
Abstract: Social judgements like comments, reviews, discussions, or ratings have become a ubiquitous component of most Web applications, especially in the e-commerce domain. Now, a central challenge is using these judgements to im- prove the user experience by offering new query paradigms or better data analyt- ics. Recommender systems have already demonstrated how ratings can be effec- tively used towards that end, allowing users to semantically explore even large item databases. In this paper, we will discuss how to use unstructured reviews to build a structured semantic representation of database items, enabling the imple- mentation of semantic queries and further machine-learning analytics. Thus, we address one of the central challenge of Big Data: making sense of huge collec- tions of unstructured user feedback.

2 citations


Cites background or methods from "Pushing the boundaries of crowd-ena..."

  • ...Our Perceptual Spaces introduced in [9] and [6] rely on a factor model using the following assumptions: Perceptual Spaces use the established assumption that item ratings in the Social Web are a result of a user’s preferences with respect to an item’s attributes [10]....

    [...]

  • ...In [9], we have shown that certain perceived properties (like the degree of funniness) can be made explicit with only minimal human input using crowdsourcing-based machine regression....

    [...]

  • ...As our experiments in [9] showed, quality of perceptual spaces increase with the involvement and activity of users: rating data obtained from a restaurant data set (where...

    [...]

  • ...In the following, we evaluate different review-based embeddings in comparison with our rating-based perceptual space [9] as a baseline....

    [...]

Journal ArticleDOI
TL;DR: A declarative meta-language, called VisFlow, for requirement specification, and a translator for mapping requirements into executable queries in a variant of SQL augmented with integration artefacts are presented.
Abstract: Data integration continues to baffle researchers even though substantial progress has been made Although the emergence of technologies such as XML, web services, semantic web and cloud computing have helped, a system in which biologists are comfortable articulating new applications and developing them without technical assistance from a computing expert is yet to be realised The distance between a friendly graphical interface that does little, and a 'traditional' system though clunky yet powerful, is deemed too great more often than not The question that remains unanswered is, if a user can state her query involving a set of complex, heterogeneous and distributed life sciences resources in an easy to use language and execute it without further help from a computer savvy programmer In this paper, we present a declarative meta-language, called VisFlow, for requirement specification, and a translator for mapping requirements into executable queries in a variant of SQL augmented with integration artefacts

2 citations

Journal ArticleDOI
31 Jul 2014
TL;DR: Qs results are used to help non-expert users in using the multi-database environment and improve performances of theMulti-databaseenvironment, which not only uses disk and memory resources, but heavily rely on network bandwidth.
Abstract: —This paper proposes N O X PERANTO , a novelcrowdsourcing approach to address querying over datacollections managed by polyglot persistence settings. The maincontribution of N O X PERANTO is the ability to solve complexqueries involving different data stores by exploiting queriesfrom expert users (i.e. a crowd of database administrators, dataengineers, domain experts, etc.), assuming that these users cansubmit meaningful queries. N O X PERANTO exploits the resultsof “meaningful queries” in order to facilitate the forthcomingquery answering processes. In particular, queries results areused to: (i) help non-expert users in using the multi-databaseenvironment and (ii) improve performances of the multi-databaseenvironment, which not only uses disk and memory resources,but heavily rely on network bandwidth. N O X PERANTO employsa layer to keep track of the information produced by the crowdmodeled as a Property Graph and managed in a Graph DatabaseManagement System (GDBMS).Index Terms—Polyglot persistence, crowdsourcing, multi-databases, big data, property graph, graph databases.

2 citations

References
More filters
Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations


"Pushing the boundaries of crowd-ena..." refers methods in this paper

  • ...Furthermore, we can show that approaches based on classification using metadata and LSI lead to surprisingly bad results (g-mean between 0.41 and 0.50), and show even worse accuracy than randomly applying labels....

    [...]

  • ...This is implemented by using Latent Semantic Indexing (LSI) [21] to generate a 100-dimensional “metadata space” from movie attributes like title, plot, main actors, directors, year, runtime, and country as recorded in IMDb....

    [...]

Journal ArticleDOI
TL;DR: This tutorial gives an overview of the basic ideas underlying Support Vector (SV) machines for function estimation, and includes a summary of currently used algorithms for training SV machines, covering both the quadratic programming part and advanced methods for dealing with large datasets.
Abstract: In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and advanced methods for dealing with large datasets. Finally, we mention some modifications and extensions that have been applied to the standard SV algorithm, and discuss the aspect of regularization from a SV perspective.

10,696 citations


"Pushing the boundaries of crowd-ena..." refers methods in this paper

  • ...Instead of relying on non-linear regression, we can use an SVM classifier [19]....

    [...]

Journal ArticleDOI
TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.
Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

6,320 citations


"Pushing the boundaries of crowd-ena..." refers background in this paper

  • ...A popular measure of classification performance in the presence of class imbalance is the g-mean measure [20], which is the geometric mean of sensitivity (accuracy on all movies truly belonging to the genre) and specificity (accuracy on all movies truly not belonging to the genre), As the g-mean punishes significant differences between sensitivity and specificity, the above naïve classifier would achieve 0% g-mean....

    [...]

Proceedings Article
03 Dec 1996
TL;DR: This work compares support vector regression (SVR) with a committee regression technique (bagging) based on regression trees and ridge regression done in feature space and expects that SVR will have advantages in high dimensionality space because SVR optimization does not depend on the dimensionality of the input space.
Abstract: A new regression technique based on Vapnik's concept of support vectors is introduced. We compare support vector regression (SVR) with a committee regression technique (bagging) based on regression trees and ridge regression done in feature space. On the basis of these experiments, it is expected that SVR will have advantages in high dimensionality space because SVR optimization does not depend on the dimensionality of the input space.

4,009 citations


"Pushing the boundaries of crowd-ena..." refers methods in this paper

  • ...perceptual space, we suggest to use Support Vector Regression Machines (SVMs) [14], which are a highly flexible technique to perform non-linear regression and classification, and have been proven to be effective when dealing with perceptual data [15]....

    [...]

BookDOI
31 Mar 2010
TL;DR: Semi-supervised learning (SSL) as discussed by the authors is the middle ground between supervised learning (in which all training examples are labeled) and unsupervised training (where no label data are given).
Abstract: In the field of machine learning, semi-supervised learning (SSL) occupies the middle ground, between supervised learning (in which all training examples are labeled) and unsupervised learning (in which no label data are given). Interest in SSL has increased in recent years, particularly because of application domains in which unlabeled data are plentiful, such as images, text, and bioinformatics. This first comprehensive overview of SSL presents state-of-the-art algorithms, a taxonomy of the field, selected applications, benchmark experiments, and perspectives on ongoing and future research. Semi-Supervised Learning first presents the key assumptions and ideas underlying the field: smoothness, cluster or low-density separation, manifold structure, and transduction. The core of the book is the presentation of SSL methods, organized according to algorithmic strategies. After an examination of generative models, the book describes algorithms that implement the low-density separation assumption, graph-based methods, and algorithms that perform two-step learning. The book then discusses SSL applications and offers guidelines for SSL practitioners by analyzing the results of extensive benchmark experiments. Finally, the book looks at interesting directions for SSL research. The book closes with a discussion of the relationship between semi-supervised learning and transduction. Adaptive Computation and Machine Learning series

3,773 citations