scispace - formally typeset
Search or ask a question

Showing papers on "Crowdsourcing published in 2017"


Journal ArticleDOI
TL;DR: How TurkPrime saves time and resources, improves data quality, and allows researchers to design and implement studies that were previously very difficult or impossible to carry out on MTurk is described.
Abstract: In recent years, Mechanical Turk (MTurk) has revolutionized social science by providing a way to collect behavioral data with unprecedented speed and efficiency. However, MTurk was not intended to be a research tool, and many common research tasks are difficult and time-consuming to implement as a result. TurkPrime was designed as a research platform that integrates with MTurk and supports tasks that are common to the social and behavioral sciences. Like MTurk, TurkPrime is an Internet-based platform that runs on any browser and does not require any downloads or installation. Tasks that can be implemented with TurkPrime include: excluding participants on the basis of previous participation, longitudinal studies, making changes to a study while it is running, automating the approval process, increasing the speed of data collection, sending bulk e-mails and bonuses, enhancing communication with participants, monitoring dropout and engagement rates, providing enhanced sampling options, and many others. This article describes how TurkPrime saves time and resources, improves data quality, and allows researchers to design and implement studies that were previously very difficult or impossible to carry out on MTurk. TurkPrime is designed as a research tool whose aim is to improve the quality of the crowdsourcing data collection process. Various features have been and continue to be implemented on the basis of feedback from the research community. TurkPrime is a free research platform.

1,241 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: A new DLP-CNN (Deep Locality-Preserving CNN) method, which aims to enhance the discriminative power of deep features by preserving the locality closeness while maximizing the inter-class scatters, is proposed.
Abstract: Past research on facial expressions have used relatively limited datasets, which makes it unclear whether current methods can be employed in real world. In this paper, we present a novel database, RAF-DB, which contains about 30000 facial images from thousands of individuals. Each image has been individually labeled about 40 times, then EM algorithm was used to filter out unreliable labels. Crowdsourcing reveals that real-world faces often express compound emotions, or even mixture ones. For all we know, RAF-DB is the first database that contains compound expressions in the wild. Our cross-database study shows that the action units of basic emotions in RAF-DB are much more diverse than, or even deviate from, those of lab-controlled ones. To address this problem, we propose a new DLP-CNN (Deep Locality-Preserving CNN) method, which aims to enhance the discriminative power of deep features by preserving the locality closeness while maximizing the inter-class scatters. The benchmark experiments on the 7-class basic expressions and 11-class compound expressions, as well as the additional experiments on SFEW and CK+ databases, show that the proposed DLP-CNN outperforms the state-of-the-art handcrafted features and deep learning based methods for the expression recognition in the wild.

746 citations


Proceedings ArticleDOI
03 Apr 2017
TL;DR: A method that combines crowdsourcing and machine learning to analyze personal attacks at scale is developed and illustrated, and an evaluation method for a classifier in terms of the aggregated number of crowd-workers it can approximate is shown.
Abstract: The damage personal attacks cause to online discourse motivates many platforms to try to curb the phenomenon However, understanding the prevalence and impact of personal attacks in online platforms at scale remains surprisingly difficult The contribution of this paper is to develop and illustrate a method that combines crowdsourcing and machine learning to analyze personal attacks at scale We show an evaluation method for a classifier in terms of the aggregated number of crowd-workers it can approximate We apply our methodology to English Wikipedia, generating a corpus of over 100k high quality human-labeled comments and 63M machine-labeled ones from a classifier that is as good as the aggregate of 3 crowd-workers, as measured by the area under the ROC curve and Spearman correlation Using this corpus of machine-labeled scores, our methodology allows us to explore some of the open questions about the nature of online personal attacks This reveals that the majority of personal attacks on Wikipedia are not the result of a few malicious users, nor primarily the consequence of allowing anonymous contributions from unregistered users

472 citations


Journal ArticleDOI
TL;DR: This tutorial assesses the evidence on the reliability of crowdsourced populations and the conditions under which crowdsourcing is a valid strategy for data collection, and proposes specific guidelines for researchers to conduct high-quality research via crowdsourcing.
Abstract: Data collection in consumer research has progressively moved away from traditional samples (e.g., university undergraduates) and toward Internet samples. In the last complete volume of the Journal of Consumer Research (June 2015-April 2016), 43% of behavioral studies were conducted on the crowdsourcing website Amazon Mechanical Turk (MTurk). The option to crowdsource empirical investigations has great efficiency benefits for both individual researchers and the field, but it also poses new challenges and questions for how research should be designed, conducted, analyzed, and evaluated. We assess the evidence on the reliability of crowdsourced populations and the conditions under which crowdsourcing is a valid strategy for data collection. Based on this evidence, we propose specific guidelines for researchers to conduct high-quality research via crowdsourcing. We hope this tutorial will strengthen the community's scrutiny on data collection practices and move the field toward better and more valid crowdsourcing of consumer research.

384 citations


Journal ArticleDOI
01 Jan 2017
TL;DR: It is believed that the truth inference problem is not fully solved, and the limitations of existing algorithms are identified and point out promising research directions.
Abstract: Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions.

376 citations


Journal ArticleDOI
TL;DR: A comprehensive survey of the use of crowdsourcing in software engineering, seeking to cover all literature on this topic, and exposing trends, open issues and opportunities for future research on Crowdsourced Software Engineering.

360 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present an evaluation of MTurk and provide a set of practical recommendations for researchers using the data source, based on which they evaluate the effectiveness of using it.
Abstract: Purpose Amazon Mechanical Turk is an increasingly popular data source in the organizational psychology research community. This paper presents an evaluation of MTurk and provides a set of practical recommendations for researchers using MTurk.

330 citations


Proceedings ArticleDOI
20 Oct 2017
TL;DR: Rico is presented, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction.
Abstract: Data-driven models help mobile app designers understand best practices and trends, and can be used to make predictions about design performance and support the creation of adaptive UIs. This paper presents Rico, the largest repository of mobile app designs to date, created to support five classes of data-driven applications: design search, UI layout generation, UI code generation, user interaction modeling, and user perception prediction. To create Rico, we built a system that combines crowdsourcing and automation to scalably mine design and interaction data from Android apps at runtime. The Rico dataset contains design data from more than 9.7k Android apps spanning 27 categories. It exposes visual, textual, structural, and interactive design properties of more than 72k unique UI screens. To demonstrate the kinds of applications that Rico enables, we present results from training an autoencoder for UI layout similarity, which supports query- by-example search over UIs.

309 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: A method of crowdsourcing linguistically-diverse data, and an analysis of the data demonstrates a broad set of linguistic phenomena, requiring visual and set-theoretic reasoning.
Abstract: We present a new visual reasoning language dataset, containing 92,244 pairs of examples of natural statements grounded in synthetic images with 3,962 unique sentences. We describe a method of crowdsourcing linguistically-diverse data, and present an analysis of our data. The data demonstrates a broad set of linguistic phenomena, requiring visual and set-theoretic reasoning. We experiment with various models, and show the data presents a strong challenge for future research.

222 citations


Journal ArticleDOI
TL;DR: This work provides a conceptual framework for gamified crowdsourcing systems in order to understand and conceptualize the key aspects of the phenomenon and indicates that gamification has been an effective approach for increasing crowdsourcing participation and the quality of the crowdsourced work.
Abstract: Two parallel phenomena are gaining attention in human–computer interaction research: gamification and crowdsourcing Because crowdsourcing's success depends on a mass of motivated crowdsourcees, crowdsourcing platforms have increasingly been imbued with motivational design features borrowed from games; a practice often called gamification While the body of literature and knowledge of the phenomenon have begun to accumulate, we still lack a comprehensive and systematic understanding of conceptual foundations, knowledge of how gamification is used in crowdsourcing, and whether it is effective We first provide a conceptual framework for gamified crowdsourcing systems in order to understand and conceptualize the key aspects of the phenomenon The paper's main contributions are derived through a systematic literature review that investigates how gamification has been examined in different types of crowdsourcing in a variety of domains This meticulous mapping, which focuses on all aspects in our framework, enables us to infer what kinds of gamification efforts are effective in different crowdsourcing approaches as well as to point to a number of research gaps and lay out future research directions for gamified crowdsourcing systems Overall, the results indicate that gamification has been an effective approach for increasing crowdsourcing participation and the quality of the crowdsourced work; however, differences exist between different types of crowdsourcing: the research conducted in the context of crowdsourcing of homogenous tasks has most commonly used simple gamification implementations, such as points and leaderboards, whereas crowdsourcing implementations that seek diverse and creative contributions employ gamification with a richer set of mechanics

212 citations


Proceedings ArticleDOI
02 May 2017
TL;DR: Revolt eliminates the burden of creating detailed label guidelines by harnessing crowd disagreements to identify ambiguous concepts and create rich structures (groups of semantically related items) for post-hoc label decisions.
Abstract: Crowdsourcing provides a scalable and efficient way to construct labeled datasets for training machine learning systems. However, creating comprehensive label guidelines for crowdworkers is often prohibitive even for seemingly simple concepts. Incomplete or ambiguous label guidelines can then result in differing interpretations of concepts and inconsistent labels. Existing approaches for improving label quality, such as worker screening or detection of poor work, are ineffective for this problem and can lead to rejection of honest work and a missed opportunity to capture rich interpretations about data. We introduce Revolt, a collaborative approach that brings ideas from expert annotation workflows to crowd-based labeling. Revolt eliminates the burden of creating detailed label guidelines by harnessing crowd disagreements to identify ambiguous concepts and create rich structures (groups of semantically related items) for post-hoc label decisions. Experiments comparing Revolt to traditional crowdsourced labeling show that Revolt produces high quality labels without requiring label guidelines in turn for an increase in monetary cost. This up front cost, however, is mitigated by Revolt's ability to produce reusable structures that can accommodate a variety of label boundaries without requiring new data to be collected. Further comparisons of Revolt's collaborative and non-collaborative variants show that collaboration reaches higher label accuracy with lower monetary cost.

Posted Content
TL;DR: This work formulates a crowdsourcing typology and shows how its four categories—crowd voting, micro-task, idea, and solution crowdsourcing—can help firms develop ‘crowd capital,’ an organizational-level resource harnessed from the crowd.
Abstract: Traditionally, the term crowd was used almost exclusively in the context of people who self-organized around a common purpose, emotion or experience. Today, however, firms often refer to crowds in discussions of how collections of individuals can be engaged for organizational purposes. Crowdsourcing, the use of information technologies to outsource business responsibilities to crowds, can now significantly influence a firms ability to leverage previously unattainable resources to build competitive advantage. Nonetheless, many managers are hesitant to consider crowdsourcing because they do not understand how its various types can add value to the firm. In response, we explain what crowdsourcing is, the advantages it offers and how firms can pursue crowdsourcing. We begin by formulating a crowdsourcing typology and show how its four categories (crowd-voting, micro-task, idea and solution crowdsourcing) can help firms develop crowd capital, an organizational-level resource harnessed from the crowd. We then present a three-step process model for generating crowd capital. Step one includes important considerations that shape how a crowd is to be constructed. Step two outlines the capabilities firms need to develop to acquire and assimilate resources (knowledge, labor, funds) from the crowd. Step three addresses key decision-areas that executives need to address to effectively engage crowds.

Journal ArticleDOI
TL;DR: In this article, the authors demonstrate the potential benefits of crowdsourcing last mile delivery by exploiting a social network of the customers and show that using friends in social networks to assist in last-mile delivery greatly reduces delivery costs and total emissions while ensuring speedy and reliable delivery.
Abstract: This paper demonstrates the potential benefits of crowdsourcing last mile delivery by exploiting a social network of the customers. The presented models and analysis are informed by the results of a survey to gauge people’s attitudes toward engaging in social network-reliant package delivery to and by friends or acquaintances. It is found that using friends in a social network to assist in last mile delivery greatly reduces delivery costs and total emissions while ensuring speedy and reliable delivery. The proposed new delivery method also mitigates the privacy concerns and not-at-home syndrome that widely exist in last mile delivery.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a conceptualization of the growing phenomenon of crowd logistics, which is a novel way of providing logistics services that taps into the dormant logistics resources and capabilities of individuals, using mobile applications and web-based platforms.
Abstract: Patterned on crowdsourcing and crowdfunding, a new crowd practice has emerged in recent years: crowd logistics. In this paper, we propose a first conceptualization of this growing phenomenon. Crowd logistics is a novel way of providing logistics services that taps into the dormant logistics resources and capabilities of individuals, using mobile applications and web-based platforms. Although crowd logistics has been widely discussed in the business world, it has not yet been the subject of any academic publication. Following an exploratory case study approach, we review the websites of 57 crowd logistics initiatives around the world and highlight the main distinctive characteristics of crowd logistics, as compared to traditional business logistics. We introduce a segmented analysis in which crowd logistics solutions are classified according to four types of service offered. Finally, we introduce six theoretical propositions on the future development of crowd logistics. At a theoretical level, our findings contribute to enriching the service-dominant logic perspective in the logistics field by conceptualizing the crowd as a co-creator of logistics value. At a managerial level, our findings contribute to identifying which types of crowd logistics services are more likely to threaten or disrupt traditional business.

Proceedings Article
12 Feb 2017
TL;DR: This work presents a general solution towards building task-oriented dialogue systems for online shopping, aiming to assist online customers in completing various purchase-related tasks, such as searching products and answering questions, in a natural language conversation manner.
Abstract: We present a general solution towards building task-oriented dialogue systems for online shopping, aiming to assist online customers in completing various purchase-related tasks, such as searching products and answering questions, in a natural language conversation manner. As a pioneering work, we show what & how existing NLP techniques, data resources, and crowdsourcing can be leveraged to build such task-oriented dialogue systems for E-commerce usage. To demonstrate its effectiveness, we integrate our system into a mobile online shopping app. To the best of our knowledge, this is the first time that an AI bot in Chinese is practically used in online shopping scenario with millions of real consumers. Interesting and insightful observations are shown in the experimental part, based on the analysis of human-bot conversation log. Several current challenges are also pointed out as our future directions.

Journal ArticleDOI
TL;DR: In this paper, the authors describe the outsourcing of an organizational functio-ctio to crowdsourcing as an alternative funding source and offers non-monetary resources through organizational learning.

Journal ArticleDOI
TL;DR: Crowdsourcing data collection from research participants recruited from online labor markets is now common in cognitive science and who is in the crowd and who can be reached by the average laboratory is reviewed.

Journal ArticleDOI
TL;DR: It is concluded that platforms such as MTurk have much to offer PD researchers, especially for certain kinds of research (e.g., where large samples are required and there is a need for iterative sampling).
Abstract: The use of crowdsourcing platforms such as Amazon's Mechanical Turk (MTurk) for data collection in the behavioral sciences has increased substantially in the past several years due in large part to (a) the ability to recruit large samples, (b) the inexpensiveness of data collection, (c) the speed of data collection, and (d) evidence that the data collected are, for the most part, of equal or better quality to that collected in undergraduate research pools. In this review, we first evaluate the strengths and potential limitations of this approach to data collection. Second, we examine how MTurk has been used to date in personality disorder (PD) research and compare the characteristics of such research to PD research conducted in other settings. Third, we compare PD trait data from the Section III trait model of the DSM-5 collected via MTurk to data collected using undergraduate and clinical samples with regard to internal consistency, mean-level differences, and factor structure. Overall, we conclude that platforms such as MTurk have much to offer PD researchers, especially for certain kinds of research (e.g., where large samples are required and there is a need for iterative sampling). Whether MTurk itself remains the predominant model of such platforms is unclear, however, and will largely depend on decisions related to cost effectiveness and the development of alternatives that offer even greater flexibility. (PsycINFO Database Record

Journal ArticleDOI
TL;DR: Current research topics in CrowdRE are presented; the benefits, challenges, and lessons learned from projects and experiments are discussed; and how to apply the methods and tools in industrial contexts are assessed.
Abstract: Crowd-based requirements engineering (CrowdRE) could significantly change RE. Performing RE activities such as elicitation with the crowd of stakeholders turns RE into a participatory effort, leads to more accurate requirements, and ultimately boosts software quality. Although any stakeholder in the crowd can contribute, CrowdRE emphasizes one stakeholder group whose role is often trivialized: users. CrowdRE empowers the management of requirements, such as their prioritization and segmentation, in a dynamic, evolved style through collecting and harnessing a continuous flow of user feedback and monitoring data on the usage context. To analyze the large amount of data obtained from the crowd, automated approaches are key. This article presents current research topics in CrowdRE; discusses the benefits, challenges, and lessons learned from projects and experiments; and assesses how to apply the methods and tools in industrial contexts. This article is part of a special issue on Crowdsourcing for Software Engineering.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: The authors presented a method for obtaining high-quality, domain-targeted multiple choice questions from crowd workers, which produces model suggestions for document selection and answer distractor choice which aid the human question generation process.
Abstract: We present a novel method for obtaining high-quality, domain-targeted multiple choice questions from crowd workers. Generating these questions can be difficult without trading away originality, relevance or diversity in the answer options. Our method addresses these problems by leveraging a large corpus of domain-specific text and a small set of existing questions. It produces model suggestions for document selection and answer distractor choice which aid the human question generation process. With this method we have assembled SciQ, a dataset of 13.7K multiple choice science exam questions. We demonstrate that the method produces in-domain questions by providing an analysis of this new dataset and by showing that humans cannot distinguish the crowdsourced questions from original questions. When using SciQ as additional training data to existing questions, we observe accuracy improvements on real science exams.

Proceedings ArticleDOI
02 May 2017
TL;DR: A deployment is reported in which flash organizations successfully carried out open-ended and complex goals previously out of reach for crowdsourcing, including product design, software development, and game production.
Abstract: This paper introduces flash organizations: crowds structured like organizations to achieve complex and open-ended goals. Microtask workflows, the dominant crowdsourcing structures today, only enable goals that are so simple and modular that their path can be entirely pre-defined. We present a system that organizes crowd workers into computationally-represented structures inspired by those used in organizations - roles, teams, and hierarchies - which support emergent and adaptive coordination toward open-ended goals. Our system introduces two technical contributions: 1) encoding the crowd's division of labor into de-individualized roles, much as movie crews or disaster response teams use roles to support coordination between on-demand workers who have not worked together before; and 2) reconfiguring these structures through a model inspired by version control, enabling continuous adaptation of the work and the division of labor. We report a deployment in which flash organizations successfully carried out open-ended and complex goals previously out of reach for crowdsourcing, including product design, software development, and game production. This research demonstrates digitally networked organizations that flexibly assemble and reassemble themselves from a globally distributed online workforce to accomplish complex work.

Journal ArticleDOI
TL;DR: A model to explain the impacts of benefit and cost factors as well as trust on solver participation behavior in crowdsourcing was developed and found that monetary reward positively affects trust (trust partially mediates its effect on participation behavior), while loss of knowledge power negatively affects trust.
Abstract: Organizations are increasingly crowdsourcing their tasks to unknown individual workers, i.e., solvers. Solvers' participation is critical to the success of crowdsourcing activities. However, challenges exist in attracting solvers to participate in crowdsourcing. In this regard, prior research has mainly investigated the influences of benefit factors on solvers’ intention to participate in crowdsourcing. Thus, there is a lack of understanding of the cost factors that influence actual participation behavior, in conjunction with the benefits. Additionally, the role of trust in the cost-benefit analysis remains to be explored. Motivated thus, based on social exchange theory and context-related literature, we develop a model to explain the impacts of benefit and cost factors as well as trust on solver participation behavior in crowdsourcing. The model was tested using survey and archival data from 156 solvers on a large crowdsourcing platform. As hypothesized, monetary reward, skill enhancement, work autonomy, enjoyment, and trust were found to positively affect solvers’ participation in crowdsourcing, while cognitive effort negatively affects their participation. In addition, it was found that monetary reward positively affects trust (trust partially mediates its effect on participation behavior), while loss of knowledge power negatively affects trust. The theoretical contributions and practical implications of the study are discussed.

Journal ArticleDOI
TL;DR: This paper proposes a mechanism based on differential privacy and geocasting that achieves effective SC services while offering privacy guarantees to workers, and addresses scenarios with both static and dynamic datasets of workers.
Abstract: Spatial Crowdsourcing (SC) is a transformative platform that engages individuals in collecting and analyzing environmental, social, and other spatio-temporal information. SC outsources spatio-temporal tasks to a set of workers , i.e., individuals with mobile devices that perform the tasks by physically traveling to specified locations. However, current solutions require the workers to disclose their locations to untrusted parties. In this paper, we introduce a framework for protecting location privacy of workers participating in SC tasks. We propose a mechanism based on differential privacy and geocasting that achieves effective SC services while offering privacy guarantees to workers. We address scenarios with both static and dynamic (i.e., moving) datasets of workers. Experimental results on real-world data show that the proposed technique protects location privacy without incurring significant performance overhead.

Proceedings ArticleDOI
01 Apr 2017
TL;DR: This paper surveys and synthesizes a wide spectrum of existing studies on crowdsourced data management and outlines key factors that need to be considered to improve crowdsourcing data management.
Abstract: Many important data management and analytics tasks cannot be completely addressed by automated processes. These tasks, such as entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human cognitive ability. Crowdsouring is an effective way to harness the capabilities of people (i.e., the crowd) to apply human computation for such tasks. Thus, crowdsourced data management has become an area of increasing interest in research and industry. We identify three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy or incorrect results so effective techniques are required to achieve high quality, (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost, (3) Latency Control: The human workers can be slow, particularly compared to automated computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans consisting of multiple operators. We survey and synthesize a wide spectrum of existing studies on crowdsourced data management.

Proceedings ArticleDOI
Yuanshun Yao1, Bimal Viswanath1, Jenna Cryan1, Haitao Zheng1, Ben Y. Zhao1 
30 Oct 2017
TL;DR: In this paper, the authors identify a new class of attacks that leverage deep learning language models (Recurrent Neural Networks or RNNs) to automate the generation of fake online reviews for products and services.
Abstract: Malicious crowdsourcing forums are gaining traction as sources of spreading misinformation online, but are limited by the costs of hiring and managing human workers. In this paper, we identify a new class of attacks that leverage deep learning language models (Recurrent Neural Networks or RNNs) to automate the generation of fake online reviews for products and services. Not only are these attacks cheap and therefore more scalable, but they can control rate of content output to eliminate the signature burstiness that makes crowdsourced campaigns easy to detect. Using Yelp reviews as an example platform, we show how a two phased review generation and customization attack can produce reviews that are indistinguishable by state-of-the-art statistical detectors. We conduct a survey-based user study to show these reviews not only evade human detection, but also score high on "usefulness" metrics by users. Finally, we develop novel automated defenses against these attacks, by leveraging the lossy transformation introduced by the RNN training and generation cycle. We consider countermeasures against our mechanisms, show that they produce unattractive cost-benefit tradeoffs for attackers, and that they can be further curtailed by simple constraints imposed by online service providers.

25 Apr 2017
TL;DR: In this paper, the authors show that Facebook posts can be classified with high accuracy as hoaxes or non-hoaxes on the basis of the users who "like" them.
Abstract: In recent years, the reliability of information on the Internet has emerged as a crucial issue of modern society. Social network sites (SNSs) have revolutionized the way in which information is spread by allowing users to freely share content. As a consequence, SNSs are also increasingly used as vectors for the diffusion of misinformation and hoaxes. The amount of disseminated information and the rapidity of its diffusion make it practically impossible to assess reliability in a timely manner, highlighting the need for automatic hoax detection systems. As a contribution towards this objective, we show that Facebook posts can be classified with high accuracy as hoaxes or non-hoaxes on the basis of the users who "liked" them. We present two classification techniques, one based on logistic regression, the other on a novel adaptation of boolean crowdsourcing algorithms. On a dataset consisting of 15,500 Facebook posts and 909,236 users, we obtain classification accuracies exceeding 99% even when the training set contains less than 1% of the posts. We further show that our techniques are robust: they work even when we restrict our attention to the users who like both hoax and non-hoax posts. These results suggest that mapping the diffusion pattern of information can be a useful component of automatic hoax detection systems.

Proceedings ArticleDOI
04 Aug 2017
TL;DR: Crowdourcing is used to evaluate TrioVecEvent, a method that leverages multimodal embeddings to achieve accurate online local event detection and introduces discriminative features that can well characterize local events.
Abstract: Detecting local events (e.g., protest, disaster) at their onsets is an important task for a wide spectrum of applications, ranging from disaster control to crime monitoring and place recommendation. Recent years have witnessed growing interest in leveraging geo-tagged tweet streams for online local event detection. Nevertheless, the accuracies of existing methods still remain unsatisfactory for building reliable local event detection systems. We propose TrioVecEvent, a method that leverages multimodal embeddings to achieve accurate online local event detection. The effectiveness of TrioVecEvent is underpinned by its two-step detection scheme. First, it ensures a high coverage of the underlying local events by dividing the tweets in the query window into coherent geo-topic clusters. To generate quality geo-topic clusters, we capture short-text semantics by learning multimodal embeddings of the location, time, and text, and then perform online clustering with a novel Bayesian mixture model. Second, TrioVecEvent considers the geo-topic clusters as candidate events and extracts a set of features for classifying the candidates. Leveraging the multimodal embeddings as background knowledge, we introduce discriminative features that can well characterize local events, which enables pinpointing true local events from the candidate pool with a small amount of training data. We have used crowdsourcing to evaluate TrioVecEvent, and found that it improves the performance of the state-of-the-art method by a large margin.

Journal ArticleDOI
TL;DR: This article provides a background on the use of MTurk as a mechanism for collecting research data, and reviews what is currently known about the advantages and issues associated with using M Turk and highlights important areas for future research.
Abstract: The advent of online platforms such as Amazon’s Mechanical Turk (MTurk) has expanded considerably researchers’ options for collecting research data. Many researchers, however, express understandabl...

Journal ArticleDOI
TL;DR: In this article, the authors used air temperature data from the prolific, low-cost, Netatmo weather station to quantify the urban heat island of London over the summer of 2015.
Abstract: Crowdsourcing techniques are frequently used across science to supplement traditional means of data collection. Although atmospheric science has so far been slow to harness the technology, developments have now reached the point where the benefits of the approaches simply cannot be ignored: crowdsourcing has potentially far-reaching consequences for the way in which measurements are collected and used in the discipline. To illustrate this point, this paper uses air temperature data from the prolific, low-cost, Netatmo weather station to quantify the urban heat island of London over the summer of 2015. The results are broadly comparable with previous studies, and indeed standard observations (albeit with a warm bias, a likely consequence of non-standard site exposure), showing a range of magnitudes of between 1 and 6 °C across the city depending on atmospheric stability. However, not all the results can be easily explained by physical processes and therefore highlight quality issues with crowdsourced data that need to be resolved. This paper aims to kickstart a step-change in the use of crowdsourcing in urban meteorology by encouraging atmospheric scientists to more positively engage with the new generation of manufacturers producing mass market sensors.

Journal ArticleDOI
TL;DR: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy with substantially less effort than relying on manual screening alone.