scispace - formally typeset
Search or ask a question

Showing papers by "Michael S. Bernstein published in 2017"


Journal ArticleDOI
TL;DR: The Visual Genome dataset as mentioned in this paper contains over 108k images where each image has an average of $35$35 objects, $26$26 attributes, and $21$21 pairwise relationships between objects.
Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

3,842 citations


Proceedings ArticleDOI
25 Feb 2017
TL;DR: This article found that both negative mood and seeing troll posts by others significantly increase the probability of a user trolling, and together double this probability, and explored long range patterns of repeated exposure to trolling.
Abstract: In online communities, antisocial behavior such as trolling disrupts constructive discussion. While prior work suggests that trolling behavior is confined to a vocal and antisocial minority, we demonstrate that ordinary people can engage in such behavior as well. We propose two primary trigger mechanisms: the individual's mood, and the surrounding context of a discussion (e.g., exposure to prior trolling behavior). Through an experiment simulating an online discussion, we find that both negative mood and seeing troll posts by others significantly increases the probability of a user trolling, and together double this probability. To support and extend these results, we study how these same mechanisms play out in the wild via a data-driven, longitudinal analysis of a large online news discussion community. This analysis exposes temporal mood effects, and explores long range patterns of repeated exposure to trolling. A predictive model of trolling behavior reveals that mood and discussion context together can explain trolling behavior better than an individual's history of trolling. These results combine to suggest that ordinary people can, under the right circumstances, behave like trolls.

368 citations


Proceedings ArticleDOI
02 May 2017
TL;DR: A deployment is reported in which flash organizations successfully carried out open-ended and complex goals previously out of reach for crowdsourcing, including product design, software development, and game production.
Abstract: This paper introduces flash organizations: crowds structured like organizations to achieve complex and open-ended goals. Microtask workflows, the dominant crowdsourcing structures today, only enable goals that are so simple and modular that their path can be entirely pre-defined. We present a system that organizes crowd workers into computationally-represented structures inspired by those used in organizations - roles, teams, and hierarchies - which support emergent and adaptive coordination toward open-ended goals. Our system introduces two technical contributions: 1) encoding the crowd's division of labor into de-individualized roles, much as movie crews or disaster response teams use roles to support coordination between on-demand workers who have not worked together before; and 2) reconfiguring these structures through a model inspired by version control, enabling continuous adaptation of the work and the division of labor. We report a deployment in which flash organizations successfully carried out open-ended and complex goals previously out of reach for crowdsourcing, including product design, software development, and game production. This research demonstrates digitally networked organizations that flexibly assemble and reassemble themselves from a globally distributed online workforce to accomplish complex work.

137 citations


Proceedings ArticleDOI
02 May 2017
TL;DR: This approach introduces theoretical grounding that can help address some of the most persistent questions in crowd work, and suggests design interventions that learn from history rather than repeat it.
Abstract: The internet is empowering the rise of crowd work, gig work, and other forms of on-demand labor. A large and growing body of scholarship has attempted to predict the socio-technical outcomes of this shift, especially addressing three questions: 1) What are the complexity limits of on-demand work?, 2) How far can work be decomposed into smaller microtasks?, and 3) What will work and the place of work look like for workers? In this paper, we look to the historical scholarship on piecework — a similar trend of work decomposition, distribution, and payment that was popular at the turn of the 20th century — to understand how these questions might play out with modern on-demand work. We identify the mechanisms that enabled and limited piecework historically, and identify whether on-demand work faces the same pitfalls or might differentiate itself. This approach introduces theoretical grounding that can help address some of the most persistent questions in crowd work, and suggests design interventions that learn from history rather than repeat it.

90 citations


Proceedings ArticleDOI
25 Feb 2017
TL;DR: This paper proposed a technique for achieving interdependent complex goals with crowds, where the crowd loops between reflection, to select a high-level goal and revision, to decompose that goal into low-level, actionable tasks.
Abstract: Crowdsourcing systems accomplish large tasks with scale and speed by breaking work down into independent parts. However, many types of complex creative work, such as fiction writing, have remained out of reach for crowds because work is tightly interdependent: changing one part of a story may trigger changes to the overall plot and vice versa. Taking inspiration from how expert authors write, we propose a technique for achieving interdependent complex goals with crowds. With this technique, the crowd loops between reflection, to select a high-level goal, and revision, to decompose that goal into low-level, actionable tasks. We embody this approach in Mechanical Novel, a system that crowdsources short fiction stories on Amazon Mechanical Turk. In a field experiment, Mechanical Novel resulted in higher-quality stories than an iterative crowdsourcing workflow. Our findings suggest that orienting crowd work around high-level goals may enable workers to coordinate their effort to accomplish complex work.

68 citations


Proceedings ArticleDOI
TL;DR: A predictive model of trolling behavior reveals that mood and discussion context together can explain trolling behavior better than an individual's history of trolling, and suggests that ordinary people can, under the right circumstances, behave like trolls.
Abstract: In online communities, antisocial behavior such as trolling disrupts constructive discussion. While prior work suggests that trolling behavior is confined to a vocal and antisocial minority, we demonstrate that ordinary people can engage in such behavior as well. We propose two primary trigger mechanisms: the individual's mood, and the surrounding context of a discussion (e.g., exposure to prior trolling behavior). Through an experiment simulating an online discussion, we find that both negative mood and seeing troll posts by others significantly increases the probability of a user trolling, and together double this probability. To support and extend these results, we study how these same mechanisms play out in the wild via a data-driven, longitudinal analysis of a large online news discussion community. This analysis reveals temporal mood effects, and explores long range patterns of repeated exposure to trolling. A predictive model of trolling behavior shows that mood and discussion context together can explain trolling behavior better than an individual's history of trolling. These results combine to suggest that ordinary people can, under the right circumstances, behave like trolls.

67 citations


Journal ArticleDOI
06 Dec 2017
TL;DR: This paper uses an inductive mixed method approach to analyze behavior trace data, chat logs, survey responses and work artifacts to understand how workers enacted and adapted the crowdsourcing workflows, and indicates that complex work may remain a fundamental limitation of workflow-based crowdsourcing infrastructures.
Abstract: The dominant crowdsourcing infrastructure today is the workflow, which decomposes goals into small independent tasks. However, complex goals such as design and engineering have remained stubbornly difficult to achieve with crowdsourcing workflows. Is this due to a lack of imagination, or a more fundamental limit? This paper explores this question through in-depth case studies of 22 workers across six workflow-based crowd teams, each pursuing a complex and interdependent web development goal. We used an inductive mixed method approach to analyze behavior trace data, chat logs, survey responses and work artifacts to understand how workers enacted and adapted the crowdsourcing workflows. Our results indicate that workflows served as useful coordination artifacts, but in many cases critically inhibited crowd workers from pursuing real-time adaptations to their work plans. However, the CSCW and organizational behavior literature argues that all sufficiently complex goals require open-ended adaptation. If complex work requires adaptation but traditional static crowdsourcing workflows can't support it, our results suggest that complex work may remain a fundamental limitation of workflow-based crowdsourcing infrastructures.

59 citations


Proceedings ArticleDOI
25 Feb 2017
TL;DR: This research enables effective crowd teams with Huddler, a system for workers to assemble familiar teams even under unpredictable availability and strict time constraints, using a dynamic programming algorithm to optimize for highly familiar teammates when individual availability is unknown.
Abstract: Distributed, parallel crowd workers can accomplish simple tasks through workflows, but teams of collaborating crowd workers are necessary for complex goals Unfortunately, a fundamental condition for effective teams -- familiarity with other members -- stands in contrast to crowd work's flexible, on-demand nature We enable effective crowd teams with Huddler, a system for workers to assemble familiar teams even under unpredictable availability and strict time constraints Huddler utilizes a dynamic programming algorithm to optimize for highly familiar teammates when individual availability is unknown We first present a field experiment that demonstrates the value of familiarity for crowd teams: familiar crowd teams doubled the performance of ad-hoc (unfamiliar) teams on a collaborative task We then report a two-week field deployment wherein Huddler enabled crowd workers to convene highly familiar teams in 18 minutes on average This research advances the goal of supporting long-term, team-based collaborations without sacrificing the flexibility of crowd work

55 citations


Proceedings ArticleDOI
25 Feb 2017
TL;DR: In this paper, the authors draw inspiration from historical worker guilds (e.g., in the silk trade) to design and implement crowd guilds: centralized groups of crowd workers who collectively certify each other's quality through double-blind peer assessment.
Abstract: Crowd workers are distributed and decentralized. While decentralization is designed to utilize independent judgment to promote high-quality results, it paradoxically undercuts behaviors and institutions that are critical to high-quality work. Reputation is one central example: crowdsourcing systems depend on reputation scores from decentralized workers and requesters, but these scores are notoriously inflated and uninformative. In this paper, we draw inspiration from historical worker guilds (e.g., in the silk trade) to design and implement crowd guilds: centralized groups of crowd workers who collectively certify each other's quality through double-blind peer assessment. A two-week field experiment compared crowd guilds to a traditional decentralized crowd work model. Crowd guilds produced reputation signals more strongly correlated with ground-truth worker quality than signals available on current crowd working platforms, and more accurate than in the traditional model.

52 citations


Proceedings ArticleDOI
20 Oct 2017
TL;DR: Crowd Research is presented, a crowdsourcing technique that coordinates open-ended research through an iterative cycle of open contribution, synchronous collaboration, and peer assessment, and introduces a decentralized credit system.
Abstract: Research experiences today are limited to a privileged few at select universities. Providing open access to research experiences would enable global upward mobility and increased diversity in the scientific workforce. How can we coordinate a crowd of diverse volunteers on open-ended research? How could a PI have enough visibility into each person's contributions to recommend them for further study? We present Crowd Research, a crowdsourcing technique that coordinates open-ended research through an iterative cycle of open contribution, synchronous collaboration, and peer assessment. To aid upward mobility and recognize contributions in publications, we introduce a decentralized credit system: participants allocate credits to each other, which a graph centrality algorithm translates into a collectively-created author order. Over 1,500 people from 62 countries have participated, 74% from institutions with low access to research. Over two years and three projects, this crowd has produced articles at top-tier Computer Science venues, and participants have gone on to leading graduate programs.

48 citations


Posted Content
TL;DR: Iris as mentioned in this paper is a conversational agent that combines simple standalone commands with human conversational strategies to perform more complex tasks that it has not been explicitly designed to support: for example, composing one command to "plot a histogram" with another to first "log-transform the data".
Abstract: Today's conversational agents are restricted to simple standalone commands. In this paper, we present Iris, an agent that draws on human conversational strategies to combine commands, allowing it to perform more complex tasks that it has not been explicitly designed to support: for example, composing one command to "plot a histogram" with another to first "log-transform the data". To enable this complexity, we introduce a domain specific language that transforms commands into automata that Iris can compose, sequence, and execute dynamically by interacting with a user through natural language, as well as a conversational type system that manages what kinds of commands can be combined. We have designed Iris to help users with data science tasks, a domain that requires support for command combination. In evaluation, we find that data scientists complete a predictive modeling task significantly faster (2.6 times speedup) with Iris than a modern non-conversational programming environment. Iris supports the same kinds of commands as today's agents, but empowers users to weave together these commands to accomplish complex goals.

Proceedings ArticleDOI
25 Feb 2017
TL;DR: Mosaic as mentioned in this paper is an online community where illustrators share work-in-progress snapshots showing how an artwork was completed from start to finish, rather than showcasing outcomes, is the main method of sharing creative work.
Abstract: Online creative communities allow creators to share their work with a large audience, maximizing opportunities to showcase their work and connect with fans and peers. However, sharing in-progress work can be technically and socially challenging in environments designed for sharing completed pieces. We propose an online creative community where sharing process, rather than showcasing outcomes, is the main method of sharing creative work. Based on this, we present Mosaic---an online community where illustrators share work-in-progress snapshots showing how an artwork was completed from start to finish. In an online deployment and observational study, artists used Mosaic as a vehicle for reflecting on how they can improve their own creative process, developed a social norm of detailed feedback, and became less apprehensive of sharing early versions of artwork. Through Mosaic, we argue that communities oriented around sharing creative process can create a collaborative environment that is beneficial for creative growth.


Proceedings ArticleDOI
25 Feb 2017
TL;DR: It is found that, contrary to these claims, workers are extremely stable in their quality over the entire period, and it is demonstrated that it is possible to predict workers' long-term quality using just a glimpse of their quality on the first five tasks.
Abstract: Microtask crowdsourcing is increasingly critical to the creation of extremely large datasets. As a result, crowd workers spend weeks or months repeating the exact same tasks, making it necessary to understand their behavior over these long periods of time. We utilize three large, longitudinal datasets of nine million annotations collected from Amazon Mechanical Turk to examine claims that workers fatigue or satisfice over these long periods, producing lower quality work. We find that, contrary to these claims, workers are extremely stable in their quality over the entire period. To understand whether workers set their quality based on the task's requirements for acceptance, we then perform an experiment where we vary the required quality for a large crowdsourcing task. Workers did not adjust their quality based on the acceptance threshold: workers who were above the threshold continued working at their usual quality level, and workers below the threshold self-selected themselves out of the task. Capitalizing on this consistency, we demonstrate that it is possible to predict workers' long-term quality using just a glimpse of their quality on the first five tasks.

Proceedings Article
03 May 2017
TL;DR: Analysis of comments from 10 Reddit subcommunities following an exogenous shock when each subcommunity was added to the default set for all Reddit users supports a narrative that the communities remain high-quality and similar to their previous selves even post-growth.
Abstract: Online communities have a love-hate relationship with membership growth: new members bring fresh perspectives, but old-timers worry that growth interrupts the community’s social dynamic and lowers content quality. To arbitrate these two theories, we analyze over 45 million comments from 10 Reddit subcommunities following an exogenous shock when each subcommunity was added to the default set for all Reddit users. Capitalizing on these natural experiments, we test for changes to the content vote patterns, linguistic patterns, and community network patterns before and after being defaulted. Results support a narrative that the communities remain high-quality and similar to their previous selves even post-growth. There is a temporary dip in upvote scores right after the communities were defaulted, but the communities quickly recover to pre-default or even higher levels. Likewise, complaints about low-quality posts do not rise in frequency after getting defaulted. Strong moderation also helps keep upvotes common and complaint levels low. Communities’ language use does not become more like the rest of Reddit after getting defaulted. However, growth does have some impact on attention: community members cluster their activity around a smaller proportion of posts after the community is defaulted.

Proceedings ArticleDOI
25 Feb 2017
TL;DR: This demo presents how Boomerang and Prototype Tasks, the fundamental building blocks of the Daemo crowdsourcing marketplace, help restore trust between workers and requesters.
Abstract: The success of crowdsourcing markets is dependent on a strong foundation of trust between workers and requesters. In current marketplaces, workers and requesters are often unable to trust each other's quality, and their mental models of tasks are misaligned due to ambiguous instructions or confusing edge cases. This breakdown of trust typically arises from (1) flawed reputation systems which do not accurately reflect worker and requester quality, and from (2) poorly designed tasks. In this demo, we present how Boomerang and Prototype Tasks, the fundamental building blocks of the Daemo crowdsourcing marketplace, help restore trust between workers and requesters. Daemo's Boomerang reputation system incentivizes alignment between opinion and ratings by determining the likelihood that workers and requesters will work together in the future based on how they rate each other. Daemo's Prototype tasks require that new tasks go through a feedback iteration phase with a small number of workers so that requesters can revise their instructions and task designs before launch.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: Empath is a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like “bleed” and “punch” to generate the category violence) and is highly correlated with similar categories in LIWC.
Abstract: Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like “bleed” and “punch” to generate the category violence). Empath draws connotations between words and phrases by learning a neural embedding across billions of words on the web. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated such as neglect, government, and social media. We show that Empath’s data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

Proceedings ArticleDOI
02 May 2017
TL;DR: This work introduces MyriadHub, a mail client where users start conversations and then crowd workers extract underlying conversational patterns and rules to accelerate responses to future similar emails, and introduces techniques that exploit similarities across conversations to recycle relevant parts of previous conversations.
Abstract: Email has scaled our ability to communicate with large groups, but has not equivalently scaled our ability to listen and respond. For example, emailing many people for feedback requires either impersonal surveys or manual effort to hold many similar conversations. To scale personalized conversations, we introduce techniques that exploit similarities across conversations to recycle relevant parts of previous conversations. These techniques reduce the authoring burden, save senders' time, and maintain recipient engagement through personalized responses. We introduce MyriadHub, a mail client where users start conversations and then crowd workers extract underlying conversational patterns and rules to accelerate responses to future similar emails. In a within-subjects experiment comparing MyriadHub to existing mass email techniques, senders spent significantly less time planning events with MyriadHub. In a second experiment comparing MyriadHub to a standard email survey, MyriadHub doubled the recipients' response rate and tripled the number of words in their responses.

Posted Content
TL;DR: It is suggested that a simple and rapid iteration cycle can improve crowd work, and empirical evidence that requester "quality" directly impacts result quality is provided.
Abstract: Low-quality results have been a long-standing problem on microtask crowdsourcing platforms, driving away requesters and justifying low wages for workers. To date, workers have been blamed for low-quality results: they are said to make as little effort as possible, do not pay attention to detail, and lack expertise. In this paper, we hypothesize that requesters may also be responsible for low-quality work: they launch unclear task designs that confuse even earnest workers, under-specify edge cases, and neglect to include examples. We introduce prototype tasks, a crowdsourcing strategy requiring all new task designs to launch a small number of sample tasks. Workers attempt these tasks and leave feedback, enabling the requester to iterate on the design before publishing it. We report a field experiment in which tasks that underwent prototype task iteration produced higher-quality work results than the original task designs. With this research, we suggest that a simple and rapid iteration cycle can improve crowd work, and we provide empirical evidence that requester “quality” directly impacts result quality.

Proceedings ArticleDOI
06 May 2017
TL;DR: A User Benefit Scale is designed and validated to complement the User Burden Scale, and suggests that benefit is more predictive of mobile app usage than burden, and the model of app usage includes constructs from both the benefit and burden scales.
Abstract: How do mobile apps keep users coming back? Suh et al. proposed that the level of burden placed on a user has a negative effect on user retention. They developed the User Burden Scale, and showed that computing systems still in use had lower burdens than those that were abandoned. What is not captured is how the added benefits a system provides increases user retention. We hypothesize that both benefits and burdens of a mobile app predict usage. To show this, we design and validate a User Benefit Scale to complement the User Burden Scale, for the evaluation of benefits of mobile apps. Our scale consists of four constructs: if an app is 1) useful and informational, 2) enjoyable and enables pursuit of interests, 3) enables social interaction, and 4) has good usability and visual/ interaction design. We administered the benefit and burden scales to 347 participants. Our results suggest that benefit is more predictive of mobile app usage than burden, and our model of app usage includes constructs from both the benefit and burden scales.

Proceedings ArticleDOI
25 Feb 2017
TL;DR: Methods for mobilizing collective social capital in sociotechnical systems, enabling an individual to ask a trusted group whether it is willing to invest its reputation in doing them a favor, are designed.
Abstract: Social costs and limited reach inhibit our use of social capital to solicit help. However, individuals are not the only holders of social capital: groups also possess reputations and social capital, and are often prepared to vouch for their own members. In this paper, we design methods for mobilizing this collective social capital in sociotechnical systems, enabling an individual to ask a trusted group whether it is willing to invest its reputation in doing them a favor. We instantiate this concept with Founder Center, a web platform in which members of a local entrepreneurship accelerator ask the accelerator community to collectively make them introductions to potential funders. In a field experiment, enabling access to collective social capital in this community nearly doubled the odds of members making a social capital request. Requests fulfilled utilizing collective social capital were at least as effective as ones utilizing traditional interpersonal social capital.