scispace - formally typeset
Search or ask a question

Showing papers by "Michael S. Bernstein published in 2019"


Proceedings ArticleDOI
02 May 2019
TL;DR: It is argued that unlike street-level bureaucrats, who reflexively refine their decision criteria as they reason through a novel situation, street- level algorithms at best refine their criteria only after the decision is made, which results in illogical decisions when handling new or extenuating circumstances.
Abstract: Errors and biases are earning algorithms increasingly malignant reputations in society. A central challenge is that algorithms must bridge the gap between high-level policy and on-the-ground decisions, making inferences in novel situations where the policy or training data do not readily apply. In this paper, we draw on the theory of street-level bureaucracies, how human bureaucrats such as police and judges interpret policy to make on-the-ground decisions. We present by analogy a theory of street-level algorithms, the algorithms that bridge the gaps between policy and decisions about people in a socio-technical system. We argue that unlike street-level bureaucrats, who reflexively refine their decision criteria as they reason through a novel situation, street-level algorithms at best refine their criteria only after the decision is made. This loop-and-a-half delay results in illogical decisions when handling new or extenuating circumstances. This theory suggests designs for street-level algorithms that draw on historical design patterns for street-level bureaucracies, including mechanisms for self-policing and recourse in the case of error.

97 citations


Proceedings Article
06 Sep 2019
TL;DR: This work establishes a gold standard human benchmark for generative realism by constructing Human eYe Perceptual Evaluation (HYPE), a human benchmark that is grounded in psychophysics research in perception, reliable across different sets of randomly sampled outputs from a model, able to produce separable model performances, and efficient in cost and time.
Abstract: Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. $250$ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

72 citations


Proceedings Article
28 Oct 2019
TL;DR: Fair Work is introduced, enabling requesters to automatically pay their workers minimum wage by adding a one-line script tag to their task HTML on Amazon Mechanical Turk, and aims to lower the threshold for pro-social work practices in microtask marketplaces.
Abstract: Accurate task pricing in microtask marketplaces requires substantial effort via trial and error, contributing to a pattern of worker underpayment. In response, we introduce Fair Work, enabling requesters to automatically pay their workers minimum wage by adding a one-line script tag to their task HTML on Amazon Mechanical Turk. Fair Work automatically surveys workers to find out how long the task takes, then aggregates those self-reports and auto-bonuses workers up to a minimum wage if needed. Evaluations demonstrate that the system estimates payments more accurately than requesters and that worker time surveys are close to behaviorally observed time measurements. With this work, we aim to lower the threshold for pro-social work practices in microtask marketplaces.

71 citations


Posted Content
TL;DR: This paper introduces a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few labeled examples and defines a complexity metric for relationships that serves as an indicator for conditions under which the method succeeds over transfer learning, the de-facto approach for training with limited labels.
Abstract: Visual knowledge bases such as Visual Genome power numerous applications in computer vision, including visual question answering and captioning, but suffer from sparse, incomplete relationships. All scene graph models to date are limited to training on a small set of visual relationships that have thousands of training labels each. Hiring human annotators is expensive, and using textual knowledge base completion methods are incompatible with visual data. In this paper, we introduce a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few labeled examples. We analyze visual relationships to suggest two types of image-agnostic features that are used to generate noisy heuristics, whose outputs are aggregated using a factor graph-based generative model. With as few as 10 labeled examples per relationship, the generative model creates enough training data to train any existing state-of-the-art scene graph model. We demonstrate that our method outperforms all baseline approaches on scene graph prediction by 5.16 recall@100 for PREDCLS. In our limited label setting, we define a complexity metric for relationships that serves as an indicator (R^2 = 0.778) for conditions under which our method succeeds over transfer learning, the de-facto approach for training with limited labels.

65 citations


Proceedings ArticleDOI
27 Mar 2019
TL;DR: It is argued that a good question is one that has a tightly focused purpose --- one that is aimed at expecting a specific type of response, and a model is built that maximizes mutual information between the image, the expected answer and the generated question.
Abstract: Though image-to-sequence generation models have become overwhelmingly popular in human-computer communications, they suffer from strongly favoring safe generic questions (``What is in this picture?''). Generating uninformative but relevant questions is not sufficient or useful. We argue that a good question is one that has a tightly focused purpose --- one that is aimed at expecting a specific type of response. We build a model that maximizes mutual information between the image, the expected answer and the generated question. To overcome the non-differentiability of discrete natural language tokens, we introduce a variational continuous latent space onto which the expected answers project. We regularize this latent space with a second latent space that ensures clustering of similar answers. Even when we don't know the expected answer, this second latent space can generate goal-driven questions specifically aimed at extracting objects (``what is the person throwing''), attributes, (``What kind of shirt is the person wearing?''), color (``what color is the frisbee?''), material (``What material is the frisbee?''), etc. We quantitatively show that our model is able to retain information about an expected answer category, resulting in more diverse, goal-driven questions. We launch our model on a set of real world images and extract previously unseen visual concepts.

48 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper proposed a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few labeled examples, whose outputs are aggregated using a factor graph-based generative model.
Abstract: Visual knowledge bases such as Visual Genome power numerous applications in computer vision, including visual question answering and captioning, but suffer from sparse, incomplete relationships. All scene graph models to date are limited to training on a small set of visual relationships that have thousands of training labels each. Hiring human annotators is expensive, and using textual knowledge base completion methods are incompatible with visual data. In this paper, we introduce a semi-supervised method that assigns probabilistic relationship labels to a large number of unlabeled images using few labeled examples. We analyze visual relationships to suggest two types of image-agnostic features that are used to generate noisy heuristics, whose outputs are aggregated using a factor graph-based generative model. With as few as 10 labeled examples per relationship, the generative model creates enough training data to train any existing state-of-the-art scene graph model. We demonstrate that our method outperforms all baseline approaches on scene graph prediction by5.16 recall@100 for PREDCLS. In our limited label setting, we define a complexity metric for relationships that serves as an indicator (R^2 = 0.778) for conditions under which our method succeeds over transfer learning, the de-facto approach for training with limited labels.

47 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: This work introduces the first scene graph prediction model that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates.
Abstract: Scene graph prediction — classifying the set of objects and predicates in a visual scene — requires substantial training data. The long-tailed distribution of relationships can be an obstacle for such approaches, however, as they can only be trained on the small set of predicates that carry sufficient labels. We introduce the first scene graph prediction model that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates. First, we introduce a new model of predicates as functions that operate on object features or image locations. Next, we define a scene graph model where these functions are trained as message passing protocols within a new graph convolution framework. We train the framework with a frequently occurring set of predicates and show that our approach outperforms those that use the same amount of supervision by 1.78 at recall@50 and performs on par with other scene graph models. Next, we extract object representations generated by the trained predicate functions to train few-shot predicate classifiers on rare predicates with as few as 1 labeled example. When compared to strong baselines like transfer learning from existing state-of-the-art representations, we show improved 5-shot performance by 4.16 recall@1. Finally, we show that our predicate functions generate interpretable visualizations, enabling the first interpretable scene graph model.

39 citations


Posted Content
TL;DR: This paper proposed a model that maximizes mutual information between the image, the expected answer and the generated question to generate more diverse, goal-driven questions by regularizing this latent space with a second latent space that ensures clustering of similar answers.
Abstract: Though image-to-sequence generation models have become overwhelmingly popular in human-computer communications, they suffer from strongly favoring safe generic questions ("What is in this picture?"). Generating uninformative but relevant questions is not sufficient or useful. We argue that a good question is one that has a tightly focused purpose --- one that is aimed at expecting a specific type of response. We build a model that maximizes mutual information between the image, the expected answer and the generated question. To overcome the non-differentiability of discrete natural language tokens, we introduce a variational continuous latent space onto which the expected answers project. We regularize this latent space with a second latent space that ensures clustering of similar answers. Even when we don't know the expected answer, this second latent space can generate goal-driven questions specifically aimed at extracting objects ("what is the person throwing"), attributes, ("What kind of shirt is the person wearing?"), color ("what color is the frisbee?"), material ("What material is the frisbee?"), etc. We quantitatively show that our model is able to retain information about an expected answer category, resulting in more diverse, goal-driven questions. We launch our model on a set of real world images and extract previously unseen visual concepts.

31 citations


Journal ArticleDOI
07 Nov 2019
TL;DR: It is found that, for some tasks, team fracture can be strongly influenced by interactions in the first moments of a team's collaboration, and that interventions targeting these initial moments may be critical to scaffolding long-lasting teams.
Abstract: Was a problematic team always doomed to frustration, or could it have ended another way? In this paper, we study the consistency of team fracture: a loss of team viability so severe that the team no longer wants to work together. Understanding whether team fracture is driven by the membership of the team, or by how their collaboration unfolded, motivates the design of interventions that either identify compatible teammates or ensure effective early interactions. We introduce an online experiment that reconvenes the same team without members realizing that they have worked together before, enabling us to temporarily erase previous team dynamics. Participants in our study completed a series of tasks across multiple teams, including one reconvened team, and privately blacklisted any teams that they would not want to work with again. We identify fractured teams as those blacklisted by half the members. We find that reconvened teams are strikingly polarized by task in the consistency of their fracture outcomes. On a creative task, teams might as well have been a completely different set of people: the same teams changed their fracture outcomes at a random chance rate. On a cognitive conflict and on an intellective task, the team instead replayed the same dynamics without realizing it, rarely changing their fracture outcomes. These results indicate that, for some tasks, team fracture can be strongly influenced by interactions in the first moments of a team's collaboration, and that interventions targeting these initial moments may be critical to scaffolding long-lasting teams.

24 citations


Proceedings ArticleDOI
TL;DR: Boomerang as mentioned in this paper is a reputation system for crowdsourcing that elicits more accurate feedback by rebounding the consequences of feedback directly back onto the person who gave it, inspired by a game-theoretic notion of incentive-compatibility.
Abstract: Paid crowdsourcing platforms suffer from low-quality work and unfair rejections, but paradoxically, most workers and requesters have high reputation scores. These inflated scores, which make high-quality work and workers difficult to find, stem from social pressure to avoid giving negative feedback. We introduce Boomerang, a reputation system for crowdsourcing that elicits more accurate feedback by rebounding the consequences of feedback directly back onto the person who gave it. With Boomerang, requesters find that their highly-rated workers gain earliest access to their future tasks, and workers find tasks from their highly-rated requesters at the top of their task feed. Field experiments verify that Boomerang causes both workers and requesters to provide feedback that is more closely aligned with their private opinions. Inspired by a game-theoretic notion of incentive-compatibility, Boomerang opens opportunities for interaction design to incentivize honest reporting over strategic dishonesty.

22 citations


Proceedings ArticleDOI
02 May 2019
TL;DR: The results of an experiment using HabitLab suggest that any conservation of procrastination effect is minimal, and that behavior change designers may target individual productivity goals without causing substantial negative second-order effects.
Abstract: Productivity behavior change systems help us reduce our time on unproductive activities. However, is that time actually saved, or is it just redirected to other unproductive activities? We report an experiment using HabitLab, a behavior change browser extension and phone application, that manipulated the frequency of interventions on a focal goal and measured the effects on time spent on other applications and platforms. We find that, when intervention frequency increases on the focal goal, time spent on other applications is held constant or even reduced. Likewise, we find that time is not redistributed across platforms from browser to mobile phone or vice versa. These results suggest that any conservation of procrastination effect is minimal, and that behavior change designers may target individual productivity goals without causing substantial negative second-order effects.

Posted Content
01 Apr 2019
TL;DR: The authors' human evaluation metric, HYPE, consistently distinguishes models from each other, and is compared to StyleGAN, ProGAN, BEGAN, and WGAN-GP on CelebA, and StyleGAN with and without truncation trick sampling on FFHQ.
Abstract: Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

Posted Content
12 Jun 2019
TL;DR: In this article, a scene graph prediction model is proposed that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates.
Abstract: Scene graph prediction --- classifying the set of objects and predicates in a visual scene --- requires substantial training data. The long-tailed distribution of relationships can be an obstacle for such approaches, however, as they can only be trained on the small set of predicates that carry sufficient labels. We introduce the first scene graph prediction model that supports few-shot learning of predicates, enabling scene graph approaches to generalize to a set of new predicates. First, we introduce a new model of predicates as functions that operate on object features or image locations. Next, we define a scene graph model where these functions are trained as message passing protocols within a new graph convolution framework. We train the framework with a frequently occurring set of predicates and show that our approach outperforms those that use the same amount of supervision by 1.78 at recall@50 and performs on par with other scene graph models. Next, we extract object representations generated by the trained predicate functions to train few-shot predicate classifiers on rare predicates with as few as 1 labeled example. When compared to strong baselines like transfer learning from existing state-of-the-art representations, we show improved 5-shot performance by 4.16 recall@1. Finally, we show that our predicate functions generate interpretable visualizations, enabling the first interpretable scene graph model.

Proceedings Article
28 Oct 2019
TL;DR: A new technique is introduced that augments questions with ML-based request strategies drawn from social psychology, and a contextual bandit algorithm is introduced to select which strategy to apply for a given task and contributor.
Abstract: To support the massive data requirements of modern supervised machine learning (ML) algorithms, crowdsourcing systems match volunteer contributors to appropriate tasks. Such systems learn what types of tasks contributors are interested to complete. In this paper, instead of focusing on what to ask, we focus on learning how to ask: how to make relevant and interesting requests to encourage crowdsourcing participation. We introduce a new technique that augments questions with ML-based request strategies drawn from social psychology. We also introduce a contextual bandit algorithm to select which strategy to apply for a given task and contributor. We deploy our approach to collect volunteer data from Instagram for the task of visual question answering (VQA), an important task in computer vision and natural language processing that has enabled numerous human-computer interaction applications. For example, when encountering a user’s Instagram post that contains the ornate Trevi Fountain in Rome, our approach learns to augment its original raw question “Where is this place?” with image-relevant compliments such as “What a great statue!” or with travel-relevant justifications such as “I would like to visit this place”, increasing the user’s likelihood of answering the question and thus providing a label. We deploy our agent on Instagram to ask questions about social media images, finding that the response rate improves from 15.8% with unaugmented questions to 30.54% with baseline rule-based strategies and to 58.1% with ML-based strategies.

Posted Content
TL;DR: The Human eYe Perceptual Evaluation (HYPE) as mentioned in this paper is a human benchmark for generative realism that is grounded in psychophysics research in perception, reliable across different sets of randomly sampled outputs from a model and efficient in cost and time.
Abstract: Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

Posted Content
TL;DR: This paper proposes a new paradigm that estimates uncertainty in the model's internal hidden space instead of themodel's output space, and builds a visual-semantic space that embeds paraphrases close together for any existing VQA model.
Abstract: Typical active learning strategies are designed for tasks, such as classification, with the assumption that the output space is mutually exclusive. The assumption that these tasks always have exactly one correct answer has resulted in the creation of numerous uncertainty-based measurements, such as entropy and least confidence, which operate over a model's outputs. Unfortunately, many real-world vision tasks, like visual question answering and image captioning, have multiple correct answers, causing these measurements to overestimate uncertainty and sometimes perform worse than a random sampling baseline. In this paper, we propose a new paradigm that estimates uncertainty in the model's internal hidden space instead of the model's output space. We specifically study a manifestation of this problem for visual question answer generation (VQA), where the aim is not to classify the correct answer but to produce a natural language answer, given an image and a question. Our method overcomes the paraphrastic nature of language. It requires a semantic space that structures the model's output concepts and that enables the usage of techniques like dropout-based Bayesian uncertainty. We build a visual-semantic space that embeds paraphrases close together for any existing VQA model. We empirically show state-of-art active learning results on the task of VQA on two datasets, being 5 times more cost-efficient on Visual Genome and 3 times more cost-efficient on VQA 2.0.

Posted Content
TL;DR: This paper focuses on the evaluation of a conditional generative model that illustrates the consequences of climate change-induced flooding to encourage public interest and awareness on the issue and proposes several automated and human-based methods for evaluation.
Abstract: With success on controlled tasks, generative models are being increasingly applied to humanitarian applications [1,2]. In this paper, we focus on the evaluation of a conditional generative model that illustrates the consequences of climate change-induced flooding to encourage public interest and awareness on the issue. Because metrics for comparing the realism of different modes in a conditional generative model do not exist, we propose several automated and human-based methods for evaluation. To do this, we adapt several existing metrics, and assess the automated metrics against gold standard human evaluation. We find that using Frechet Inception Distance (FID) with embeddings from an intermediary Inception-V3 layer that precedes the auxiliary classifier produces results most correlated with human realism. While insufficient alone to establish a human-correlated automatic evaluation metric, we believe this work begins to bridge the gap between human and automated generative evaluation procedures.

Posted Content
TL;DR: This work introduces a framework that induces object representations that are structured according to their visual relationships that supports few-shot learning of predicates and achieves a 5-shot performance increase when compared to strong transfer learning baselines.
Abstract: Scene graph prediction --- classifying the set of objects and predicates in a visual scene --- requires substantial training data. However, most predicates only occur a handful of times making them difficult to learn. We introduce the first scene graph prediction model that supports few-shot learning of predicates. Existing scene graph generation models represent objects using pretrained object detectors or word embeddings that capture semantic object information at the cost of encoding information about which relationships they afford. So, these object representations are unable to generalize to new few-shot relationships. We introduce a framework that induces object representations that are structured according to their visual relationships. Unlike past methods, our framework embeds objects that afford similar relationships closer together. This property allows our model to perform well in the few-shot setting. For example, applying the 'riding' predicate transformation to 'person' modifies the representation towards objects like 'skateboard' and 'horse' that enable riding. We generate object representations by learning predicates trained as message passing functions within a new graph convolution framework. The object representations are used to build few-shot predicate classifiers for rare predicates with as few as 1 labeled example. We achieve a 5-shot performance of 22.70 recall@50, a 3.7 increase when compared to strong transfer learning baselines.

Journal ArticleDOI
TL;DR: The authors developed a conceptual framework and a novel web-based mechanism for observing consideration, and used them to study consideration among an entire cohort of students at a private university between 2016-2018.
Abstract: In elective curriculums, undergraduates are encouraged to consider a range of academic courses for possible enrollment each term, yet course consideration has not been explicitly theorized and is difficult to observe. We develop a conceptual framework and a novel web-based mechanism for observing consideration, and use them to study consideration among an entire cohort of students at a private university between 2016-2018. Our findings reveal (1) substantial winnowing from available to considered courses; (2) homogeneous consideration set sizes regardless of students’ subsequent majors; and (3) heterogeneous consideration set compositions correlated with subsequent majors. Our work demonstrates that course consideration is an empirically demonstrable component of course selection and suggests mechanisms for intervening in consideration to support informed choice and efficient academic progress.

Proceedings Article
27 Mar 2019
TL;DR: The Human eYe Perceptual Evaluation (HYPE) as discussed by the authors is a human benchmark for generative realism that is grounded in psychophysics research in perception, reliable across different sets of randomly sampled outputs from a model and efficient in cost and time.
Abstract: Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct Human eYe Perceptual Evaluation (HYPE) a human benchmark that is (1) grounded in psychophysics research in perception, (2) reliable across different sets of randomly sampled outputs from a model, (3) able to produce separable model performances, and (4) efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model's outputs appear real (e.g. 250ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that HYPE can track model improvements across training epochs, and we confirm via bootstrap sampling that HYPE rankings are consistent and replicable.

Proceedings ArticleDOI
02 May 2019
TL;DR: Eevee is presented, an image-editing system that empowers users to transform images by specifying intents in terms of high-level themes and introduces an optimization function that balances semantic plausibility, visual plausibility and theme relevance to surface possible image edits.
Abstract: There is a significant gap between the high-level, semantic manner in which we reason about image edits and the low-level, pixel-oriented way in which we execute these edits. While existing image-editing tools provide a great deal of flexibility for professionals, they can be disorienting to novice editors because of the gap between a user's goals and the unfamiliar operations needed to actualize them. We present Eevee, an image-editing system that empowers users to transform images by specifying intents in terms of high-level themes. Based on a provided theme and an understanding of the objects and relationships in the original image, we introduce an optimization function that balances semantic plausibility, visual plausibility, and theme relevance to surface possible image edits. A formative evaluation finds that we are able to guide users to meet their goals while helping them to explore novel, creative ideas for their image edit.

Posted Content
TL;DR: Boomerang as discussed by the authors is a reputation system for crowdsourcing that elicits more accurate feedback by rebounding the consequences of feedback directly back onto the person who gave it, inspired by a game-theoretic notion of incentive-compatibility.
Abstract: Paid crowdsourcing platforms suffer from low-quality work and unfair rejections, but paradoxically, most workers and requesters have high reputation scores. These inflated scores, which make high-quality work and workers difficult to find, stem from social pressure to avoid giving negative feedback. We introduce Boomerang, a reputation system for crowdsourcing that elicits more accurate feedback by rebounding the consequences of feedback directly back onto the person who gave it. With Boomerang, requesters find that their highly-rated workers gain earliest access to their future tasks, and workers find tasks from their highly-rated requesters at the top of their task feed. Field experiments verify that Boomerang causes both workers and requesters to provide feedback that is more closely aligned with their private opinions. Inspired by a game-theoretic notion of incentive-compatibility, Boomerang opens opportunities for interaction design to incentivize honest reporting over strategic dishonesty.