scispace - formally typeset
Search or ask a question

Showing papers by "Yahoo! published in 2010"


Proceedings ArticleDOI
03 May 2010
TL;DR: The architecture of HDFS is described and experience using HDFS to manage 25 petabytes of enterprise data at Yahoo! is reported on.
Abstract: The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.

5,005 citations


Proceedings ArticleDOI
Brian F. Cooper1, Adam Silberstein1, Erwin Tam1, Raghu Ramakrishnan1, Russell Sears1 
10 Jun 2010
TL;DR: This work presents the "Yahoo! Cloud Serving Benchmark" (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems, and defines a core set of benchmarks and reports results for four widely used systems.
Abstract: While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud OLTP" applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being applied to a diverse range of applications that differ considerably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of apples-to-apples performance comparisons, makes it difficult to understand the tradeoffs between systems and the workloads for which they are suited. We present the "Yahoo! Cloud Serving Benchmark" (YCSB) framework, with the goal of facilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!'s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the development of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible--it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.

3,276 citations


Journal ArticleDOI
TL;DR: The goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs, and to provide pointers into the literature for those who are less familiar with the field.
Abstract: Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

2,843 citations


Posted Content
TL;DR: In this article, the authors demonstrate how to use Mechanical Turk for conducting behavioral research and lower the barrier to entry for researchers who could benefit from this platform, and illustrate the mechanics of putting a task on Mechanical Turk including recruiting subjects, executing the task, and reviewing the work submitted.
Abstract: Amazon’s Mechanical Turk is an online labor market where requesters post jobs and workers choose which jobs to do for pay. The central purpose of this paper is to demonstrate how to use this website for conducting behavioral research and lower the barrier to entry for researchers who could benefit from this platform. We describe general techniques that apply to a variety of types of research and experiments across disciplines. We begin by discussing some of the advantages of doing experiments on Mechanical Turk, such as easy access to a large, stable, and diverse subject pool, the low cost of doing experiments and faster iteration between developing theory and executing experiments. We will discuss how the behavior of workers compares to experts and to laboratory subjects. Then, we illustrate the mechanics of putting a task on Mechanical Turk including recruiting subjects, executing the task, and reviewing the work that was submitted. We also provide solutions to common problems that a researcher might face when executing their research on this platform including techniques for conducting synchronous experiments, methods to ensure high quality work, how to keep data private, and how to maintain code security.

2,755 citations


Proceedings ArticleDOI
26 Apr 2010
TL;DR: This work model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks.
Abstract: Personalized web services strive to adapt their services (advertisements, news articles, etc.) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation.In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks.The contributions of this work are three-fold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5% click lift compared to a standard context-free bandit algorithm, and the advantage becomes even greater when data gets more scarce.

2,467 citations


Proceedings Article
23 Jun 2010
TL;DR: ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state to enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers.
Abstract: In this paper, we describe ZooKeeper, a service for coordinating processes of distributed applications Since ZooKeeper is part of critical infrastructure, ZooKeeper aims to provide a simple and high performance kernel for building more complex coordination primitives at the client It incorporates elements from group messaging, shared registers, and distributed lock services in a replicated, centralized service The interface exposed by Zoo-Keeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service The ZooKeeper interface enables a high-performance service implementation In addition to the wait-free property, ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state These design decisions enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers We show for the target workloads, 2:1 to 100:1 read to write ratio, that ZooKeeper can handle tens to hundreds of thousands of transactions per second This performance allows ZooKeeper to be used extensively by client applications

1,637 citations


Proceedings ArticleDOI
13 Apr 2010
TL;DR: This work proposes a simple algorithm called delay scheduling, which achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness.
Abstract: As organizations start to use data-intensive cluster computing systems like Hadoop and Dryad for more applications, there is a growing need to share clusters between users. However, there is a conflict between fairness in scheduling and data locality (placing tasks on nodes that contain their input data). We illustrate this problem through our experience designing a fair scheduler for a 600-node Hadoop cluster at Facebook. To address the conflict between locality and fairness, we propose a simple algorithm called delay scheduling: when the job that should be scheduled next according to fairness cannot launch a local task, it waits for a small amount of time, letting other jobs launch tasks instead. We find that delay scheduling achieves nearly optimal data locality in a variety of workloads and can increase throughput by up to 2x while preserving fairness. In addition, the simplicity of delay scheduling makes it applicable under a wide variety of scheduling policies beyond fair sharing.

1,514 citations


Proceedings ArticleDOI
26 Sep 2010
TL;DR: An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task, and new variants of two collaborative filtering algorithms are offered.
Abstract: In many commercial systems, the 'best bet' recommendations are shown, but the predicted rating values are not. This is usually referred to as a top-N recommendation task, where the goal of the recommender system is to find a few specific items which are supposed to be most appealing to the user. Common methodologies based on error metrics (such as RMSE) are not a natural fit for evaluating the top-N recommendation task. Rather, top-N performance can be directly measured by alternative methodologies based on accuracy metrics (such as precision/recall).An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task. Results show that improvements in RMSE often do not translate into accuracy improvements. In particular, a naive non-personalized algorithm can outperform some common recommendation approaches and almost match the accuracy of sophisticated algorithms. Another finding is that the very few top popular items can skew the top-N performance. The analysis points out that when evaluating a recommender algorithm on the top-N recommendation task, the test set should be chosen carefully in order to not bias accuracy metrics towards non-personalized solutions. Finally, we offer practitioners new variants of two collaborative filtering algorithms that, regardless of their RMSE, significantly outperform other recommender algorithms in pursuing the top-N recommendation task, with offering additional practical advantages. This comes at surprise given the simplicity of these two methods.

1,398 citations


01 Jan 2010
TL;DR: In this paper, the evolution of structure within large online social networks is studied. Butler et al. present a series of measurements of two such networks, together comprising in excess of five million people and ten million friendship links, annotated with metadata capturing the time of every event.
Abstract: In this paper, we consider the evolution of structure within large online social networks. We present a series of measurements of two such networks, together comprising in excess of five million people and ten million friendship links, annotated with metadata capturing the time of every event in the life of the network. Our measurements expose a surprising segmentation of these networks into three regions: singletons who do not participate in the network; isolated communities which overwhelmingly display star structure; and a giant component anchored by a well-connected core region which persists even in the absence of stars.We present a simple model of network growth which captures these aspects of component structure. The model follows our experimental results, characterizing users as either passive members of the network; inviters who encourage offline friends and acquaintances to migrate online; and linkers who fully participate in the social evolution of the network.

1,391 citations


Proceedings Article
06 Dec 2010
TL;DR: This paper presents the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence and introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits.
Abstract: With the increase in available data parallel machine learning has become an increasingly pressing problem. In this paper we present the first parallel stochastic gradient descent algorithm including a detailed analysis and experimental evidence. Unlike prior work on parallel optimization algorithms [5, 7] our variant comes with parallel acceleration guarantees and it poses no overly tight latency constraints, which might only be available in the multicore setting. Our analysis introduces a novel proof technique — contractive mappings to quantify the speed of convergence of parameter distributions to their asymptotic limits. As a side effect this answers the question of how quickly stochastic gradient descent algorithms reach the asymptotically normal regime [1, 8].

1,269 citations


Proceedings ArticleDOI
04 Feb 2010
TL;DR: This paper proposes models and algorithms for learning the model parameters and for testing the learned models to make predictions, and develops techniques for predicting the time by which a user may be expected to perform an action.
Abstract: Recently, there has been tremendous interest in the phenomenon of influence propagation in social networks. The studies in this area assume they have as input to their problems a social graph with edges labeled with probabilities of influence between users. However, the question of where these probabilities come from or how they can be computed from real social network data has been largely ignored until now. Thus it is interesting to ask whether from a social graph and a log of actions by its users, one can build models of influence. This is the main problem attacked in this paper. In addition to proposing models and algorithms for learning the model parameters and for testing the learned models to make predictions, we also develop techniques for predicting the time by which a user may be expected to perform an action. We validate our ideas and techniques using the Flickr data set consisting of a social graph with 1.3M nodes, 40M edges, and an action log consisting of 35M tuples referring to 300K distinct actions. Beyond showing that there is genuine influence happening in a real social network, we show that our techniques have excellent prediction performance.

Proceedings ArticleDOI
25 Jul 2010
TL;DR: The behavior of Twitter users under an emergency situation is explored and it is shown that it is posible to detect rumors by using aggregate analysis on tweets, and that the propagation of tweets that correspond to rumors differs from tweets that spread news.
Abstract: In this article we explore the behavior of Twitter users under an emergency situation. In particular, we analyze the activity related to the 2010 earthquake in Chile and characterize Twitter in the hours and days following this disaster. Furthermore, we perform a preliminary study of certain social phenomenons, such as the dissemination of false rumors and confirmed news. We analyze how this information propagated through the Twitter network, with the purpose of assessing the reliability of Twitter as an information source under extreme circumstances. Our analysis shows that the propagation of tweets that correspond to rumors differs from tweets that spread news because rumors tend to be questioned more than news by the Twitter community. This result shows that it is posible to detect rumors by using aggregate analysis on tweets.

Proceedings ArticleDOI
13 Dec 2010
TL;DR: The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers.
Abstract: S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model, providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our design is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.

Posted Content
TL;DR: In this paper, the authors explore a range of network community detection methods in order to compare them and to understand their relative performance and the systematic biases in the clusters they identify, and examine several different classes of approximation algorithms that aim to optimize such objective functions.
Abstract: Detecting clusters or communities in large real-world graphs such as large social or information networks is a problem of considerable interest. In practice, one typically chooses an objective function that captures the intuition of a network cluster as set of nodes with better internal connectivity than external connectivity, and then one applies approximation algorithms or heuristics to extract sets of nodes that are related to the objective function and that "look like" good communities for the application of interest. In this paper, we explore a range of network community detection methods in order to compare them and to understand their relative performance and the systematic biases in the clusters they identify. We evaluate several common objective functions that are used to formalize the notion of a network community, and we examine several different classes of approximation algorithms that aim to optimize such objective functions. In addition, rather than simply fixing an objective and asking for an approximation to the best cluster of any size, we consider a size-resolved version of the optimization problem. Considering community quality as a function of its size provides a much finer lens with which to examine community detection algorithms, since objective functions and approximation algorithms often have non-obvious size-dependent behavior.

Proceedings ArticleDOI
26 Apr 2010
TL;DR: Considering community quality as a function of its size provides a much finer lens with which to examine community detection algorithms, since objective functions and approximation algorithms often have non-obvious size-dependent behavior.
Abstract: Detecting clusters or communities in large real-world graphs such as large social or information networks is a problem of considerable interest. In practice, one typically chooses an objective function that captures the intuition of a network cluster as set of nodes with better internal connectivity than external connectivity, and then one applies approximation algorithms or heuristics to extract sets of nodes that are related to the objective function and that "look like" good communities for the application of interest.In this paper, we explore a range of network community detection methods in order to compare them and to understand their relative performance and the systematic biases in the clusters they identify. We evaluate several common objective functions that are used to formalize the notion of a network community, and we examine several different classes of approximation algorithms that aim to optimize such objective functions. In addition, rather than simply fixing an objective and asking for an approximation to the best cluster of any size, we consider a size-resolved version of the optimization problem. Considering community quality as a function of its size provides a much finer lens with which to examine community detection algorithms, since objective functions and approximation algorithms often have non-obvious size-dependent behavior.

Proceedings ArticleDOI
28 Apr 2010
TL;DR: A modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed, and can reduce completion times and improve system utilization for batch jobs as well.
Abstract: MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, many implementations of MapReduce materialize the entire output of each map and reduce task before it can be consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop and can run unmodified user-defined MapReduce programs.

Journal ArticleDOI
Yehuda Koren1
TL;DR: Two leading collaborative filtering recommendation approaches are revamped and a more sensitive approach is required, which can make better distinctions between transient effects and long-term patterns.
Abstract: Customer preferences for products are drifting over time. Product perception and popularity are constantly changing as new selection emerges. Similarly, customer inclinations are evolving, leading them to ever redefine their taste. Thus, modeling temporal dynamics is essential for designing recommender systems or general customer preference models. However, this raises unique challenges. Within the ecosystem intersecting multiple products and customers, many different characteristics are shifting simultaneously, while many of them influence each other and often those shifts are delicate and associated with a few data instances. This distinguishes the problem from concept drift explorations, where mostly a single concept is tracked. Classical time-window or instance decay approaches cannot work, as they lose too many signals when discarding data instances. A more sensitive approach is required, which can make better distinctions between transient effects and long-term patterns. We show how to model the time changing behavior throughout the life span of the data. Such a model allows us to exploit the relevant components of all data instances, while discarding only what is modeled as being irrelevant. Accordingly, we revamp two leading collaborative filtering recommendation approaches. Evaluation is made on a large movie-rating dataset underlying the Netflix Prize contest. Results are encouraging and better than those previously reported on this dataset. In particular, methods described in this paper play a significant role in the solution that won the Netflix contest.

Proceedings ArticleDOI
17 Jan 2010
TL;DR: A simulation lemma is proved showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce, and it is demonstrated how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds.
Abstract: In recent years the MapReduce framework has emerged as one of the most widely used parallel computing platforms for processing data on terabyte and petabyte scales. Used daily at companies such as Yahoo!, Google, Amazon, and Facebook, and adopted more recently by several universities, it allows for easy parallelization of data intensive computations over many machines. One key feature of MapReduce that differentiates it from previous models of parallel computation is that it interleaves sequential and parallel computation. We propose a model of efficient computation using the MapReduce paradigm. Since MapReduce is designed for computations over massive data sets, our model limits the number of machines and the memory per machine to be substantially sublinear in the size of the input. On the other hand, we place very loose restrictions on the computational power of of any individual machine---our model allows each machine to perform sequential computations in time polynomial in the size of the original input.We compare MapReduce to the PRAM model of computation. We prove a simulation lemma showing that a large class of PRAM algorithms can be efficiently simulated via MapReduce. The strength of MapReduce, however, lies in the fact that it uses both sequential and parallel computation. We demonstrate how algorithms can take advantage of this fact to compute an MST of a dense graph in only two rounds, as opposed to Ω(log(n)) rounds needed in the standard PRAM model. We show how to evaluate a wide class of functions using the MapReduce framework. We conclude by applying this result to show how to compute some basic algorithmic problems such as undirected s-t connectivity in the MapReduce framework.

Journal ArticleDOI
Sharad Goel1, Jake M. Hofman1, Sébastien Lahaie1, David M. Pennock1, Duncan J. Watts1 
TL;DR: This work uses search query volume to forecast the opening weekend box-office revenue for feature films, first-month sales of video games, and the rank of songs on the Billboard Hot 100 chart, finding in all cases that search counts are highly predictive of future outcomes.
Abstract: Recent work has demonstrated that Web search volume can “predict the present,” meaning that it can be used to accurately track outcomes such as unemployment levels, auto and home sales, and disease prevalence in near real time. Here we show that what consumers are searching for online can also predict their collective future behavior days or even weeks in advance. Specifically we use search query volume to forecast the opening weekend box-office revenue for feature films, first-month sales of video games, and the rank of songs on the Billboard Hot 100 chart, finding in all cases that search counts are highly predictive of future outcomes. We also find that search counts generally boost the performance of baseline models fit on other publicly available data, where the boost varies from modest to dramatic, depending on the application in question. Finally, we reexamine previous work on tracking flu trends and show that, perhaps surprisingly, the utility of search data relative to a simple autoregressive model is modest. We conclude that in the absence of other data sources, or where small improvements in predictive performance are material, search queries provide a useful guide to the near future.

Proceedings Article
15 Jul 2010
TL;DR: This paper defines the task, describes the training and test data and the process of their creation, lists the participating systems (10 teams, 28 runs), and discusses their results.
Abstract: SemEval-2 Task 8 focuses on Multi-way classification of semantic relations between pairs of nominals. The task was designed to compare different approaches to semantic relation classification and to provide a standard testbed for future research. This paper defines the task, describes the training and test data and the process of their creation, lists the participating systems (10 teams, 28 runs), and discusses their results.

Journal ArticleDOI
01 Sep 2010
TL;DR: This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations and shows that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.
Abstract: This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

Proceedings ArticleDOI
10 Apr 2010
TL;DR: An analytical methodology and visual representations are developed that could help a journalist or public affairs person better understand the temporal dynamics of sentiment in reaction to the debate video.
Abstract: Television broadcasters are beginning to combine social micro-blogging systems such as Twitter with television to create social video experiences around events. We looked at one such event, the first U.S. presidential debate in 2008, in conjunction with aggregated ratings of message sentiment from Twitter. We begin to develop an analytical methodology and visual representations that could help a journalist or public affairs person better understand the temporal dynamics of sentiment in reaction to the debate video. We demonstrate visuals and metrics that can be used to detect sentiment pulse, anomalies in that pulse, and indications of controversial topics that can be used to inform the design of visual analytic systems for social media events.

Proceedings ArticleDOI
25 Jul 2010
TL;DR: This paper studies a query-dependent variant of the community-detection problem, which it is called thecommunity-search problem: given a graph G, and a set of query nodes in the graph, it is sought to find a subgraph of G that contains the query nodes and it is densely connected, and develops an optimum greedy algorithm for this measure.
Abstract: A lot of research in graph mining has been devoted in the discovery of communities. Most of the work has focused in the scenario where communities need to be discovered with only reference to the input graph. However, for many interesting applications one is interested in finding the community formed by a given set of nodes. In this paper we study a query-dependent variant of the community-detection problem, which we call the community-search problem: given a graph G, and a set of query nodes in the graph, we seek to find a subgraph of G that contains the query nodes and it is densely connected. We motivate a measure of density based on minimum degree and distance constraints, and we develop an optimum greedy algorithm for this measure. We proceed by characterizing a class of monotone constraints and we generalize our algorithm to compute optimum solutions satisfying any set of monotone constraints. Finally we modify the greedy algorithm and we present two heuristic algorithms that find communities of size no greater than a specified upper bound. Our experimental evaluation on real datasets demonstrates the efficiency of the proposed algorithms and the quality of the solutions we obtain.

Journal ArticleDOI
Winter Mason1, Duncan J. Watts1
TL;DR: It is found that increased financial incentives increase the quantity, but not the quality, of work performed by participants, where the difference appears to be due to an "anchoring" effect.
Abstract: The relationship between financial incentives and performance, long of interest to social scientists, has gained new relevance with the advent of web-based "crowd-sourcing" models of production. Here we investigate the effect of compensation on performance in the context of two experiments, conducted on Amazon's Mechanical Turk (AMT). We find that increased financial incentives increase the quantity, but not the quality, of work performed by participants, where the difference appears to be due to an "anchoring" effect: workers who were paid more also perceived the value of their work to be greater, and thus were no more motivated than workers paid less. In contrast with compensation levels, we find the details of the compensation scheme do matter--specifically, a "quota" system results in better work for less pay than an equivalent "piece rate" system. Although counterintuitive, these findings are consistent with previous laboratory studies, and may have real-world analogs as well.

Proceedings ArticleDOI
TL;DR: In this article, the authors model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article selection strategy based on user-click feedback to maximize total user clicks.
Abstract: Personalized web services strive to adapt their services (advertisements, news articles, etc) to individual users by making use of both content and user information. Despite a few recent advances, this problem remains challenging for at least two reasons. First, web service is featured with dynamically changing pools of content, rendering traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest calls for solutions that are both fast in learning and computation. In this work, we model personalized recommendation of news articles as a contextual bandit problem, a principled approach in which a learning algorithm sequentially selects articles to serve users based on contextual information about the users and articles, while simultaneously adapting its article-selection strategy based on user-click feedback to maximize total user clicks. The contributions of this work are three-fold. First, we propose a new, general contextual bandit algorithm that is computationally efficient and well motivated from learning theory. Second, we argue that any bandit algorithm can be reliably evaluated offline using previously recorded random traffic. Finally, using this offline evaluation method, we successfully applied our new algorithm to a Yahoo! Front Page Today Module dataset containing over 33 million events. Results showed a 12.5% click lift compared to a standard context-free bandit algorithm, and the advantage becomes even greater when data gets more scarce.

Olivier Chapelle1, Yi Chang1
25 Jun 2010
TL;DR: The Yahoo! Learning to Rank Challenge as discussed by the authors was organized to promote the development of state-of-the-art learning-to-rank algorithms for web search ranking, which has gained a lot of interest in the recent years.
Abstract: Learning to rank for information retrieval has gained a lot of interest in the recent years but there is a lack for large real-world datasets to benchmark algorithms. That led us to publicly release two datasets used internally at Yahoo! for learning the web search ranking function. To promote these datasets and foster the development of state-of-the-art learning to rank algorithms, we organized the Yahoo! Learning to Rank Challenge in spring 2010. This paper provides an overview and an analysis of this challenge, along with a detailed description of the released datasets.

Book
05 May 2010
TL;DR: This book discusses graph-based community detection techniques and many important extensions that handle dynamic, heterogeneous networks in social media, and demonstrates how discovered patterns of communities can be used for social media mining.
Abstract: This book, from a data mining perspective, introduces characteristics of social media, reviews representative tasks of computing with social media, and illustrates associated challenges. It introduces basic concepts, presents state-of-the-art algorithms with easy-to-understand examples, and recommends effective evaluation methods. In particular, we discuss graph-based community detection techniques and many important extensions that handle dynamic, heterogeneous networks in social media. We also demonstrate how discovered patterns of communities can be used for social media mining. The concepts, algorithms, and methods presented in this lecture can help harness the power of social media and support building socially-intelligent systems. This book is an accessible introduction to the study of \emph{community detection and mining in social media}. It is an essential reading for students, researchers, and practitioners in disciplines and applications where social media is a key source of data that piques our curiosity to understand, manage, innovate, and excel. This book is supported by additional materials, including lecture slides, the complete set of figures, key references, some toy data sets used in the book, and the source code of representative algorithms. The readers are encouraged to visit the book website http://dmml.asu.edu/cdm/ for the latest information.

Journal ArticleDOI
TL;DR: Investigating the performance of RDS by simulating sampling from 85 known, network populations finds that RDS is substantially less accurate than generally acknowledged and that reported RDS confidence intervals are misleadingly narrow.
Abstract: Respondent-driven sampling (RDS) is a network-based technique for estimating traits in hard-to-reach populations, for example, the prevalence of HIV among drug injectors. In recent years RDS has been used in more than 120 studies in more than 20 countries and by leading public health organizations, including the Centers for Disease Control and Prevention in the United States. Despite the widespread use and growing popularity of RDS, there has been little empirical validation of the methodology. Here we investigate the performance of RDS by simulating sampling from 85 known, network populations. Across a variety of traits we find that RDS is substantially less accurate than generally acknowledged and that reported RDS confidence intervals are misleadingly narrow. Moreover, because we model a best-case scenario in which the theoretical RDS sampling assumptions hold exactly, it is unlikely that RDS performs any better in practice than in our simulations. Notably, the poor performance of RDS is driven not by the bias but by the high variance of estimates, a possibility that had been largely overlooked in the RDS literature. Given the consistency of our results across networks and our generous sampling conditions, we conclude that RDS as currently practiced may not be suitable for key aspects of public health surveillance where it is now extensively applied.

Journal ArticleDOI
TL;DR: The barrier, mechanical, and other properties of biopolymer-coated paper are reviewed and potential applications for bioactive coatings on paper packaging materials are discussed with examples.
Abstract: Increased environmental concerns over the use of certain synthetic packaging and coatings in com- bination with consumer demands for both higher quality and longer shelf life have led to increased interest in al- ternative packaging materials research Naturally renewable biopolymers can be used as barrier coatings on paper packaging materials These biopolymer coatings may retard unwanted moisture transfer in food products, are good oxygen and oil barriers, are biodegradable, and have potential to replace current synthetic paper and paperboard coatings Incorporation of antimicrobial agents in coatings to produce active paper packaging materials provides anattractiveoptionforprotectingfoodfrommicroorganismdevelopmentandspreadThebarrier,mechanical,and otherpropertiesofbiopolymer-coatedpaperarereviewedExistingandpotentialapplicationsforbioactivecoatings on paper packaging materials are discussed with examples

Proceedings ArticleDOI
13 Jun 2010
TL;DR: It is indicated that high quality itineraries can be automatically constructed from Flickr data, by constructing intra-city travel itineraries automatically by tapping a latent source reflecting geo-temporal breadcrumbs left by millions of tourists.
Abstract: Vacation planning is one of the frequent---but nonetheless laborious---tasks that people engage themselves with online; requiring skilled interaction with a multitude of resources. This paper constructs intra-city travel itineraries automatically by tapping a latent source reflecting geo-temporal breadcrumbs left by millions of tourists. For example, the popular rich media sharing site, Flickr, allows photos to be stamped by the time of when they were taken and be mapped to Points Of Interests (POIs) by geographical (i.e. latitude-longitude) and semantic (e.g., tags) metadata.Leveraging this information, we construct itineraries following a two-step approach. Given a city, we first extract photo streams of individual users. Each photo stream provides estimates on where the user was, how long he stayed at each place, and what was the transit time between places. In the second step, we aggregate all user photo streams into a POI graph. Itineraries are then automatically constructed from the graph based on the popularity of the POIs and subject to the user's time and destination constraints.We evaluate our approach by constructing itineraries for several major cities and comparing them, through a "crowd-sourcing" marketplace (Amazon Mechanical Turk), against itineraries constructed from popular bus tours that are professionally generated. Our extensive survey-based user studies over about 450 workers on AMT indicate that high quality itineraries can be automatically constructed from Flickr data.