scispace - formally typeset
Search or ask a question

Showing papers by "Facebook published in 2013"


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed a novel 3D CNN model for action recognition, which extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Abstract: We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.

4,545 citations


Proceedings ArticleDOI
01 Oct 2013
TL;DR: The design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN is summarized, which decouples the programming model from the resource management infrastructure, and delegates many scheduling functions to per-application components.
Abstract: The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agora---the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs' control flow, which resulted in endless scalability concerns for the scheduler. In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task fault-tolerance) to per-application components. We provide experimental evidence demonstrating the improvements we made, confirm improved efficiency by reporting the experience of running YARN on production environments (including 100% of Yahoo! grids), and confirm the flexibility claims by discussing the porting of several programming frameworks onto YARN viz. Dryad, Giraph, Hoya, Hadoop MapReduce, REEF, Spark, Storm, Tez.

2,006 citations


Posted Content
TL;DR: This article showed that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend, which suggests that it is the space, rather than individual units, that contains of the semantic information in the high layers of neural networks.
Abstract: Deep neural networks are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties. First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks. Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. We can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

1,313 citations


Posted Content
TL;DR: This paper proposed a new benchmark corpus for measuring progress in statistical language modeling, which consists of almost one billion words of training data and can be used to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques.
Abstract: We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a this http URL project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

865 citations


Posted Content
TL;DR: In this paper, the redundancy in the parameterization of deep learning models is demonstrated and it is shown that given only a few weight values for each feature it is possible to accurately predict the remaining values.
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.

855 citations


Proceedings Article
02 Apr 2013
TL;DR: This paper describes how Facebook leverages memcached as a building block to construct and scale a distributed key-value store that supports the world's largest social network.
Abstract: Memcached is a well known, simple, in-memory caching solution. This paper describes how Facebook leverages memcached as a building block to construct and scale a distributed key-value store that supports the world's largest social network. Our system handles billions of requests per second and holds trillions of items to deliver a rich experience for over a billion users around the world.

784 citations


Proceedings ArticleDOI
05 Dec 2013
TL;DR: It is demonstrated that there is significant redundancy in the parameterization of several deep learning models and not only can the parameter values be predicted, but many of them need not be learned at all.
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy.

752 citations


Journal ArticleDOI
01 Mar 2013
TL;DR: In this article, the authors present a family of erasure codes that are efficient repairable and offer higher reliability compared to Reed-Solomon codes, which is the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.
Abstract: Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance.We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2× on the repair disk I/O and repair network traffic. The disadvantage of the new coding scheme is that it requires 14% more storage compared to Reed-Solomon codes, an overhead shown to be information theoretically optimal to obtain locality. Because the new codes repair failures faster, this provides higher reliability, which is orders of magnitude higher compared to replication.

742 citations


Posted Content
TL;DR: In this paper, pose-normalized CNNs are used to estimate human attributes from images of people under large variation of viewpoint, pose, appearance, articulation, and occlusion.
Abstract: We propose a method for inferring human attributes (such as gender, hair style, clothes style, expression, action) from images of people under large variation of viewpoint, pose, appearance, articulation and occlusion. Convolutional Neural Nets (CNN) have been shown to perform very well on large scale object recognition problems. In the context of attribute classification, however, the signal is often subtle and it may cover only a small part of the image, while the image is dominated by the effects of pose and viewpoint. Discounting for pose variation would require training on very large labeled datasets which are not presently available. Part-based models, such as poselets and DPM have been shown to perform well for this problem but they are limited by shallow low-level features. We propose a new method which combines part-based models and deep learning by training pose-normalized CNNs. We show substantial improvement vs. state-of-the-art methods on challenging attribute classification tasks in unconstrained settings. Experiments confirm that our method outperforms both the best part-based methods on this problem and conventional CNNs trained on the full bounding box of the person.

427 citations


Proceedings Article
26 Jun 2013
TL;DR: TAO is a geographically distributed data store that provides efficient and timely access to the social graph for Facebook's demanding workload using a fixed set of queries.
Abstract: We introduce a simple data model and API tailored for serving the social graph, and TAO, an implementation of this model. TAO is a geographically distributed data store that provides efficient and timely access to the social graph for Facebook's demanding workload using a fixed set of queries. It is deployed at Facebook, replacing memcache for many data types that fit its model. The system runs on thousands of machines, is widely distributed, and provides access to many petabytes of data. TAO can process a billion reads and millions of writes each second.

426 citations


Proceedings ArticleDOI
27 Apr 2013
TL;DR: This paper combines survey and large-scale log data to examine how well users' perceptions of their audience match their actual audience on Facebook, and finds that social media users consistently underestimate their audience size for their posts.
Abstract: When you share content in an online social network, who is listening? Users have scarce information about who actually sees their content, making their audience seem invisible and difficult to estimate. However, understanding this invisible audience can impact both science and design, since perceived audiences influence content production and self-presentation online. In this paper, we combine survey and large-scale log data to examine how well users' perceptions of their audience match their actual audience on Facebook. We find that social media users consistently underestimate their audience size for their posts, guessing that their audience is just 27% of its true size. Qualitative coding of survey responses reveals folk theories that attempt to reverse-engineer audience size using feedback and friend count, though none of these approaches are particularly accurate. We analyze audience logs for 222,000 Facebook users' posts over the course of one month and find that publicly visible signals --- friend count, likes, and comments --- vary widely and do not strongly indicate the audience of a single post. Despite the variation, users typically reach 61% of their friends each month. Together, our results begin to reveal the invisible undercurrents of audience attention and behavior in online social networks.

Proceedings ArticleDOI
13 May 2013
TL;DR: The method, which is referred to as CopyCatch, detects lockstep Page Like patterns on Facebook by analyzing only the social graph between users and Pages and the times at which the edges in the graph were created.
Abstract: How can web services that depend on user generated content discern fraudulent input by spammers from legitimate input? In this paper we focus on the social network Facebook and the problem of discerning ill-gotten Page Likes, made by spammers hoping to turn a profit, from legitimate Page Likes. Our method, which we refer to as CopyCatch, detects lockstep Page Like patterns on Facebook by analyzing only the social graph between users and Pages and the times at which the edges in the graph (the Likes) were created. We offer the following contributions: (1) We give a novel problem formulation, with a simple concrete definition of suspicious behavior in terms of graph structure and edge constraints. (2) We offer two algorithms to find such suspicious lockstep behavior - one provably-convergent iterative algorithm and one approximate, scalable MapReduce implementation. (3) We show that our method severely limits "greedy attacks" and analyze the bounds from the application of the Zarankiewicz problem to our setting. Finally, we demonstrate and discuss the effectiveness of CopyCatch at Facebook and on synthetic data, as well as potential extensions to anomaly detection problems in other domains. CopyCatch is actively in use at Facebook, searching for attacks on Facebook's social graph of over a billion users, many millions of Pages, and billions of Page Likes.

Proceedings ArticleDOI
22 Jun 2013
TL;DR: LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.
Abstract: Database benchmarks are an important tool for database researchers and practitioners that ease the process of making informed comparisons between different database hardware, software and configurations. Large scale web services such as social networks are a major and growing database application area, but currently there are few benchmarks that accurately model web service workloads.In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.

Journal ArticleDOI
TL;DR: Internet companies such as Facebook operate in a "perpetual development" mindset, which means that the website continues to undergo development with no predefined final objective, and that new developments are deployed so that users can enjoy them as soon as they're ready.
Abstract: Internet companies such as Facebook operate in a "perpetual development" mindset. This means that the website continues to undergo development with no predefined final objective, and that new developments are deployed so that users can enjoy them as soon as they're ready. To support this, Facebook uses both technical approaches such as peer review and extensive automated testing, and a culture of personal responsibility.

Journal ArticleDOI
02 Jan 2013-PLOS ONE
TL;DR: It was found the frequency of online interaction was diagnostic of strong ties, and interaction frequency was much more useful diagnostically than were attributes of the user or the user’s friends.
Abstract: Some social connections are stronger than others. People have not only friends, but also best friends. Social scientists have long recognized this characteristic of social connections and researchers frequently use the term tie strength to refer to this concept. We used online interaction data (specifically, Facebook interactions) to successfully identify real-world strong ties. Ground truth was established by asking users themselves to name their closest friends in real life. We found the frequency of online interaction was diagnostic of strong ties, and interaction frequency was much more useful diagnostically than were attributes of the user or the user's friends. More private communications (messages) were not necessarily more informative than public communications (comments, wall posts, and other interactions).

Proceedings ArticleDOI
13 May 2013
TL;DR: This paper obtains bounds on the error rate of the algorithm and shows it is governed by the expansion of the graph, and demonstrates, using several synthetic and real datasets, that the algorithm outperforms the state of the art.
Abstract: In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary choice questions. Each user has an unknown, fixed, reliability that determines the user's error rate in answering questions. The problem is to determine the truth values of the questions solely based on the user answers. Although this problem has been studied extensively, theoretical error bounds have been shown only for restricted settings: when the graph between users and questions is either random or complete. In this paper we consider a general setting of the problem where the user--question graph can be arbitrary. We obtain bounds on the error rate of our algorithm and show it is governed by the expansion of the graph. We demonstrate, using several synthetic and real datasets, that our algorithm outperforms the state of the art.

Proceedings ArticleDOI
03 Nov 2013
TL;DR: This paper instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses.
Abstract: This paper examines the workload of Facebook's photo-serving stack and the effectiveness of the many layers of caching it employs Facebook's image-management infrastructure is complex and geographically distributed It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional caching via Akamai The underlying image storage layer is widely distributed, and includes multiple data centersWe instrumented every Facebook-controlled layer of the stack and sampled the resulting event stream to obtain traces covering over 77 million requests for more than 1 million unique photos This permits us to study traffic patterns, cache access patterns, geolocation of clients and servers, and to explore correlation between properties of the content and accesses Our results (1) quantify the overall traffic percentages served by different layers: 655% browser cache, 200% Edge Cache, 46% Origin Cache, and 99% Backend storage, (2) reveal that a significant portion of photo requests are routed to remote PoPs and data centers as a consequence both of load-balancing and peering policy, (3) demonstrate the potential performance benefits of coordinating Edge Caches and adopting S4LRU eviction algorithms at both Edge and Origin layers, and (4) show that the popularity of photos is highly dependent on content age and conditionally dependent on the social-networking metrics we considered

Posted Content
TL;DR: In this paper, the authors present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook's warehouse cluster in production.
Abstract: Erasure codes, such as Reed-Solomon (RS) codes, are being increasingly employed in data centers to combat the cost of reliably storing large amounts of data. Although these codes provide optimal storage efficiency, they require significantly high network and disk usage during recovery of missing data. In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook's warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded data. To address this issue, we present a new storage code using our recently proposed "Piggybacking" framework, that reduces the network and disk usage during recovery by 30% in theory, while also being storage optimal and supporting arbitrary design parameters. The implementation of the proposed code in the Hadoop Distributed File System (HDFS) is underway. We use the measurements from the warehouse cluster to show that the proposed code would lead to a reduction of close to fifty terabytes of cross-rack traffic per day.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work argues for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment, and demonstrates the potential benefits of the TSSP architecture through an FPGA prototyping platform, and shows the potential for a 6X-16X power-performance improvement over conventional server baselines.
Abstract: Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks.Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.

Proceedings ArticleDOI
23 Feb 2013
TL;DR: Surprisingly, individuals who have lost a job feel greater stress after talking with strong ties, and communication withStrong ties is more predictive of finding employment within three months.
Abstract: Among those who have recently lost a job, social networks in general and online ones in particular may be useful to cope with stress and find new employment. This study focuses on the psychological and practical consequences of Facebook use following job loss. By pairing longitudinal surveys of Facebook users with logs of their online behavior, we examine how communication with different kinds of ties predicts improvements in stress, social support, bridging social capital, and whether they find new jobs. Losing a job is associated with increases in stress, while talking with strong ties is generally associated with improvements in stress and social support. Weak ties do not provide these benefits. Bridging social capital comes from both strong and weak ties. Surprisingly, individuals who have lost a job feel greater stress after talking with strong ties. Contrary to the "strength of weak ties" hypothesis, communication with strong ties is more predictive of finding employment within three months.

Proceedings ArticleDOI
04 Feb 2013
TL;DR: This work introduces an efficient algorithm, balanced label propagation, for precisely partitioning massive graphs while greedily maximizing edge locality, the number of edges that are assigned to the same shard of a partition.
Abstract: Partitioning graphs at scale is a key challenge for any application that involves distributing a graph across disks, machines, or data centers. Graph partitioning is a very well studied problem with a rich literature, but existing algorithms typically can not scale to billions of edges, or can not provide guarantees about partition sizes.In this work we introduce an efficient algorithm, balanced label propagation, for precisely partitioning massive graphs while greedily maximizing edge locality, the number of edges that are assigned to the same shard of a partition. By combining the computational efficiency of label propagation --- where nodes are iteratively relabeled to the same 'label' as the plurality of their graph neighbors --- with the guarantees of constrained optimization --- guiding the propagation by a linear program constraining the partition sizes --- our algorithm makes it practically possible to partition graphs with billions of edges.Our algorithm is motivated by the challenge of performing graph predictions in a distributed system. Because this requires assigning each node in a graph to a physical machine with memory limitations, it is critically necessary to ensure the resulting partition shards do not overload any single machine.We evaluate our algorithm for its partitioning performance on the Facebook social graph, and also study its performance when partitioning Facebook's 'People You May Know' service (PYMK), the distributed system responsible for the feature extraction and ranking of the friends-of-friends of all active Facebook users. In a live deployment, we observed average query times and average network traffic levels that were 50.5% and 37.1% (respectively) when compared to the previous naive random sharding.

Posted Content
TL;DR: It is shown that proper cluster randomization can lead to exponentially lower estimator variance when experimentally measuring average treatment effects under interference, and if a graph satisfies a restricted-growth condition on the growth rate of neighborhoods, then there exists a natural clustering algorithm, based on vertex neighborhoods, for which the variance of the estimator can be upper bounded by a linear function of the degrees.
Abstract: A/B testing is a standard approach for evaluating the effect of online experiments; the goal is to estimate the `average treatment effect' of a new feature or condition by exposing a sample of the overall population to it. A drawback with A/B testing is that it is poorly suited for experiments involving social interference, when the treatment of individuals spills over to neighboring individuals along an underlying social network. In this work, we propose a novel methodology using graph clustering to analyze average treatment effects under social interference. To begin, we characterize graph-theoretic conditions under which individuals can be considered to be `network exposed' to an experiment. We then show how graph cluster randomization admits an efficient exact algorithm to compute the probabilities for each vertex being network exposed under several of these exposure conditions. Using these probabilities as inverse weights, a Horvitz-Thompson estimator can then provide an effect estimate that is unbiased, provided that the exposure model has been properly specified. Given an estimator that is unbiased, we focus on minimizing the variance. First, we develop simple sufficient conditions for the variance of the estimator to be asymptotically small in n, the size of the graph. However, for general randomization schemes, this variance can be lower bounded by an exponential function of the degrees of a graph. In contrast, we show that if a graph satisfies a restricted-growth condition on the growth rate of neighborhoods, then there exists a natural clustering algorithm, based on vertex neighborhoods, for which the variance of the estimator can be upper bounded by a linear function of the degrees. Thus we show that proper cluster randomization can lead to exponentially lower estimator variance when experimentally measuring average treatment effects under interference.

Proceedings ArticleDOI
13 May 2013
TL;DR: This work finds that the space of subgraph frequencies is governed both by its combinatorial properties --- based on extremal results that constrain all graphs --- as well as by its empirical properties, manifested in the way that real social graphs appear to lie near a simple one-dimensional curve through this space.
Abstract: A growing set of on-line applications are generating data that can be viewed as very large collections of small, dense social graphs --- these range from sets of social groups, events, or collaboration projects to the vast collection of graph neighborhoods in large social networks A natural question is how to usefully define a domain-independent 'coordinate system' for such a collection of graphs, so that the set of possible structures can be compactly represented and understood within a common space In this work, we draw on the theory of graph homomorphisms to formulate and analyze such a representation, based on computing the frequencies of small induced subgraphs within each graph We find that the space of subgraph frequencies is governed both by its combinatorial properties --- based on extremal results that constrain all graphs --- as well as by its empirical properties --- manifested in the way that real social graphs appear to lie near a simple one-dimensional curve through this space We develop flexible frameworks for studying each of these aspects For capturing empirical properties, we characterize a simple stochastic generative model, a single-parameter extension of Erdos-Renyi random graphs, whose stationary distribution over subgraphs closely tracks the one-dimensional concentration of the real social graph families For the extremal properties, we develop a tractable linear program for bounding the feasible space of subgraph frequencies by harnessing a toolkit of known extremal graph theory Together, these two complementary frameworks shed light on a fundamental question pertaining to social graphs: what properties of social graphs are 'social' properties and what properties are 'graph' properties? We conclude with a brief demonstration of how the coordinate system we examine can also be used to perform classification tasks, distinguishing between structures arising from different types of social graphs

Proceedings Article
28 Jun 2013
TL;DR: There is specific evidence supporting the theory that a user’s “perceived audience” lies at the heart of the issue: posts are censored more frequently than comments, with status updates and posts directed at groups censored most frequently of all sharing use cases investigated.
Abstract: We report results from an exploratory analysis examining “last-minute” self-censorship, or content that is filtered after being written, on Facebook. We collected data from 3.9 million users over 17 days and associate self-censorship behavior with features describing users, their social graph, and the interactions between them. Our results indicate that 71% of users exhibited some level of last-minute self-censorship in the time period, and provide specific evidence supporting the theory that a user’s “perceived audience” lies at the heart of the issue: posts are censored more frequently than comments, with status updates and posts directed at groups censored most frequently of all sharing use cases investigated. Furthermore, we find that: people with more boundaries to regulate censor more; males censor more posts than females and censor even more posts with mostly male friends than do females, but censor no more comments than females; people who exercise more control over their audience censor more content; and, users with more politically and age diverse friends censor less, in general.

Proceedings Article
28 Jun 2013
TL;DR: It is found that a small fraction of photos account for a significant proportion of reshare activity and generate cascades of non-trivial size and depth, suggesting not only that cascades can achieve considerable size but that they can do so in distinct ways.
Abstract: When users post photos on Facebook, they have the option of allowing their friends, followers, or anyone at all to subsequently reshare the photo. A portion of the billions of photos posted to Facebook generates cascades of reshares, enabling many additional users to see, like, comment, and reshare the photos. In this paper we present characteristics of such cascades in aggregate, finding that a small fraction of photos account for a significant proportion of reshare activity and generate cascades of non-trivial size and depth. We also show that the true influence chains in such cascades can be much deeper than what is visible through direct attribution. To illuminate how large cascades can form, we study the diffusion trees of two widely distributed photos: one posted on President Barack Obama’s page following his reelection victory, and another posted by an individual Facebook user hoping to garner enough likes for a cause. We show that the two cascades, despite achieving comparable total sizes, are markedly different in their time evolution, reshare depth distribution, predictability of subcascade sizes, and the demographics of users who propagate them. The findings suggest not only that cascades can achieve considerable size but that they can do so in distinct ways.

Proceedings Article
27 Jun 2013
TL;DR: A new storage code is presented, using the recently proposed "Piggybacking" framework, that reduces the network and disk usage during recovery by 30% in theory, while also being storage optimal and supporting arbitrary design parameters.
Abstract: Erasure codes, such as Reed-Solomon (RS) codes, are being increasingly employed in data centers to combat the cost of reliably storing large amounts of data. Although these codes provide optimal storage efficiency, they require significantly high network and disk usage during recovery of missing data. In this paper, we first present a study on the impact of recovery operations of erasure-coded data on the data-center network, based on measurements from Facebook's warehouse cluster in production. To the best of our knowledge, this is the first study of its kind available in the literature. Our study reveals that recovery of RS-coded data results in a significant increase in network traffic, more than a hundred terabytes per day, in a cluster storing multiple petabytes of RS-coded data. To address this issue, we present a new storage code using our recently proposed Piggybacking framework, that reduces the network and disk usage during recovery by 30% in theory, while also being storage optimal and supporting arbitrary design parameters. The implementation of the proposed code in the Hadoop Distributed File System (HDFS) is underway. We use the measurements from the warehouse cluster to show that the proposed code would lead to a reduction of close to fifty terabytes of cross-rack traffic per day.

Proceedings ArticleDOI
13 May 2013
TL;DR: A new approach to information credibility, Latent Credibility Analysis (LCA), is introduced, constructing strongly principled, probabilistic models where the truth of each claim is a latent variable and the credibility of a source is captured by a set of model parameters.
Abstract: A frequent problem when dealing with data gathered from multiple sources on the web (ranging from booksellers to Wikipedia pages to stock analyst predictions) is that these sources disagree, and we must decide which of their (often mutually exclusive) claims we should accept. Current state-of-the-art information credibility algorithms known as "fact-finders" are transitive voting systems with rules specifying how votes iteratively flow from sources to claims and then back to sources. While this is quite tractable and often effective, fact-finders also suffer from substantial limitations; in particular, a lack of transparency obfuscates their credibility decisions and makes them difficult to adapt and analyze: knowing the mechanics of how votes are calculated does not readily tell us what those votes mean, and finding, for example, that a source has a score of 6 is not informative. We introduce a new approach to information credibility, Latent Credibility Analysis (LCA), constructing strongly principled, probabilistic models where the truth of each claim is a latent variable and the credibility of a source is captured by a set of model parameters. This gives LCA models clear semantics and modularity that make extending them to capture additional observed and latent credibility factors straightforward. Experiments over four real-world datasets demonstrate that LCA models can outperform the best fact-finders in both unsupervised and semi-supervised settings.

Proceedings ArticleDOI
TL;DR: This work uses data from a large sample of Facebook users to investigate a particular category of strong ties, those involving spouses or romantic partners, and offers methods for identifying types of structurally significant people in on-line applications and suggests a potential expansion of existing theories of tie strength.
Abstract: A crucial task in the analysis of on-line social-networking systems is to identify important people --- those linked by strong social ties --- within an individual's network neighborhood. Here we investigate this question for a particular category of strong ties, those involving spouses or romantic partners. We organize our analysis around a basic question: given all the connections among a person's friends, can you recognize his or her romantic partner from the network structure alone? Using data from a large sample of Facebook users, we find that this task can be accomplished with high accuracy, but doing so requires the development of a new measure of tie strength that we term `dispersion' --- the extent to which two people's mutual friends are not themselves well-connected. The results offer methods for identifying types of structurally significant people in on-line applications, and suggest a potential expansion of existing theories of tie strength.

Patent
Raghotham Murthy1, Rajat Goel1
19 Dec 2013
TL;DR: In this article, the authors present a low-latency database query processing system consisting of a gateway server and a plurality of worker nodes, where the gateway server is configured to divide a database query, for a database containing data stored in a distributed storage cluster having a plurality-of-data nodes, into partial queries and construct a query result based on the plurality of intermediate results.
Abstract: Techniques for a system capable of performing low-latency database query processing are disclosed herein. The system includes a gateway server and a plurality of worker nodes. The gateway server is configured to divide a database query, for a database containing data stored in a distributed storage cluster having a plurality of data nodes, into a plurality of partial queries and construct a query result based on a plurality of intermediate results. Each worker node of the plurality of worker nodes is configured to process a respective partial query of the plurality of partial queries by scanning data related to the respective partial query that stored on at least one data node of the distributed storage cluster and generate an intermediate result of the plurality of intermediate results that is stored in a memory of that worker node.

Journal ArticleDOI
TL;DR: This work considers unions of conjunctive queries, UCQ, which are equivalent to positive, existential First Order Logic sentences, and also to nonrecursive datalog programs, and proves the following dichotomy theorem.
Abstract: We study the complexity of computing a query on a probabilistic database. We consider unions of conjunctive queries, UCQ, which are equivalent to positive, existential First Order Logic sentences, and also to nonrecursive datalog programs. The tuples in the database are independent random events. We prove the following dichotomy theorem. For every UCQ query, either its probability can be computed in polynomial time in the size of the database, or is nP-hard. Our result also has applications to the problem of computing the probability of positive, Boolean expressions, and establishes a dichotomy for such classes based on their structure. For the tractable case, we give a very simple algorithm that alternates between two steps: applying the inclusion/exclusion formula, and removing one existential variable. A key and novel feature of this algorithm is that it avoids computing terms that cancel out in the inclusion/exclusion formula, in other words it only computes those terms whose Mobius function in an appropriate lattice is nonzero. We show that this simple feature is a key ingredient needed to ensure completeness. For the hardness proof, we give a reduction from the counting problem for positive, partitioned 2CNF, which is known to be nP-complete. The hardness proof is nontrivial, and combines techniques from logic, classical algebra, and analysis.