Showing papers by "Google published in 2010"

PDF

Open Access

Proceedings Article•

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

[...]

John C. Duchi¹, Elad Hazan², Yoram Singer³•Institutions (3)

University of California, Berkeley¹, IBM², Google³

01 Jan 2010

TL;DR: Adaptive subgradient methods as discussed by the authors dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning, which allows us to find needles in haystacks in the form of very predictive but rarely seen features.

...read moreread less

Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

...read moreread less

7,244 citations

Proceedings Article•DOI•

Pregel: a system for large-scale graph processing

[...]

Grzegorz Malewicz, Matthew H. Austern¹, Aart J. C. Bik¹, James C. Dehnert¹, Ilan Horn¹, Naty Leiser¹, Grzegorz Czajkowski¹ - Show less +3 more•Institutions (1)

Google¹

06 Jun 2010

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.

...read moreread less

Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

...read moreread less

3,840 citations

Journal Article•DOI•

A theory of learning from different domains

[...]

Shai Ben-David¹, John Blitzer², Koby Crammer³, Alex Kulesza⁴, Fernando Pereira⁵, Jennifer Wortman Vaughan⁶ - Show less +2 more•Institutions (6)

University of Waterloo¹, University of California, Berkeley², Technion – Israel Institute of Technology³, University of Pennsylvania⁴, Google⁵, Harvard University⁶

01 May 2010-Machine Learning

TL;DR: A classifier-induced divergence measure that can be estimated from finite, unlabeled samples from the domains and shows how to choose the optimal combination of source and target error as a function of the divergence, the sample sizes of both domains, and the complexity of the hypothesis class.

...read moreread less

Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. Often, however, we have plentiful labeled training data from a source domain but wish to learn a classifier which performs well on a target domain with a different distribution and little or no labeled training data. In this work we investigate two questions. First, under what conditions can a classifier trained from source data be expected to perform well on target data? Second, given a small amount of labeled target data, how should we combine it during training with the large amount of labeled source data to achieve the lowest target error at test time? We address the first question by bounding a classifier's target error in terms of its source error and the divergence between the two domains. We give a classifier-induced divergence measure that can be estimated from finite, unlabeled samples from the domains. Under the assumption that there exists some hypothesis that performs well in both domains, we show that this quantity together with the empirical source error characterize the target error of a source-trained classifier. We answer the second question by bounding the target error of a model which minimizes a convex combination of the empirical source and target errors. Previous theoretical work has considered minimizing just the source error, just the target error, or weighting instances from the two domains equally. We show how to choose the optimal combination of source and target error as a function of the divergence, the sample sizes of both domains, and the complexity of the hypothesis class. The resulting bound generalizes the previously studied cases and is always at least as tight as a bound which considers minimizing only the target error or an equal weighting of source and target errors.

...read moreread less

2,921 citations

Journal Article•DOI•

Accurate, Dense, and Robust Multiview Stereopsis

[...]

Yasutaka Furukawa¹, Jean Ponce²•Institutions (2)

Google¹, École Normale Supérieure²

01 Aug 2010-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A novel algorithm for multiview stereopsis that outputs a dense set of small rectangular patches covering the surfaces visible in the images, which outperforms all others submitted so far for four out of the six data sets.

...read moreread less

Abstract: This paper proposes a novel algorithm for multiview stereopsis that outputs a dense set of small rectangular patches covering the surfaces visible in the images. Stereopsis is implemented as a match, expand, and filter procedure, starting from a sparse set of matched keypoints, and repeatedly expanding these before using visibility constraints to filter away false matches. The keys to the performance of the proposed algorithm are effective techniques for enforcing local photometric consistency and global visibility constraints. Simple but effective methods are also proposed to turn the resulting patch model into a mesh which can be further refined by an algorithm that enforces both photometric consistency and regularization constraints. The proposed approach automatically detects and discards outliers and obstacles and does not require any initialization in the form of a visual hull, a bounding box, or valid depth ranges. We have tested our algorithm on various data sets including objects with fine surface details, deep concavities, and thin structures, outdoor scenes observed from a restricted set of viewpoints, and "crowded" scenes where moving obstacles appear in front of a static structure of interest. A quantitative evaluation on the Middlebury benchmark [1] shows that the proposed method outperforms all others submitted so far for four out of the six data sets.

...read moreread less

2,863 citations

Journal Article•

Why Does Unsupervised Pre-training Help Deep Learning?

[...]

Dumitru Erhan¹, Yoshua Bengio¹, Aaron Courville¹, Pierre-Antoine Manzagol¹, Pascal Vincent¹, Samy Bengio² - Show less +2 more•Institutions (2)

Université de Montréal¹, Google²

01 Mar 2010-Journal of Machine Learning Research

TL;DR: In this paper, the authors empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples, and they suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization.

...read moreread less

Abstract: Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

...read moreread less

2,036 citations

Proceedings Article•DOI•

Onix: a distributed control platform for large-scale production networks

[...]

Teemu Koponen, Martin Casado, Natasha Gude, Jeremy Stribling, Leon Poutievski¹, Min Zhu¹, Rajiv Ramanathan¹, Yuichiro Iwata², Hiroaki Inoue², Takayuki Hama², Scott Shenker³ - Show less +7 more•Institutions (3)

Google¹, NEC², International Computer Science Institute³

04 Oct 2010

TL;DR: Onix provides a general API for control plane implementations, while allowing them to make their own trade-offs among consistency, durability, and scalability.

...read moreread less

Abstract: Computer networks lack a general control paradigm, as traditional networks do not provide any network-wide management abstractions. As a result, each new function (such as routing) must provide its own state distribution, element discovery, and failure recovery mechanisms. We believe this lack of a common control platform has significantly hindered the development of flexible, reliable and feature-rich network control planes.To address this, we present Onix, a platform on top of which a network control plane can be implemented as a distributed system. Control planes written within Onix operate on a global view of the network, and use basic state distribution primitives provided by the platform. Thus Onix provides a general API for control plane implementations, while allowing them to make their own trade-offs among consistency, durability, and scalability.

...read moreread less

1,463 citations

Journal Article•DOI•

On the meaning of work: A theoretical integration and review

[...]

Brent D. Rosso¹, Kathryn Dekas², Amy Wrzesniewski³•Institutions (3)

University of Michigan¹, Google², Yale University³

01 Jan 2010-Research in Organizational Behavior

TL;DR: The meaning of work literature is the product of a long tradition of rich inquiry spanning many disciplines as discussed by the authors, and the field lacks overarching structures that would facilitate greater integration, consistency, and understanding of this body of research.

...read moreread less

1,409 citations

Journal Article•DOI•

MapReduce: a flexible data processing tool

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

01 Jan 2010-Communications of The ACM

TL;DR: MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.

...read moreread less

Abstract: MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs.

...read moreread less

1,293 citations

Proceedings Article•DOI•

Accurate online power estimation and automatic battery behavior based power model generation for smartphones

[...]

Lide Zhang¹, Birjodh Tiwana¹, Robert P. Dick¹, Zhiyun Qian¹, Z. Morley Mao¹, Zhaoguang Wang¹, Lei Yang² - Show less +3 more•Institutions (2)

University of Michigan¹, Google²

24 Oct 2010

TL;DR: PowerBooter is an automated power model construction technique that uses built-in battery voltage sensors and knowledge of battery discharge behavior to monitor power consumption while explicitly controlling the power management and activity states of individual components.

...read moreread less

Abstract: This paper describes PowerBooter, an automated power model construction technique that uses built-in battery voltage sensors and knowledge of battery discharge behavior to monitor power consumption while explicitly controlling the power management and activity states of individual components. It requires no external measurement equipment. We also describe PowerTutor, a component power management and activity state introspection based tool that uses the model generated by PowerBooter for online power estimation. PowerBooter is intended to make it quick and easy for application developers and end users to generate power models for new smartphone variants, which each have different power consumption properties and therefore require different power models. PowerTutor is intended to ease the design and selection of power efficient software for embedded systems. Combined, PowerBooter and PowerTutor have the goal of opening power modeling and analysis for more smartphone variants and their users.

...read moreread less

1,225 citations

Proceedings Article•DOI•

The YouTube video recommendation system

[...]

James Davidson¹, Benjamin Liebald¹, Junning Liu¹, Palash Nandy¹, Taylor Van Vleet¹, Ullas Gargi¹, Sujoy Gupta¹, Yu He¹, Mike Lambert¹, Blake Livingston¹, Dasarathi Sampath¹ - Show less +7 more•Institutions (1)

Google¹

26 Sep 2010

TL;DR: The video recommendation system in use at YouTube, the world's most popular online video community, is discussed, with details on the experimentation and evaluation framework used to test and tune new algorithms.

...read moreread less

Abstract: We discuss the video recommendation system in use at YouTube, the world's most popular online video community. The system recommends personalized sets of videos to users based on their activity on the site. We discuss some of the unique challenges that the system faces and how we address them. In addition, we provide details on the experimentation and evaluation framework used to test and tune new algorithms. We also present some of the findings from these experiments.

...read moreread less

1,069 citations

Proceedings Article•DOI•

Web-scale k-means clustering

[...]

D. Sculley¹•Institutions (1)

Google¹

26 Apr 2010

TL;DR: This work proposes the use of mini-batch optimization for k-means clustering, which reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent.

...read moreread less

Abstract: We present two modifications to the popular k-means clustering algorithm to address the extreme requirements for latency, scalability, and sparsity encountered in user-facing web applications. First, we propose the use of mini-batch optimization for k-means clustering. This reduces computation cost by orders of magnitude compared to the classic batch algorithm while yielding significantly better solutions than online stochastic gradient descent. Second, we achieve sparsity with projected gradient descent, and give a fast e-accurate projection onto the L1-ball. Source code is freely available: http://code.google.com/p/sofia-ml

...read moreread less

Journal Article•DOI•

Astrometry.net: Blind Astrometric Calibration of Arbitrary Astronomical Images

[...]

Dustin Lang¹, Dustin Lang², David W. Hogg³, David W. Hogg⁴, Keir Mierle¹, Keir Mierle⁵, Michael R. Blanton³, Sam T. Roweis³, Sam T. Roweis¹, Sam T. Roweis⁵ - Show less +6 more•Institutions (5)

University of Toronto¹, Princeton University², New York University³, Max Planck Society⁴, Google⁵

22 Mar 2010-The Astronomical Journal

TL;DR: In this paper, the authors present a system that takes as input an astronomical image, and returns as output the pointing, scale, and orientation of that image (the astrometric calibration or World Coordinate System information).

...read moreread less

Abstract: We have built a reliable and robust system that takes as input an astronomical image, and returns as output the pointing, scale, and orientation of that image (the astrometric calibration or World Coordinate System information). The system requires no first guess, and works with the information in the image pixels alone; that is, the problem is a generalization of the "lost in space" problem in which nothing—not even the image scale—is known. After robust source detection is performed in the input image, asterisms (sets of four or five stars) are geometrically hashed and compared to pre-indexed hashes to generate hypotheses about the astrometric calibration. A hypothesis is only accepted as true if it passes a Bayesian decision theory test against a null hypothesis. With indices built from the USNO-B catalog and designed for uniformity of coverage and redundancy, the success rate is >99.9% for contemporary near-ultraviolet and visual imaging survey data, with no false positives. The failure rate is consistent with the incompleteness of the USNO-B catalog; augmentation with indices built from the Two Micron All Sky Survey catalog brings the completeness to 100% with no false positives. We are using this system to generate consistent and standards-compliant meta-data for digital and digitized imaging from plate repositories, automated observatories, individual scientific investigators, and hobbyists. This is the first step in a program of making it possible to trust calibration meta-data for astronomical data of arbitrary provenance.

...read moreread less

Proceedings Article•DOI•

Towards Internet-scale multi-view stereo

[...]

Yasutaka Furukawa¹, Brian Curless², Steven M. Seitz¹, Richard Szeliski³•Institutions (3)

Google¹, University of Washington², Microsoft³

13 Jun 2010

TL;DR: An approach for enabling existing multi-view stereo methods to operate on extremely large unstructured photo collections to decompose the collection into a set of overlapping sets of photos that can be processed in parallel, and to merge the resulting reconstructions.

...read moreread less

Abstract: This paper introduces an approach for enabling existing multi-view stereo methods to operate on extremely large unstructured photo collections. The main idea is to decompose the collection into a set of overlapping sets of photos that can be processed in parallel, and to merge the resulting reconstructions. This overlapping clustering problem is formulated as a constrained optimization and solved iteratively. The merging algorithm, designed to be parallel and out-of-core, incorporates robust filtering steps to eliminate low-quality reconstructions and enforce global visibility constraints. The approach has been tested on several large datasets downloaded from Flickr.com, including one with over ten thousand images, yielding a 3D reconstruction with nearly thirty million points.

...read moreread less

Proceedings Article•DOI•

Efficient hierarchical graph-based video segmentation

[...]

Matthias Grundmann¹, Vivek Kwatra², Mei Han², Irfan Essa¹•Institutions (2)

Georgia Institute of Technology¹, Google²

13 Jun 2010

TL;DR: An efficient and scalable technique for spatiotemporal segmentation of long video sequences using a hierarchical graph-based algorithm that generates high quality segmentations, which are temporally coherent with stable region boundaries, and allows subsequent applications to choose from varying levels of granularity.

...read moreread less

Abstract: We present an efficient and scalable technique for spatiotemporal segmentation of long video sequences using a hierarchical graph-based algorithm. We begin by over-segmenting a volumetric video graph into space-time regions grouped by appearance. We then construct a “region graph” over the obtained segmentation and iteratively repeat this process over multiple levels to create a tree of spatio-temporal segmentations. This hierarchical approach generates high quality segmentations, which are temporally coherent with stable region boundaries, and allows subsequent applications to choose from varying levels of granularity. We further improve segmentation quality by using dense optical flow to guide temporal connections in the initial graph. We also propose two novel approaches to improve the scalability of our technique: (a) a parallel out-of-core algorithm that can process volumes much larger than an in-core algorithm, and (b) a clip-based processing algorithm that divides the video into overlapping clips in time, and segments them successively while enforcing consistency. We demonstrate hierarchical segmentations on video shots as long as 40 seconds, and even support a streaming mode for arbitrarily long videos, albeit without the ability to process them hierarchically.

...read moreread less

Proceedings Article•DOI•

Personalized news recommendation based on click behavior

[...]

Jiahui Liu¹, Peter Dolan¹, Elin Rønby Pedersen¹•Institutions (1)

Google¹

07 Feb 2010

TL;DR: This research presents the content-based recommendation mechanism which uses learned user profiles with an existing collaborative filtering mechanism to generate personalized news recommendations in Google News and demonstrates that the hybrid method improves the quality of news recommendation and increases traffic to the site.

...read moreread less

Abstract: Online news reading has become very popular as the web provides access to news articles from millions of sources around the world. A key challenge of news websites is to help users find the articles that are interesting to read. In this paper, we present our research on developing personalized news recommendation system in Google News. For users who are logged in and have explicitly enabled web history, the recommendation system builds profiles of users' news interests based on their past click behavior. To understand how users' news interests change over time, we first conducted a large-scale analysis of anonymized Google News users click logs. Based on the log analysis, we developed a Bayesian framework for predicting users' current news interests from the activities of that particular user and the news trends demonstrated in the activity of all users. We combine the content-based recommendation mechanism which uses learned user profiles with an existing collaborative filtering mechanism to generate personalized news recommendations. The hybrid recommender system was deployed in Google News. Experiments on the live traffic of Google News website demonstrated that the hybrid method improves the quality of news recommendation and increases traffic to the site.

...read moreread less

Journal Article•DOI•

Google Street View: Capturing the World at Street Level

[...]

Dragomir Anguelov¹, Carole Dulong¹, Daniel Joseph Filip¹, Christian Frueh¹, Stephane Lafon¹, Richard F. Lyon¹, Abhijit Ogale¹, Luc Vincent¹, Josh Weaver¹ - Show less +5 more•Institutions (1)

Google¹

01 Jun 2010-IEEE Computer

TL;DR: A team of Google researchers describes the technical challenges involved in capturing, processing, and serving street-level imagery on a global scale.

...read moreread less

Abstract: Street View serves millions of Google users daily with panoramic imagery captured in hundreds of cities in 20 countries across four continents. A team of Google researchers describes the technical challenges involved in capturing, processing, and serving street-level imagery on a global scale.

...read moreread less

Proceedings Article•DOI•

Availability in globally distributed storage systems

[...]

Daniel Ford¹, François Labelle¹, Florentina Popovici¹, Murray Stokely¹, Van-Anh Truong², Luiz Andre Barroso¹, Carrie Grimes¹, Sean Quinlan¹ - Show less +4 more•Institutions (2)

Google¹, Columbia University²

04 Oct 2010

TL;DR: This work characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and presents statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies.

...read moreread less

Abstract: Highly available cloud storage is often implemented with complex, multi-tiered distributed systems built on top of clusters of commodity servers and disk drives. Sophisticated management, load balancing and recovery techniques are needed to achieve high performance and availability amidst an abundance of failure sources that include software, hardware, network connectivity, and power issues. While there is a relative wealth of failure studies of individual components of storage systems, such as disk drives, relatively little has been reported so far on the overall availability behavior of large cloudbased storage services.We characterize the availability properties of cloud storage systems based on an extensive one year study of Google's main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

...read moreread less

Journal Article•DOI•

Dremel: interactive analysis of web-scale datasets

[...]

Sergey Melnik¹, Andrey Gubarev¹, Jing Jing Long¹, Geoffrey M. Romer¹, Shiva Shivakumar¹, Matthew B. Tolton¹, Theodore Vassilakis¹ - Show less +3 more•Institutions (1)

Google¹

01 Sep 2010

TL;DR: The architecture and implementation of Dremel are described, and how it complements MapReduce-based computing is explained, and a novel columnar storage representation for nested records is presented.

...read moreread less

Abstract: Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

...read moreread less

Proceedings Article•DOI•

Semi-supervised hashing for scalable image retrieval

[...]

Jun Wang¹, Sanjiv Kumar², Shih-Fu Chang¹•Institutions (2)

Columbia University¹, Google²

13 Jun 2010

TL;DR: This work proposes a semi-supervised hashing method that is formulated as minimizing empirical error on the labeled data while maximizing variance and independence of hash bits over the labeled and unlabeled data.

...read moreread less

Abstract: Large scale image search has recently attracted considerable attention due to easy availability of huge amounts of data. Several hashing methods have been proposed to allow approximate but highly efficient search. Unsupervised hashing methods show good performance with metric distances but, in image search, semantic similarity is usually given in terms of labeled pairs of images. There exist supervised hashing methods that can handle such semantic similarity but they are prone to overfitting when labeled data is small or noisy. Moreover, these methods are usually very slow to train. In this work, we propose a semi-supervised hashing method that is formulated as minimizing empirical error on the labeled data while maximizing variance and independence of hash bits over the labeled and unlabeled data. The proposed method can handle both metric as well as semantic similarity. The experimental results on two large datasets (up to one million samples) demonstrate its superior performance over state-of-the-art supervised and unsupervised methods.

...read moreread less

Proceedings Article•DOI•

PRESS: PRedictive Elastic ReSource Scaling for cloud systems

[...]

Zhenhuan Gong¹, Xiaohui Gu¹, John Wilkes²•Institutions (2)

North Carolina State University¹, Google²

01 Oct 2010

TL;DR: This paper presents a novel PRedictive Elastic reSource Scaling (PRESS) scheme for cloud systems that unobtrusively extracts fine-grained dynamic patterns in application resource demands and adjust their resource allocations automatically.

...read moreread less

Abstract: Cloud systems require elastic resource allocation to minimize resource provisioning costs while meeting service level objectives (SLOs). In this paper, we present a novel PRedictive Elastic reSource Scaling (PRESS) scheme for cloud systems. PRESS unobtrusively extracts fine-grained dynamic patterns in application resource demands and adjust their resource allocations automatically. Our approach leverages light-weight signal processing and statistical learning algorithms to achieve online predictions of dynamic application resource requirements. We have implemented the PRESS system on Xen and tested it using RUBiS and an application load trace from Google. Our experiments show that we can achieve good resource prediction accuracy with less than 5% over-estimation error and near zero under-estimation error, and elastic resource scaling can both significantly reduce resource waste and SLO violations.

...read moreread less

Proceedings Article•DOI•

Large-scale incremental processing using distributed transactions and notifications

[...]

Daniel Peng¹, Frank Dabek¹•Institutions (1)

Google¹

04 Oct 2010

TL;DR: Percolator is built, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index, which processes the same number of documents per day while reducing the average age of documents in Google search results by 50%.

...read moreread less

Abstract: Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google's indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

...read moreread less

Journal Article•DOI•

Large scale image annotation: learning to rank with joint word-image embeddings

[...]

Jason Weston¹, Samy Bengio¹, Nicolas Usunier²•Institutions (2)

Google¹, University of Paris²

01 Oct 2010

TL;DR: This work proposes a strongly performing method that scales to image annotation datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations.

...read moreread less

Abstract: Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced "sibling" precision metric, where our method also obtains excellent results.

...read moreread less

Journal Article•

Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

[...]

Yin-Wen Chang¹, Cho-Jui Hsieh¹, Kai-Wei Chang¹, Michael Ringgaard², Chih-Jen Lin¹ - Show less +1 more•Institutions (2)

National Taiwan University¹, Google²

01 Mar 2010-Journal of Machine Learning Research

TL;DR: The proposed fast linear-SVM methods are applied to the explicit form of polynomially mapped data and successfully applied to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements.

...read moreread less

Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efficiently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements.

...read moreread less

Proceedings Article•DOI•

Energy proportional datacenter networks

[...]

Dennis Abts¹, Michael R. Marty¹, Philip M. Wells¹, Peter Michael Klausler¹, Hong Liu¹ - Show less +1 more•Institutions (1)

Google¹

19 Jun 2010

TL;DR: It is demonstrated that energy proportional datacenter communication is indeed possible and that there is a significant power advantage to having independent control of each unidirectional channel comprising a network link.

...read moreread less

Abstract: Numerous studies have shown that datacenter computers rarely operate at full utilization, leading to a number of proposals for creating servers that are energy proportional with respect to the computation that they are performing. In this paper, we show that as servers themselves become more energy proportional, the datacenter network can become a significant fraction (up to 50%) of cluster power. In this paper we propose several ways to design a high-performance datacenter network whose power consumption is more proportional to the amount of traffic it is moving -- that is, we propose energy proportional datacenter networks. We first show that a flattened butterfly topology itself is inherently more power efficient than the other commonly proposed topology for high-performance datacenter networks. We then exploit the characteristics of modern plesiochronous links to adjust their power and performance envelopes dynamically. Using a network simulator, driven by both synthetic workloads and production datacenter traces, we characterize and understand design tradeoffs, and demonstrate an 85% reduction in power --- which approaches the ideal energy-proportionality of the network. Our results also demonstrate two challenges for the designers of future network switches: 1) We show that there is a significant power advantage to having independent control of each unidirectional channel comprising a network link, since many traffic patterns show very asymmetric use, and 2) system designers should work to optimize the high-speed channel designs to be more energy efficient by choosing optimal data rate and equalization technology. Given these assumptions, we demonstrate that energy proportional datacenter communication is indeed possible.

...read moreread less

Book Chapter•DOI•

Bundle adjustment in the large

[...]

Sameer Agarwal¹, Noah Snavely², Steven M. Seitz³, Richard Szeliski⁴•Institutions (4)

Google¹, Cornell University², University of Washington³, Microsoft⁴

05 Sep 2010

TL;DR: The experiments show that truncated Newton methods, when paired with relatively simple preconditioners, offer state of the art performance for large-scale bundle adjustment.

...read moreread less

Abstract: We present the design and implementation of a new inexact Newton type algorithm for solving large-scale bundle adjustment problems with tens of thousands of images. We explore the use of Conjugate Gradients for calculating the Newton step and its performance as a function of some simple and computationally efficient preconditioners. We show that the common Schur complement trick is not limited to factorization-based methods and that it can be interpreted as a form of preconditioning. Using photos from a street-side dataset and several community photo collections, we generate a variety of bundle adjustment problems and use them to evaluate the performance of six different bundle adjustment algorithms. Our experiments show that truncated Newton methods, when paired with relatively simple preconditioners, offer state of the art performance for large-scale bundle adjustment. The code, test problems and detailed performance data are available at http://grail.cs.washington.edu/projects/bal.

...read moreread less

Journal Article•DOI•

Trustworthy Hardware: Identifying and Classifying Hardware Trojans

[...]

Ramesh Karri¹, Jeyavijayan Rajendran¹, Kurt Rosenfeld², Mohammad Tehranipoor³•Institutions (3)

New York University¹, Google², University of Connecticut³

01 Oct 2010-IEEE Computer

TL;DR: A proposed new hardware Trojan taxonomy provides a first step in better understanding existing and potential threats.

...read moreread less

Abstract: For reasons of economy, critical systems will inevitably depend on electronics made in untrusted factories. A proposed new hardware Trojan taxonomy provides a first step in better understanding existing and potential threats.

...read moreread less

Journal Article•DOI•

A modern Bayesian look at the multi-armed bandit

[...]

Steven L. Scott¹•Institutions (1)

Google¹

01 Nov 2010-Applied Stochastic Models in Business and Industry

TL;DR: A heuristic for managing multi-armed bandits called randomized probability matching is described, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal.

...read moreread less

Abstract: A multi-armed bandit is an experiment with the goal of accumulating rewards from a payoff distribution with unknown parameters that are to be learned sequentially. This article describes a heuristic for managing multi-armed bandits called randomized probability matching, which randomly allocates observations to arms according the Bayesian posterior probability that each arm is optimal. Advances in Bayesian computation have made randomized probability matching easy to apply to virtually any payoff distribution. This flexibility frees the experimenter to work with payoff distributions that correspond to certain classical experimental designs that have the potential to outperform methods that are ‘optimal’ in simpler contexts. I summarize the relationships between randomized probability matching and several related heuristics that have been used in the reinforcement learning literature. Copyright © 2010 John Wiley & Sons, Ltd.

...read moreread less

Clouds, big data, and smart assets: Ten tech-enabled business trends to watch

[...]

William H. Dutton¹, Kris Pister², Hal R. Varian³, Rob Bernard⁴, Rob Salkowitz, Jacques Bughin⁵, Michael Chui⁵, James Manyika⁵ - Show less +4 more•Institutions (5)

University of Oxford¹, University of California, Berkeley², Google³, Microsoft⁴, McKinsey & Company⁵

01 Jan 2010

Patent•

Automatically providing content associated with captured information, such as information captured in real-time

[...]

Martin T. King¹, Redwood Stephens¹, Claes-Fredrik Mannby¹, Jesse Peterson¹, Mark Sanvitale¹, Michael John Sebastian Smith¹, Christopher J. Daley-Watson¹ - Show less +3 more•Institutions (1)

Google¹

12 Mar 2010

TL;DR: In this paper, a system and method for automatically providing content associated with captured information is described, in which the system receives input by a user, and automatically provides content or links to the information associated with the input.

...read moreread less

Abstract: A system and method for automatically providing content associated with captured information is described. In some examples, the system receives input by a user, and automatically provides content or links to content associated with the input. In some examples, the system receives input via text entry or by capturing text from a rendered document, such as a printed document, an object, an audio stream, and so on.

...read moreread less

Journal Article•DOI•

Native Client: a sandbox for portable, untrusted x86 native code

[...]

Bennet S. Yee¹, David C. Sehr¹, Gregory Dardyk¹, J. Bradley Chen¹, Robert Muth¹, Tavis Ormandy¹, Shiki Okasaka¹, Neha Narula¹, Nicholas Fullagar¹ - Show less +5 more•Institutions (1)

Google¹

01 Jan 2010-Communications of The ACM

TL;DR: This paper describes the design, implementation and evaluation of Native Client, a sandbox for untrusted x86 native code that combines software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client.

...read moreread less

Abstract: This paper describes the design, implementation and evaluation of Native Client, a sandbox for untrusted x86 native code. Native Client aims to give browser-based applications the computational performance of native applications without compromising safety. Native Client uses software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client. Native Client provides operating system portability for binary code while supporting performance-oriented features generally absent from web application programming environments, such as thread support, instruction set extensions such as SSE, and use of compiler intrinsics and hand-coded assembler. We combine these properties in an open architecture that encourages community review and 3rd-party tools.

...read moreread less

Collapse