Showing papers by "Google published in 2009"

PDF

Open Access

Journal Article•DOI•

Detecting influenza epidemics using search engine query data

[...]

Jeremy Ginsberg¹, Matthew H. Mohebbi¹, Rajan Patel¹, Lynnette Brammer², Mark S. Smolinski¹, Lawrence B. Brilliant¹ - Show less +2 more•Institutions (2)

Google¹, Centers for Disease Control and Prevention²

19 Feb 2009-Nature

TL;DR: A method of analysing large numbers of Google search queries to track influenza-like illness in a population and accurately estimate the current level of weekly influenza activity in each region of the United States with a reporting lag of about one day is presented.

...read moreread less

Abstract: This paper - first published on-line in November 2008 - draws on data from an early version of the Google Flu Trends search engine to estimate the levels of flu in a population. It introduces a computational model that converts raw search query data into a region-by-region real-time surveillance system that accurately estimates influenza activity with a lag of about one day - one to two weeks faster than the conventional reports published by the Centers for Disease Prevention and Control. This report introduces a computational model based on internet search queries for real-time surveillance of influenza-like illness (ILI), which reproduces the patterns observed in ILI data from the Centers for Disease Control and Prevention. Seasonal influenza epidemics are a major public health concern, causing tens of millions of respiratory illnesses and 250,000 to 500,000 deaths worldwide each year1. In addition to seasonal influenza, a new strain of influenza virus against which no previous immunity exists and that demonstrates human-to-human transmission could result in a pandemic with millions of fatalities2. Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza3,4. One way to improve early detection is to monitor health-seeking behaviour in the form of queries to online search engines, which are submitted by millions of users around the world each day. Here we present a method of analysing large numbers of Google search queries to track influenza-like illness in a population. Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.

...read moreread less

3,984 citations

Journal Article•DOI•

Solution Hybrid Selection with Ultra-long Oligonucleotides for Massively Parallel Targeted Sequencing

[...]

Andreas Gnirke¹, Alexandre Melnikov¹, Jared Maguire¹, Peter Rogov¹, Emily M LeProust², William Brockman³, William Brockman¹, Timothy Fennell¹, Georgia Giannoukos¹, Sheila Fisher¹, Carsten Russ¹, Stacey Gabriel¹, David B. Jaffe¹, Eric S. Lander¹, Eric S. Lander⁴, Eric S. Lander⁵, Chad Nusbaum¹ - Show less +13 more•Institutions (5)

Broad Institute¹, Agilent Technologies², Google³, Massachusetts Institute of Technology⁴, Harvard University⁵

01 Feb 2009-Nature Biotechnology

TL;DR: A capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments that uniformity was such that ∼60% of target bases in the exonic 'catch', and ∼80% in the regional catch, had at least half the mean coverage.

...read moreread less

Abstract: Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that approximately 60% of target bases in the exonic 'catch', and approximately 80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.

...read moreread less

1,444 citations

Journal Article•DOI•

The Unreasonable Effectiveness of Data

[...]

Alon Halevy¹, Peter Norvig¹, Fernando Pereira¹•Institutions (1)

Google¹

01 Mar 2009-IEEE Intelligent Systems

TL;DR: A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior.

...read moreread less

Abstract: At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it's taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It's not annotated with carefully hand-corrected part-of-speech tags. But the fact that it's a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks - if only we knew how to extract the model from the data.

...read moreread less

1,404 citations

Journal Article•DOI•

Head Pose Estimation in Computer Vision: A Survey

[...]

Erik Murphy-Chutorian¹, Mohan M. Trivedi²•Institutions (2)

Google¹, University of California, Los Angeles²

01 Apr 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper discusses the inherent difficulties in head pose estimation and presents an organized survey describing the evolution of the field, comparing systems by focusing on their ability to estimate coarse and fine head pose and highlighting approaches well suited for unconstrained environments.

...read moreread less

Abstract: The capacity to estimate the head pose of another person is a common human ability that presents a unique challenge for computer vision systems. Compared to face detection and recognition, which have been the primary foci of face-related vision research, identity-invariant head pose estimation has fewer rigorously evaluated systems or generic solutions. In this paper, we discuss the inherent difficulties in head pose estimation and present an organized survey describing the evolution of the field. Our discussion focuses on the advantages and disadvantages of each approach and spans 90 of the most innovative and characteristic papers that have been published on this topic. We compare these systems by focusing on their ability to estimate coarse and fine head pose, highlighting approaches that are well suited for unconstrained environments.

...read moreread less

1,402 citations

Proceedings Article•DOI•

[...]

Eneko Agirre¹, Enrique Alfonseca², Keith Hall², Jana Kravalova³, Marius Pasca², Aitor Soroa¹ - Show less +2 more•Institutions (3)

University of the Basque Country¹, Google², Charles University in Prague³

31 May 2009

TL;DR: This paper presents and compares WordNet-based and distributional similarity approaches, and pioneer cross-lingual similarity, showing that the methods are easily adapted for a cross-lingsual task with minor losses.

...read moreread less

Abstract: This paper presents and compares WordNet-based and distributional similarity approaches. The strengths and weaknesses of each approach regarding similarity and relatedness tasks are discussed, and a combination is presented. Each of our methods independently provide the best results in their class on the RG and WordSim353 datasets, and a supervised combination of them yields the best published results on all datasets. Finally, we pioneer cross-lingual similarity, showing that our methods are easily adapted for a cross-lingual task with minor losses.

...read moreread less

936 citations

Journal Article•DOI•

A brief history of the internet

[...]

Barry M. Leiner, Vinton G. Cerf¹, David D. Clark², Robert E. Kahn, Leonard Kleinrock³, Daniel C. Lynch, Jon Postel⁴, Larry G. Roberts, Stephen Wolff⁵ - Show less +5 more•Institutions (5)

Google¹, Massachusetts Institute of Technology², University of California, Los Angeles³, Information Sciences Institute⁴, Cisco Systems, Inc.⁵

07 Oct 2009

TL;DR: This paper was first published online by the Internet Society in December 20031 and is being re-published in ACM SIGCOMM Computer Communication Review because of its historic import.

...read moreread less

Abstract: This paper was first published online by the Internet Society in December 20031 and is being re-published in ACM SIGCOMM Computer Communication Review because of its historic import. It was written at the urging of its primary editor, the late Barry Leiner. He felt that a factual rendering of the events and activities associated with the development of the early Internet would be a valuable contribution. The contributing authors did their best to incorporate only factual material into this document. There are sure to be many details that have not been captured in the body of the document but it remains one of the most accurate renderings of the early period of development available.

...read moreread less

926 citations

Proceedings Article•DOI•

Expected reciprocal rank for graded relevance

[...]

Olivier Chapelle¹, Donald Metlzer¹, Ya Zhang¹, Pierre Grinspan²•Institutions (2)

Yahoo!¹, Google²

02 Nov 2009

TL;DR: This work presents a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents and calls it Expected Reciprocal Rank (ERR).

...read moreread less

Abstract: While numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assumption: a document in a given position has always the same gain and discount independently of the documents shown above it. Inspired by the "cascade" user model, we present a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents. More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical reciprocal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We conduct an extensive evaluation on the query logs of a commercial search engine and show that ERR correlates better with clicks metrics than other editorial metrics.

...read moreread less

831 citations

Proceedings Article•DOI•

DRAM errors in the wild: a large-scale field study

[...]

Bianca Schroeder¹, Eduardo Pinheiro², Wolf-Dietrich Weber²•Institutions (2)

University of Toronto¹, Google²

15 Jun 2009

TL;DR: Measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode.

...read moreread less

Abstract: Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.

...read moreread less

724 citations

Journal Article•DOI•

Stereo Matching with Color-Weighted Correlation, Hierarchical Belief Propagation, and Occlusion Handling

[...]

Qingxiong Yang¹, Wang Liang², Ruigang Yang², Henrik Stewenius³, David Nister⁴ - Show less +1 more•Institutions (4)

University of Illinois at Urbana–Champaign¹, University of Kentucky², Google³, Microsoft⁴

01 Mar 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A stereo matching algorithm with careful handling of disparity, discontinuity, and occlusion based on an energy-minimization framework that is evaluated on the Middlebury data sets, showing that the algorithm is the top performer among all the algorithms listed there.

...read moreread less

Abstract: In this paper, we formulate a stereo matching algorithm with careful handling of disparity, discontinuity, and occlusion. The algorithm works with a global matching stereo model based on an energy-minimization framework. The global energy contains two terms, the data term and the smoothness term. The data term is first approximated by a color-weighted correlation, then refined in occluded and low-texture areas in a repeated application of a hierarchical loopy belief propagation algorithm. The experimental results are evaluated on the Middlebury data sets, showing that our algorithm is the top performer among all the algorithms listed there.

...read moreread less

623 citations

Journal Article•DOI•

Development and implementation of high-throughput SNP genotyping in barley

[...]

Timothy J. Close¹, Prasanna R. Bhat², Prasanna R. Bhat¹, Stefano Lonardi¹, Yonghui Wu³, Yonghui Wu¹, Nils Rostoks⁴, Nils Rostoks⁵, Luke Ramsay⁵, Arnis Druka⁵, Nils Stein⁶, Jan T. Svensson¹, Jan T. Svensson⁷, Steve Wanamaker¹, Serdar Bozdag¹, Mikeal L. Roose¹, Matthew J. Moscou¹, Matthew J. Moscou⁸, Shiaoman Chao⁹, Rajeev K. Varshney⁶, Rajeev K. Varshney¹⁰, Péter Szűcs¹¹, Kazuhiro Sato¹², Patrick M. Hayes¹¹, David E. Matthews¹³, Andris Kleinhofs¹⁴, Gary J. Muehlbauer¹⁵, Joseph DeYoung¹⁶, David Marshall⁵, Kavitha Madishetty¹, Raymond D. Fenton¹, Pascal Condamine¹, Andreas Graner⁶, Robbie Waugh⁵ - Show less +30 more•Institutions (16)

University of California, Riverside¹, Monsanto², Google³, University of Latvia⁴, Scottish Crop Research Institute⁵, Leibniz Association⁶, University of Copenhagen⁷, Iowa State University⁸, Agricultural Research Service⁹, International Crops Research Institute for the Semi-Arid Tropics¹⁰, Oregon State University¹¹, Okayama University¹², Cornell University¹³, Washington State University¹⁴, University of Minnesota¹⁵, University of California, Los Angeles¹⁶

04 Dec 2009-BMC Genomics

TL;DR: A high-density consensus genetic map of barley based only on complete and error-free datasets and genic markers, represented accurately by graphs and approximately by a best-fit linear order, and supported by a readily available SNP genotyping resource is presented in this paper.

...read moreread less

Abstract: High density genetic maps of plants have, nearly without exception, made use of marker datasets containing missing or questionable genotype calls derived from a variety of genic and non-genic or anonymous markers, and been presented as a single linear order of genetic loci for each linkage group. The consequences of missing or erroneous data include falsely separated markers, expansion of cM distances and incorrect marker order. These imperfections are amplified in consensus maps and problematic when fine resolution is critical including comparative genome analyses and map-based cloning. Here we provide a new paradigm, a high-density consensus genetic map of barley based only on complete and error-free datasets and genic markers, represented accurately by graphs and approximately by a best-fit linear order, and supported by a readily available SNP genotyping resource. Approximately 22,000 SNPs were identified from barley ESTs and sequenced amplicons; 4,596 of them were tested for performance in three pilot phase Illumina GoldenGate assays. Data from three barley doubled haploid mapping populations supported the production of an initial consensus map. Over 200 germplasm selections, principally European and US breeding material, were used to estimate minor allele frequency (MAF) for each SNP. We selected 3,072 of these tested SNPs based on technical performance, map location, MAF and biological interest to fill two 1536-SNP "production" assays (BOPA1 and BOPA2), which were made available to the barley genetics community. Data were added using BOPA1 from a fourth mapping population to yield a consensus map containing 2,943 SNP loci in 975 marker bins covering a genetic distance of 1099 cM. The unprecedented density of genic markers and marker bins enabled a high resolution comparison of the genomes of barley and rice. Low recombination in pericentric regions is evident from bins containing many more than the average number of markers, meaning that a large number of genes are recombinationally locked into the genetic centromeric regions of several barley chromosomes. Examination of US breeding germplasm illustrated the usefulness of BOPA1 and BOPA2 in that they provide excellent marker density and sensitivity for detection of minor alleles in this genetically narrow material.

...read moreread less

564 citations

Proceedings Article•DOI•

Native Client: A Sandbox for Portable, Untrusted x86 Native Code

[...]

Bennet S. Yee¹, David C. Sehr¹, Gregory Dardyk¹, J. Bradley Chen¹, Robert Muth¹, Tavis Ormandy¹, Shiki Okasaka¹, Neha Narula¹, Nicholas Fullagar¹ - Show less +5 more•Institutions (1)

Google¹

17 May 2009

TL;DR: The Native Client project as mentioned in this paper is a sandbox for untrusted x86 native code that uses software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client.

...read moreread less

Abstract: This paper describes the design, implementation and evaluation of Native Client, a sandbox for untrusted x86 native code. Native Client aims to give browser-based applications the computational performance of native applications without compromising safety. Native Client uses software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client. Native Client provides operating system portability for binary code while supporting performance-oriented features generally absent from web application programming environments, such as thread support, instruction set extensions such as SSE, and use of compiler intrinsics and hand-coded assembler. We combine these properties in an open architecture that encourages community review and 3rd-party tools.

...read moreread less

Proceedings Article•DOI•

The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages

[...]

Jan Hajiċ¹, Massimiliano Ciaramita², Richard Johansson³, Daisuke Kawahara⁴, Maria Antònia Martí⁵, Lluís Màrquez⁶, Adam Meyers⁷, Joakim Nivre⁸, Sebastian Padó⁹, Jan Štėpánek¹, Pavel Straňák¹, Mihai Surdeanu¹⁰, Nianwen Xue¹¹, Yi Zhang¹² - Show less +10 more•Institutions (12)

Charles University in Prague¹, Google², University of Trento³, National Institute of Information and Communications Technology⁴, University of Barcelona⁵, Polytechnic University of Catalonia⁶, New York University⁷, Uppsala University⁸, University of Stuttgart⁹, Stanford University¹⁰, Brandeis University¹¹, Saarland University¹²

04 Jun 2009

TL;DR: This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task and describes how the data sets were created and show their quantitative properties.

...read moreread less

Abstract: For the 11th straight year, the Conference on Computational Natural Language Learning has been accompanied by a shared task whose purpose is to promote natural language processing applications and evaluate them in a standard setting. In 2009, the shared task was dedicated to the joint parsing of syntactic and semantic dependencies in multiple languages. This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task. In this paper, we define the shared task, describe how the data sets were created and show their quantitative properties, report the results and summarize the approaches of the participating systems.

...read moreread less

Journal Article•DOI•

Swoosh: a generic approach to entity resolution

[...]

Omar Benjelloun¹, Hector Garcia-Molina², David Menestrina², Qi Su², Steven Euijong Whang², Jennifer Widom² - Show less +2 more•Institutions (2)

Google¹, Stanford University²

01 Jan 2009

TL;DR: This work formalizes the generic ER problem, treating the functions for comparing and merging records as black-boxes, and identifies four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms.

...read moreread less

Abstract: We consider the entity resolution (ER) problem (also known as deduplication, or merge---purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the four properties. F-Swoosh in addition assumes knowledge of the "features" (e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an "approximate" result is acceptable.

...read moreread less

Patent•

Notification of mobile device events

[...]

Erick Tseng¹, Dianne K. Hackborn¹, Daniel Johansson¹, Per Claes Olof Grimberg¹, Joseph M. Onorato¹, German W. Bauer¹, Jeffrey D. Yaksick¹, Christopher J. Desalvo¹ - Show less +4 more•Institutions (1)

Google¹

30 Jan 2009

TL;DR: In this paper, a computer-implemented user notification method includes displaying, in a status area near a perimeter of a graphical interface, a notification of a recent alert event for a mobile device, receiving a user selection in the status area, and in response to the receipt of the user selection, displaying detail regarding a plurality of recent messaging events.

...read moreread less

Abstract: A computer-implemented user notification method includes displaying, in a status area near a perimeter of a graphical interface, a notification of a recent alert event for a mobile device, receiving a user selection in the status area, and in response to the receipt of the user selection, displaying, in a central zone of the graphical interface, detail regarding a plurality of recent messaging events for the mobile device.

...read moreread less

Patent•

Securing, monitoring and tracking shipping containers

[...]

Thomas R. Berger¹, Joseph E. Denny¹, David S. Robins¹, LaMonte Peter Koop¹, Edward Allen Payne¹, Robert W. Twitchell¹ - Show less +2 more•Institutions (1)

Google¹

16 May 2009

TL;DR: A method of securing a container includes inserting, into a seal device at a container, an electronic bolt, reading, by the seal device, a serial number stored in the electronic bolt; communicating, from the sealed device, to a user application, insertion of the bolt; scanning, by a user via a handheld device, an identification of the sealing device; inputting, by user at the container via the handheld devices, information associated with the container; associating, in a database by the user application as mentioned in this paper.

...read moreread less

Abstract: A method of securing a container includes inserting, into a seal device at a container, an electronic bolt; reading, by the seal device, a serial number stored in the electronic bolt; communicating, from the seal device, to a user application, insertion of the bolt; scanning, by the user via a handheld device, a barcode on the seal device representative of an identification of the seal device; communicating, from the handheld device to the user application, the identification of the seal device; inputting, by a user at the container via the handheld device, information associated with the container; communicating, from the handheld device to the user application, the information associated with the container; associating, in a database by the user application, the information associated with the container with the bolt serial number and the identification of the seal device; communicating, by the user application, a confirmation to the seal device.

...read moreread less

Proceedings Article•

The Difficulty of Training Deep Architectures and the Effect of Unsupervised Pre-Training

[...]

Dumitru Erhan¹, Pierre-Antoine Manzagol¹, Yoshua Bengio¹, Samy Bengio², Pascal Vincent¹ - Show less +1 more•Institutions (2)

Université de Montréal¹, Google²

15 Apr 2009

TL;DR: The experiments confirm and clarify the advantage of unsupervised pre- training, and empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

...read moreread less

Abstract: Whereas theoretical work suggests that deep architectures might be more e cient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pretraining. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this di cult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive e ect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

...read moreread less

Journal Article•DOI•

Efficient Subwindow Search: A Branch and Bound Framework for Object Localization

[...]

Christoph H. Lampert¹, Matthew B. Blaschko², Thomas Hofmann³•Institutions (3)

Max Planck Society¹, University of Oxford², Google³

01 Dec 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A simple yet powerful branch and bound scheme that allows efficient maximization of a large class of quality functions over all possible subimages and converges to a globally optimal solution typically in linear or even sublinear time, in contrast to the quadratic scaling of exhaustive or sliding window search.

...read moreread less

Abstract: Most successful object recognition systems rely on binary classification, deciding only if an object is present or not, but not providing information on the actual object location. To estimate the object's location, one can take a sliding window approach, but this strongly increases the computational cost because the classifier or similarity function has to be evaluated over a large set of candidate subwindows. In this paper, we propose a simple yet powerful branch and bound scheme that allows efficient maximization of a large class of quality functions over all possible subimages. It converges to a globally optimal solution typically in linear or even sublinear time, in contrast to the quadratic scaling of exhaustive or sliding window search. We show how our method is applicable to different object detection and image retrieval scenarios. The achieved speedup allows the use of classifiers for localization that formerly were considered too slow for this task, such as SVMs with a spatial pyramid kernel or nearest-neighbor classifiers based on the lambda2 distance. We demonstrate state-of-the-art localization performance of the resulting systems on the UIUC Cars data set, the PASCAL VOC 2006 data set, and in the PASCAL VOC 2007 competition.

...read moreread less

Proceedings Article•DOI•

Reciprocal rank fusion outperforms condorcet and individual rank learning methods

[...]

Gordon V. Cormack¹, Charles L. A. Clarke¹, Stefan Buettcher²•Institutions (2)

University of Waterloo¹, Google²

19 Jul 2009

TL;DR: Reciprocal Rank Fusion is demonstrated by using RRF to combine the results of several TREC experiments, and to build a meta-learner that ranks the LETOR 3 dataset better than any previously reported method.

...read moreread less

Abstract: Reciprocal Rank Fusion (RRF), a simple method for combining the document rankings from multiple IR systems, consistently yields better results than any individual system, and better results than the standard method Condorcet Fuse. This result is demonstrated by using RRF to combine the results of several TREC experiments, and to build a meta-learner that ranks the LETOR 3 dataset better than any previously reported method

...read moreread less

Journal Article•DOI•

Smartphones: An Emerging Tool for Social Scientists

[...]

Mika Raento¹, Antti Oulasvirta², Nathan Eagle³•Institutions (3)

Google¹, Helsinki Institute for Information Technology², Massachusetts Institute of Technology³

01 Feb 2009-Sociological Methods & Research

TL;DR: The authors argue that the technological and social transformations of the smartphone have produced a new kind of device: a programmable mobile phone, the smartphone.

...read moreread less

Abstract: Recent developments in mobile technologies have produced a new kind of device: a programmable mobile phone, the smartphone. In this article, the authors argue that the technological and social char...

...read moreread less

Patent•

Secured Electronic Transaction System

[...]

Philipp Frank Hermann Udo Hertel¹, Alexander Wolfgang Karl Kurt Hertel¹, John David Trevor Graham¹, Mark Braverman¹•Institutions (1)

Google¹

18 May 2009

TL;DR: In this article, a configuration (a system and/or a method) is disclosed that includes a unified and integrated configuration that is composed of a payment system, an advertising system, and an identity management system as well as their associated methods.

...read moreread less

Abstract: A configuration (a system and/or a method) are disclosed that includes a unified and integrated configuration that is composed of a payment system, an advertising system, and an identity management system as well as their associated methods such that the unified system has all of the benefits of the individual systems as well as several additional synergistic benefits. Also described are specific configurations (subsystems and/or methods) including the system's access point architecture, a user interface that acts as a visual wallet simulator, a security architecture, coupon handling as well as the system's structure and means for delivering them as targeted advertising, business card handling, membership card handling for the purposes of login management, receipt handling, and the editors and grammars provided for customizing the different types of objects in the system as well as the creation of new custom objects with custom behaviors. The configurations are operable on-line as well as through physical presence transactions, e.g., mobile transaction through a mobile phone or dedicated device at a physical site for a transaction.

...read moreread less

Proceedings Article•DOI•

State of the Art in Example-based Texture Synthesis

[...]

Li-Yi Wei¹, Sylvain Lefebvre, Vivek Kwatra², Greg Turk³•Institutions (3)

Microsoft¹, Google², Georgia Institute of Technology³

30 Mar 2009

TL;DR: This report aims to provide a tutorial that is easy to follow for readers who are not already familiar with the subject, make a comprehensive survey and comparisons of different methods, and sketch a vision for future work that can help motivate and guide readers that are interested in texture synthesis research.

...read moreread less

Abstract: Recent years have witnessed significant progress in example-based texture synthesis algorithms. Given an example texture, these methods produce a larger texture that is tailored to the user's needs. In this state-of-the-art report, we aim to achieve three goals: (1) provide a tutorial that is easy to follow for readers who are not already familiar with the subject, (2) make a comprehensive survey and comparisons of different methods, and (3) sketch a vision for future work that can help motivate and guide readers that are interested in texture synthesis research. We cover fundamental algorithms as well as extensions and applications of texture synthesis.

...read moreread less

Journal Article•DOI•

Semi-supervised graph clustering: a kernel approach

[...]

Brian Kulis¹, Sugato Basu², Inderjit S. Dhillon¹, Raymond J. Mooney¹•Institutions (2)

University of Texas at Austin¹, Google²

01 Jan 2009-Machine Learning

TL;DR: The proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective.

...read moreread less

Abstract: Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.

...read moreread less

Proceedings Article•DOI•

Tour the world: Building a web-scale landmark recognition engine

[...]

Yan-Tao Zheng¹, Ming Zhao², Yang Song², Hartwig Adam², Ulrich Buddemeier², Alessandro Bissacco², Fernando Brucher², Tat-Seng Chua¹, Hartmut Neven² - Show less +5 more•Institutions (2)

National University of Singapore¹, Google²

20 Jun 2009

TL;DR: This paper leverages the vast amount of multimedia data on the Web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address issues of modeling and recognizing landmarks at world-scale.

...read moreread less

Abstract: Modeling and recognizing landmarks at world-scale is a useful yet challenging task There exists no readily available list of worldwide landmarks Obtaining reliable visual models for each landmark can also pose problems, and efficiency is another challenge for such a large scale system This paper leverages the vast amount of multimedia data on the Web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address these issues First, a comprehensive list of landmarks is mined from two sources: (1) ~20 million GPS-tagged photos and (2) online tour guide Web pages Candidate images for each landmark are then obtained from photo sharing Websites or by querying an image search engine Second, landmark visual models are built by pruning candidate images using efficient image matching and unsupervised clustering techniques Finally, the landmarks and their visual models are validated by checking authorship of their member images The resulting landmark recognition engine incorporates 5312 landmarks from 1259 cities in 144 countries The experiments demonstrate that the engine can deliver satisfactory recognition performance with high efficiency

...read moreread less

Proceedings Article•DOI•

ThreadSanitizer: data race detection in practice

[...]

Konstantin Serebryany¹, Timur Iskhodzhanov²•Institutions (2)

Google¹, Moscow Institute of Physics and Technology²

12 Dec 2009

TL;DR: This paper presents ThreadSanitizer -- a dynamic detector of data races, and introduces what is called dynamic annotations -- a sort of race detection API that allows a user to inform the detector about any tricky synchronization in the user program.

...read moreread less

Abstract: Data races are a particularly unpleasant kind of threading bugs. They are hard to find and reproduce -- you may not observe a bug during the entire testing cycle and will only see it in production as rare unexplainable failures. This paper presents ThreadSanitizer -- a dynamic detector of data races. We describe the hybrid algorithm (based on happens-before and locksets) used in the detector. We introduce what we call dynamic annotations -- a sort of race detection API that allows a user to inform the detector about any tricky synchronization in the user program. Various practical aspects of using ThreadSanitizer for testing multithreaded C++ code at Google are also discussed.

...read moreread less

Proceedings Article•DOI•

Semi-Supervised Polarity Lexicon Induction

[...]

Delip Rao¹, Deepak Ravichandran²•Institutions (2)

Johns Hopkins University¹, Google²

30 Mar 2009

TL;DR: The results indicate that label propagation improves significantly over the baseline and other semi-supervised learning methods like Mincuts and Randomized Mincuts for this task.

...read moreread less

Abstract: We present an extensive study on the problem of detecting polarity of words. We consider the polarity of a word to be either positive or negative. For example, words such as good, beautiful, and wonderful are considered as positive words; whereas words such as bad, ugly, and sad are considered negative words. We treat polarity detection as a semi-supervised label propagation problem in a graph. In the graph, each node represents a word whose polarity is to be determined. Each weighted edge encodes a relation that exists between two words. Each node (word) can have two labels: positive or negative. We study this framework in two different resource availability scenarios using WordNet and OpenOffice thesaurus when WordNet is not available. We report our results on three different languages: English, French, and Hindi. Our results indicate that label propagation improves significantly over the baseline and other semi-supervised learning methods like Mincuts and Randomized Mincuts for this task.

...read moreread less

Proceedings Article•DOI•

Online Stochastic Matching: Beating 1-1/e

[...]

Jon Feldman¹, Aranyak Mehta¹, Vahab Mirrokni¹, S. Muthukrishnan¹•Institutions (1)

Google¹

25 Oct 2009

TL;DR: In this article, the authors study the online stochastic bipartite matching problem, in a form motivated by display ad allocation on the Internet, and show that no online algorithm can achieve an approximation ratio better than 0.632.

...read moreread less

Abstract: We study the online stochastic bipartite matching problem, in a form motivated by display ad allocation on the Internet. In the online, but adversarial case, the celebrated result of Karp, Vazirani and Vazirani gives an approximation ratio of $1-{1\over e} \simeq 0.632$, a very familiar bound that holds for many online problems; further, the bound is tight in this case. In the online, stochastic case when nodes are drawn repeatedly from a known distribution, the greedy algorithm matches this approximation ratio, but still, no algorithm is known that beats the $1 - {1\over e}$ bound.Our main result is a $0.67$-approximation online algorithm for stochastic bipartite matching, breaking this $1 - {1\over e}$ barrier. Furthermore, we show that no online algorithm can produce a $1-\epsilon$ approximation for an arbitrarily small $\epsilon$ for this problem. Our algorithms are based on computing an optimal offline solution to the expected instance, and using this solution as a guideline in the process of online allocation. We employ a novel application of the idea of the power of two choices from load balancing: we compute two disjoint solutions to the expected instance, and use both of them in the online algorithm in a prescribed preference order. To identify these two disjoint solutions, we solve a max flow problem in a boosted flow graph, and then carefully decompose this maximum flow to two edge-disjoint (near-) matchings. In addition to guiding the online decision making, these two offline solutions are used to characterize an upper bound for the optimum in any scenario. This is done by identifying a cut whose value we can bound under the arrival distribution. At the end, we discuss extensions of our results to more general bipartite allocations that are important in a display ad application.

...read moreread less

Proceedings Article•

Learning Non-Linear Combinations of Kernels

[...]

Corinna Cortes¹, Mehryar Mohri², Afshin Rostamizadeh²•Institutions (2)

Google¹, Courant Institute of Mathematical Sciences²

07 Dec 2009

TL;DR: A projection-based gradient descent algorithm is given for solving the optimization problem of learning kernels based on a polynomial combination of base kernels and it is proved that the global solution of this problem always lies on the boundary.

...read moreread less

Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.

...read moreread less

Journal Article•DOI•

SOAR: Simple Opportunistic Adaptive Routing Protocol for Wireless Mesh Networks

[...]

Eric J. Rozner¹, J. Seshadri², Y. Mehta³, Lili Qiu¹•Institutions (3)

University of Texas at Austin¹, VMware², Google³

01 Dec 2009-IEEE Transactions on Mobile Computing

TL;DR: This paper proposes a simple opportunistic adaptive routing protocol (SOAR) to explicitly support multiple simultaneous flows in wireless mesh networks and shows that SOAR significantly outperforms traditional routing and a seminal opportunistic routing protocol, ExOR, under a wide range of scenarios.

...read moreread less

Abstract: Multihop wireless mesh networks are becoming a new attractive communication paradigm owing to their low cost and ease of deployment. Routing protocols are critical to the performance and reliability of wireless mesh networks. Traditional routing protocols send traffic along predetermined paths and face difficulties in coping with unreliable and unpredictable wireless medium. In this paper, we propose a simple opportunistic adaptive routing protocol (SOAR) to explicitly support multiple simultaneous flows in wireless mesh networks. SOAR incorporates the following four major components to achieve high throughput and fairness: 1) adaptive forwarding path selection to leverage path diversity while minimizing duplicate transmissions, 2) priority timer-based forwarding to let only the best forwarding node forward the packet, 3) local loss recovery to efficiently detect and retransmit lost packets, and 4) adaptive rate control to determine an appropriate sending rate according to the current network conditions. We implement SOAR in both NS-2 simulation and an 18-node wireless mesh testbed. Our extensive evaluation shows that SOAR significantly outperforms traditional routing and a seminal opportunistic routing protocol, ExOR, under a wide range of scenarios.

...read moreread less

Journal Article•DOI•

PLANET: massively parallel learning of tree ensembles with MapReduce

[...]

Biswanath Panda¹, Joshua Seth Herbach¹, Sugato Basu¹, Roberto J. Bayardo¹•Institutions (1)

Google¹

01 Aug 2009

TL;DR: This paper describes PLANET: a scalable distributed framework for learning tree models over large datasets, and shows how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models.

...read moreread less

Abstract: Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google's computing infrastructure is based on commodity hardware.In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.

...read moreread less

Patent•

Textual Disambiguation Using Social Connections

[...]

David P. Conway¹, Andrew E. Rubin¹•Institutions (1)

Google¹

16 Oct 2009

Collapse