scispace - formally typeset
Search or ask a question

Showing papers by "Google published in 2009"


Journal Article•DOI•
19 Feb 2009-Nature
TL;DR: A method of analysing large numbers of Google search queries to track influenza-like illness in a population and accurately estimate the current level of weekly influenza activity in each region of the United States with a reporting lag of about one day is presented.
Abstract: This paper - first published on-line in November 2008 - draws on data from an early version of the Google Flu Trends search engine to estimate the levels of flu in a population. It introduces a computational model that converts raw search query data into a region-by-region real-time surveillance system that accurately estimates influenza activity with a lag of about one day - one to two weeks faster than the conventional reports published by the Centers for Disease Prevention and Control. This report introduces a computational model based on internet search queries for real-time surveillance of influenza-like illness (ILI), which reproduces the patterns observed in ILI data from the Centers for Disease Control and Prevention. Seasonal influenza epidemics are a major public health concern, causing tens of millions of respiratory illnesses and 250,000 to 500,000 deaths worldwide each year1. In addition to seasonal influenza, a new strain of influenza virus against which no previous immunity exists and that demonstrates human-to-human transmission could result in a pandemic with millions of fatalities2. Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza3,4. One way to improve early detection is to monitor health-seeking behaviour in the form of queries to online search engines, which are submitted by millions of users around the world each day. Here we present a method of analysing large numbers of Google search queries to track influenza-like illness in a population. Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.

3,984 citations


Journal Article•DOI•
TL;DR: A capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments that uniformity was such that ∼60% of target bases in the exonic 'catch', and ∼80% in the regional catch, had at least half the mean coverage.
Abstract: Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that approximately 60% of target bases in the exonic 'catch', and approximately 80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.

1,444 citations


Journal Article•DOI•
Alon Halevy1, Peter Norvig1, Fernando Pereira1•
TL;DR: A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior.
Abstract: At Brown University, there is excitement of having access to the Brown Corpus, containing one million English words. Since then, we have seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it's taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It's not annotated with carefully hand-corrected part-of-speech tags. But the fact that it's a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus - along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions - captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks - if only we knew how to extract the model from the data.

1,404 citations


Journal Article•DOI•
TL;DR: This paper discusses the inherent difficulties in head pose estimation and presents an organized survey describing the evolution of the field, comparing systems by focusing on their ability to estimate coarse and fine head pose and highlighting approaches well suited for unconstrained environments.
Abstract: The capacity to estimate the head pose of another person is a common human ability that presents a unique challenge for computer vision systems. Compared to face detection and recognition, which have been the primary foci of face-related vision research, identity-invariant head pose estimation has fewer rigorously evaluated systems or generic solutions. In this paper, we discuss the inherent difficulties in head pose estimation and present an organized survey describing the evolution of the field. Our discussion focuses on the advantages and disadvantages of each approach and spans 90 of the most innovative and characteristic papers that have been published on this topic. We compare these systems by focusing on their ability to estimate coarse and fine head pose, highlighting approaches that are well suited for unconstrained environments.

1,402 citations


Proceedings Article•DOI•
31 May 2009
TL;DR: This paper presents and compares WordNet-based and distributional similarity approaches, and pioneer cross-lingual similarity, showing that the methods are easily adapted for a cross-lingsual task with minor losses.
Abstract: This paper presents and compares WordNet-based and distributional similarity approaches. The strengths and weaknesses of each approach regarding similarity and relatedness tasks are discussed, and a combination is presented. Each of our methods independently provide the best results in their class on the RG and WordSim353 datasets, and a supervised combination of them yields the best published results on all datasets. Finally, we pioneer cross-lingual similarity, showing that our methods are easily adapted for a cross-lingual task with minor losses.

936 citations


Journal Article•DOI•
07 Oct 2009
TL;DR: This paper was first published online by the Internet Society in December 20031 and is being re-published in ACM SIGCOMM Computer Communication Review because of its historic import.
Abstract: This paper was first published online by the Internet Society in December 20031 and is being re-published in ACM SIGCOMM Computer Communication Review because of its historic import. It was written at the urging of its primary editor, the late Barry Leiner. He felt that a factual rendering of the events and activities associated with the development of the early Internet would be a valuable contribution. The contributing authors did their best to incorporate only factual material into this document. There are sure to be many details that have not been captured in the body of the document but it remains one of the most accurate renderings of the early period of development available.

926 citations


Proceedings Article•DOI•
02 Nov 2009
TL;DR: This work presents a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents and calls it Expected Reciprocal Rank (ERR).
Abstract: While numerous metrics for information retrieval are available in the case of binary relevance, there is only one commonly used metric for graded relevance, namely the Discounted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assumption: a document in a given position has always the same gain and discount independently of the documents shown above it. Inspired by the "cascade" user model, we present a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents. More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical reciprocal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We conduct an extensive evaluation on the query logs of a commercial search engine and show that ERR correlates better with clicks metrics than other editorial metrics.

831 citations


Proceedings Article•DOI•
15 Jun 2009
TL;DR: Measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode.
Abstract: Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age?We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.

724 citations


Journal Article•DOI•
TL;DR: A stereo matching algorithm with careful handling of disparity, discontinuity, and occlusion based on an energy-minimization framework that is evaluated on the Middlebury data sets, showing that the algorithm is the top performer among all the algorithms listed there.
Abstract: In this paper, we formulate a stereo matching algorithm with careful handling of disparity, discontinuity, and occlusion. The algorithm works with a global matching stereo model based on an energy-minimization framework. The global energy contains two terms, the data term and the smoothness term. The data term is first approximated by a color-weighted correlation, then refined in occluded and low-texture areas in a repeated application of a hierarchical loopy belief propagation algorithm. The experimental results are evaluated on the Middlebury data sets, showing that our algorithm is the top performer among all the algorithms listed there.

623 citations


Journal Article•DOI•
TL;DR: A high-density consensus genetic map of barley based only on complete and error-free datasets and genic markers, represented accurately by graphs and approximately by a best-fit linear order, and supported by a readily available SNP genotyping resource is presented in this paper.
Abstract: High density genetic maps of plants have, nearly without exception, made use of marker datasets containing missing or questionable genotype calls derived from a variety of genic and non-genic or anonymous markers, and been presented as a single linear order of genetic loci for each linkage group. The consequences of missing or erroneous data include falsely separated markers, expansion of cM distances and incorrect marker order. These imperfections are amplified in consensus maps and problematic when fine resolution is critical including comparative genome analyses and map-based cloning. Here we provide a new paradigm, a high-density consensus genetic map of barley based only on complete and error-free datasets and genic markers, represented accurately by graphs and approximately by a best-fit linear order, and supported by a readily available SNP genotyping resource. Approximately 22,000 SNPs were identified from barley ESTs and sequenced amplicons; 4,596 of them were tested for performance in three pilot phase Illumina GoldenGate assays. Data from three barley doubled haploid mapping populations supported the production of an initial consensus map. Over 200 germplasm selections, principally European and US breeding material, were used to estimate minor allele frequency (MAF) for each SNP. We selected 3,072 of these tested SNPs based on technical performance, map location, MAF and biological interest to fill two 1536-SNP "production" assays (BOPA1 and BOPA2), which were made available to the barley genetics community. Data were added using BOPA1 from a fourth mapping population to yield a consensus map containing 2,943 SNP loci in 975 marker bins covering a genetic distance of 1099 cM. The unprecedented density of genic markers and marker bins enabled a high resolution comparison of the genomes of barley and rice. Low recombination in pericentric regions is evident from bins containing many more than the average number of markers, meaning that a large number of genes are recombinationally locked into the genetic centromeric regions of several barley chromosomes. Examination of US breeding germplasm illustrated the usefulness of BOPA1 and BOPA2 in that they provide excellent marker density and sensitivity for detection of minor alleles in this genetically narrow material.

564 citations


Proceedings Article•DOI•
17 May 2009
TL;DR: The Native Client project as mentioned in this paper is a sandbox for untrusted x86 native code that uses software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client.
Abstract: This paper describes the design, implementation and evaluation of Native Client, a sandbox for untrusted x86 native code. Native Client aims to give browser-based applications the computational performance of native applications without compromising safety. Native Client uses software fault isolation and a secure runtime to direct system interaction and side effects through interfaces managed by Native Client. Native Client provides operating system portability for binary code while supporting performance-oriented features generally absent from web application programming environments, such as thread support, instruction set extensions such as SSE, and use of compiler intrinsics and hand-coded assembler. We combine these properties in an open architecture that encourages community review and 3rd-party tools.

Proceedings Article•DOI•
04 Jun 2009
TL;DR: This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task and describes how the data sets were created and show their quantitative properties.
Abstract: For the 11th straight year, the Conference on Computational Natural Language Learning has been accompanied by a shared task whose purpose is to promote natural language processing applications and evaluate them in a standard setting. In 2009, the shared task was dedicated to the joint parsing of syntactic and semantic dependencies in multiple languages. This shared task combines the shared tasks of the previous five years under a unique dependency-based formalism similar to the 2008 task. In this paper, we define the shared task, describe how the data sets were created and show their quantitative properties, report the results and summarize the approaches of the participating systems.

Journal Article•DOI•
01 Jan 2009
TL;DR: This work formalizes the generic ER problem, treating the functions for comparing and merging records as black-boxes, and identifies four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms.
Abstract: We consider the entity resolution (ER) problem (also known as deduplication, or merge---purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the four properties. F-Swoosh in addition assumes knowledge of the "features" (e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an "approximate" result is acceptable.

Patent•
30 Jan 2009
TL;DR: In this paper, a computer-implemented user notification method includes displaying, in a status area near a perimeter of a graphical interface, a notification of a recent alert event for a mobile device, receiving a user selection in the status area, and in response to the receipt of the user selection, displaying detail regarding a plurality of recent messaging events.
Abstract: A computer-implemented user notification method includes displaying, in a status area near a perimeter of a graphical interface, a notification of a recent alert event for a mobile device, receiving a user selection in the status area, and in response to the receipt of the user selection, displaying, in a central zone of the graphical interface, detail regarding a plurality of recent messaging events for the mobile device.

Patent•
16 May 2009
TL;DR: A method of securing a container includes inserting, into a seal device at a container, an electronic bolt, reading, by the seal device, a serial number stored in the electronic bolt; communicating, from the sealed device, to a user application, insertion of the bolt; scanning, by a user via a handheld device, an identification of the sealing device; inputting, by user at the container via the handheld devices, information associated with the container; associating, in a database by the user application as mentioned in this paper.
Abstract: A method of securing a container includes inserting, into a seal device at a container, an electronic bolt; reading, by the seal device, a serial number stored in the electronic bolt; communicating, from the seal device, to a user application, insertion of the bolt; scanning, by the user via a handheld device, a barcode on the seal device representative of an identification of the seal device; communicating, from the handheld device to the user application, the identification of the seal device; inputting, by a user at the container via the handheld device, information associated with the container; communicating, from the handheld device to the user application, the information associated with the container; associating, in a database by the user application, the information associated with the container with the bolt serial number and the identification of the seal device; communicating, by the user application, a confirmation to the seal device.

Proceedings Article•
15 Apr 2009
TL;DR: The experiments confirm and clarify the advantage of unsupervised pre- training, and empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.
Abstract: Whereas theoretical work suggests that deep architectures might be more e cient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pretraining. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this di cult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive e ect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

Journal Article•DOI•
TL;DR: A simple yet powerful branch and bound scheme that allows efficient maximization of a large class of quality functions over all possible subimages and converges to a globally optimal solution typically in linear or even sublinear time, in contrast to the quadratic scaling of exhaustive or sliding window search.
Abstract: Most successful object recognition systems rely on binary classification, deciding only if an object is present or not, but not providing information on the actual object location. To estimate the object's location, one can take a sliding window approach, but this strongly increases the computational cost because the classifier or similarity function has to be evaluated over a large set of candidate subwindows. In this paper, we propose a simple yet powerful branch and bound scheme that allows efficient maximization of a large class of quality functions over all possible subimages. It converges to a globally optimal solution typically in linear or even sublinear time, in contrast to the quadratic scaling of exhaustive or sliding window search. We show how our method is applicable to different object detection and image retrieval scenarios. The achieved speedup allows the use of classifiers for localization that formerly were considered too slow for this task, such as SVMs with a spatial pyramid kernel or nearest-neighbor classifiers based on the lambda2 distance. We demonstrate state-of-the-art localization performance of the resulting systems on the UIUC Cars data set, the PASCAL VOC 2006 data set, and in the PASCAL VOC 2007 competition.

Proceedings Article•DOI•
19 Jul 2009
TL;DR: Reciprocal Rank Fusion is demonstrated by using RRF to combine the results of several TREC experiments, and to build a meta-learner that ranks the LETOR 3 dataset better than any previously reported method.
Abstract: Reciprocal Rank Fusion (RRF), a simple method for combining the document rankings from multiple IR systems, consistently yields better results than any individual system, and better results than the standard method Condorcet Fuse. This result is demonstrated by using RRF to combine the results of several TREC experiments, and to build a meta-learner that ranks the LETOR 3 dataset better than any previously reported method

Journal Article•DOI•
TL;DR: The authors argue that the technological and social transformations of the smartphone have produced a new kind of device: a programmable mobile phone, the smartphone.
Abstract: Recent developments in mobile technologies have produced a new kind of device: a programmable mobile phone, the smartphone. In this article, the authors argue that the technological and social char...

Patent•
18 May 2009
TL;DR: In this article, a configuration (a system and/or a method) is disclosed that includes a unified and integrated configuration that is composed of a payment system, an advertising system, and an identity management system as well as their associated methods.
Abstract: A configuration (a system and/or a method) are disclosed that includes a unified and integrated configuration that is composed of a payment system, an advertising system, and an identity management system as well as their associated methods such that the unified system has all of the benefits of the individual systems as well as several additional synergistic benefits. Also described are specific configurations (subsystems and/or methods) including the system's access point architecture, a user interface that acts as a visual wallet simulator, a security architecture, coupon handling as well as the system's structure and means for delivering them as targeted advertising, business card handling, membership card handling for the purposes of login management, receipt handling, and the editors and grammars provided for customizing the different types of objects in the system as well as the creation of new custom objects with custom behaviors. The configurations are operable on-line as well as through physical presence transactions, e.g., mobile transaction through a mobile phone or dedicated device at a physical site for a transaction.

Proceedings Article•DOI•
30 Mar 2009
TL;DR: This report aims to provide a tutorial that is easy to follow for readers who are not already familiar with the subject, make a comprehensive survey and comparisons of different methods, and sketch a vision for future work that can help motivate and guide readers that are interested in texture synthesis research.
Abstract: Recent years have witnessed significant progress in example-based texture synthesis algorithms. Given an example texture, these methods produce a larger texture that is tailored to the user's needs. In this state-of-the-art report, we aim to achieve three goals: (1) provide a tutorial that is easy to follow for readers who are not already familiar with the subject, (2) make a comprehensive survey and comparisons of different methods, and (3) sketch a vision for future work that can help motivate and guide readers that are interested in texture synthesis research. We cover fundamental algorithms as well as extensions and applications of texture synthesis.

Journal Article•DOI•
TL;DR: The proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective.
Abstract: Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.

Proceedings Article•DOI•
20 Jun 2009
TL;DR: This paper leverages the vast amount of multimedia data on the Web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address issues of modeling and recognizing landmarks at world-scale.
Abstract: Modeling and recognizing landmarks at world-scale is a useful yet challenging task There exists no readily available list of worldwide landmarks Obtaining reliable visual models for each landmark can also pose problems, and efficiency is another challenge for such a large scale system This paper leverages the vast amount of multimedia data on the Web, the availability of an Internet image search engine, and advances in object recognition and clustering techniques, to address these issues First, a comprehensive list of landmarks is mined from two sources: (1) ~20 million GPS-tagged photos and (2) online tour guide Web pages Candidate images for each landmark are then obtained from photo sharing Websites or by querying an image search engine Second, landmark visual models are built by pruning candidate images using efficient image matching and unsupervised clustering techniques Finally, the landmarks and their visual models are validated by checking authorship of their member images The resulting landmark recognition engine incorporates 5312 landmarks from 1259 cities in 144 countries The experiments demonstrate that the engine can deliver satisfactory recognition performance with high efficiency

Proceedings Article•DOI•
12 Dec 2009
TL;DR: This paper presents ThreadSanitizer -- a dynamic detector of data races, and introduces what is called dynamic annotations -- a sort of race detection API that allows a user to inform the detector about any tricky synchronization in the user program.
Abstract: Data races are a particularly unpleasant kind of threading bugs. They are hard to find and reproduce -- you may not observe a bug during the entire testing cycle and will only see it in production as rare unexplainable failures. This paper presents ThreadSanitizer -- a dynamic detector of data races. We describe the hybrid algorithm (based on happens-before and locksets) used in the detector. We introduce what we call dynamic annotations -- a sort of race detection API that allows a user to inform the detector about any tricky synchronization in the user program. Various practical aspects of using ThreadSanitizer for testing multithreaded C++ code at Google are also discussed.

Proceedings Article•DOI•
30 Mar 2009
TL;DR: The results indicate that label propagation improves significantly over the baseline and other semi-supervised learning methods like Mincuts and Randomized Mincuts for this task.
Abstract: We present an extensive study on the problem of detecting polarity of words. We consider the polarity of a word to be either positive or negative. For example, words such as good, beautiful, and wonderful are considered as positive words; whereas words such as bad, ugly, and sad are considered negative words. We treat polarity detection as a semi-supervised label propagation problem in a graph. In the graph, each node represents a word whose polarity is to be determined. Each weighted edge encodes a relation that exists between two words. Each node (word) can have two labels: positive or negative. We study this framework in two different resource availability scenarios using WordNet and OpenOffice thesaurus when WordNet is not available. We report our results on three different languages: English, French, and Hindi. Our results indicate that label propagation improves significantly over the baseline and other semi-supervised learning methods like Mincuts and Randomized Mincuts for this task.

Proceedings Article•DOI•
25 Oct 2009
TL;DR: In this article, the authors study the online stochastic bipartite matching problem, in a form motivated by display ad allocation on the Internet, and show that no online algorithm can achieve an approximation ratio better than 0.632.
Abstract: We study the online stochastic bipartite matching problem, in a form motivated by display ad allocation on the Internet. In the online, but adversarial case, the celebrated result of Karp, Vazirani and Vazirani gives an approximation ratio of $1-{1\over e} \simeq 0.632$, a very familiar bound that holds for many online problems; further, the bound is tight in this case. In the online, stochastic case when nodes are drawn repeatedly from a known distribution, the greedy algorithm matches this approximation ratio, but still, no algorithm is known that beats the $1 - {1\over e}$ bound.Our main result is a $0.67$-approximation online algorithm for stochastic bipartite matching, breaking this $1 - {1\over e}$ barrier. Furthermore, we show that no online algorithm can produce a $1-\epsilon$ approximation for an arbitrarily small $\epsilon$ for this problem. Our algorithms are based on computing an optimal offline solution to the expected instance, and using this solution as a guideline in the process of online allocation. We employ a novel application of the idea of the power of two choices from load balancing: we compute two disjoint solutions to the expected instance, and use both of them in the online algorithm in a prescribed preference order. To identify these two disjoint solutions, we solve a max flow problem in a boosted flow graph, and then carefully decompose this maximum flow to two edge-disjoint (near-) matchings. In addition to guiding the online decision making, these two offline solutions are used to characterize an upper bound for the optimum in any scenario. This is done by identifying a cut whose value we can bound under the arrival distribution. At the end, we discuss extensions of our results to more general bipartite allocations that are important in a display ad application.

Proceedings Article•
07 Dec 2009
TL;DR: A projection-based gradient descent algorithm is given for solving the optimization problem of learning kernels based on a polynomial combination of base kernels and it is proved that the global solution of this problem always lies on the boundary.
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.

Journal Article•DOI•
TL;DR: This paper proposes a simple opportunistic adaptive routing protocol (SOAR) to explicitly support multiple simultaneous flows in wireless mesh networks and shows that SOAR significantly outperforms traditional routing and a seminal opportunistic routing protocol, ExOR, under a wide range of scenarios.
Abstract: Multihop wireless mesh networks are becoming a new attractive communication paradigm owing to their low cost and ease of deployment. Routing protocols are critical to the performance and reliability of wireless mesh networks. Traditional routing protocols send traffic along predetermined paths and face difficulties in coping with unreliable and unpredictable wireless medium. In this paper, we propose a simple opportunistic adaptive routing protocol (SOAR) to explicitly support multiple simultaneous flows in wireless mesh networks. SOAR incorporates the following four major components to achieve high throughput and fairness: 1) adaptive forwarding path selection to leverage path diversity while minimizing duplicate transmissions, 2) priority timer-based forwarding to let only the best forwarding node forward the packet, 3) local loss recovery to efficiently detect and retransmit lost packets, and 4) adaptive rate control to determine an appropriate sending rate according to the current network conditions. We implement SOAR in both NS-2 simulation and an 18-node wireless mesh testbed. Our extensive evaluation shows that SOAR significantly outperforms traditional routing and a seminal opportunistic routing protocol, ExOR, under a wide range of scenarios.

Journal Article•DOI•
01 Aug 2009
TL;DR: This paper describes PLANET: a scalable distributed framework for learning tree models over large datasets, and shows how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models.
Abstract: Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google's computing infrastructure is based on commodity hardware.In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.

Patent•
David P. Conway1, Andrew E. Rubin1•
16 Oct 2009