Showing papers by "Google published in 2006"

PDF

Open Access

Proceedings Article•DOI•

Bigtable: a distributed storage system for structured data

[...]

Fay W. Chang¹, Jeffrey Dean¹, Sanjay Ghemawat¹, Wilson C. Hsieh¹, Deborah A. Wallach¹, Michael Burrows¹, Tushar Deepak Chandra¹, Andrew Fikes¹, Robert E. Gruber¹ - Show less +5 more•Institutions (1)

Google¹

06 Nov 2006

TL;DR: Bigtable as discussed by the authors is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.

...read moreread less

Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

...read moreread less

1,523 citations

Proceedings Article•DOI•

The Chubby lock service for loosely-coupled distributed systems

[...]

Michael Burrows¹•Institutions (1)

Google¹

06 Nov 2006

TL;DR: The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.

...read moreread less

Abstract: We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.

...read moreread less

905 citations

Proceedings Article•DOI•

High-performance complex event processing over streams

[...]

Eugene Wu¹, Yanlei Diao², Shariq Rizvi³•Institutions (3)

University of California, Berkeley¹, University of Massachusetts Amherst², Google³

27 Jun 2006

TL;DR: This paper proposes a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications and describes a query plan-based approach to efficiently implementing this language.

...read moreread less

Abstract: In this paper, we present the design, implementation, and evaluation of a system that executes complex event queries over real-time streams of RFID readings encoded as events. These complex event queries filter and correlate events to match specific patterns, and transform the relevant events into new composite events for the use of external monitoring applications. Stream-based execution of these queries enables time-critical actions to be taken in environments such as supply chain management, surveillance and facility management, healthcare, etc. We first propose a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications. We then describe a query plan-based approach to efficiently implementing this language. Our approach uses native operators to efficiently handle query-defined sequences, which are a key component of complex event processing, and pipeline such sequences to subsequent operators that are built by leveraging relational techniques. We also develop a large suite of optimization techniques to address challenges such as large sliding windows and intermediate result sizes. We demonstrate the effectiveness of our approach through a detailed performance analysis of our prototype implementation under a range of data and query workloads as well as through a comparison to a state-of-the-art stream processor.

...read moreread less

902 citations

Proceedings Article•DOI•

A web-based kernel function for measuring the similarity of short text snippets

[...]

Mehran Sahami¹, Timothy Dharma Heilman¹•Institutions (1)

Google¹

23 May 2006

TL;DR: This paper defines a similarity kernel function, mathematically analyze some of its properties, and provides examples of its efficacy, and shows the use of this kernel function in a large-scale system for suggesting related queries to search engine users.

...read moreread less

Abstract: Determining the similarity of short text snippets, such as search queries, works poorly with traditional document similarity measures (e.g., cosine), since there are often few, if any, terms in common between two short text snippets. We address this problem by introducing a novel method for measuring the similarity between short text snippets (even those without any overlapping terms) by leveraging web search results to provide greater context for the short texts. In this paper, we define such a similarity kernel function, mathematically analyze some of its properties, and provide examples of its efficacy. We also show the use of this kernel function in a large-scale system for suggesting related queries to search engine users.

...read moreread less

797 citations

Journal Article•DOI•

Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization

[...]

Stephane Lafon¹, Ann B. Lee²•Institutions (2)

Google¹, Carnegie Mellon University²

01 Sep 2006-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: It is shown that the quantization distortion in diffusion space bounds the error of compression of the operator, thus giving a rigorous justification for k-means clustering in diffusionspace and a precise measure of the performance of general clustering algorithms.

...read moreread less

Abstract: We provide evidence that nonlinear dimensionality reduction, clustering, and data set parameterization can be solved within one and the same framework. The main idea is to define a system of coordinates with an explicit metric that reflects the connectivity of a given data set and that is robust to noise. Our construction, which is based on a Markov random walk on the data, offers a general scheme of simultaneously reorganizing and subsampling graphs and arbitrarily shaped data sets in high dimensions using intrinsic geometry. We show that clustering in embedding spaces is equivalent to compressing operators. The objective of data partitioning and clustering is to coarse-grain the random walk on the data while at the same time preserving a diffusion operator for the intrinsic geometry or connectivity of the data set up to some accuracy. We show that the quantization distortion in diffusion space bounds the error of compression of the operator, thus giving a rigorous justification for k-means clustering in diffusion space and a precise measure of the performance of general clustering algorithms

...read moreread less

711 citations

Proceedings Article•DOI•

Data integration: the teenage years

[...]

Alon Halevy¹, Anand Rajaraman, Joann J. Ordille²•Institutions (2)

Google¹, Avaya²

01 Sep 2006

650 citations

Proceedings Article•DOI•

Fast and memory-efficient regular expression matching for deep packet inspection

[...]

Fang Yu¹, Zhifeng Chen², Yanlei Diao³, T. V. Lakshman⁴, Randy H. Katz¹ - Show less +1 more•Institutions (4)

University of California, Berkeley¹, Google², University of Massachusetts Amherst³, Alcatel-Lucent⁴

03 Dec 2006

TL;DR: In this article, the authors proposed regular expression rewrite techniques that can effectively reduce memory usage and developed a grouping scheme that can strategically compile a set of regular expressions into several engines, resulting in remarkable improvement of regular expression matching speed without much increase in memory usage.

...read moreread less

Abstract: Packet content scanning at high speed has become extremely important due to its applications in network security, network monitoring, HTTP load balancing, etc. In content scanning, the packet payload is compared against a set of patterns specified as regular expressions. In this paper, we first show that memory requirements using traditional methods are prohibitively high for many patterns used in packet scanning applications. We then propose regular expression rewrite techniques that can effectively reduce memory usage. Further, we develop a grouping scheme that can strategically compile a set of regular expressions into several engines, resulting in remarkable improvement of regular expression matching speed without much increase in memory usage. We implement a new DFA-based packet scanner using the above techniques. Our experimental results using real-world traffic and patterns show that our implementation achieves a factor of 12 to 42 performance improvement over a commonly used DFA- based scanner. Compared to the state-of-art NFA-based implementation, our DFA-based packet scanner achieves 50 to 700 times speedup.

...read moreread less

527 citations

Proceedings Article•DOI•

ULDBs: databases with uncertainty and lineage

[...]

Omar Benjelloun¹, Anish Das Sarma¹, Alon Halevy², Jennifer Widom¹•Institutions (2)

Stanford University¹, Google²

01 Sep 2006

TL;DR: It is shown that the ULDB representation is complete, and that it permits straightforward implementation of many relational operations, and how ULDBs enable a new approach to query processing in probabilistic databases.

...read moreread less

Abstract: This paper introduces ULDBs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately.We show that the ULDB representation is complete, and that it permits straightforward implementation of many relational operations. We define two notions of ULDB minimality--data-minimal and lineage-minimal--and study minimization of ULDB representations under both notions. With lineage, derived relations are no longer self-contained: their uncertainty depends on uncertainty in the base data. We provide an algorithm for the new operation of extracting a database subset in the presence of interconnected uncertainty. Finally, we show how ULDBs enable a new approach to query processing in probabilistic databases.ULDBs form the basis of the Trio system under development at Stanford.

...read moreread less

523 citations

Proceedings Article•DOI•

Finding near-duplicate web pages: a large-scale evaluation of algorithms

[...]

Monika Henzinger¹•Institutions (1)

Google¹

06 Aug 2006

TL;DR: A combined algorithm is presented which achieves precision 0.79 with 79% of the recall of the other algorithms, and since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall than Broder et al.'s algorithm.

...read moreread less

Abstract: Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.

...read moreread less

506 citations

Patent•

System and method for enabling the use of captured images through recognition

[...]

Salih Burak Gokturk, Dragomir Anguelov¹, Vincent Vanhoucke, Kuang-Chih Lee, Diem Vu, Danny Yang¹, Munjal Shah¹, Azhar Khan¹ - Show less +4 more•Institutions (1)

Google¹

09 May 2006

TL;DR: In this paper, a collection of captured images that form at least a portion of a library of images is used to enable retrieval of the captured images, and an index is generated where the index data is based on recognized information.

...read moreread less

Abstract: An embodiment provides for enabling retrieval of a collection of captured images that form at least a portion of a library of images. For each image in the collection, a captured image may be analyzed to recognize information from image data contained in the captured image, and an index may be generated, where the index data is based on the recognized information. Using the index, functionality such as search and retrieval is enabled. Various recognition techniques, including those that use the face, clothing, apparel, and combinations of characteristics may be utilized. Recognition may be performed on, among other things, persons and text carried on objects.

...read moreread less

463 citations

Patent•

Entity display priority in a distributed geographic information system

[...]

Michael T. Jones¹, Brian McClendon¹, Amin Charaniya¹, Michael Ashbridge¹•Institutions (1)

Google¹

12 Oct 2006

TL;DR: In this article, a system for ranking geospatial entities is described, which consists of an interface for receiving ranking data about a plurality of entities and an entity ranking module that uses a ranking mechanism to generate place ranks for the entities based on the ranking data.

...read moreread less

Abstract: A system for ranking geospatial entities is described. In one embodiment, the system comprises an interface for receiving ranking data about a plurality of geospatial entities and an entity ranking module. The module uses a ranking mechanism to generate place ranks for the geospatial entities based on the ranking data. Ranked entity data generated by the entity ranking module is stored in a database. The entity ranking module may be configured to evaluate a plurality of diverse attributes to determine a total score for a geospatial entity. The entity ranking module may be configured to organize ranked entity data into placemark layers.

...read moreread less

Journal Article•DOI•

Linear work suffix array construction

[...]

Juha Kärkkäinen¹, Peter Sanders², Stefan Burkhardt³•Institutions (3)

University of Helsinki¹, Karlsruhe Institute of Technology², Google³

01 Nov 2006-Journal of the ACM

TL;DR: A generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff and is asymptotically faster than all previous suffix tree or array construction algorithms.

...read moreread less

Abstract: Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1,√n], it runs in O(vn) time using O(n/√v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

...read moreread less

Patent•

Method and apparatus for pilot signal transmission

[...]

Timothy A. Thomas¹, Kevin L. Baum¹, Brian K. Classon¹, Vijay Nangia²•Institutions (2)

Motorola¹, Google²

29 Jun 2006

TL;DR: In this paper, a method and apparatus for pilot signal transmission is described, where the set of occupied pilot block sub-carriers at least changes at least once in a burst.

...read moreread less

Abstract: A method and apparatus for pilot signal transmission is disclosed herein. In particular, a pilot transmission scheme is utilized where pilot sub-carrier bandwidth differs from data sub-carrier bandwidth. Because some user's data sub-carriers will no longer have the user's pilot sub-carriers adjacent to them, the set, or pattern of sub-carriers used by the pilot blocks changes at least once in a burst. Changing the pilot block pattern (the set of occupied pilot block sub-carriers) at least once in the burst serves to increase the frequency proximity of occupied data sub-carriers to occupied pilot sub-carriers in the burst.

...read moreread less

Patent•

Method and apparatus for reducing round-trip latency and overhead within a communication system

[...]

Brian K. Classon¹, Kevin L. Baum¹, Amitava Ghosh², Robert T. Love¹, Vijay Nangia¹, Kenneth A. Stewart¹ - Show less +2 more•Institutions (2)

Motorola¹, Google²

27 Mar 2006

TL;DR: In this paper, radio frames are divided into a plurality of subframes, and data is transmitted over the radio frames within the radio subframes by having a frame duration selected from two or more possible frame durations.

...read moreread less

Abstract: During operation radio frames are divided into a plurality of subframes. Data is transmitted over the radio frames within a plurality of subframes, and having a frame duration selected from two or more possible frame durations.

...read moreread less

Book•

Mesh Parameterization Methods and Their Applications

[...]

Alla Sheffer¹, Emil Praun², Kenneth Rose¹•Institutions (2)

University of British Columbia¹, Google²

31 Dec 2006

TL;DR: A survey of recent methods for creating piecewise linear mappings between triangulations in 3D and simpler domains such as planar regions, simplicial complexes, and spheres is presented in this article.

...read moreread less

Abstract: We present a survey of recent methods for creating piecewise linear mappings between triangulations in 3D and simpler domains such as planar regions, simplicial complexes, and spheres. We also discuss emerging tools such as global parameterization, inter-surface mapping, and parameterization with constraints. We start by describing the wide range of applications where parameterization tools have been used in recent years. We then briefly review the pertinent mathematical background and terminology, before proceeding to survey the existing parameterization techniques. Our survey summarizes the main ideas of each technique and discusses its main properties, comparing it to other methods available. Thus it aims to provide guidance to researchers and developers when assessing the suitability of different methods for various applications. This survey focuses on the practical aspects of the methods available, such as time complexity and robustness and shows multiple examples of parameterizations generated using different methods, allowing the reader to visually evaluate and compare the results.

...read moreread less

Proceedings Article•DOI•

Truthful auctions for pricing search keywords

[...]

Gagan Aggarwal¹, Ashish Goel², Rajeev Motwani²•Institutions (2)

Google¹, Stanford University²

11 Jun 2006

TL;DR: This work presents a truthful auction for pricing advertising slots on a web-page assuming that advertisements for different merchants must be ranked in decreasing order of their (weighted) bids.

...read moreread less

Abstract: We present a truthful auction for pricing advertising slots on a web-page assuming that advertisements for different merchants must be ranked in decreasing order of their (weighted) bids. This captures both the Overture model where bidders are ranked in order of the submitted bids, and the Google model where bidders are ranked in order of the expected revenue (or utility) that their advertisement generates. Assuming separable click-through rates, we prove revenue-equivalence between our auction and the non-truthful next-price auctions currently in use.

...read moreread less

Proceedings Article•DOI•

Towards IP geolocation using delay and topology measurements

[...]

Ethan Katz-Bassett¹, John P. John¹, Arvind Krishnamurthy¹, David Wetherall¹, Thomas Anderson¹, Yatin Chawathe² - Show less +2 more•Institutions (2)

University of Washington¹, Google²

25 Oct 2006

TL;DR: Topology-based Geolocation (TBG), a novel approach to estimating the geographic location of arbitrary Internet hosts by leveraging network topology, along with measurements of network delay, to constrain host position, improves the consistency of location estimates.

...read moreread less

Abstract: We present Topology-based Geolocation (TBG), a novel approach to estimating the geographic location of arbitrary Internet hosts. We motivate our work by showing that 1) existing approaches, based on end-to-end delay measurements from a set of landmarks, fail to outperform much simpler techniques, and 2) the error of these approaches is strongly determined by the distance to the nearest landmark, even when triangulation is used to combine estimates from different landmarks. Our approach improves on these earlier techniques by leveraging network topology, along with measurements of network delay, to constrain host position. We convert topology and delay data into a set of constraints, then solve for router and host locations simultaneously. This approach improves the consistency of location estimates, reducing the error substantially for structured networks in our experiments on Abilene and Sprint. For networks with insufficient structural constraints, our techniques integrate external hints that are validated using measurements before being trusted. Together, these techniques lower the median estimation error for our university-based dataset to 67 km vs. 228 km for the best previous approach.

...read moreread less

Proceedings Article•

Comparative experiments on sentiment classification for online product reviews

[...]

Hang Cui¹, Vibhu Mittal², Mayur Datar²•Institutions (2)

National University of Singapore¹, Google²

16 Jul 2006

TL;DR: A series of experiments with different machine learning algorithms are discussed in order to experimentally evaluate various trade-offs, using approximately 100K product reviews from the web.

...read moreread less

Abstract: Evaluating text fragments for positive and negative subjective expressions and their strength can be important in applications such as single- or multi- document summarization, document ranking, data mining, etc. This paper looks at a simplified version of the problem: classifying online product reviews into positive and negative classes. We discuss a series of experiments with different machine learning algorithms in order to experimentally evaluate various trade-offs, using approximately 100K product reviews from the web.

...read moreread less

Proceedings Article•DOI•

Achieving anonymity via clustering

[...]

Gagan Aggarwal¹, Tomás Feder², Krishnaram Kenthapadi², Samir Khuller³, Rina Panigrahy², Dilys Thomas², An Zhu¹ - Show less +3 more•Institutions (3)

Google¹, Stanford University², University of Maryland, College Park³

26 Jun 2006

TL;DR: This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k, and extends the algorithms to allow an ε fraction of points to remain unclustered, i.e., deleted from the anonymized publication.

...read moreread less

Abstract: Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of de-identifying records is to remove identifying fields such as social security number, name etc. However, recent research has shown that a large fraction of the US population can be identified using non-key attributes (called quasi-identifiers) such as date of birth, gender, and zip code [15]. Sweeney [16] proposed the k-anonymity model for privacy where non-key attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k−1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a pre-specified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-Anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant-factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an e fraction of points to remain unclustered, i.e., deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.

...read moreread less

Journal Article•DOI•

Data Fusion and Multicue Data Matching by Diffusion Maps

[...]

Stephane Lafon¹, Yosi Keller², Ronald R. Coifman²•Institutions (2)

Google¹, Yale University²

01 Nov 2006-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper presents the Laplace-Beltrami approach for computing density invariant embeddings which are essential for integrating different sources of data and describes a refinement of the Nystrom extension algorithm called "geometric harmonics."

...read moreread less

Abstract: Data fusion and multicue data matching are fundamental tasks of high-dimensional data analysis. In this paper, we apply the recently introduced diffusion framework to address these tasks. Our contribution is three-fold: first, we present the Laplace-Beltrami approach for computing density invariant embeddings which are essential for integrating different sources of data. Second, we describe a refinement of the Nystrom extension algorithm called "geometric harmonics." We also explain how to use this tool for data assimilation. Finally, we introduce a multicue data matching scheme based on nonlinear spectral graphs alignment. The effectiveness of the presented schemes is validated by applying it to the problems of lipreading and image sequence alignment

...read moreread less

Patent•

Mobile image-based information retrieval system

[...]

Hartmut Neven¹•Institutions (1)

Google¹

12 May 2006

TL;DR: In this article, an image-based information retrieval system, including a mobile telephone, a remote recognition server, and a remote media server, was presented. But the system was not designed for the retrieval of images.

...read moreread less

Abstract: An image-based information retrieval system, including a mobile telephone, a remote recognition server, and a remote media server, the mobile telephone having a built-in camera and a communication link for transmitting an image from the built-in camera to the remote recognition server and for receiving mobile media content from the remote media server, the remote recognition server for matching an image from the mobile telephone with an object representation in a database and forwarding an associated text identifier to the remote media server, and the remote media server for forwarding mobile media content to the mobile telephone based on the associated text identifier.

...read moreread less

Proceedings Article•DOI•

Principles of dataspace systems

[...]

Alon Halevy¹, Michael J. Franklin², David Maier³•Institutions (3)

Google¹, University of California, Berkeley², Portland State University³

26 Jun 2006

TL;DR: This paper lays out specific technical challenges to realizing DSSPs, the DSSP's ability to introspect on its content, and the use of human attention to enhance the semantic relationships in a dataspace.

...read moreread less

Abstract: The most acute information management challenges today stem from organizations relying on a large number of diverse, interrelated data sources, but having no means of managing them in a convenient, integrated, or principled fashion. These challenges arise in enterprise and government data management, digital libraries, "smart" homes and personal information management. We have proposed dataspaces as a data management abstraction for these diverse applications and DataSpace Support Platforms (DSSPs) as systems that should be built to provide the required services over dataspaces. Unlike data integration systems, DSSPs do not require full semantic integration of the sources in order to provide useful services. This paper lays out specific technical challenges to realizing DSSPs and ties them to existing work in our field. We focus on query answering in DSSPs, the DSSP's ability to introspect on its content, and the use of human attention to enhance the semantic relationships in a dataspace.

...read moreread less

Journal Article•DOI•

Growth codes: maximizing sensor network data persistence

[...]

Abhinav Kamra¹, Vishal Misra¹, Jon Feldman², Dan Rubenstein¹•Institutions (2)

Columbia University¹, Google²

11 Aug 2006

TL;DR: This paper design and analyze techniques to increase "persistence" of sensed data, so that data is more likely to reach a data sink, even as network nodes fail, by replicating data compactly at neighboring nodes using novel "Growth Codes" that increase in efficiency as data accumulates at the sink.

...read moreread less

Abstract: Sensor networks are especially useful in catastrophic or emergency scenarios such as floods, fires, terrorist attacks or earthquakes where human participation may be too dangerous. However, such disaster scenarios pose an interesting design challenge since the sensor nodes used to collect and communicate data may themselves fail suddenly and unpredictably, resulting in the loss of valuable data. Furthermore, because these networks are often expected to be deployed in response to a disaster, or because of sudden configuration changes due to failure, these networks are often expected to operate in a "zero-configuration" paradigm, where data collection and transmission must be initiated immediately, before the nodes have a chance to assess the current network topology. In this paper, we design and analyze techniques to increase "persistence" of sensed data, so that data is more likely to reach a data sink, even as network nodes fail. This is done by replicating data compactly at neighboring nodes using novel "Growth Codes" that increase in efficiency as data accumulates at the sink. We show that Growth Codes preserve more data in the presence of node failures than previously proposed erasure resilient techniques.

...read moreread less

Proceedings Article•DOI•

Address Space Layout Permutation (ASLP): Towards Fine-Grained Randomization of Commodity Software

[...]

Chongkyung Kil¹, Jinsuk Jun¹, Christopher Bookholt¹, Jun Xu², Peng Ning¹ - Show less +1 more•Institutions (2)

North Carolina State University¹, Google²

11 Dec 2006

TL;DR: This work proposes address space layout permutation (ASLP) that introduces high degree of randomness (or high entropy) with minimal performance overhead and modified the Linux operating system kernel to permute stack, heap, and memory mapped regions.

...read moreread less

Abstract: Address space randomization is an emerging and promising method for stopping a broad range of memory corruption attacks. By randomly shifting critical memory regions at process initialization time, address space randomization converts an otherwise successful malicious attack into a benign process crash. However, existing approaches either introduce insufficient randomness, or require source code modification. While insufficient randomness allows successful brute-force attacks, as shown in recent studies, the required source code modification prevents this effective method from being used for commodity software, which is the major source of exploited vulnerabilities on the Internet. We propose Address Space Layout Permutation (ASLP) that introduces high degree of randomness (or high entropy) with minimal performance overhead. Essential to ASLP is a novel binary rewriting tool that can place the static code and data segments of a compiled executable to a randomly specified location and performs fine-grained permutation of procedure bodies in the code segment as well as static data objects in the data segment. We have also modified the Linux operating system kernel to permute stack, heap, and memory mapped regions. Together, ASLP completely permutes memory regions in an application. Our security and performance evaluation shows minimal performance overhead with orders of magnitude improvement in randomness (e.g., up to 29 bits of randomness on a 32-bit architecture).

...read moreread less

Proceedings Article•DOI•

A large scale study of wireless search behavior: Google mobile search

[...]

Maryam Kamvar¹, Shumeet Baluja¹•Institutions (1)

Google¹

22 Apr 2006

TL;DR: The goal is to understand the current state of wireless search by analyzing over 1 Million hits to Google's mobile search sites, which includes the examination of search queries and the general categories under which they fall.

...read moreread less

Abstract: We present a large scale study of search patterns on Google's mobile search interface. Our goal is to understand the current state of wireless search by analyzing over 1 Million hits to Google's mobile search sites. Our study also includes the examination of search queries and the general categories under which they fall. We follow users throughout multiple interactions to determine search behavior; we estimate how long they spend inputting a query, viewing the search results, and how often they click on a search result. We also compare and contrast search patterns between 12-key keypad phones (cellphones), phones with QWERTY keyboards (PDAs) and conventional computers.

...read moreread less

Patent•

Business listing search

[...]

Brian Strope¹, William J. Byrne¹, Francoise Beaufays¹•Institutions (1)

Google¹

13 Oct 2006

TL;DR: In this article, a method of operating a voice-enabled business directory search system includes receiving category-business pairs, each category business pair including a business category and a specific business, and establishing a data structure having nodes based on the category business pairs Each node of the data structure is associated with one or more business categories and a speech recognition language model for recognizing specific businesses associated with the one or multiple businesses categories.

...read moreread less

Abstract: A method of operating a voice-enabled business directory search system includes receiving category-business pairs, each category-business pair including a business category and a specific business, and establishing a data structure having nodes based on the category-business pairs Each node of the data structure is associated with one or more business categories and a speech recognition language model for recognizing specific businesses associated with the one or more businesses categories

...read moreread less

Patent•

Disk array device

[...]

Chikusa Takashi¹, Hori Masanori², Tachibana Toshio², Maki Takehiro², Honma Hirotaka - Show less +1 more•Institutions (2)

Hitachi¹, Google²

23 Feb 2006

TL;DR: In this paper, a disk array device for cooling respective sections including a HDD by a fan, drop of cooling efficiency by the difference of conditions of HDD mounted and unmounted sections is prevented and the use of a dummy HDD is deleted/eliminated.

...read moreread less

Abstract: In a disk array device for cooling respective sections including a HDD by a fan, drop of cooling efficiency by the difference of conditions of HDD mounted and unmounted sections is prevented and the use of a dummy HDD is deleted/eliminated A housing contains a HDA mounted on the HDD, a power controller board for performing control to the HDD, a power unit for supplying power to each section, a fan for cooling the housing therein, and a backboard for connecting all the sections On one backboard surface, the HDA is mounted, which has a cooling function for making cooling air flowing in the housing and exhausting air from the housing via a region on which the HDA are mounted and via a vent hole on the backboard For the vent hole, a shutter is mounted, which has a mechanism for adjusting an open area rate of the vent hole by opening when the HDA is mounted and closing when removed

...read moreread less

Patent•

Determining advertisements using user interest information and map-based location information

[...]

Grossman Steven¹, Joshy Joseph¹, Bill Kilday¹, Nguyen Giao¹, Dominic Preuss¹, Sridhar Ramaswamy¹ - Show less +2 more•Institutions (1)

Google¹

08 Dec 2006

TL;DR: In response to a query for information in a geographic region or at a location, ranked ads may be plotted on, or in association with, a map (e.g., as a list beside the map), satellite photo, or any other form of visual representation of geographic information.

...read moreread less

Abstract: In response to a query for information in a geographic region or at a location, ranked ads may be plotted on, or in association with, a map (e.g., as a list beside the map), satellite photo, or any other form of visual representation of geographic information (referred to generally as “maps”). Sponsored ads might be shown in a dedicated place and/or might be elevated above other non-sponsored search results (e.g., Yellow Page listings). The number of ads shown in the list and/or plotted on the map could vary as a function of the resolution of the map or geographic image. The ads could be ranked or scored, and attributes or features of various ads may be a function of such a score or ranking. The plots on the map might be selectable to provide a pop-up with further information and possibly sponsored information (such as images, further ads, etc).

...read moreread less

Patent•

Modular computing environments

[...]

Jimmy Clidaras¹, William H. Whitted¹, William Hamburgen¹, Montgomery Sykora¹, Winnie Leung¹, Gerald Aigner¹, Donald L. Beaty¹ - Show less +3 more•Institutions (1)

Google¹

27 Jun 2006

TL;DR: In this paper, a computer system may include a connecting hub having a plurality of docking regions and be configured to provide to each docking region electrical power, a data network interface, cooling fluid supply and a cooling fluid return.

...read moreread less

Abstract: A computer system may include a connecting hub having a plurality of docking regions and be configured to provide to each docking region electrical power, a data network interface, a cooling fluid supply and a cooling fluid return; and a plurality of shipping containers that each enclose a modular computing environment that incrementally adds computing power to the system. Each shipping container may include a) a plurality of processing units coupled to the data network interface, each of which include a microprocessor; b) a heat exchanger configured to remove heat generated by the plurality of processing units by circulating cooling fluid from the supply through the heat exchanger and discharging it into the return; and c) docking members configured to releaseably couple to the connecting hub at one of the docking regions to receive electrical power, connect to the data network interface, and receive and discharge cooling fluid.

...read moreread less

Proceedings Article•DOI•

Adaptive Blocking: Learning to Scale Up Record Linkage

[...]

Mikhail Bilenko¹, Beena Kamath², Raymond J. Mooney³•Institutions (3)

Microsoft¹, Google², University of Texas at Austin³

18 Dec 2006

TL;DR: This paper introduces an adaptive framework for automatically learning blocking functions that are efficient and accurate, and describes two predicate-based formulations of learnable blocking functions and provides learning algorithms for training them.

...read moreread less

Abstract: Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.

...read moreread less

Collapse