scispace - formally typeset
Search or ask a question

Showing papers by "Google published in 2006"


Proceedings ArticleDOI
06 Nov 2006
TL;DR: Bigtable as discussed by the authors is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

1,523 citations


Proceedings ArticleDOI
Michael Burrows1
06 Nov 2006
TL;DR: The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.
Abstract: We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.

905 citations


Proceedings ArticleDOI
27 Jun 2006
TL;DR: This paper proposes a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications and describes a query plan-based approach to efficiently implementing this language.
Abstract: In this paper, we present the design, implementation, and evaluation of a system that executes complex event queries over real-time streams of RFID readings encoded as events. These complex event queries filter and correlate events to match specific patterns, and transform the relevant events into new composite events for the use of external monitoring applications. Stream-based execution of these queries enables time-critical actions to be taken in environments such as supply chain management, surveillance and facility management, healthcare, etc. We first propose a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications. We then describe a query plan-based approach to efficiently implementing this language. Our approach uses native operators to efficiently handle query-defined sequences, which are a key component of complex event processing, and pipeline such sequences to subsequent operators that are built by leveraging relational techniques. We also develop a large suite of optimization techniques to address challenges such as large sliding windows and intermediate result sizes. We demonstrate the effectiveness of our approach through a detailed performance analysis of our prototype implementation under a range of data and query workloads as well as through a comparison to a state-of-the-art stream processor.

902 citations


Proceedings ArticleDOI
23 May 2006
TL;DR: This paper defines a similarity kernel function, mathematically analyze some of its properties, and provides examples of its efficacy, and shows the use of this kernel function in a large-scale system for suggesting related queries to search engine users.
Abstract: Determining the similarity of short text snippets, such as search queries, works poorly with traditional document similarity measures (e.g., cosine), since there are often few, if any, terms in common between two short text snippets. We address this problem by introducing a novel method for measuring the similarity between short text snippets (even those without any overlapping terms) by leveraging web search results to provide greater context for the short texts. In this paper, we define such a similarity kernel function, mathematically analyze some of its properties, and provide examples of its efficacy. We also show the use of this kernel function in a large-scale system for suggesting related queries to search engine users.

797 citations


Journal ArticleDOI
TL;DR: It is shown that the quantization distortion in diffusion space bounds the error of compression of the operator, thus giving a rigorous justification for k-means clustering in diffusionspace and a precise measure of the performance of general clustering algorithms.
Abstract: We provide evidence that nonlinear dimensionality reduction, clustering, and data set parameterization can be solved within one and the same framework. The main idea is to define a system of coordinates with an explicit metric that reflects the connectivity of a given data set and that is robust to noise. Our construction, which is based on a Markov random walk on the data, offers a general scheme of simultaneously reorganizing and subsampling graphs and arbitrarily shaped data sets in high dimensions using intrinsic geometry. We show that clustering in embedding spaces is equivalent to compressing operators. The objective of data partitioning and clustering is to coarse-grain the random walk on the data while at the same time preserving a diffusion operator for the intrinsic geometry or connectivity of the data set up to some accuracy. We show that the quantization distortion in diffusion space bounds the error of compression of the operator, thus giving a rigorous justification for k-means clustering in diffusion space and a precise measure of the performance of general clustering algorithms

711 citations


Proceedings ArticleDOI
01 Sep 2006

650 citations


Proceedings ArticleDOI
03 Dec 2006
TL;DR: In this article, the authors proposed regular expression rewrite techniques that can effectively reduce memory usage and developed a grouping scheme that can strategically compile a set of regular expressions into several engines, resulting in remarkable improvement of regular expression matching speed without much increase in memory usage.
Abstract: Packet content scanning at high speed has become extremely important due to its applications in network security, network monitoring, HTTP load balancing, etc. In content scanning, the packet payload is compared against a set of patterns specified as regular expressions. In this paper, we first show that memory requirements using traditional methods are prohibitively high for many patterns used in packet scanning applications. We then propose regular expression rewrite techniques that can effectively reduce memory usage. Further, we develop a grouping scheme that can strategically compile a set of regular expressions into several engines, resulting in remarkable improvement of regular expression matching speed without much increase in memory usage. We implement a new DFA-based packet scanner using the above techniques. Our experimental results using real-world traffic and patterns show that our implementation achieves a factor of 12 to 42 performance improvement over a commonly used DFA- based scanner. Compared to the state-of-art NFA-based implementation, our DFA-based packet scanner achieves 50 to 700 times speedup.

527 citations


Proceedings ArticleDOI
01 Sep 2006
TL;DR: It is shown that the ULDB representation is complete, and that it permits straightforward implementation of many relational operations, and how ULDBs enable a new approach to query processing in probabilistic databases.
Abstract: This paper introduces ULDBs, an extension of relational databases with simple yet expressive constructs for representing and manipulating both lineage and uncertainty. Uncertain data and data lineage are two important areas of data management that have been considered extensively in isolation, however many applications require the features in tandem. Fundamentally, lineage enables simple and consistent representation of uncertain data, it correlates uncertainty in query results with uncertainty in the input data, and query processing with lineage and uncertainty together presents computational benefits over treating them separately.We show that the ULDB representation is complete, and that it permits straightforward implementation of many relational operations. We define two notions of ULDB minimality--data-minimal and lineage-minimal--and study minimization of ULDB representations under both notions. With lineage, derived relations are no longer self-contained: their uncertainty depends on uncertainty in the base data. We provide an algorithm for the new operation of extracting a database subset in the presence of interconnected uncertainty. Finally, we show how ULDBs enable a new approach to query processing in probabilistic databases.ULDBs form the basis of the Trio system under development at Stanford.

523 citations


Proceedings ArticleDOI
Monika Henzinger1
06 Aug 2006
TL;DR: A combined algorithm is presented which achieves precision 0.79 with 79% of the recall of the other algorithms, and since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall than Broder et al.'s algorithm.
Abstract: Broder et al.'s [3] shingling algorithm and Charikar's [4] random projection based approach are considered "state-of-the-art" algorithms for finding near-duplicate web pages. Both algorithms were either developed at or used by popular web search engines. We compare the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that neither of the algorithms works well for finding near-duplicate pairs on the same site, while both achieve high precision for near-duplicate pairs on different sites. Since Charikar's algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al.'s algorithm. We present a combined algorithm which achieves precision 0.79 with 79% of the recall of the other algorithms.

506 citations


Patent
09 May 2006
TL;DR: In this paper, a collection of captured images that form at least a portion of a library of images is used to enable retrieval of the captured images, and an index is generated where the index data is based on recognized information.
Abstract: An embodiment provides for enabling retrieval of a collection of captured images that form at least a portion of a library of images. For each image in the collection, a captured image may be analyzed to recognize information from image data contained in the captured image, and an index may be generated, where the index data is based on the recognized information. Using the index, functionality such as search and retrieval is enabled. Various recognition techniques, including those that use the face, clothing, apparel, and combinations of characteristics may be utilized. Recognition may be performed on, among other things, persons and text carried on objects.

463 citations


Patent
12 Oct 2006
TL;DR: In this article, a system for ranking geospatial entities is described, which consists of an interface for receiving ranking data about a plurality of entities and an entity ranking module that uses a ranking mechanism to generate place ranks for the entities based on the ranking data.
Abstract: A system for ranking geospatial entities is described. In one embodiment, the system comprises an interface for receiving ranking data about a plurality of geospatial entities and an entity ranking module. The module uses a ranking mechanism to generate place ranks for the geospatial entities based on the ranking data. Ranked entity data generated by the entity ranking module is stored in a database. The entity ranking module may be configured to evaluate a plurality of diverse attributes to determine a total score for a geospatial entity. The entity ranking module may be configured to organize ranked entity data into placemark layers.

Journal ArticleDOI
TL;DR: A generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff and is asymptotically faster than all previous suffix tree or array construction algorithms.
Abstract: Suffix trees and suffix arrays are widely used and largely interchangeable index structures on strings and sequences. Practitioners prefer suffix arrays due to their simplicity and space efficiency while theoreticians use suffix trees due to linear-time construction algorithms and more explicit structure. We narrow this gap between theory and practice with a simple linear-time construction algorithm for suffix arrays. The simplicity is demonstrated with a C++ implementation of 50 effective lines of code. The algorithm is called DC3, which stems from the central underlying concept of difference cover. This view leads to a generalized algorithm, DC, that allows a space-efficient implementation and, moreover, supports the choice of a space--time tradeoff. For any v ∈ [1,√n], it runs in O(vn) time using O(n/√v) space in addition to the input string and the suffix array. We also present variants of the algorithm for several parallel and hierarchical memory models of computation. The algorithms for BSP and EREW-PRAM models are asymptotically faster than all previous suffix tree or array construction algorithms.

Patent
29 Jun 2006
TL;DR: In this paper, a method and apparatus for pilot signal transmission is described, where the set of occupied pilot block sub-carriers at least changes at least once in a burst.
Abstract: A method and apparatus for pilot signal transmission is disclosed herein. In particular, a pilot transmission scheme is utilized where pilot sub-carrier bandwidth differs from data sub-carrier bandwidth. Because some user's data sub-carriers will no longer have the user's pilot sub-carriers adjacent to them, the set, or pattern of sub-carriers used by the pilot blocks changes at least once in a burst. Changing the pilot block pattern (the set of occupied pilot block sub-carriers) at least once in the burst serves to increase the frequency proximity of occupied data sub-carriers to occupied pilot sub-carriers in the burst.

Patent
27 Mar 2006
TL;DR: In this paper, radio frames are divided into a plurality of subframes, and data is transmitted over the radio frames within the radio subframes by having a frame duration selected from two or more possible frame durations.
Abstract: During operation radio frames are divided into a plurality of subframes. Data is transmitted over the radio frames within a plurality of subframes, and having a frame duration selected from two or more possible frame durations.

Book
31 Dec 2006
TL;DR: A survey of recent methods for creating piecewise linear mappings between triangulations in 3D and simpler domains such as planar regions, simplicial complexes, and spheres is presented in this article.
Abstract: We present a survey of recent methods for creating piecewise linear mappings between triangulations in 3D and simpler domains such as planar regions, simplicial complexes, and spheres. We also discuss emerging tools such as global parameterization, inter-surface mapping, and parameterization with constraints. We start by describing the wide range of applications where parameterization tools have been used in recent years. We then briefly review the pertinent mathematical background and terminology, before proceeding to survey the existing parameterization techniques. Our survey summarizes the main ideas of each technique and discusses its main properties, comparing it to other methods available. Thus it aims to provide guidance to researchers and developers when assessing the suitability of different methods for various applications. This survey focuses on the practical aspects of the methods available, such as time complexity and robustness and shows multiple examples of parameterizations generated using different methods, allowing the reader to visually evaluate and compare the results.

Proceedings ArticleDOI
11 Jun 2006
TL;DR: This work presents a truthful auction for pricing advertising slots on a web-page assuming that advertisements for different merchants must be ranked in decreasing order of their (weighted) bids.
Abstract: We present a truthful auction for pricing advertising slots on a web-page assuming that advertisements for different merchants must be ranked in decreasing order of their (weighted) bids. This captures both the Overture model where bidders are ranked in order of the submitted bids, and the Google model where bidders are ranked in order of the expected revenue (or utility) that their advertisement generates. Assuming separable click-through rates, we prove revenue-equivalence between our auction and the non-truthful next-price auctions currently in use.

Proceedings ArticleDOI
25 Oct 2006
TL;DR: Topology-based Geolocation (TBG), a novel approach to estimating the geographic location of arbitrary Internet hosts by leveraging network topology, along with measurements of network delay, to constrain host position, improves the consistency of location estimates.
Abstract: We present Topology-based Geolocation (TBG), a novel approach to estimating the geographic location of arbitrary Internet hosts. We motivate our work by showing that 1) existing approaches, based on end-to-end delay measurements from a set of landmarks, fail to outperform much simpler techniques, and 2) the error of these approaches is strongly determined by the distance to the nearest landmark, even when triangulation is used to combine estimates from different landmarks. Our approach improves on these earlier techniques by leveraging network topology, along with measurements of network delay, to constrain host position. We convert topology and delay data into a set of constraints, then solve for router and host locations simultaneously. This approach improves the consistency of location estimates, reducing the error substantially for structured networks in our experiments on Abilene and Sprint. For networks with insufficient structural constraints, our techniques integrate external hints that are validated using measurements before being trusted. Together, these techniques lower the median estimation error for our university-based dataset to 67 km vs. 228 km for the best previous approach.

Proceedings Article
16 Jul 2006
TL;DR: A series of experiments with different machine learning algorithms are discussed in order to experimentally evaluate various trade-offs, using approximately 100K product reviews from the web.
Abstract: Evaluating text fragments for positive and negative subjective expressions and their strength can be important in applications such as single- or multi- document summarization, document ranking, data mining, etc. This paper looks at a simplified version of the problem: classifying online product reviews into positive and negative classes. We discuss a series of experiments with different machine learning algorithms in order to experimentally evaluate various trade-offs, using approximately 100K product reviews from the web.

Proceedings ArticleDOI
26 Jun 2006
TL;DR: This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k, and extends the algorithms to allow an ε fraction of points to remain unclustered, i.e., deleted from the anonymized publication.
Abstract: Publishing data for analysis from a table containing personal records, while maintaining individual privacy, is a problem of increasing importance today. The traditional approach of de-identifying records is to remove identifying fields such as social security number, name etc. However, recent research has shown that a large fraction of the US population can be identified using non-key attributes (called quasi-identifiers) such as date of birth, gender, and zip code [15]. Sweeney [16] proposed the k-anonymity model for privacy where non-key attributes that leak information are suppressed or generalized so that, for every record in the modified table, there are at least k−1 other records having exactly the same values for quasi-identifiers. We propose a new method for anonymizing data records, where quasi-identifiers of data records are first clustered and then cluster centers are published. To ensure privacy of the data records, we impose the constraint that each cluster must contain no fewer than a pre-specified number of data records. This technique is more general since we have a much larger choice for cluster centers than k-Anonymity. In many cases, it lets us release a lot more information without compromising privacy. We also provide constant-factor approximation algorithms to come up with such a clustering. This is the first set of algorithms for the anonymization problem where the performance is independent of the anonymity parameter k. We further observe that a few outlier points can significantly increase the cost of anonymization. Hence, we extend our algorithms to allow an e fraction of points to remain unclustered, i.e., deleted from the anonymized publication. Thus, by not releasing a small fraction of the database records, we can ensure that the data published for analysis has less distortion and hence is more useful. Our approximation algorithms for new clustering objectives are of independent interest and could be applicable in other clustering scenarios as well.

Journal ArticleDOI
TL;DR: This paper presents the Laplace-Beltrami approach for computing density invariant embeddings which are essential for integrating different sources of data and describes a refinement of the Nystrom extension algorithm called "geometric harmonics."
Abstract: Data fusion and multicue data matching are fundamental tasks of high-dimensional data analysis. In this paper, we apply the recently introduced diffusion framework to address these tasks. Our contribution is three-fold: first, we present the Laplace-Beltrami approach for computing density invariant embeddings which are essential for integrating different sources of data. Second, we describe a refinement of the Nystrom extension algorithm called "geometric harmonics." We also explain how to use this tool for data assimilation. Finally, we introduce a multicue data matching scheme based on nonlinear spectral graphs alignment. The effectiveness of the presented schemes is validated by applying it to the problems of lipreading and image sequence alignment

Patent
Hartmut Neven1
12 May 2006
TL;DR: In this article, an image-based information retrieval system, including a mobile telephone, a remote recognition server, and a remote media server, was presented. But the system was not designed for the retrieval of images.
Abstract: An image-based information retrieval system, including a mobile telephone, a remote recognition server, and a remote media server, the mobile telephone having a built-in camera and a communication link for transmitting an image from the built-in camera to the remote recognition server and for receiving mobile media content from the remote media server, the remote recognition server for matching an image from the mobile telephone with an object representation in a database and forwarding an associated text identifier to the remote media server, and the remote media server for forwarding mobile media content to the mobile telephone based on the associated text identifier.

Proceedings ArticleDOI
26 Jun 2006
TL;DR: This paper lays out specific technical challenges to realizing DSSPs, the DSSP's ability to introspect on its content, and the use of human attention to enhance the semantic relationships in a dataspace.
Abstract: The most acute information management challenges today stem from organizations relying on a large number of diverse, interrelated data sources, but having no means of managing them in a convenient, integrated, or principled fashion. These challenges arise in enterprise and government data management, digital libraries, "smart" homes and personal information management. We have proposed dataspaces as a data management abstraction for these diverse applications and DataSpace Support Platforms (DSSPs) as systems that should be built to provide the required services over dataspaces. Unlike data integration systems, DSSPs do not require full semantic integration of the sources in order to provide useful services. This paper lays out specific technical challenges to realizing DSSPs and ties them to existing work in our field. We focus on query answering in DSSPs, the DSSP's ability to introspect on its content, and the use of human attention to enhance the semantic relationships in a dataspace.

Journal ArticleDOI
11 Aug 2006
TL;DR: This paper design and analyze techniques to increase "persistence" of sensed data, so that data is more likely to reach a data sink, even as network nodes fail, by replicating data compactly at neighboring nodes using novel "Growth Codes" that increase in efficiency as data accumulates at the sink.
Abstract: Sensor networks are especially useful in catastrophic or emergency scenarios such as floods, fires, terrorist attacks or earthquakes where human participation may be too dangerous. However, such disaster scenarios pose an interesting design challenge since the sensor nodes used to collect and communicate data may themselves fail suddenly and unpredictably, resulting in the loss of valuable data. Furthermore, because these networks are often expected to be deployed in response to a disaster, or because of sudden configuration changes due to failure, these networks are often expected to operate in a "zero-configuration" paradigm, where data collection and transmission must be initiated immediately, before the nodes have a chance to assess the current network topology. In this paper, we design and analyze techniques to increase "persistence" of sensed data, so that data is more likely to reach a data sink, even as network nodes fail. This is done by replicating data compactly at neighboring nodes using novel "Growth Codes" that increase in efficiency as data accumulates at the sink. We show that Growth Codes preserve more data in the presence of node failures than previously proposed erasure resilient techniques.

Proceedings ArticleDOI
11 Dec 2006
TL;DR: This work proposes address space layout permutation (ASLP) that introduces high degree of randomness (or high entropy) with minimal performance overhead and modified the Linux operating system kernel to permute stack, heap, and memory mapped regions.
Abstract: Address space randomization is an emerging and promising method for stopping a broad range of memory corruption attacks. By randomly shifting critical memory regions at process initialization time, address space randomization converts an otherwise successful malicious attack into a benign process crash. However, existing approaches either introduce insufficient randomness, or require source code modification. While insufficient randomness allows successful brute-force attacks, as shown in recent studies, the required source code modification prevents this effective method from being used for commodity software, which is the major source of exploited vulnerabilities on the Internet. We propose Address Space Layout Permutation (ASLP) that introduces high degree of randomness (or high entropy) with minimal performance overhead. Essential to ASLP is a novel binary rewriting tool that can place the static code and data segments of a compiled executable to a randomly specified location and performs fine-grained permutation of procedure bodies in the code segment as well as static data objects in the data segment. We have also modified the Linux operating system kernel to permute stack, heap, and memory mapped regions. Together, ASLP completely permutes memory regions in an application. Our security and performance evaluation shows minimal performance overhead with orders of magnitude improvement in randomness (e.g., up to 29 bits of randomness on a 32-bit architecture).

Proceedings ArticleDOI
Maryam Kamvar1, Shumeet Baluja1
22 Apr 2006
TL;DR: The goal is to understand the current state of wireless search by analyzing over 1 Million hits to Google's mobile search sites, which includes the examination of search queries and the general categories under which they fall.
Abstract: We present a large scale study of search patterns on Google's mobile search interface. Our goal is to understand the current state of wireless search by analyzing over 1 Million hits to Google's mobile search sites. Our study also includes the examination of search queries and the general categories under which they fall. We follow users throughout multiple interactions to determine search behavior; we estimate how long they spend inputting a query, viewing the search results, and how often they click on a search result. We also compare and contrast search patterns between 12-key keypad phones (cellphones), phones with QWERTY keyboards (PDAs) and conventional computers.

Patent
13 Oct 2006
TL;DR: In this article, a method of operating a voice-enabled business directory search system includes receiving category-business pairs, each category business pair including a business category and a specific business, and establishing a data structure having nodes based on the category business pairs Each node of the data structure is associated with one or more business categories and a speech recognition language model for recognizing specific businesses associated with the one or multiple businesses categories.
Abstract: A method of operating a voice-enabled business directory search system includes receiving category-business pairs, each category-business pair including a business category and a specific business, and establishing a data structure having nodes based on the category-business pairs Each node of the data structure is associated with one or more business categories and a speech recognition language model for recognizing specific businesses associated with the one or more businesses categories

Patent
Chikusa Takashi1, Hori Masanori2, Tachibana Toshio2, Maki Takehiro2, Honma Hirotaka 
23 Feb 2006
TL;DR: In this paper, a disk array device for cooling respective sections including a HDD by a fan, drop of cooling efficiency by the difference of conditions of HDD mounted and unmounted sections is prevented and the use of a dummy HDD is deleted/eliminated.
Abstract: In a disk array device for cooling respective sections including a HDD by a fan, drop of cooling efficiency by the difference of conditions of HDD mounted and unmounted sections is prevented and the use of a dummy HDD is deleted/eliminated A housing contains a HDA mounted on the HDD, a power controller board for performing control to the HDD, a power unit for supplying power to each section, a fan for cooling the housing therein, and a backboard for connecting all the sections On one backboard surface, the HDA is mounted, which has a cooling function for making cooling air flowing in the housing and exhausting air from the housing via a region on which the HDA are mounted and via a vent hole on the backboard For the vent hole, a shutter is mounted, which has a mechanism for adjusting an open area rate of the vent hole by opening when the HDA is mounted and closing when removed

Patent
Grossman Steven1, Joshy Joseph1, Bill Kilday1, Nguyen Giao1, Dominic Preuss1, Sridhar Ramaswamy1 
08 Dec 2006
TL;DR: In response to a query for information in a geographic region or at a location, ranked ads may be plotted on, or in association with, a map (e.g., as a list beside the map), satellite photo, or any other form of visual representation of geographic information.
Abstract: In response to a query for information in a geographic region or at a location, ranked ads may be plotted on, or in association with, a map (e.g., as a list beside the map), satellite photo, or any other form of visual representation of geographic information (referred to generally as “maps”). Sponsored ads might be shown in a dedicated place and/or might be elevated above other non-sponsored search results (e.g., Yellow Page listings). The number of ads shown in the list and/or plotted on the map could vary as a function of the resolution of the map or geographic image. The ads could be ranked or scored, and attributes or features of various ads may be a function of such a score or ranking. The plots on the map might be selectable to provide a pop-up with further information and possibly sponsored information (such as images, further ads, etc).

Patent
27 Jun 2006
TL;DR: In this paper, a computer system may include a connecting hub having a plurality of docking regions and be configured to provide to each docking region electrical power, a data network interface, cooling fluid supply and a cooling fluid return.
Abstract: A computer system may include a connecting hub having a plurality of docking regions and be configured to provide to each docking region electrical power, a data network interface, a cooling fluid supply and a cooling fluid return; and a plurality of shipping containers that each enclose a modular computing environment that incrementally adds computing power to the system. Each shipping container may include a) a plurality of processing units coupled to the data network interface, each of which include a microprocessor; b) a heat exchanger configured to remove heat generated by the plurality of processing units by circulating cooling fluid from the supply through the heat exchanger and discharging it into the return; and c) docking members configured to releaseably couple to the connecting hub at one of the docking regions to receive electrical power, connect to the data network interface, and receive and discharge cooling fluid.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: This paper introduces an adaptive framework for automatically learning blocking functions that are efficient and accurate, and describes two predicate-based formulations of learnable blocking functions and provides learning algorithms for training them.
Abstract: Many data mining tasks require computing similarity between pairs of objects. Pairwise similarity computations are particularly important in record linkage systems, as well as in clustering and schema mapping algorithms. Because the number of object pairs grows quadratically with the size of the dataset, computing similarity between all pairs is impractical and becomes prohibitive for large datasets and complex similarity functions. Blocking methods alleviate this problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar. Previously proposed blocking methods require manually constructing an index- based similarity function or selecting a set of predicates, followed by hand-tuning of parameters. In this paper, we introduce an adaptive framework for automatically learning blocking functions that are efficient and accurate. We describe two predicate-based formulations of learnable blocking functions and provide learning algorithms for training them. The effectiveness of the proposed techniques is demonstrated on real and simulated datasets, on which they prove to be more accurate than non-adaptive blocking methods.