scispace - formally typeset
Search or ask a question
Author

Sean Dorward

Bio: Sean Dorward is an academic researcher from Google. The author has contributed to research in topics: Web query classification & Procedural programming. The author has an hindex of 3, co-authored 5 publications receiving 883 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.
Abstract: Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design -- including the separation into two phases, the form of the programming language, and the properties of the aggregators -- exploits the parallelism inherent in having data and computation distributed across many machines.

718 citations

Patent
Rob Pike1, Sean Quinlan1, Sean Dorward1, Jeffrey Dean1, Sanjay Ghemawat1 
28 Feb 2012
TL;DR: In this paper, a method and system for analyzing data records includes allocating groups of records to respective processes of a first plurality of processes executing in parallel, for each record in the group of records allocated to the respective process, a query is applied to the record so as to produce zero or more values.
Abstract: A method and system for analyzing data records includes allocating groups of records to respective processes of a first plurality of processes executing in parallel. In each respective process of the first plurality of processes, for each record in the group of records allocated to the respective process, a query is applied to the record so as to produce zero or more values. Zero or more emit operators are applied to each of the zero or more produced values so as to add corresponding information to an intermediate data structure. Information from a plurality of the intermediate data structures is aggregated to produce output data.

115 citations

Patent
31 Jan 2007
TL;DR: In this article, a method for generating search results based on a user's search query was proposed, where the search results and information identifying at least one of a telephone number or an address associated with the first one of the search result were provided to the user.
Abstract: A method includes receiving a search query from a user and generating search results based on the search query. The method may also include providing the search results and information identifying at least one of a telephone number or an address associated with a first one of the search results to the user. The method may further include providing a link to a map associated with at least the first search result to the user.

54 citations

Patent
15 Jul 2004
TL;DR: In this article, the computational efficiency of decoding of block-sorted compressed data is improved by ensuring that more than one set of operations corresponding to a plurality of paths through a mapping array T are being handled by a processor.
Abstract: In an embodiment of the present invention, the computational efficiency of decoding of block-sorted compressed data is improved by ensuring that more than one set of operations corresponding to a plurality of paths through a mapping array T are being handled by a processor. This sequence of operations, including instructions from the plurality of sets of operations, ensures that there is another operation in the pipeline if a cache miss on any given lookup operation in the mapping array results in a slower main memory access. In this way, the processor utilization is improved. While the sets of operations in the sequence of operations are independent of another other, there will be an overlap of a plurality of the main memory access operations due to the long time required for main memory access.

2 citations

Patent
31 Jan 2007
TL;DR: The authors concerne un procede consistant a recevoir une demande de recherche d'un utilisateur and a produire les resultats de la recherches sur la base of cette demande.
Abstract: L'invention concerne un procede consistant a recevoir une demande de recherche d'un utilisateur et a produire les resultats de la recherche sur la base de cette demande. Le procede peut egalement consister a fournir a l'utilisateur les resultats de la recherche et des informations identifiant au moins un numero de telephone ou une adresse associe(e) a un premier resultat parmi les resultats de la recherche. Le procede peut en outre consister a fournir a l'utilisateur un lien pointant vers une carte associee au moins au premier resultat de la recherche.

Cited by
More filters
Proceedings Article
01 Jan 2006
TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

4,843 citations

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Abstract: Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

3,840 citations

Journal ArticleDOI
TL;DR: The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

3,259 citations

Proceedings ArticleDOI
Michael Isard1, Mihai Budiu1, Yuan Yu1, Andrew Birrell1, Dennis Fetterly1 
21 Mar 2007
TL;DR: The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.
Abstract: Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through flies, TCP pipes, and shared-memory FIFOs.The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources.Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between vertices.

2,867 citations

Journal ArticleDOI
TL;DR: The background and state-of-the-art of big data are reviewed, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid, as well as related technologies.
Abstract: In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.

2,303 citations