Showing papers by "Jeffrey Dean published in 2004"

PDF

Open Access

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

...read moreread less

20,309 citations

Patent•

Information retrieval based on historical data

[...]

Anurag Acharya, Matt Cutts, Jeffrey Dean¹, Paul Haahr¹, Monika Henzinger¹, Urs Hoelzle, Steve Lawrence¹, Karl Pfleger¹, Olcan Sercinoglu¹, Simon Tong¹ - Show less +6 more•Institutions (1)

Google¹

15 Sep 2004

TL;DR: In this paper, a system (125) identifies a document and obtains one or more types of history data associated with the document, and generates a score for the document based on at least part of the history data.

...read moreread less

Abstract: A system (125) identifies a document and obtains one or more types of history data associated with the document. The system (125) may generate a score for the document based, at least in part, on the one or more types of history data.

...read moreread less

418 citations

Patent•

System and method of accessing a document efficiently through multi-tier web caching

[...]

Eric Russell Fredricksen¹, Fritz Schneider¹, Jeffrey Dean¹, Sanjay Ghemawat¹, Niels Provos¹, Georges R. Harik¹ - Show less +2 more•Institutions (1)

Google¹

30 Jun 2004

TL;DR: In this article, a client assistant examines its cache for the requested document, and if the client assistant cannot provide the copy, the server seeks it from a document repository rather than the document's web host.

...read moreread less

Abstract: Upon receipt of a document request, a client assistant examines its cache for the document. If not successful, a server searches for the requested document in its cache. If the server copy is still not fresh or not found, the server seeks the document from its host. If the host cannot provide the copy, the server seeks it from a document repository. Certain documents are identified from the document repository as being fresh or stable. Information about each these identified documents is transmitted to the server which inserts entries into an index if the index does not already contain an entry for the document. If and when this particular document is requested, the document will not be present in the server, however the server will contain an entry directing the server to obtain the document from the document repository rather than the document's web host.

...read moreread less

282 citations

Patent•

System and method for efficient large-scale data processing

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

18 Jun 2004

TL;DR: In this paper, a large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment.

...read moreread less

Abstract: A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment A plurality of intermediate data structures are used to store the intermediate data values One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data

...read moreread less

193 citations

Patent•

Ranking documents based on user behavior and/or feature data

[...]

Jeffrey Dean¹, Corin Anderson¹, Alexis Battle¹•Institutions (1)

Google¹

17 Jun 2004

TL;DR: In this article, a system generates a model based on feature data relating to different features of a link from a linking document to a linked document and user behavior data related to navigational actions associated with the link and assigns a rank to a document based on the model.

...read moreread less

Abstract: A system generates a model based on feature data relating to different features of a link from a linking document to a linked document and user behavior data relating to navigational actions associated with the link. The system also assigns a rank to a document based on the model.

...read moreread less

110 citations

Patent•

[...]

Jeffrey Dean¹, Krishna Bharat¹, Paul T. Buchheit¹•Institutions (1)

Google¹

27 Feb 2004

TL;DR: The usefulness of content (target content) such as advertisements, may be increased by determining additional content and providing such additional content in association with the content as mentioned in this paper, where the target content (310) may be text, a Web page, a URL, a search query, etc.

...read moreread less

Abstract: The usefulness of content (target content), such as advertisements, may be increased by determining additional content and providing such additional content in association with the content. The target content (310) may be text, a Web page, a URL, a search query, etc. The additional content might be related suggested queries (e.g. 'Try a search for….”), news articles (or excerpts or summaries thereof), reviews (or excerpts or summaries thereof), advertisements, user group messages, etc.

...read moreread less

59 citations

Patent•

Document compression system and method for use with tokenspace repository

[...]

Jeffrey Dean¹, Gautham Thambidorai¹, Sanjay Ghemawat¹, Benedict A. Gomes¹, Olcan Sercinoglu¹ - Show less +1 more•Institutions (1)

Google¹

13 Aug 2004

TL;DR: In this article, a multi-tiered mapping scheme is proposed to enable multi-stage query scoring, including snippet generation, through incremental document reconstruction facilitated by a multilevel mapping scheme.

...read moreread less

Abstract: The disclosed embodiments enable multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme. The mapping scheme includes a first mapping between unique tokens contained in a set of documents and unique global token identifiers (e.g., 32-bit integers) contained in a global-lexicon (i.e., dictionary). The mapping scheme also includes a second mapping between the global token identifiers and a set of fixed-length local token identifiers (e.g., 8-bit integers) contained in one or more mini-lexicons (i.e., sub-dictionaries). Each mini-lexicon is associated with a range of token positions in the tokenized documents. The first and second mappings are used to encode/decode documents into local token identifiers having fixed widths which can be compactly stored in the tokenspace repository. The use of fixed-length local token identifiers allows for fast and efficient decoding of tokenized documents.

...read moreread less

37 citations

Patent•

Large-scale data processing in a distributed and parallel processing enviornment

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

18 Jun 2004

...read moreread less

Abstract: A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.

...read moreread less

23 citations

Patent•

System and method for encoding and decoding variable-length data

[...]

Jeffrey Dean¹, Michael Burrows¹, Gautham Thambidorai¹, Olcan Sercinoglu¹•Institutions (1)

Google¹

13 Aug 2004

TL;DR: In this paper, a data structure including a data field and a tag field is described, where each tag subfield includes one or more tag bits which indicate the length of the data stored in the corresponding data subfield.

...read moreread less

Abstract: A system and method for encoding and decoding variable-length data includes storing data values in a data structure including a data field and a tag field. The data field includes one or more variable-length data subfields capable of storing variable-length data (e.g., 1 to N bytes of data). In some embodiments, the data subfields and the tag field of the data structure each start on a byte boundary which simplifies decoding. The tag field includes one or more tag subfields, each corresponding to the one or more data subfields. Each tag subfield includes one or more tag bits which indicate the length of the data stored in the corresponding data subfield. Unpacking or decompressing data values from the data structure can be achieved by using a look-up table of offsets and masks, thus reducing the number of bit operations needed to unpack data values from the data structure.

...read moreread less

9 citations

Patent•

Récuperation d'information basée sur des données historiques

[...]

Anurag Acharya, Simon Tong, Matt Cutts, Jeffrey Dean, Paul Haahr, Monika H. Henzinger, Urs Hoelzle, Steve Lawrence, Karl Pfleger, Olcan Sercinoglu - Show less +6 more

15 Sep 2004

TL;DR: In this paper, the authors present a trait a un systeme (125) permettant l'identification d'un document de donnees and l'obtention d'one ou de plusieurs types de donnes d'historique associees au document.

...read moreread less

Abstract: La presente invention a trait a un systeme (125) permettant l'identification d'un document de donnees et l'obtention d'un ou de plusieurs types de donnees d'historique associees au document. Le systeme (125) peut assurer la generation d'une notation pour le document en fonction, au moins en partie, dudit un ou desdits plusieurs types de donnees d'historique.

...read moreread less

Patent•

Identification d'informations connexes en fonction d'un contenu et/ou presentation d'informations connexes en association avec des annonces publicitaires liees au contenu

[...]

Jeffrey Dean, Krishna Bharat, Paul T. Buchheit

27 Feb 2004

TL;DR: In this article, the authors define a contenu cible, which is the "utilite" of contenus (contenus cible), and define a set of relations between contenues cible and their users.

...read moreread less

Abstract: Selon l'invention, l'utilite d'un contenu (contenu cible), tel que des annonces publicitaires, peut etre accrue par determination d'un contenu additionnel et par fourniture de ce contenu additionnel en association avec ledit contenu. Le contenu cible peut etre un texte, une page Web, une URL, une interrogation de recherche, etc. Le contenu additionnel peut se presenter sous forme d'interrogations suggerees associees (telles que 'Essayez de rechercher '), d'articles d'information (ou des extraits ou resumes de ceux-ci), d'etudes (ou des extraits ou resumes de celles-ci), d'annonces publicitaires, de messages de groupes d'utilisateurs, etc.

...read moreread less

Patent•

Informationsabruf auf der Basis von historischen Daten

[...]

Anurag Acharya, Simon Tong, Matt Cutts, Jeffrey Dean, Paul Haahr, Monika H. Henzinger, Urs Hoelzle, Steve Lawrence, Karl Pfleger, Olcan Sercinoglu - Show less +6 more

15 Sep 2004

Patent•

Identifizierung von auf einen bestimmten inhalt bezogenen informationen und/oder präsentation von bezugsinformationen in zusammenhang mit inhaltsbezogenen werbungen

[...]

Jeffrey Dean, Krishna Bharat, Paul T. Buchheit

27 Feb 2004