scispace - formally typeset
Search or ask a question

Showing papers by "Jeffrey Dean published in 2004"


Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations


Patent
15 Sep 2004
TL;DR: In this paper, a system (125) identifies a document and obtains one or more types of history data associated with the document, and generates a score for the document based on at least part of the history data.
Abstract: A system (125) identifies a document and obtains one or more types of history data associated with the document. The system (125) may generate a score for the document based, at least in part, on the one or more types of history data.

418 citations


Patent
30 Jun 2004
TL;DR: In this article, a client assistant examines its cache for the requested document, and if the client assistant cannot provide the copy, the server seeks it from a document repository rather than the document's web host.
Abstract: Upon receipt of a document request, a client assistant examines its cache for the document. If not successful, a server searches for the requested document in its cache. If the server copy is still not fresh or not found, the server seeks the document from its host. If the host cannot provide the copy, the server seeks it from a document repository. Certain documents are identified from the document repository as being fresh or stable. Information about each these identified documents is transmitted to the server which inserts entries into an index if the index does not already contain an entry for the document. If and when this particular document is requested, the document will not be present in the server, however the server will contain an entry directing the server to obtain the document from the document repository rather than the document's web host.

282 citations


Patent
Jeffrey Dean1, Sanjay Ghemawat1
18 Jun 2004
TL;DR: In this paper, a large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment.
Abstract: A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment A plurality of intermediate data structures are used to store the intermediate data values One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data

193 citations


Patent
17 Jun 2004
TL;DR: In this article, a system generates a model based on feature data relating to different features of a link from a linking document to a linked document and user behavior data related to navigational actions associated with the link and assigns a rank to a document based on the model.
Abstract: A system generates a model based on feature data relating to different features of a link from a linking document to a linked document and user behavior data relating to navigational actions associated with the link. The system also assigns a rank to a document based on the model.

110 citations


Patent
27 Feb 2004
TL;DR: The usefulness of content (target content) such as advertisements, may be increased by determining additional content and providing such additional content in association with the content as mentioned in this paper, where the target content (310) may be text, a Web page, a URL, a search query, etc.
Abstract: The usefulness of content (target content), such as advertisements, may be increased by determining additional content and providing such additional content in association with the content. The target content (310) may be text, a Web page, a URL, a search query, etc. The additional content might be related suggested queries (e.g. 'Try a search for….”), news articles (or excerpts or summaries thereof), reviews (or excerpts or summaries thereof), advertisements, user group messages, etc.

59 citations


Patent
13 Aug 2004
TL;DR: In this article, a multi-tiered mapping scheme is proposed to enable multi-stage query scoring, including snippet generation, through incremental document reconstruction facilitated by a multilevel mapping scheme.
Abstract: The disclosed embodiments enable multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme. The mapping scheme includes a first mapping between unique tokens contained in a set of documents and unique global token identifiers (e.g., 32-bit integers) contained in a global-lexicon (i.e., dictionary). The mapping scheme also includes a second mapping between the global token identifiers and a set of fixed-length local token identifiers (e.g., 8-bit integers) contained in one or more mini-lexicons (i.e., sub-dictionaries). Each mini-lexicon is associated with a range of token positions in the tokenized documents. The first and second mappings are used to encode/decode documents into local token identifiers having fixed widths which can be compactly stored in the tokenspace repository. The use of fixed-length local token identifiers allows for fast and efficient decoding of tokenized documents.

37 citations


Patent
Jeffrey Dean1, Sanjay Ghemawat1
18 Jun 2004
TL;DR: In this paper, a large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment.
Abstract: A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment. A plurality of intermediate data structures are used to store the intermediate data values. One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data.

23 citations


Patent
13 Aug 2004
TL;DR: In this paper, a data structure including a data field and a tag field is described, where each tag subfield includes one or more tag bits which indicate the length of the data stored in the corresponding data subfield.
Abstract: A system and method for encoding and decoding variable-length data includes storing data values in a data structure including a data field and a tag field. The data field includes one or more variable-length data subfields capable of storing variable-length data (e.g., 1 to N bytes of data). In some embodiments, the data subfields and the tag field of the data structure each start on a byte boundary which simplifies decoding. The tag field includes one or more tag subfields, each corresponding to the one or more data subfields. Each tag subfield includes one or more tag bits which indicate the length of the data stored in the corresponding data subfield. Unpacking or decompressing data values from the data structure can be achieved by using a look-up table of offsets and masks, thus reducing the number of bit operations needed to unpack data values from the data structure.

9 citations


Patent
15 Sep 2004
TL;DR: In this paper, the authors present a trait a un systeme (125) permettant l'identification d'un document de donnees and l'obtention d'one ou de plusieurs types de donnes d'historique associees au document.
Abstract: La presente invention a trait a un systeme (125) permettant l'identification d'un document de donnees et l'obtention d'un ou de plusieurs types de donnees d'historique associees au document. Le systeme (125) peut assurer la generation d'une notation pour le document en fonction, au moins en partie, dudit un ou desdits plusieurs types de donnees d'historique.

Patent
27 Feb 2004
TL;DR: In this article, the authors define a contenu cible, which is the "utilite" of contenus (contenus cible), and define a set of relations between contenues cible and their users.
Abstract: Selon l'invention, l'utilite d'un contenu (contenu cible), tel que des annonces publicitaires, peut etre accrue par determination d'un contenu additionnel et par fourniture de ce contenu additionnel en association avec ledit contenu. Le contenu cible peut etre un texte, une page Web, une URL, une interrogation de recherche, etc. Le contenu additionnel peut se presenter sous forme d'interrogations suggerees associees (telles que 'Essayez de rechercher '), d'articles d'information (ou des extraits ou resumes de ceux-ci), d'etudes (ou des extraits ou resumes de celles-ci), d'annonces publicitaires, de messages de groupes d'utilisateurs, etc.