scispace - formally typeset
Search or ask a question

Showing papers by "Google published in 2004"


Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations


Patent
12 Mar 2004
TL;DR: In this paper, the authors present methods and apparatuses for targeting the delivery of advertisements over a network such as the Internet, and the use of advertisements is tracked to permit targeting of the advertisements of individual users.
Abstract: Methods and apparatuses for targeting the delivery of advertisements over a network such as the Internet are disclosed. Statistics are compiled on individual users and networks and the use of the advertisements is tracked to permit targeting of the advertisements of individual users. In response to requests from affiliated sites, an advertising server transmits to people accessing the page of a site an appropriate one of the advertisement based upon profiling of users and networks.

2,131 citations


Journal ArticleDOI
TL;DR: A phrase-based statistical machine translation approach the alignment template approach is described, which allows for general many-to-many relations between words and is easier to extend than classical statistical machinetranslation systems.
Abstract: A phrase-based statistical machine translation approach — the alignment template approach — is described. This translation approach allows for general many-to-many relations between words. Thereby, the context of words is taken into account in the translation model, and local changes in word order from source to target language can be learned explicitly. The model is described using a log-linear modeling approach, which is a generalization of the often used source–channel approach. Thereby, the model is easier to extend than classical statistical machine translation systems. We describe in detail the process for learning phrasal translations, the feature functions used, and the search algorithm. The evaluation of this approach is performed on three different tasks. For the German–English speech VERBMOBIL task, we analyze the effect of various system components. On the French–English Canadian HANSARDS task, the alignment template system obtains significantly better results than a single-word-based translation model. In the Chinese–English 2002 National Institute of Standards and Technology (NIST) machine translation evaluation it yields statistically significantly better NIST scores than all competing research and commercial translation systems.

1,031 citations


Proceedings Article
Niels Provos1
13 Aug 2004
TL;DR: Honeyd is presented, a framework for virtual honeypots that simulates virtual computer systems at the network level and shows how the Honeyd framework helps in many areas of system security, e.g. detecting and disabling worms, distracting adversaries, or preventing the spread of spam email.
Abstract: A honeypot is a closely monitored network decoy serving several purposes: it can distract adversaries from more valuable machines on a network, provide early warning about new attack and exploitation trends, or allow in-depth examination of adversaries during and after exploitation of a honeypot. Deploying a physical honeypot is often time intensive and expensive as different operating systems require specialized hardware and every honeypot requires its own physical system. This paper presents Honeyd, a framework for virtual honeypots that simulates virtual computer systems at the network level. The simulated computer systems appear to run on unallocated network addresses. To deceive network fingerprinting tools, Honeyd simulates the networking stack of different operating systems and can provide arbitrary routing topologies and services for an arbitrary number of virtual systems. This paper discusses Honeyd's design and shows how the Honeyd framework helps in many areas of system security, e.g. detecting and disabling worms, distracting adversaries, or preventing the spread of spam email.

729 citations


Patent
22 Jun 2004
TL;DR: In this paper, a search system monitors the input of a search query by a user, and sends a portion of the query as a partial query to the search engine for possible selection.
Abstract: A search system monitors the input of a search query by a user. Before the user finishes entering the search query, the search system identifies and sends a portion of the query as a partial query to the search engine. Based on the partial query, the search engine creates a set of predicted queries. This process may take into account prior queries submitted by a community of users, and may take into account a user profile. The predicted queries are be sent back to the user for possible selection. The search system may also cache search results corresponding to one or more of the predicted queries in anticipation of the user selecting one of the predicted queries. The search engine may also return at least a portion of the search results corresponding to one or more of the predicted queries.

545 citations


Patent
23 Aug 2004
TL;DR: In this article, location information is determined (or simply accepted) and used to determine the relevancy (350) determination of an ad and price information (380) associated with price information, such as a maximum price bid.
Abstract: The usefulness, and consequently the performance, of advertisements (330) are improved by allowing businesses to better target their ads to a responsive audience. Location information is determined (or simply accepted) and used. For example, location information may be used in a relevancy (350) determination of an ad. As another example, location information (380) may be associated with price information, such as a maximum price bid. Such location information may be associated with ad performance information. Ad performance information may be tracked on the basis of location information. The content of an ad creative, and/or of a landing page may be selected and/or modified using location information. Finally, tools, such as user interfaces may be provided to allow a business to enter and/or modify location information such as location information used for targeting and location-dependent price information. The location information (380) used to target and/or score (350) ads may be include or define an area. The area may be defined by at least one geographic reference point (e.g., defined by latitude and longitude coordinates) and perhaps additional information. Thus, the area may be a circle defined by a geographic reference point and a radius, an ellipse defined by two geographic reference points and a distance sum, or a polygon defined by three or more geographic reference points.

466 citations


Journal ArticleDOI
TL;DR: It is pointed out that Kleinberg's "hub and authority" method to identify web-pages relevant to a given query can be viewed as a special case of the definition in the case where one of the graphs has two vertices and a unique directed edge between them.
Abstract: We introduce a concept of {similarity} between vertices of directed graphs. Let GA and GB be two directed graphs with, respectively, nA and nB vertices. We define an nB \times nA similarity matrix S whose real entry sij expresses how similar vertex j (in GA) is to vertex i (in GB): we say that sij is their similarity score. The similarity matrix can be obtained as the limit of the normalized even iterates of Sk+1 = BSkAT + BTSkA, where A and B are adjacency matrices of the graphs and S0 is a matrix whose entries are all equal to 1. In the special case where GA = GB = G, the matrix S is square and the score sij is the similarity score between the vertices i and j of G. We point out that Kleinberg's "hub and authority" method to identify web-pages relevant to a given query can be viewed as a special case of our definition in the case where one of the graphs has two vertices and a unique directed edge between them. In analogy to Kleinberg, we show that our similarity scores are given by the components of a dominant eigenvector of a nonnegative matrix. Potential applications of our similarity concept are numerous. We illustrate an application for the automatic extraction of synonyms in a monolingual dictionary.

436 citations


Patent
15 Sep 2004
TL;DR: In this paper, a system (125) identifies a document and obtains one or more types of history data associated with the document, and generates a score for the document based on at least part of the history data.
Abstract: A system (125) identifies a document and obtains one or more types of history data associated with the document. The system (125) may generate a score for the document based, at least in part, on the one or more types of history data.

418 citations


Journal ArticleDOI
08 Jun 2004
TL;DR: A metric is studied between labelled Markov processes that has the property that processes are at zero distance if and only if they are bisimilar and is related, in spirit, to the Hutchinson metric.
Abstract: The notion of process equivalence of probabilistic processes is sensitive to the exact probabilities of transitions. Thus, a slight change in the transition probabilities will result in two equivalent processes being deemed no longer equivalent. This instability is due to the quantitative nature of probabilistic processes. In a situation where the process behavior has a quantitative aspect there should be a more robust approach to process equivalence. This paper studies a metric between labelled Markov processes. This metric has the property that processes are at zero distance if and only if they are bisimilar. The metric is inspired by earlier work on logics for characterizing bisimulation and is related, in spirit, to the Hutchinson metric.

364 citations


Patent
Kevin A. Gibbs1
11 Nov 2004
TL;DR: In this paper, a set of ordered predicted completion strings are presented to a user as the user enters text in a text entry box (e.g., a browser or a toolbar).
Abstract: A set of ordered predicted completion strings are presented to a user as the user enters text in a text entry box (e.g., a browser or a toolbar). The predicted completion strings can be in the form of URLs or query strings. The ordering may be based on any number of factors (e.g., a query's frequency of submission from a community of users). URLs can be ranked based on an importance value of the URL. Privacy is taken into account in a number of ways, such as using a previously submitted query only when more than a certain number of unique requestors have made the query. The sets of ordered predicted completion strings is obtained by matching a fingerprint value of the user's entry string to a fingerprint to table map which contains the set of ordered predicted completion strings.

303 citations


Patent
30 Jun 2004
TL;DR: In this article, a client assistant examines its cache for the requested document, and if the client assistant cannot provide the copy, the server seeks it from a document repository rather than the document's web host.
Abstract: Upon receipt of a document request, a client assistant examines its cache for the document. If not successful, a server searches for the requested document in its cache. If the server copy is still not fresh or not found, the server seeks the document from its host. If the host cannot provide the copy, the server seeks it from a document repository. Certain documents are identified from the document repository as being fresh or stable. Information about each these identified documents is transmitted to the server which inserts entries into an index if the index does not already contain an entry for the document. If and when this particular document is requested, the document will not be present in the server, however the server will contain an entry directing the server to obtain the document from the document repository rather than the document's web host.

Patent
Jason B. Liebman1, Krishna Bharat1
30 Jun 2004
TL;DR: In this paper, the authors present methods and systems for requesting and providing information in a social network, which can include outputting an information request interface, which provides a user with the ability to request information from at least one member of a network associated with the user.
Abstract: The present invention relates to methods and systems for requesting and providing information in a social network. A method can comprise outputting an information request interface, which can provide a user with the ability to request information from at least one member of a social network associated with the user. One or more members of the social network can be notified of the user's information request and can provide, or assist in providing, the requested information to the user.

Patent
27 Jul 2004
TL;DR: In this paper, the authors propose to use relay resources to increase the quality of service (QoS) for the facilitated communication of a remote unit that is already within reception range of a base station.
Abstract: Communications sourced by a remote unit (14) that is already within reception range of a base site (10) can nevertheless be further facilitated through allocation of one or more relay resources (15, 16). Such relay resources, properly employed, then serve to effectively increase the quality of service for the facilitated communication. This, in turn, can permit the use of, for example, increased data rates for communications from a relatively low power remote unit.

Patent
Simon Tong1
08 Jun 2004
TL;DR: A decision component that determines whether documents that are returned in response to a user search query are likely to be very relevant to the search query is presented with visual cues that assist the user in browsing the links as mentioned in this paper.
Abstract: A search engine includes a decision component that determines whether documents that are returned in response to a user search query are likely to be very relevant to the search query. Links that refer to documents that the search engine determines to likely be very relevant may be displayed with visual cues that assist the user in browsing the links. The decision component may base its decision on a number of parameters, including: (1) the position of the document in a ranked list of search results, (2) the click through rate of the document, (3) relevance scores for the document and other documents that are returned as hits in response to the search query, and (4) whether the document is classified as a pornographic document (the search engine may refrain from showing visual cues for potentially pornographic documents).

Patent
05 Aug 2004
TL;DR: In this paper, a system, method and computer program product for presenting an advertisement is described, where a request to access a web page may be received from a requester via a network.
Abstract: A system, method and computer program product for presenting an advertisement is described. A request to access a web page may be received from a requester via a network. The request may be generated in response to selection of a link to the web page on another web page. A response may be transmitted back to the requester. The response may include the requested web page as well as an ad script that may be executed after receipt of the response by the requester. The ad script may generate an ad request that includes one or more ad parameters extracted from the response. These ad parameters may include information about a network address of the other page. The generated ad request may then be received via the network. One or more advertisements may then be selected for presentment to the requester utilizing the ad parameters of the ad request.

Patent
David Bau1
30 Sep 2004
TL;DR: In this paper, the authors use navigation history information, and perhaps information about a current document, to determine content-relevant and personalized ads, which are more interesting and relevant to a current user interest inferred from their (recent) navigation.
Abstract: Ads better targeted to individual users can be determined by using (recent) navigation history. User navigation (e.g., Web browsing) may be tracked, recorded and maintained. The navigation history information, and perhaps information about a current document, may be used to determine content-relevant and personalized ads. By doing so the ads seen by the user are more interesting and relevant to a current user interest inferred from their (recent) navigation.

Patent
Ross Koningstein1
03 Nov 2004
TL;DR: A computer-implemented method and system for advertising that performs the steps of delivering an electronic advertisement comprising one or more menu options and a reference to a network location for retrieving specified content associated with each menu option for inclusion in a first electronic document is described in this paper.
Abstract: A computer-implemented method and system for advertising that performs the steps of delivering an electronic advertisement comprising one or more menu options and a reference to a network location for retrieving specified content associated with each menu option for inclusion in a first electronic document, receiving a selection of one or more menu options from the electronic advertisement and delivering a subsequent accessible document including content from the referenced network location associated with the menu option selected, the subsequent accessible document including the electronic advertisement.

Patent
Jeffrey Dean1, Sanjay Ghemawat1
18 Jun 2004
TL;DR: In this paper, a large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment.
Abstract: A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment A plurality of intermediate data structures are used to store the intermediate data values One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data

Journal ArticleDOI
TL;DR: Improved the efficiency of the combinatorial search, and extended the range of the graph database to nT = 7 for some high mirror symmetry space groups, and implemented a more sophisticated Monte Carlo strategy for imbedding graphs in real space as an SiO2 composition.

Patent
Jan Matthias Ruhl1, Mayur Datar1
14 Dec 2004
TL;DR: In this article, a graphical user interface on a computer that includes a plurality of portions of reviews for a product and a search input area for entering search terms to search for reviews of the product that contain the search terms is presented.
Abstract: The embodiments disclosed herein include new, more efficient ways to collect product reviews from the Internet, aggregate reviews for the same product, and provide an aggregated review to end users in a searchable format. One aspect of the invention is a graphical user interface on a computer that includes a plurality of portions of reviews for a product and a search input area for entering search terms to search for reviews of the product that contain the search terms.

Proceedings Article
01 Dec 2004
TL;DR: This paper provides confidence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples, which can be viewed as the equivalent for AUC of the standard confidence intervals given in the case of the errors.
Abstract: In many applications, good ranking is a highly desirable performance for a classifier. The criterion commonly used to measure the ranking quality of a classification algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of confidence for its value. This paper provides confidence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard confidence intervals given in the case of the error rate. They are compared with previous approaches in several standard classification tasks demonstrating the benefits of our analysis.

Patent
Krishna Bharat1
30 Jun 2004
TL;DR: In this paper, the authors propose to resolve ambiguities with respect to a user's topic interest by monitoring user behavior, determining a user topic interest (e.g., from a plurality of different candidate topics) based on the monitored behavior, and serving ads relevant to the determined topic interest.
Abstract: Ambiguities with respect to a user topic interest may be resolved so that useful topic-relevant ads can be presented. Such ambiguities may be resolved by monitoring user behavior, determining a user topic interest (e.g., from a plurality of different candidate topics) based on the monitored behavior, and serving ads relevant to the determined user topic interest.

Patent
31 Mar 2004
TL;DR: In this article, techniques are disclosed that locate implicitly defined semantic structures in a document, such as, for example, implicitly defined lists in an HTML document, which can be used in the calculation of distance values between terms in the documents.
Abstract: Techniques are disclosed that locate implicitly defined semantic structures in a document, such as, for example, implicitly defined lists in an HTML document. The semantic structures can be used in the calculation of distance values between terms in the documents. The distance values may be used, for example, in the generation of ranking scores that indicate a relevance level of the document to a search query.

Patent
11 Mar 2004
TL;DR: In this article, a method and apparatus for multi-antenna transmission in a multiple-input, multiple-output (MIMO) communication system, a method that reduces the number of transmit weight matrices is presented, where each transmit weight matrix is applied to a plurality of subcarriers.
Abstract: In a multiple-input, multiple-output (MIMO) communication system, a method and apparatus for multi-antenna transmission In accordance with the preferred embodiment of the present invention a reduced number of transmit weight matrices are fed back to the transmitter Each transmit weight matrix is then applied to a plurality of subcarriers Because each transmit weight matrix is applied to more than one subcarrier, the amount of weight matrixs being fed back to the transmitter is greatly reduced

Patent
29 Dec 2004
TL;DR: In this article, a method for direct navigation to and/or highlighting a specific portion of a target document such as a query-relevant portion of the document is presented, where the client browser can have an artificial anchor module installed to execute the instruction to navigate directly to and optionally highlight the intra-document portion within the target document.
Abstract: Systems and methods for direct navigation to and/or highlighting a specific portion of a target document such as query-relevant portion of the document are disclosed. The method may include generating a search result link to a search result document and generating an instruction to a client document browser to navigate directly to an intra-document portion related to the query within the search result document. The search result may include a snippet extracted from the search result document such that the instruction causes navigation directly to at least a portion of the snippet. The instruction may be an artificial anchor undefined in the search result document, e.g., designated by a preassigned artificial anchor designator. The client browser may have an artificial anchor module installed to execute the instruction to navigate directly to and optionally highlight the intra-document portion within the target document in response to the document link being selected.

Patent
Simon Tong1, Mark Pearson1, Sergey Brin1
10 Sep 2004
TL;DR: In this paper, a search query is received, a related query related to the query is determined, an article (such as a web page) associated with the search query was determined, and a ranking score for the article based at least in part on data associated with related queries was determined.
Abstract: Systems and methods that improve search rankings for a search query by using data associated with queries related to the search query are described. In one aspect, a search query is received, a related query related to the search query is determined, an article (such as a web page) associated with the search query is determined, and a ranking score for the article based at least in part on data associated with the related query is determined. Several algorithms and types of data associated with related queries useful in carrying out such systems and methods are described.

Patent
22 Nov 2004
TL;DR: In this paper, a search engine implements a method comprising receiving a search query and determining a personalized result by searching a personalized search object using the search query, determining a general result by search a general search object and providing a search result for the query based at least in part on the personalized result and the general result.
Abstract: Systems and methods for personalized network searching are described. A search engine implements a method comprising receiving a search query, determining a personalized result by searching a personalized search object using the search query, determining a general result by searching a general search object using the search query, and providing a search result for the search query based at least in part on the personalized result and the general result. The search engine may utilize ratings or annotations associated with the previously identified uniform resource locator to locate and sort results.

Patent
26 Aug 2004
TL;DR: In this paper, the authors present a method comprising outputting a ratings interface for rating at least one member of a social network associated with a user, wherein the rating interface provides the user with the ability to rate the member in one or more categories, receiving ratings for the member from the user, associating the ratings with the member, and connecting the ratings of the member with the user.
Abstract: Systems and methods for rating associated members in a social network are set forth. According to one embodiment a method comprising outputting a ratings interface for rating at least one member of a social network associated with a user, wherein the rating interface provides the user with the ability to rate the member in one or more categories, receiving ratings for the member from the user, associating the ratings with the member, and connecting the ratings for the member with the user is set forth.

Proceedings ArticleDOI
Marius Pasca1
13 Nov 2004
TL;DR: The method applies lightweight lexico-syntactic extraction patterns to the unstructured text of Web documents and does not impose any a-priori restriction on the categories of extracted names.
Abstract: The recognition of names and their associated categories within unstructured text traditionally relies on semantic lexicons and gazetteers. The amount of effort required to assemble large lexicons confines the recognition to either a limited domain (e.g., medical imaging), or a small set of pre-defined, broader categories of interest (e.g., persons, countries, organizations, products). This constitutes a serious limitation in an information seeking context. In this case, the categories of potential interest to users are more diverse (universities, agencies, retailers, celebrities), often refined (e.g., SLR digital cameras, programming languages, multinational oil companies), and usually overlapping (e.g., the same entity may be concurrently a brand name, a technology company, and an industry leader). We present a lightly supervised method for acquiring named entities in arbitrary categories. The method applies lightweight lexico-syntactic extraction patterns to the unstructured text of Web documents. The method is a departure from traditional approaches to named entity recognition in that: 1) it does not require any start-up seed names or training; 2) it does not encode any domain knowledge in its extraction patterns; 3) it is only lightly supervised, and data-driven; 4) it does not impose any a-priori restriction on the categories of extracted names. We illustrate applications of the method in Web search, and describe experiments on 500 million Web documents and news articles.

Patent
31 Mar 2004
TL;DR: In this paper, a query system receives a user context attribute and generates a plurality of implicit search queries based at least in part on the user context attributes, and then combines the results for display to a user.
Abstract: Systems and methods for generating multiple implicit search queries are described. In one described system, a query system receives a user context attribute and generates a plurality of implicit search queries based at least in part on the user context attribute. The query system then receives result sets associated with each of the plurality of implicit search queries and combines the results for display to a user.