scispace - formally typeset
Search or ask a question

Showing papers by "Google published in 2007"


Journal ArticleDOI
Luiz Andre Barroso1, Urs Hölzle1
TL;DR: Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use, particularly the memory and disk subsystems.
Abstract: Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use. Achieving energy proportionality will require significant improvements in the energy usage profile of every system component, particularly the memory and disk subsystems.

2,499 citations


Proceedings ArticleDOI
28 Oct 2007
TL;DR: The provable data possession (PDP) model as discussed by the authors allows a client that has stored data at an untrusted server to verify that the server possesses the original data without retrieving it.
Abstract: We introduce a model for provable data possession (PDP) that allows a client that has stored data at an untrusted server to verify that the server possesses the original data without retrieving it. The model generates probabilistic proofs of possession by sampling random sets of blocks from the server, which drastically reduces I/O costs. The client maintains a constant amount of metadata to verify the proof. The challenge/response protocol transmits a small, constant amount of data, which minimizes network communication. Thus, the PDP model for remote data checking supports large data sets in widely-distributed storage system.We present two provably-secure PDP schemes that are more efficient than previous solutions, even when compared with schemes that achieve weaker guarantees. In particular, the overhead at the server is low (or even constant), as opposed to linear in the size of the data. Experiments using our implementation verify the practicality of PDP and reveal that the performance of PDP is bounded by disk I/O and not by cryptographic computation.

2,238 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: This paper presents the aggregate power usage characteristics of large collections of servers for different classes of applications over a period of approximately six months, and uses the modelling framework to estimate the potential of power management schemes to reduce peak power and energy usage.
Abstract: Large-scale Internet services require a computing infrastructure that can beappropriately described as a warehouse-sized computing system. The cost ofbuilding datacenter facilities capable of delivering a given power capacity tosuch a computer can rival the recurring energy consumption costs themselves.Therefore, there are strong economic incentives to operate facilities as closeas possible to maximum capacity, so that the non-recurring facility costs canbe best amortized. That is difficult to achieve in practice because ofuncertainties in equipment power ratings and because power consumption tends tovary significantly with the actual computing activity. Effective powerprovisioning strategies are needed to determine how much computing equipmentcan be safely and efficiently hosted within a given power budget.In this paper we present the aggregate power usage characteristics of largecollections of servers (up to 15 thousand) for different classes ofapplications over a period of approximately six months. Those observationsallow us to evaluate opportunities for maximizing the use of the deployed powercapacity of datacenters, and assess the risks of over-subscribing it. We findthat even in well-tuned applications there is a noticeable gap (7 - 16%)between achieved and theoretical aggregate peak power usage at the clusterlevel (thousands of servers). The gap grows to almost 40% in wholedatacenters. This headroom can be used to deploy additional compute equipmentwithin the same power budget with minimal risk of exceeding it. We use ourmodeling framework to estimate the potential of power management schemes toreduce peak power and energy usage. We find that the opportunities for powerand energy savings are significant, but greater at the cluster-level (thousandsof servers) than at the rack-level (tens). Finally we argue that systems needto be power efficient across the activity range, and not only at peakperformance levels.

2,047 citations


Proceedings ArticleDOI
08 May 2007
TL;DR: This paper describes the approach to collaborative filtering for generating personalized recommendations for users of Google News using MinHash clustering, Probabilistic Latent Semantic Indexing, and covisitation counts, and combines recommendations from different algorithms using a linear model.
Abstract: Several approaches to collaborative filtering have been studied but seldom have studies been reported for large (several millionusers and items) and dynamic (the underlying item set is continually changing) settings. In this paper we describe our approach to collaborative filtering for generating personalized recommendations for users of Google News. We generate recommendations using three approaches: collaborative filtering using MinHash clustering, Probabilistic Latent Semantic Indexing (PLSI), and covisitation counts. We combine recommendations from different algorithms using a linear model. Our approach is content agnostic and consequently domain independent, making it easily adaptable for other applications and languages with minimal effort. This paper will describe our algorithms and system setup in detail, and report results of running the recommendations engine on Google News.

1,710 citations


Proceedings Article
03 Dec 2007
TL;DR: This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms and shows distinct tradeoffs for the case of small-scale and large-scale learning problems.
Abstract: This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation-estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways.

1,599 citations


Proceedings ArticleDOI
Ray Smith1
23 Sep 2007
TL;DR: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview.
Abstract: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.

1,530 citations


Journal ArticleDOI
TL;DR: The supervised formulation is shown to achieve higher accuracy than various previously published methods at a fraction of their computational cost and to be fairly robust to parameter tuning.
Abstract: A probabilistic formulation for semantic image annotation and retrieval is proposed. Annotation and retrieval are posed as classification problems where each class is defined as the group of database images labeled with a common semantic label. It is shown that, by establishing this one-to-one correspondence between semantic labels and semantic classes, a minimum probability of error annotation and retrieval are feasible with algorithms that are 1) conceptually simple, 2) computationally efficient, and 3) do not require prior semantic segmentation of training images. In particular, images are represented as bags of localized feature vectors, a mixture density estimated for each image, and the mixtures associated with all images annotated with a common semantic label pooled into a density estimate for the corresponding semantic class. This pooling is justified by a multiple instance learning argument and performed efficiently with a hierarchical extension of expectation-maximization. The benefits of the supervised formulation over the more complex, and currently popular, joint modeling of semantic label and visual feature distributions are illustrated through theoretical arguments and extensive experiments. The supervised formulation is shown to achieve higher accuracy than various previously published methods at a fraction of their computational cost. Finally, the proposed method is shown to be fairly robust to parameter tuning

962 citations


Journal ArticleDOI
TL;DR: A path following algorithm for L1‐regularized generalized linear models that efficiently computes solutions along the entire regularization path by using the predictor–corrector method of convex optimization.
Abstract: Summary. We introduce a path following algorithm for L1-regularized generalized linear models. The L1-regularization procedure is useful especially because it, in effect, selects variables according to the amount of penalization on the L1-norm of the coefficients, in a manner that is less greedy than forward selection–backward deletion. The generalized linear model path algorithm efficiently computes solutions along the entire regularization path by using the predictor–corrector method of convex optimization. Selecting the step length of the regularization parameter is critical in controlling the overall accuracy of the paths; we suggest intuitive and flexible strategies for choosing appropriate values. We demonstrate the implementation with several simulated and real data sets.

887 citations


Proceedings Article
13 Feb 2007
TL;DR: In this article, the authors present data collected from detailed observations of a large disk drive population in a production Internet services deployment, and analyze the correlation between failures and several parameters generally believed to impact longevity.
Abstract: It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

866 citations


Proceedings ArticleDOI
08 May 2007
TL;DR: This work proposes a simple algorithm based on novel indexing and optimization strategies that solves the problem of finding all pairs of vectors whose similarity score is above a given threshold without relying on approximation methods or extensive parameter tuning.
Abstract: Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.

768 citations


Journal ArticleDOI
TL;DR: The notion of a tradeoff revealing LP is introduced and used to derive two optimal algorithms achieving competitive ratios of 1-1/e for this problem of online bipartite matching.
Abstract: How does a search engine company decide what ads to display with each query so as to maximize its revenueq This turns out to be a generalization of the online bipartite matching problem. We introduce the notion of a trade-off revealing LP and use it to derive an optimal algorithm achieving a competitive ratio of 1−1/e for this problem.

Proceedings ArticleDOI
12 Aug 2007
TL;DR: The experience in building a fault-tolerant data-base using the Paxos consensus algorithm is described, and the measurements indicate that the team has built a competitive system.
Abstract: We describe our experience in building a fault-tolerant data-base using the Paxos consensus algorithm. Despite the existing literature in the field, building such a database proved to be non-trivial. We describe selected algorithmic and engineering problems encountered, and the solutions we found for them. Our measurements indicate that we have built a competitive system.

Proceedings Article
Thorsten Brants1, Ashok C. Popat1, Peng Xu1, Franz Josef Och1, Jeffrey Dean1 
22 Jun 2007
TL;DR: Systems, methods, and computer program products for machine translation are provided for backoff score determination as a function of a backoff factor and a relative frequency of a corresponding backoff n-gram in the corpus.
Abstract: Systems, methods, and computer program products for machine translation are provided. In some implementations a system is provided. The system includes a language model including a collection of n-grams from a corpus, each n-gram having a corresponding relative frequency in the corpus and an order n corresponding to a number of tokens in the n-gram, each n-gram corresponding to a backoff n-gram having an order of n-1 and a collection of backoff scores, each backoff score associated with an n-gram, the backoff score determined as a function of a backoff factor and a relative frequency of a corresponding backoff n-gram in the corpus.

Book ChapterDOI
16 Jul 2007
TL;DR: OpenFst as mentioned in this paper is an open-source library for weighted finite-state transducers (WFSTs), which is designed to be both very efficient in time and space and to scale to very large problems.
Abstract: We describe OpenFst, an open-source library for weighted finite-state transducers (WFSTs). OpenFst consists of a C++ template library with efficient WFST representations and over twenty-five operations for constructing, combining, optimizing, and searching them. At the shell-command level, there are corresponding transducer file representations and programs that operate on them. OpenFst is designed to be both very efficient in time and space and to scale to very large problems. This library has key applications speech, image, and natural language processing, pattern and string matching, and machine learning. We give an overview of the library, examples of its use, details of its design that allow customizing the labels, states, and weights and the lazy evaluation of many of its operations. Further information and a download of the OpenFst library can be obtained from http://www.openfst.org.

Journal ArticleDOI
TL;DR: It is found that relative preferences derived from clicks are reasonably accurate on average, and not only between results from an individual query, but across multiple sets of results within chains of query reformulations.
Abstract: This article examines the reliability of implicit feedback generated from clickthrough data and query reformulations in World Wide Web (WWW) search. Analyzing the users' decision process using eyetracking and comparing implicit feedback against manual relevance judgments, we conclude that clicks are informative but biased. While this makes the interpretation of clicks as absolute relevance judgments difficult, we show that relative preferences derived from clicks are reasonably accurate on average. We find that such relative preferences are accurate not only between results from an individual query, but across multiple sets of results within chains of query reformulations.

Proceedings ArticleDOI
08 May 2007
TL;DR: This work demonstrates that Charikar's fingerprinting technique is appropriate for near-duplicate detection and presents an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k.
Abstract: Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and all batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.

Proceedings Article
Niels Provos1, Dean McNamee1, Panayiotis Mavrommatis1, Ke Wang1, Nagendra Modadugu1 
10 Apr 2007
TL;DR: This work identifies the four prevalent mechanisms used to inject malicious content on popular web sites: web server security, user contributed content, advertising and third-party widgets, and presents examples of abuse found on the Internet.
Abstract: As more users are connected to the Internet and conduct their daily activities electronically, computer users have become the target of an underground economy that infects hosts with malware or adware for financial gain. Unfortunately, even a single visit to an infected web site enables the attacker to detect vulnerabilities in the user's applications and force the download a multitude of malware binaries. Frequently, this malware allows the adversary to gain full control of the compromised systems leading to the ex-filtration of sensitive information or installation of utilities that facilitate remote control of the host. We believe that such behavior is similar to our traditional understanding of botnets. However, the main difference is that web-based malware infections are pull-based and that the resulting command feedback loop is looser. To characterize the nature of this rising thread, we identify the four prevalent mechanisms used to inject malicious content on popular web sites: web server security, user contributed content, advertising and third-party widgets. For each of these areas, we present examples of abuse found on the Internet. Our aim is to present the state of malware on the Web and emphasize the importance of this rising threat.

Patent
Monika H. Henzinger1
03 Aug 2007
TL;DR: Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by extracting parts from the document, assigning the extracted parts to one or more of a predetermined number of lists, and generating a fingerprint from each of the populated lists as mentioned in this paper.
Abstract: Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match.

Journal ArticleDOI
TL;DR: This paper presents a multi-agent based framework for simulating human and social behavior during emergency evacuation, which is able to demonstrate some emergent behaviors, such as competitive, queuing, and herding behaviors.
Abstract: Many computational tools for the simulation and design of emergency evacuation and egress are now available. However, due to the scarcity of human and social behavioral data, these computational tools rely on assumptions that have been found inconsistent or unrealistic. This paper presents a multi-agent based framework for simulating human and social behavior during emergency evacuation. A prototype system has been developed, which is able to demonstrate some emergent behaviors, such as competitive, queuing, and herding behaviors. For illustration, an example application of the system for safe egress design is provided.

Proceedings ArticleDOI
02 Nov 2007
TL;DR: It is found that it is often possible to tell whether or not a URL belongs to a phishing attack without requiring any knowledge of the corresponding page data.
Abstract: Phishing is form of identity theft that combines social engineering techniques and sophisticated attack vectors to harvest financial information from unsuspecting consumers Often a phisher tries to lure her victim into clicking a URL pointing to a rogue page In this paper, we focus on studying the structure of URLs employed in various phishing attacks We find that it is often possible to tell whether or not a URL belongs to a phishing attack without requiring any knowledge of the corresponding page data We describe several features that can be used to distinguish a phishing URL from a benign one These features are used to model a logistic regression filter that is efficient and has a high accuracy We use this filter to perform thorough measurements on several million URLs and quantify the prevalence of phishing on the Internet today

Journal ArticleDOI
TL;DR: In this paper, the authors developed and exploited a novel database on patent coauthorship to investigate the effects of collaboration networks on innovation and found that both shorter path lengths and larger connected components correlate with increased innovation.
Abstract: Small-world networks have attracted much theoretical attention and are widely thought to enhance creativity. Yet empirical studies of their evolution and evidence of their benefits remain scarce. We develop and exploit a novel database on patent coauthorship to investigate the effects of collaboration networks on innovation. Our analysis reveals the existence of regional small-world structures and the emergence and disappearance of giant components in patent collaboration networks. Using statistical models, we test and fail to find evidence that small-world structure (cohesive clusters connected by occasional nonlocal ties) enhances innovative productivity within geographic regions. We do find that both shorter path lengths and larger connected components correlate with increased innovation. We discuss the implications of our findings for future social network research and theory as well as regional innovation policies.

Proceedings ArticleDOI
26 Dec 2007
TL;DR: This work addresses the problem of visual category recognition by learning an image-to-image distance function that attempts to satisfy the following property: the distance between images from the same category should be less than the distanceBetween images from different categories.
Abstract: We address the problem of visual category recognition by learning an image-to-image distance function that attempts to satisfy the following property: the distance between images from the same category should be less than the distance between images from different categories. We use patch-based feature vectors common in object recognition work as a basis for our image-to-image distance functions. Our large-margin formulation for learning the distance functions is similar to formulations used in the machine learning literature on distance metric learning, however we differ in that we learn local distance functions?a different parameterized function for every image of our training set?whereas typically a single global distance function is learned. This was a novel approach first introduced in Frome, Singer, & Malik, NIPS 2006. In that work we learned the local distance functions independently, and the outputs of these functions could not be compared at test time without the use of additional heuristics or training. Here we introduce a different approach that has the advantage that it learns distance functions that are globally consistent in that they can be directly compared for purposes of retrieval and classification. The output of the learning algorithm are weights assigned to the image features, which is intuitively appealing in the computer vision setting: some features are more salient than others, and which are more salient depends on the category, or image, being considered. We train and test using the Caltech 101 object recognition benchmark.

Book ChapterDOI
Ryan McDonald1
02 Apr 2007
TL;DR: This work defines a general framework for inference in summarization and presents three algorithms: a greedy approximate method, a dynamic programming approach based on solutions to the knapsack problem, and an exact algorithm that uses an Integer Linear Programming formulation of the problem.
Abstract: In this work we study the theoretical and empirical properties of various global inference algorithms for multi-document summarization. We start by defining a general framework for inference in summarization. We then present three algorithms: The first is a greedy approximate method, the second a dynamic programming approach based on solutions to the knapsack problem, and the third is an exact algorithm that uses an Integer Linear Programming formulation of the problem. We empirically evaluate all three algorithms and show that, relative to the exact solution, the dynamic programming algorithm provides near optimal results with preferable scaling properties.

Journal ArticleDOI
23 Sep 2007
TL;DR: The concept of probabilistic schema mappings is introduced and it is shown that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but the author does not know what it is; by-tuple semantics assuming that the correct mapping may depend on the particular tuple in the source data.
Abstract: This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, queries to the system may be posed with keywords rather than in a structured form. Third, the data from the sources may be extracted using information extraction techniques and so may yield imprecise data. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we don't know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of approximate schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting.

Posted Content
TL;DR: These two algorithms are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist.
Abstract: Many data analysis applications deal with large matrices and involve approximating the matrix using a small number of ``components.'' Typically, these components are linear combinations of the rows and columns of the matrix, and are thus difficult to interpret in terms of the original features of the input data. In this paper, we propose and study matrix approximations that are explicitly expressed in terms of a small number of columns and/or rows of the data matrix, and thereby more amenable to interpretation in terms of the original data. Our main algorithmic results are two randomized algorithms which take as input an $m \times n$ matrix $A$ and a rank parameter $k$. In our first algorithm, $C$ is chosen, and we let $A'=CC^+A$, where $C^+$ is the Moore-Penrose generalized inverse of $C$. In our second algorithm $C$, $U$, $R$ are chosen, and we let $A'=CUR$. ($C$ and $R$ are matrices that consist of actual columns and rows, respectively, of $A$, and $U$ is a generalized inverse of their intersection.) For each algorithm, we show that with probability at least $1-\delta$: $$ ||A-A'||_F \leq (1+\epsilon) ||A-A_k||_F, $$ where $A_k$ is the ``best'' rank-$k$ approximation provided by truncating the singular value decomposition (SVD) of $A$. The number of columns of $C$ and rows of $R$ is a low-degree polynomial in $k$, $1/\epsilon$, and $\log(1/\delta)$. Our two algorithms are the first polynomial time algorithms for such low-rank matrix approximations that come with relative-error guarantees; previously, in some cases, it was not even known whether such matrix decompositions exist. Both of our algorithms are simple, they take time of the order needed to approximately compute the top $k$ singular vectors of $A$, and they use a novel, intuitive sampling method called ``subspace sampling.''

Proceedings Article
01 Jan 2007
TL;DR: This paper proposes a new data integration architecture, PAYGO, which is inspired by the concept of dataspaces and emphasizes pay-as-you-go data management as means for achieving web-scale data integration.
Abstract: The World Wide Web is witnessing an increase in the amount of structured content – vast heterogeneous collections of structured data are on the rise due to the Deep Web, annotation schemes like Flickr, and sites like Google Base. While this phenomenon is creating an opportunity for structured data management, dealing with heterogeneity on the web-scale presents many new challenges. In this paper, we highlight these challenges in two scenarios – the Deep Web and Google Base. We contend that traditional data integration techniques are no longer valid in the face of such heterogeneity and scale. We propose a new data integration architecture, PAYGO, which is inspired by the concept of dataspaces and emphasizes pay-as-you-go data management as means for achieving web-scale data integration.

Patent
Steven T. Kirsch1
16 Mar 2007
TL;DR: In this paper, the origin address of an e-mail message is validated to enable blocking of e-mails from spam sources, by preparing, in response to the receipt of a predetermined e -mail message from an unverified source address, a data key encoding information reflective of the message.
Abstract: The origin address of an e-mail message is validated to enable blocking of e-mail from spam e-mail sources, by preparing, in response to the receipt of a predetermined e-mail message from an unverified source address, a data key encoding information reflective of the predetermined e-mail message. This message, including the data key, is then issued to the unverified source address. The computer system then operates to detect whether a response e-mail message, responsive to the challenge e-mail message, is received and whether the response e-mail message includes a response key encoding predetermined information reflective of a predetermined aspect of the challenge e-mail message. The unverified source address may be recorded in a verified source address list. Thus, when an e-mail message is received, the computer may operate to accept receipt of a predetermined e-mail message on condition that the source address of the predetermined e-mail message is recorded in the verified source address list and alternatively on condition that the predetermined e-mail message includes the response key.

Journal ArticleDOI
TL;DR: SkipGraphs as mentioned in this paper are a distributed data structure based on skip lists that provide the full functionality of a balanced tree in a distributed system where resources are stored in separate nodes that may fail at any time.
Abstract: Skip graphs are a novel distributed data structure, based on skip lists, that provide the full functionality of a balanced tree in a distributed system where resources are stored in separate nodes that may fail at any time. They are designed for use in searching peer-to-peer systems, and by providing the ability to perform queries based on key ordering, they improve on existing search tools that provide only hash table functionality. Unlike skip lists or other tree data structures, skip graphs are highly resilient, tolerating a large fraction of failed nodes without losing connectivity. In addition, simple and straightforward algorithms can be used to construct a skip graph, insert new nodes into it, search it, and detect and repair errors within it introduced due to node failures.

Journal ArticleDOI
01 Sep 2007
TL;DR: This model represents the environment using hierarchical data structures, which efficiently support the perceptual queries that influence the behavioral responses of the autonomous pedestrians and sustain their ability to plan their actions on local and global scales.
Abstract: We address the challenging problem of emulating the rich complexity of real pedestrians in urban environments. Our artificial life approach integrates motor, perceptual, behavioral, and cognitive components within a comprehensive model of pedestrians as individuals. Featuring innovations in these components, as well as in their combination, our model yields results of unprecedented fidelity and complexity for fully autonomous multihuman simulation in a large urban environment. We represent the environment using hierarchical data structures, which efficiently support the perceptual queries that influence the behavioral responses of the autonomous pedestrians and sustain their ability to plan their actions on local and global scales.

Proceedings Article
Ryan McDonald1, Joakim Nivre
01 Jun 2007
TL;DR: A comparative error analysis of the two dominant approaches in datadriven dependency parsing: global, exhaustive, graph-based models, and local, greedy, transition- based models shows that, in spite of similar performance, the two models produce different types of errors.
Abstract: We present a comparative error analysis of the two dominant approaches in datadriven dependency parsing: global, exhaustive, graph-based models, and local, greedy, transition-based models. We show that, in spite of similar performance overall, the two models produce different types of errors, in a way that can be explained by theoretical properties of the two models. This analysis leads to new directions for parser development.