scispace - formally typeset
Search or ask a question

Showing papers by "AT&T Labs published in 2006"


Journal ArticleDOI
TL;DR: In this paper, the use of the maximum entropy method (Maxent) for modeling species geographic distributions with presence-only data was introduced, which is a general-purpose machine learning method with a simple and precise mathematical formulation.

13,120 citations


Proceedings ArticleDOI
27 Jun 2006
TL;DR: This work defines a variety of essential and practical cost metrics associated with ODB systems and looks at solutions that can handle dynamic scenarios, where owners periodically update the data residing at the servers, both for static and dynamic environments.
Abstract: In outsourced database (ODB)systems the database owner publishes its data through a number of remote servers, with the goal of enabling clients at the edge of the network to access and query the data more efficiently. As servers might be untrusted or can be compromised, query authentication becomes an essential component of ODB systems. Existing solutions for this problem concentrate mostly on static scenarios and are based on idealistic properties for certain cryptographic primitives. In this work, first we define a variety of essential and practical cost metrics associated with ODB systems. Then, we analytically evaluate a number of different approaches, in search for a solution that best leverages all metrics. Most importantly, we look at solutions that can handle dynamic scenarios, where owners periodically update the data residing at the servers. Finally, we discuss query freshness, a new dimension in data authentication that has not been explored before. A comprehensive experimental evaluation of the proposed and existing approaches is used to validate the analytical models and verify our claims. Our findings exhibit that the proposed solutions improve performance substantially over existing approaches, both for static and dynamic environments.

434 citations


Proceedings ArticleDOI
27 Jun 2006
TL;DR: This tutorial provides a comprehensive and cohesive overview of the key research results in the area of record linkage methodologies and algorithms for identifying approximate duplicate records, and available tools for this purpose.
Abstract: This tutorial provides a comprehensive and cohesive overview of the key research results in the area of record linkage methodologies and algorithms for identifying approximate duplicate records, and available tools for this purpose. It encompasses techniques introduced in several communities including databases, information retrieval, statistics and machine learning. It aims to identify similarities and differences across the techniques as well as their merits and limitations.

330 citations


Journal ArticleDOI
TL;DR: A P3P user agent called Privacy Bird is developed, which can fetch P2P privacy policies automatically, compare them with a user's privacy preferences, and alert and advise the user.
Abstract: Most people do not often read privacy policies because they tend to be long and difficult to understand. The Platform for Privacy Preferences (P3P) addresses this problem by providing a standard machine-readable format for website privacy policies. P3P user agents can fetch P3P privacy policies automatically, compare them with a user's privacy preferences, and alert and advise the user. Developing user interfaces for P3P user agents is challenging for several reasons: privacy policies are complex, user privacy preferences are often complex and nuanced, users tend to have little experience articulating their privacy preferences, users are generally unfamiliar with much of the terminology used by privacy experts, users often do not understand the privacy-related consequences of their behavior, and users have differing expectations about the type and extent of privacy policy information they would like to see. We developed a P3P user agent called Privacy Bird. Our design was informed by privacy surveys and our previous experience with prototype P3P user agents. We describe our design approach, compare it with the approach used in other P3P use agents, evaluate our design, and make recommendations to designers of other privacy agents.

278 citations


Journal ArticleDOI
Nick Duffield1
TL;DR: This paper abstracts the properties of network performance that allow this to be done and exploits them with a quick and simple inference algorithm that, with high likelihood, identifies the worst performing links.
Abstract: In network performance tomography, characteristics of the network interior, such as link loss and packet latency, are inferred from correlated end-to-end measurements. Most work to date is based on exploiting packet level correlations, e.g., of multicast packets or unicast emulations of them. However, these methods are often limited in scope-multicast is not widely deployed-or require deployment of additional hardware or software infrastructure. Some recent work has been successful in reaching a less detailed goal: identifying the lossiest network links using only uncorrelated end-to-end measurements. In this paper, we abstract the properties of network performance that allow this to be done and exploit them with a quick and simple inference algorithm that, with high likelihood, identifies the worst performing links. We give several examples of real network performance measures that exhibit the required properties. Moreover, the algorithm is sufficiently simple that we can analyze its performance explicitly

262 citations


Proceedings Article
31 Jul 2006
TL;DR: This paper presents a new Prefix Hijack Alert System (PHAS), a real-time notification system that alerts prefix owners when their BGP origin changes, and illustrates the effectiveness of PHAS and evaluates its overhead using BGP logs collected from RouteViews.
Abstract: In a BGP prefix hijacking event, a router originates a route to a prefix, but does not provide data delivery to the actual prefix. Prefix hijacking events have been widely reported and are a serious problem in the Internet. This paper presents a new Prefix Hijack Alert System (PHAS). PHAS is a real-time notification system that alerts prefix owners when their BGP origin changes. By providing reliable and timely notification of origin AS changes, PHAS allows prefix owners to quickly and easily detect prefix hijacking events and take prompt action to address the problem. We illustrate the effectiveness of PHAS and evaluate its overhead using BGP logs collected from RouteViews. PHAS is light-weight, easy to implement, and readily deployable. In addition to protecting against false BGP origins, the PHAS concept can be extended to detect prefix hijacking events that involve announcing more specific prefixes or modifying the last hop in the path.

256 citations


Journal ArticleDOI
11 Aug 2006
TL;DR: This paper proposes COPE, a class of traffic engineering algorithms that optimize for the expected scenarios while providing a worst-case guarantee for unexpected scenarios and shows that COPE can achieve efficient resource utilization and avoid network congestion in a wide variety of scenarios.
Abstract: Traffic engineering plays a critical role in determining the performance and reliability of a network. A major challenge in traffic engineering is how to cope with dynamic and unpredictable changes in traffic demand. In this paper, we propose COPE, a class of traffic engineering algorithms that optimize for the expected scenarios while providing a worst-case guarantee for unexpected scenarios. Using extensive evaluations based on real topologies and traffic traces, we show that COPE can achieve efficient resource utilization and avoid network congestion in a wide variety of scenarios.

245 citations


Proceedings ArticleDOI
22 Jan 2006
TL;DR: These are the first such results with additive error terms that are sublinear in the distance being approximated in the undirected and unweighted graph.
Abstract: Let k ≥ 2 be an integer. We show that any undirected and unweighted graph G = (V, E) on n vertices has a subgraph G' = (V, E') with O(kn1+1/k) edges such that for any two vertices u, v ∈ V, if δG(u, v) = d, then δG'(u, v) = d+O(d1-1/k-1). Furthermore, we show that such subgraphs can be constructed in O(mn1/k) time, where m and n are the number of edges and vertices in the original graph. We also show that it is possible to construct a weighted graph G* = (V, E*) with O(kn1+1/(2k-1)) edges such that for every u, v ∈ V, if δG(u, v) = d, then δ ≤ δG*(u, v) = d + O(d1-1/k-1). These are the first such results with additive error terms of the form o(d), i.e., additive error terms that are sublinear in the distance being approximated.

175 citations


Proceedings ArticleDOI
25 Oct 2006
TL;DR: The metropolized random walk with backtracking (MRWB) is proposed as a viable and promising technique for collecting nearly unbiased samples and an extensive simulation study is conducted to demonstrate that the technique works well for a wide variety of commonly-encountered peer-to-peer network conditions.
Abstract: This paper addresses the difficult problem of selecting representative samples of peer properties (eg degree, link bandwidth, number of files shared) in unstructured peer-to-peer systems. Due to the large size and dynamic nature of these systems, measuring the quantities of interest on every peer is often prohibitively expensive, while sampling provides a natural means for estimating system-wide behavior efficiently. However, commonly-used sampling techniques for measuring peer-to-peer systems tend to introduce considerable bias for two reasons. First, the dynamic nature of peers can bias results towards short-lived peers, much as naively sampling flows in a router can lead to bias towards short-lived flows. Second, the heterogeneous nature of the overlay topology can lead to bias towards high-degree peers.We present a detailed examination of the ways that the behavior of peer-to-peer systems can introduce bias and suggest the Metropolized Random Walk with Backtracking (MRWB) as a viable and promising technique for collecting nearly unbiased samples. We conduct an extensive simulation study to demonstrate that the proposed technique works well for a wide variety of common peer-to-peer network conditions. Using the Gnutella network, we empirically show that our implementation of the MRWB technique yields more accurate samples than relying on commonly-used sampling techniques. Furthermore, we provide insights into the causes of the observed differences. The tool we have developed, ion-sampler, selects peer addresses uniformly at random using the MRWB technique. These addresses may then be used as input to another measurement tool to collect data on a particular property.

168 citations


Journal ArticleDOI
11 Aug 2006
TL;DR: This work conducts extensive measurement that involves both controlled routing updates through two tier-1 ISPs and active probes of a diverse set of end-to-end paths on the Internet and finds that routing changes contribute to end- to-end packet loss significantly.
Abstract: Extensive measurement studies have shown that end-to-end Internet path performance degradation is correlated with routing dynamics. However, the root cause of the correlation between routing dynamics and such performance degradation is poorly understood. In particular, how do routing changes result in degraded end-to-end path performance in the first place? How do factors such as topological properties, routing policies, and iBGP configurations affect the extent to which such routing events can cause performance degradation? Answers to these questions are critical for improving network performance.In this paper, we conduct extensive measurement that involves both controlled routing updates through two tier-1 ISPs and active probes of a diverse set of end-to-end paths on the Internet. We find that routing changes contribute to end-to-end packet loss significantly. Specifically, we study failover events in which a link failure leads to a routing change and recovery events in which a link repair causes a routing change. In both cases, it is possible to experience data plane performance degradation in terms of increased long loss burst as well as forwarding loops. Furthermore, we find that common routing policies and iBGP configurations of ISPs can directly affect the end-to-end path performance during routing changes. Our work provides new insights into potential measures that network operators can undertake to enhance network performance.

166 citations


Journal ArticleDOI
TL;DR: This paper proposes methods for a tighter integration of ASR and SLU using word confusion networks (WCNs), which provide a compact representation of multiple aligned ASR hypotheses along with word confidence scores, without compromising recognition accuracy.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: A new way of measuring and extracting proximity in networks called "cycle free effective conductance"(CFEC) is proposed, which can handle more than two endpoints, directed edges, is statistically well-behaved, and produces an effectiveness score for the computed subgraphs.
Abstract: Measuring distance or some other form of proximity between objects is a standard data mining tool. Connection subgraphs were recently proposed as a way to demonstrate proximity between nodes in networks. We propose a new way of measuring and extracting proximity in networks called "cycle free effective conductance"(CFEC). Our proximity measure can handle more than two endpoints, directed edges, is statistically well-behaved, and produces an effectiveness score for the computed subgraphs. We provide an efficien talgorithm. Also, we report experimental results and show examples for three large network data sets: a telecommunications calling graph, the IMDB actors graph, and an academic co-authorship network.

Journal ArticleDOI
TL;DR: The focus of this work is to exploit data and to use machine learning techniques to create scalable SLU systems which can be quickly deployed for new domains with minimal human intervention.
Abstract: Spoken language understanding (SLU) aims at extracting meaning from natural language speech. Over the past decade, a variety of practical goal-oriented spoken dialog systems have been built for limited domains. SLU in these systems ranges from understanding predetermined phrases through fixed grammars, extracting some predefined named entities, extracting users' intents for call classification, to combinations of users' intents and named entities. In this paper, we present the SLU system of VoiceTone/spl reg/ (a service provided by ATT 2) extending statistical classifiers to seamlessly integrate hand crafted classification rules with the rules learned from data; and 3) developing an active learning framework to minimize the human labeling effort for quickly building the classifier models and adapting them to changes. We present an evaluation of this system using two deployed applications of VoiceTone/spl reg/.

Proceedings Article
30 May 2006
TL;DR: This work investigates the design space for in-network DDoS detection and proposes a triggered, multi-stage approach that addresses both scalability and accuracy, as well as using LADS to detect DDoS attacks in a tier-1 ISP.
Abstract: Many Denial of Service attacks use brute-force bandwidth flooding of intended victims. Such volume-based attacks aggregate at a target's access router, suggesting that (i) detection and mitigation are best done by providers in their networks; and (ii) attacks are most readily detectable at access routers, where their impact is strongest. In-network detection presents a tension between scalability and accuracy. Specifically, accuracy of detection dictates fine grained traffic monitoring, but performing such monitoring for the tens or hundreds of thousands of access interfaces in a large provider network presents serious scalability issues. We investigate the design space for in-network DDoS detection and propose a triggered, multi-stage approach that addresses both scalability and accuracy. Our contribution is the design and implementation of LADS (Large-scale Automated DDoS detection System). The attractiveness of this system lies in the fact that it makes use of data that is readily available to an ISP, namely, SNMP and Netflow feeds from routers, without dependence on proprietary hardware solutions. We report our experiences using LADS to detect DDoS attacks in a tier-1 ISP.

Proceedings ArticleDOI
25 Oct 2006
TL;DR: The results show that the size of the privacy footprint is a legitimate cause for concern across the sets of sites that the study and the effectiveness of existing and new techniques to reduce this diffusion are examined.
Abstract: As a follow up to characterizing traffic deemed as unwanted by Web clients such as advertisements, we examine how information related to individual users is aggregated as a result of browsing seemingly unrelated Web sites. We examine the privacy diffusion on the Internet, hidden transactions, and the potential for a few sites to be able to construct a profile of individual users. We define and generate a privacy footprint allowing us to assess and compare the diffusion of privacy information across a wide variety of sites. We examine the effectiveness of existing and new techniques to reduce this diffusion. Our results show that the size of the privacy footprint is a legitimate cause for concern across the sets of sites that we study.

Book ChapterDOI
18 Sep 2006
TL;DR: Three independent, complementary techniques for lowering the density and improving the readability of circular layouts are suggested, able to reduce clutter, density and crossings compared with existing methods.
Abstract: Circular graph layout is a drawing scheme where all nodes are placed on the perimeter of a circle. An inherent issue with circular layouts is that the rigid restriction on node placement often gives rise to long edges and an overall dense drawing. We suggest here three independent, complementary techniques for lowering the density and improving the readability of circular layouts. First, a new algorithm is given for placing the nodes on the circle such that edge lengths are reduced. Second, we enhance the circular drawing style by allowing some of the edges to be routed around the exterior of the circle. This is accomplished with an algorithm for optimally selecting such a set of externally routed edges. The third technique reduces density by coupling groups of edges as bundled splines that share part of their route. Together, these techniques are able to reduce clutter, density and crossings compared with existing methods.

Journal ArticleDOI
TL;DR: This work presents a multistart heuristic for the uncapacitated facility location problem, based on a very successful method originally developed for the p-median problem, that consistently outperforms other heuristics in the literature.

Journal Article
TL;DR: In this paper, the notion of resource-fair protocols is introduced, which states that if one party learns the output of the protocol, then so can all other parties, as long as they expend roughly the same amount of resources.
Abstract: We introduce the notion of resource-fair protocols. Informally, this property states that if one party learns the output of the protocol, then so can all other parties, as long as they expend roughly the same amount of resources. As opposed to similar previously proposed definitions, our definition follows the standard simulation paradigm and enjoys strong composability properties. In particular, our definition is similar to the security definition in the universal composability (UC) framework, but works in a model that allows any party to request additional resources from the environment to deal with dishonest parties that may prematurely abort. In this model we specify the ideally fair functionality as allowing parties to invest resources in return for outputs, but in such an event offering all other parties a fair deal. (The formulation of fair dealings is kept independent of any particular functionality, by defining it using a wrapper.) Thus, by relaxing the notion of fairness, we avoid a well-known impossibility result for fair multi-party computation with corrupted majority; in particular, our definition admits constructions that tolerate arbitrary number of corruptions. We also show that, as in the UC framework, protocols in our framework may be arbitrarily and concurrently composed. Turning to constructions, we define a commit-prove-fair-open functionality and design an efficient resource-fair protocol that securely realizes it, using a new variant of a cryptographic primitive known as time-lines. With (the fairly wrapped version of) this functionality we show that some of the existing secure multi-party computation protocols can be easily transformed into resource-fair protocols while preserving their security.

Journal ArticleDOI
26 Jun 2006
TL;DR: This paper optimize packet classifier configurations by identifying semantically equivalent rule sets that lead to reduced number of TCAM entries when represented in hardware, and develops a number of effective techniques that can be applied on the rule sets optimized by the scheme.
Abstract: Serving as the core component in many packet forwarding, differentiating and filtering schemes, packet classification continues to grow its importance in today's IP networks. Currently, most vendors use Ternary CAMs (TCAMs) for packet classification. TCAMs usually use brute-force parallel hardware to simultaneously check for all rules. One of the fundamental problems of TCAMs is that TCAMs suffer from range specifications because rules with range specifications need to be translated into multiple TCAM entries. Hence, the cost of packet classification will increase substantially as the number of TCAM entries grows. As a result, network operators hesitate to configure packet classifiers using range specifications. In this paper, we optimize packet classifier configurations by identifying semantically equivalent rule sets that lead to reduced number of TCAM entries when represented in hardware. In particular, we develop a number of effective techniques, which include: trimming rules, expanding rules, merging rules, and adding rules. Compared with previously proposed techniques which typically require modifications to the packet processor hardware, our scheme does not require any hardware modification, which is highly preferred by ISPs. Moreover, our scheme is complementary to previous techniques in that those techniques can be applied on the rule sets optimized by our scheme. We evaluate the effectiveness and potential of the proposed techniques using extensive experiments based on both real packet classifiers managed by a large tier-1 ISP and synthetic data generated randomly. We observe significant reduction on the number of TCAM entries that are needed to represent the optimized packet classifier configurations.

Proceedings ArticleDOI
21 Jul 2006
TL;DR: Three versions of the negative binomial regression model are explored, as well as a simple lines-of-code based model, to make predictions for this system and the differences observed from the earlier studies are discussed.
Abstract: We continue investigating the use of a negative binomial regression model to predict which files in a large industrial software system are most likely to contain many faults in the next release. A new empirical study is described whose subject is an automated voice response system. Not only is this system's functionality substantially different from that of the earlier systems we studied (an inventory system and a service provisioning system), it also uses a significantly different software development process. Instead of having regularly scheduled releases as both of the earlier systems did, this system has what are referred to as "continuous releases." We explore the use of three versions of the negative binomial regression model, as well as a simple lines-of-code based model, to make predictions for this system and discuss the differences observed from the earlier studies. Despite the different development process, the best version of the prediction model was able to identify, over the lifetime of the project, 20% of the system's files that contained, on average, nearly three quarters of the faults that were detected in the system's next releases.

Proceedings ArticleDOI
20 Aug 2006
TL;DR: This paper analyzes the trajectory segmentation problem from a global perspective, utilizing data aware distance-based optimization techniques, which optimize pairwise distance estimates hence leading to more efficient object pruning.
Abstract: This work introduces distance-based criteria for segmentation of object trajectories. Segmentation leads to simplification of the original objects into smaller, less complex primitives that are better suited for storage and retrieval purposes. Previous work on trajectory segmentation attacked the problem locally, segmenting separately each trajectory of the database. Therefore, they did not directly optimize the inter-object separability, which is necessary for mining operations such as searching, clustering, and classification on large databases. In this paper we analyze the trajectory segmentation problem from a global perspective, utilizing data aware distance-based optimization techniques, which optimize pairwise distance estimates hence leading to more efficient object pruning. We first derive exact solutions of the distance-based formulation. Due to the intractable complexity of the exact solution, we present anapproximate, greedy solution that exploits forward searching of locally optimal solutions. Since the greedy solution also imposes a prohibitive computational cost, we also put forward more light weight variance-based segmentation techniques, which intelligently "relax" the pairwise distance only in the areas that affect the least the mining operation.

Journal ArticleDOI
26 Jun 2006
TL;DR: This work designs novel inference techniques that, by statistically correlating SNMP link loads and sampled NetFlow records, allow for much more accurate estimation of traffic matrices than obtainable from either information source alone, even when sampled Net Flow records are available at only a subset of ingress.
Abstract: Estimation of traffic matrices, which provide critical input for network capacity planning and traffic engineering, has recently been recognized as an important research problem. Most of the previous approaches infer traffic matrix from either SNMP link loads or sampled NetFlow records. In this work, we design novel inference techniques that, by statistically correlating SNMP link loads and sampled NetFlow records, allow for much more accurate estimation of traffic matrices than obtainable from either information source alone, even when sampled NetFlow records are available at only a subset of ingress. Our techniques are practically important and useful since both SNMP and NetFlow are now widely supported by vendors and deployed in most of the operational IP networks. More importantly, this research leads us to a new insight that SNMP link loads and sampled NetFlow records can serve as "error correction codes" to each other. This insight helps us to solve a challenging open problem in traffic matrix estimation, "How to deal with dirty data (SNMP and NetFlow measurement errors due to hardware/software/transmission problems)?" We design techniques that, by comparing notes between the above two information sources, identify and remove dirty data, and therefore allow for accurate estimation of the traffic matrices with the cleaned dat.We conducted experiments on real measurement data obtained from a large tier-1 ISP backbone network. We show that, when full deployment of NetFlow is not available, our algorithm can improve estimation accuracy significantly even with a small fraction of NetFlow data. More importantly, we show that dirty data can contaminate a traffic matrix, and identifying and removing them can reduce errors in traffic matrix estimation by up to an order of magnitude. Routing changes is another a key factor that affects estimation accuracy. We show that using them as the a priori, the traffic matrices can be estimated much more accurately than those omitting the routing change. To the best of our knowledge, this work is the first to offer a comprehensive solution which fully takes advantage of using multiple readily available data sources. Our results provide valuable insights on the effectiveness of combining flow measurement and link load measurement.

Proceedings ArticleDOI
25 Oct 2006
TL;DR: This work has collected a large streaming media workload from thousands of broadband home users and business users hosted by a major ISP, and analyzed the most commonly used streaming techniques such as automatic protocol switch, Fast Streaming, MBR encoding and rate adaptation.
Abstract: Modern Internet streaming services have utilized various techniques to improve the quality of streaming media delivery. Despite the characterization of media access patterns and user behaviors in many measurement studies, few studies have focused on the streaming techniques themselves, particularly on the quality of streaming experiences they offer end users and on the resources of the media systems that they consume. In order to gain insights into current streaming services techniques and thus provide guidance on designing resource-efficient and high quality streaming media systems, we have collected a large streaming media workload from thousands of broadband home users and business users hosted by a major ISP, and analyzed the most commonly used streaming techniques such as automatic protocol switch, Fast Streaming, MBR encoding and rate adaptation. Our measurement and analysis results show that with these techniques, current streaming systems these techniques tend to over-utilize CPU and bandwidth resources to provide better services to end users, which may not be a desirable and effective is not necessary the best way to improve the quality of streaming media delivery. Motivated by these results, we propose and evaluate a coordination mechanism that effectively takes advantage of both Fast Streaming and rate adaptation to better utilize the server and Internet resources for streaming quality improvement.

Journal ArticleDOI
TL;DR: The authors discuss the proposition that most of the temporal PMD changes that are observed in installed routes arise primarily from a relatively small number of "hot spots" along the route that are exposed to the ambient environment, whereas the buried shielded sections remain largely stable for month-long time periods.
Abstract: Polarization mode dispersion (PMD), a potentially limiting impairment in high-speed long-distance fiber-optic communication systems, refers to the distortion of propagating optical pulses due to random birefringences in an optical system. Because these perturbations (which can be introduced through manufacturing imperfections, cabling stresses, installation procedures, and environmental sensitivities of fiber and other in-line components) are unknowable and continually changing, PMD is unique among optical impairments. This makes PMD both a fascinating research subject and potentially one of the most challenging technical obstacles for future optoelectronic transmission. Mitigation and compensation techniques, proper emulation, and accurate prediction of PMD-induced outage probabilities critically depend on the understanding and modeling of the statistics of PMD in installed links. Using extensive data on buried fibers used in long-haul high-speed links, the authors discuss the proposition that most of the temporal PMD changes that are observed in installed routes arise primarily from a relatively small number of "hot spots" along the route that are exposed to the ambient environment, whereas the buried shielded sections remain largely stable for month-long time periods. It follows that the temporal variations of the differential group delay for any given channel constitute a distinct statistical distribution with its own channel-specific mean value. The impact of these observations on outage statistics is analyzed, and the implications for future optoelectronic fiber-based transmission are discussed

Proceedings ArticleDOI
20 Aug 2006
TL;DR: A simple exact algorithm for finding the largest discrepancy region in a domain and a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus run time trade-off of prior methods are described.
Abstract: Spatial scan statistics are used to determine hotspots in spatial data, and are widely used in epidemiology and biosurveillance. In recent years, there has been much effort invested in designing efficient algorithms for finding such "high discrepancy" regions, with methods ranging from fast heuristics for special cases, to general grid-based methods, and to efficient approximation algorithms with provable guarantees on performance and quality.In this paper, we make a number of contributions to the computational study of spatial scan statistics. First, we describe a simple exact algorithm for finding the largest discrepancy region in a domain. Second, we propose a new approximation algorithm for a large class of discrepancy functions (including the Kulldorff scan statistic) that improves the approximation versus run time trade-off of prior methods. Third, we extend our simple exact and our approximation algorithms to data sets which lie naturally on a grid or are accumulated onto a grid. Fourth, we conduct a detailed experimental comparison of these methods with a number of known methods, demonstrating that our approximation algorithm has far superior performance in practice to prior methods, and exhibits a good performance-accuracy trade-off.All extant methods (including those in this paper) are suitable for data sets that are modestly sized; if data sets are of the order of millions of data points, none of these methods scale well. For such massive data settings, it is natural to examine whether small-space streaming algorithms might yield accurate answers. Here, we provide some negative results, showing that any streaming algorithms that even provide approximately optimal answers to the discrepancy maximization problem must use space linear in the input.

Proceedings ArticleDOI
26 Jun 2006
TL;DR: This work presents the first deterministic algorithms for answering biased quantiles queries accurately with small—sublinear in the input size—space and time bounds in one pass, and shows it uses less space than existing methods in many practical settings, and is fast to maintain.
Abstract: Skew is prevalent in data streams, and should be taken into account by algorithms that analyze the data. The problem of finding "biased quantiles"—that is, approximate quantiles which must be more accurate for more extreme values—is a framework for summarizing such skewed data on data streams. We present the first deterministic algorithms for answering biased quantiles queries accurately with small—sublinear in the input size—space and time bounds in one pass. The space bound is near-optimal, and the amortized update cost is close to constant, making it practical for handling high speed network data streams. We not only demonstrate theoretical properties of the algorithm, but also show it uses less space than existing methods in many practical settings, and is fast to maintain.

Journal ArticleDOI
TL;DR: It is shown a reduction from the complexity of one-round, information-theoretic Private Information Retrieval Systems (with two servers) to Locally Decodable Codes, and it is proved that if all the servers’ answers are linear combinations of the database content, then t = Ω (n/2a), where t is the length of the user’s query.
Abstract: We prove that if a linear error-correcting code C:{0, 1}n?{0, 1}m is such that a bit of the message can be probabilistically reconstructed by looking at two entries of a corrupted codeword, then m = 2? (n). We also present several extensions of this result. We show a reduction from the complexity of one-round, information-theoretic Private Information Retrieval Systems (with two servers) to Locally Decodable Codes, and conclude that if all the servers' answers are linear combinations of the database content, then t = ? (n/2a), where t is the length of the user's query and a is the length of the servers' answers. Actually, 2a can be replaced by O(ak), where k is the number of bit locations in the answer that are actually inspected in the reconstruction.

Journal ArticleDOI
David Applegate1, Edith Cohen1
TL;DR: It is possible to obtain a robust routing that guarantees a nearly optimal utilization with a fairly limited knowledge of the applicable traffic demands, and novel algorithms for constructing optimal robust routings are developed.
Abstract: Intra-domain traffic engineering can significantly enhance the performance of large IP backbone networks. Two important components of traffic engineering are understanding the traffic demands and configuring the routing protocols. These two components are inter-linked, as it is widely believed that an accurate view of traffic is important for optimizing the configuration of routing protocols, and through that, the utilization of the network. This basic premise, however, seems never to have been quantified. How important is accurate knowledge of traffic demands for obtaining good utilization of the network? Since traffic demand values are dynamic and illusive, is it possible to obtain a routing that is "robust" to variations in demands? We develop novel algorithms for constructing optimal robust routings and for evaluating the performance of any given routing on loosely constrained rich sets of traffic demands. Armed with these algorithms we explore these questions on a diverse collection of ISP networks. We arrive at a surprising conclusion: it is possible to obtain a robust routing that guarantees a nearly optimal utilization with a fairly limited knowledge of the applicable traffic demands

Proceedings ArticleDOI
03 Apr 2006
TL;DR: The compilation rules for the complete language into that algebra are described and novel optimization techniques that address the needs of complex queries are presented that account for XQuery’s complex predicate semantics.
Abstract: As XQuery nears standardization, more sophisticated XQuery applications are emerging, which often exploit the entire language and are applied to non-trivial XML sources. We propose an algebra and optimization techniques that are suitable for building an XQuery compiler that is complete, correct, and efficient. We describe the compilation rules for the complete language into that algebra and present novel optimization techniques that address the needs of complex queries. These techniques include new query unnesting rewritings and specialized join algorithms that account for XQuery’s complex predicate semantics. The algebra and optimizations are implemented in the Galax XQuery engine, and yield execution plans that are up to three orders of magnitude faster than earlier versions of Galax.

Journal ArticleDOI
01 Dec 2006
TL;DR: Support for a combination of "structured" and full-text search for effectively querying XML documents was unanimous in a recent panel at SIGMOD 2005, and is being widely studied in the IR community.
Abstract: The development of approaches to access XML content has generated a wealth of issues in information retrieval (IR) and database (DB) (e.g., [2, 15, 17, 20, 19, 47, 26, 32, 24]). While the IR community has traditionally focused on searching unstructured content, and has developed various techniques for ranking query results and evaluating their effectiveness, the DB community has focused on developing query languages and efficient evaluation algorithms for highly structured content. Recent trends in DB and IR research demonstrate a growing interest in merging IR and DB techniques for accessing XML content. Support for a combination of "structured" and full-text search for effectively querying XML documents was unanimous in a recent panel at SIGMOD 2005 [3], and is being widely studied in the IR community [20].