Showing papers by "AT&T Labs published in 2004"

PDF

Open Access

Proceedings Article•DOI•

A maximum entropy approach to species distribution modeling

[...]

Steven J. Phillips¹, Miroslav Dudík², Robert E. Schapire²•Institutions (2)

04 Jul 2004

TL;DR: This work proposes the use of maximum-entropy techniques for this problem, specifically, sequential-update algorithms that can handle a very large number of features, and investigates the interpretability of models constructed using maxent.

...read moreread less

Abstract: We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximum-entropy techniques for this problem, specifically, sequential-update algorithms that can handle a very large number of features. We describe experiments comparing maxent with a standard distribution-modeling tool, called GARP, on a dataset containing observation data for North American breeding birds. We also study how well maxent performs as a function of the number of training examples and training time, analyze the use of regularization to avoid overfitting when the number of examples is small, and explore the interpretability of models constructed using maxent.

...read moreread less

1,956 citations

Journal Article•DOI•

Mining with rarity: a unifying framework

[...]

Gary M. Weiss¹•Institutions (1)

AT&T Labs¹

01 Jun 2004-Sigkdd Explorations

TL;DR: It is demonstrated that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.

...read moreread less

Abstract: Rare objects are often of great interest and great value Until recently, however, rarity has not received much attention in the context of data mining Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage This article discusses the role that rare classes and rare cases play in data mining The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems These descriptions utilize examples from existing research So that this article provides a good survey of the literature on rarity in data mining This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods

...read moreread less

1,409 citations

Proceedings Article•DOI•

Accurate, scalable in-network identification of p2p traffic using application signatures

[...]

Subhabrata Sen¹, Oliver Spatscheck¹, Dongmei Wang¹•Institutions (1)

AT&T Labs¹

17 May 2004

TL;DR: In this article, the authors identify the application level signatures by examining some available documentations, and packet-level traces, and then utilize the identified signatures to develop online filters that can efficiently and accurately track the P2P traffic even on high-speed network links.

...read moreread less

Abstract: The ability to accurately identify the network traffic associated with different P2P applications is important to a broad range of network operations including application-specific traffic engineering, capacity planning, provisioning, service differentiation,etc. However, traditional traffic to higher-level application mapping techniques such as default server TCP or UDP network-port baseddisambiguation is highly inaccurate for some P2P applications.In this paper, we provide an efficient approach for identifying the P2P application traffic through application level signatures. We firstidentify the application level signatures by examining some available documentations, and packet-level traces. We then utilize the identified signatures to develop online filters that can efficiently and accurately track the P2P traffic even on high-speed network links.We examine the performance of our application-level identification approach using five popular P2P protocols. Our measurements show thatour technique achieves less than 5% false positive and false negative ratios in most cases. We also show that our approach only requires the examination of the very first few packets (less than 10packets) to identify a P2P connection, which makes our approach highly scalable. Our technique can significantly improve the P2P traffic volume estimates over what pure network port based approaches provide. For instance, we were able to identify 3 times as much traffic for the popular Kazaa P2P protocol, compared to the traditional port-based approach.

...read moreread less

856 citations

Journal Article•DOI•

Analyzing peer-to-peer traffic across large networks

[...]

Subhabrata Sen¹, Jia Wang¹•Institutions (1)

AT&T Labs¹

01 Apr 2004-IEEE ACM Transactions on Networking

TL;DR: The high volume and good stability properties of P2P traffic suggests that the P1P workload is a good candidate for being managed via application-specific layer-3 traffic engineering in an ISP's network.

...read moreread less

Abstract: The use of peer-to-peer (P2P) applications is growing dramatically, particularly for sharing large video/audio files and software. In this paper, we analyze P2P traffic by measuring flow-level information collected at multiple border routers across a large ISP network, and report our investigation of three popular P2P systems--FastTrack, Gnutella, and Direct-Connect. We characterize the P2P trafffic observed at a single ISP and its impact on the underlying network. We observe very skewed distribution in the traffic across the network at different levels of spatial aggregation (IP, prefix, AS). All three P2P systems exhibit significant dynamics at short time scale and particularly at the IP address level. Still, the fraction of P2P traffic contributed by each prefix is more stable than the corresponding distribution of either Web traffic or overall traffic. The high volume and good stability properties of P2P traffic suggests that the P2P workload is a good candidate for being managed via application-specific layer-3 traffic engineering in an ISP's network.

...read moreread less

691 citations

Proceedings Article•DOI•

Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification

[...]

Matthew Roughan¹, Subhabrata Sen², Oliver Spatscheck², Nick Duffield²•Institutions (2)

University of Adelaide¹, AT&T Labs²

25 Oct 2004

TL;DR: It is argued that measurement based automated Class of Service (CoS) mapping is an important practical problem that needs to be studied, and a solution framework for measurement based classification of traffic for QoS based on statistical application signatures is outlined.

...read moreread less

Abstract: The ability to provide different Quality of Service (QoS) guarantees to traffic from different applications is a highly desired feature for many IP network operators, particularly for enterprise networks. Although various mechanisms exist for providing QoS in the network, QoS is yet to be widely deployed. We believe that a key factor holding back widespread QoS adoption is the absence of suitable methodologies/processes for appropriately mapping the traffic from different applications to different QoS classes. This is a challenging task, because many enterprise network operators who are interested in QoS do not know all the applications running on their network, and furthermore, over recent years port-based application classification has become problematic. We argue that measurement based automated Class of Service (CoS) mapping is an important practical problem that needs to be studied.In this paper we describe the requirements and associated challenges, and outline a solution framework for measurement based classification of traffic for QoS based on statistical application signatures. In our approach the signatures are chosen in such as way as to make them insensitive to the particular application layer protocol, but rather to determine the way in which an application is used -- for instance is it used interactively, or for bulk-data transport. The resulting application signature can then be used to derive the network layer signatures required to determine the CoS class for individual IP datagrams. Our evaluations using traffic traces from a variety of network locations, demonstrate the feasibility and potential of the approach.

...read moreread less

552 citations

Book Chapter•DOI•

Graphviz and Dynagraph — Static and Dynamic Graph Drawing Tools

[...]

John C. Ellson¹, Emden R. Gansner¹, Eleftherios Koutsofios¹, Stephen C. North¹, Gordon Woodhull¹ - Show less +1 more•Institutions (1)

AT&T Labs¹

01 Jan 2004

TL;DR: Graphviz is a collection of software for viewing and manipulating abstract graphs that provides graph visualization for tools and web sites in domains such as software engineering, networking, databases, knowledge representation, and bioinformatics.

...read moreread less

Abstract: Graphviz is a collection of software for viewing and manipulating abstract graphs. It provides graph visualization for tools and web sites in domains such as software engineering, networking, databases, knowledge representation, and bioinformatics. Hundreds of thousands of copies have been distributed under an open source license.

...read moreread less

469 citations

Proceedings Article•DOI•

Incremental Parsing with the Perceptron Algorithm

[...]

Michael Collins¹, Brian Roark²•Institutions (2)

Massachusetts Institute of Technology¹, AT&T Labs²

21 Jul 2004

TL;DR: It is demonstrated that training a perceptron model to combine with the generative model during search provides a 2.1 percent F-measure improvement over the Generative model alone, to 88.8 percent.

...read moreread less

Abstract: This paper describes an incremental parsing approach where parameters are estimated using a variant of the perceptron algorithm. A beam-search algorithm is used during both training and decoding phases of the method. The perceptron approach was implemented with the same feature set as that of an existing generative model (Roark, 2001a), and experimental results show that it gives competitive performance to the generative model on parsing the Penn treebank. We demonstrate that training a perceptron model to combine with the generative model during search provides a 2.1 percent F-measure improvement over the generative model alone, to 88.8 percent.

...read moreread less

457 citations

Proceedings Article•DOI•

A first-principles approach to understanding the internet's router-level topology

[...]

Lun Li¹, David L. Alderson¹, Walter Willinger², John Doyle¹•Institutions (2)

California Institute of Technology¹, AT&T Labs²

30 Aug 2004

TL;DR: It is claimed that very simple models that incorporate hard technological constraints on router and link bandwidth and connectivity, together with abstract models of user demand and network performance, can successfully address this challenge and further resolve much of the confusion and controversy that has surrounded topology generation and evaluation.

...read moreread less

Abstract: A detailed understanding of the many facets of the Internet's topological structure is critical for evaluating the performance of networking protocols, for assessing the effectiveness of proposed techniques to protect the network from nefarious intrusions and attacks, or for developing improved designs for resource provisioning. Previous studies of topology have focused on interpreting measurements or on phenomenological descriptions and evaluation of graph-theoretic properties of topology generators. We propose a complementary approach of combining a more subtle use of statistics and graph theory with a first-principles theory of router-level topology that reflects practical constraints and tradeoffs. While there is an inevitable tradeoff between model complexity and fidelity, a challenge is to distill from the seemingly endless list of potentially relevant technological and economic issues the features that are most essential to a solid understanding of the intrinsic fundamentals of network topology. We claim that very simple models that incorporate hard technological constraints on router and link bandwidth and connectivity, together with abstract models of user demand and network performance, can successfully address this challenge and further resolve much of the confusion and controversy that has surrounded topology generation and evaluation.

...read moreread less

422 citations

Book Chapter•DOI•

Graph drawing by stress majorization

[...]

Emden R. Gansner¹, Yehuda Koren¹, Stephen C. North¹•Institutions (1)

AT&T Labs¹

29 Sep 2004

TL;DR: This work shows how to draw graphs by stress majorization, adapting a technique known in the MDS community for more than two decades and appears that majorization has advantages over the technique of Kamada and Kawai in running time and stability.

...read moreread less

Abstract: One of the most popular graph drawing methods is based on achieving graph-theoretic target distances. This method was used by Kamada and Kawai [15], who formulated it as an energy optimization problem. Their energy is known in the multidimensional scaling (MDS) community as the stress function. In this work, we show how to draw graphs by stress majorization, adapting a technique known in the MDS community for more than two decades. It appears that majorization has advantages over the technique of Kamada and Kawai in running time and stability. We also found the majorization-based optimization being essential to a few extensions to the basic energy model. These extensions can improve layout quality and computation speed in practice.

...read moreread less

402 citations

Journal Article•DOI•

Measurement and modeling of an ultra-wide bandwidth indoor channel

[...]

S.S. Ghassemzadeh¹, Rittwik Jana¹, Christopher W. Rice¹, W. Turin¹, Vahid Tarokh¹ - Show less +1 more•Institutions (1)

AT&T Labs¹

01 Nov 2004-IEEE Transactions on Communications

TL;DR: A path loss model as well as a second-order autoregressive model is proposed for frequency response generation of the UWB indoor channel and results of frequency-domain channel sounding in residential environments are described.

...read moreread less

Abstract: This paper describes the results of frequency-domain channel sounding in residential environments. It consists of detailed characterization of complex frequency responses of ultra-wideband (UWB) signals having a nominal center frequency of 5 GHz. A path loss model as well as a second-order autoregressive model is proposed for frequency response generation of the UWB indoor channel. Probability distributions of the model parameters for different locations are presented. Also, time-domain results such as root mean square delay spread and percent of captured power are presented.

...read moreread less

336 citations

Journal Article•DOI•

Compact oracles for reachability and approximate distances in planar digraphs

[...]

Mikkel Thorup¹•Institutions (1)

AT&T Labs¹

01 Nov 2004-Journal of the ACM

TL;DR: It is shown that a planar digraph can be preprocessed in near-linear time, producing a near- linear space oracle that can answer reachability queries in constant time.

...read moreread less

Abstract: It is shown that a planar digraph can be preprocessed in near-linear time, producing a near-linear space oracle that can answer reachability queries in constant time. The oracle can be distributed as an O(log n) space label for each vertex and then we can determine if one vertex can reach another considering their two labels only.The approach generalizes to give a near-linear space approximate distances oracle for a weighted planar digraph. With weights drawn from l0, …, Nr, it approximates distances within a factor (1 + e) in O(log log (nN) + 1/e) time. Our scheme can be extended to find and route along correspondingly short dipaths.

...read moreread less

Book Chapter•DOI•

Notions of Reducibility between Cryptographic Primitives

[...]

Omer Reingold¹, Luca Trevisan², Salil Vadhan³•Institutions (3)

AT&T Labs¹, University of California, Berkeley², Harvard University³

19 Feb 2004

TL;DR: Starting with the seminal paper of Impagliazzo and Rudich [17], there has been a large body of work showing that various cryptographic primitives cannot be reduced to each other via “black-box” reductions.

...read moreread less

Abstract: Starting with the seminal paper of Impagliazzo and Rudich [17], there has been a large body of work showing that various cryptographic primitives cannot be reduced to each other via “black-box” reductions. The common interpretation of these results is that there are inherent limitations in using a primitive as a black box, and that these impossibility results can be overcome only by explicitly using the code of the primitive in the construction.

...read moreread less

Proceedings Article•DOI•

The case for separating routing from routers

[...]

Nick Feamster¹, Hari Balakrishnan¹, Jennifer Rexford², Aman Shaikh², Jacobus Van der Merwe² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, AT&T Labs²

30 Aug 2004

TL;DR: This work presents a design overview of RCP based on three architectural principles path computation based on a consistent view of network state, controlled interactions between routing protocol layers, and expressive specification of routing policies and discusses the architectural strengths and weaknesses of the proposal.

...read moreread less

Abstract: Over the past decade, the complexity of the Internet's routing infrastructure has increased dramatically. This complexity and the problems it causes stem not just from various new demands made of the routing infrastructure, but also from fundamental limitations in the ability of today's distributed infrastructure to scalably cope with new requirements.The limitations in today's routing system arise in large part from the fully distributed path-selection computation that the IP routers in an autonomous system (AS) must perform. To overcome this weakness, interdomain routing should be separated from today's IP routers, which should simply forward packets (for the most part). Instead, a separate Routing Control Platform (RCP) should select routes on behalf of the IP routers in each AS and exchange reachability information with other domains.Our position is that an approach like RCP is a good way of coping with complexity while being responsive to new demands and can lead to a routing system that is substantially easier to manage than today. We present a design overview of RCP based on three architectural principles path computation based on a consistent view of network state, controlled interactions between routing protocol layers, and expressive specification of routing policies and discuss the architectural strengths and weaknesses of our proposal.

...read moreread less

Proceedings Article•DOI•

Where the bugs are

[...]

Thomas J. Ostrand¹, Elaine J. Weyuker¹, Robert M. Bell¹•Institutions (1)

AT&T Labs¹

01 Jul 2004

TL;DR: A negative binomial regression model using information from previous releases has been developed and used to predict the numbers of faults for a large industrial inventory system, and was extremely accurate.

...read moreread less

Abstract: The ability to predict which files in a large software system are most likely to contain the largest numbers of faults in the next release can be a very valuable asset. To accomplish this, a negative binomial regression model using information from previous releases has been developed and used to predict the numbers of faults for a large industrial inventory system. The files of each release were sorted in descending order based on the predicted number of faults and then the first 20% of the files were selected. This was done for each of fifteen consecutive releases, representing more than four years of field usage. The predictions were extremely accurate, correctly selecting files that contained between 71% and 92% of the faults, with the overall average being 83%. In addition, the same model was used on data for the same system's releases, but with all fault data prior to integration testing removed. The prediction was again very accurate, ranging from 71% to 93%, with the average being 84%. Predictions were made for a second system, and again the first 20% of files accounted for 83% of the identified faults. Finally, a highly simplified predictor was considered which correctly predicted 73% and 74% of the faults for the two systems.

...read moreread less

Proceedings Article•DOI•

Data streaming algorithms for efficient and accurate estimation of flow size distribution

[...]

Abhishek Kumar¹, Minho Sung¹, Jun Xu¹, Jia Wang²•Institutions (2)

Georgia Institute of Technology¹, AT&T Labs²

01 Jun 2004

TL;DR: A novel data streaming algorithm to provide much more accurate estimates of flow distribution, using a "lossy data structure" which consists of an array of counters fitted well into SRAM, which not only dramatically improves the accuracy offlow distribution measurement, but also contributes to the field of data streaming.

...read moreread less

Abstract: Knowing the distribution of the sizes of traffic flows passing through a network link helps a network operator to characterize network resource usage, infer traffic demands, detect traffic anomalies, and accommodate new traffic demands through better traffic engineering. Previous work on estimating the flow size distribution has been focused on making inferences from sampled network traffic. Its accuracy is limited by the (typically) low sampling rate required to make the sampling operation affordable. In this paper we present a novel data streaming algorithm to provide much more accurate estimates of flow distribution, using a "lossy data structure" which consists of an array of counters fitted well into SRAM. For each incoming packet, our algorithm only needs to increment one underlying counter, making the algorithm fast enough even for 40 Gbps (OC-768) links. The data structure is lossy in the sense that sizes of multiple flows may collide into the same counter. Our algorithm uses Bayesian statistical methods such as Expectation Maximization to infer the most likely flow size distribution that results in the observed counter values after collision. Evaluations of this algorithm on large Internet traces obtained from several sources (including a tier-1 ISP) demonstrate that it has very high measurement accuracy (within 2%). Our algorithm not only dramatically improves the accuracy of flow distribution measurement, but also contributes to the field of data streaming by formalizing an existing methodology and applying it to the context of estimating the flow-distribution.

...read moreread less

Journal Article•DOI•

A generic model for MIMO wireless propagation channels in macro- and microcells

[...]

Andreas F. Molisch¹•Institutions (1)

AT&T Labs¹

01 Jan 2004-IEEE Transactions on Signal Processing

TL;DR: A geometry-based model is proposed that includes the propagation effects that are critical for MIMO performance: i) single scattering around the BS and MS, ii) scattering by far clusters, iii) double-scattering, iv) waveguiding, and v) diffraction by roof edges.

...read moreread less

Abstract: This paper derives a generic model for the multiple-input multiple-output (MIMO) wireless channel. The model incorporates important effects, including i) interdependency of directions-of-arrival and directions-of-departure, ii) large delay and angle dispersion by propagation via far clusters, and iii) rank reduction of the transfer function matrix. We propose a geometry-based model that includes the propagation effects that are critical for MIMO performance: i) single scattering around the BS and MS, ii) scattering by far clusters, iii) double-scattering, iv) waveguiding, and v) diffraction by roof edges. The required parameters for the complete definition of the model are enumerated, and typical parameter values in macro and microcellular environments are discussed.

...read moreread less

Journal Article•DOI•

Increasing Internet Capacity Using Local Search

[...]

Bernard Fortz¹, Mikkel Thorup²•Institutions (2)

Université catholique de Louvain¹, AT&T Labs²

01 Oct 2004-Computational Optimization and Applications

TL;DR: The problem of optimizing OSPF weights for a given a set of projected demands so as to avoid congestion is shown to be NP-hard, even for approximation, and a local search heuristic is proposed to solve it.

...read moreread less

Abstract: Open Shortest Path First (OSPF) is one of the most commonly used intra-domain internet routing protocol. Traffic flow is routed along shortest paths, splitting flow evenly at nodes where several outgoing links are on shortest paths to the destination. The weights of the links, and thereby the shortest path routes, can be changed by the network operator. The weights could be set proportional to the physical lengths of the links, but often the main goal is to avoid congestion, i.e. overloading of links, and the standard heuristic recommended by Cisco (a major router vendor) is to make the weight of a link inversely proportional to its capacity. We study the problem of optimizing OSPF weights for a given a set of projected demands so as to avoid congestion. We show this problem is NP-hard, even for approximation, and propose a local search heuristic to solve it. We also provide worst-case results about the performance of OSPF routing vs. an optimal multi-commodity flow routing. Our numerical experiments compare the results obtained with our local search heuristic to the optimal multi-commodity flow routing, as well as simple and commonly used heuristics for setting the weights. Experiments were done with a proposed next-generation AT&T WorldNet backbone as well as synthetic internetworks.

...read moreread less

Journal Article•DOI•

An evolutionary algorithm for manufacturing cell formation

[...]

José Fernando Gonçalves, Mauricio G. C. Resende¹•Institutions (1)

AT&T Labs¹

01 Nov 2004-Computers & Industrial Engineering

TL;DR: A new approach for obtaining machine cells and product families is presented that combines a local search heuristic with a genetic algorithm and produced solutions with a grouping efficacy that is at least as good as any results previously reported in literature.

...read moreread less

Book Chapter•DOI•

Performance guarantees for regularized maximum entropy density estimation

[...]

Miroslav Dudík¹, Steven J. Phillips², Robert E. Schapire¹•Institutions (2)

Princeton University¹, AT&T Labs²

01 Jul 2004

TL;DR: It is proved non-asymptotic bounds showing that, with respect to the true underlying distribution, this relaxed version of maxent produces density estimates that are almost as good as the best possible.

...read moreread less

Abstract: We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of the features. By convex duality, this turns out to be equivalent to finding the Gibbs distribution minimizing a regularized version of the empirical log loss. We prove non-asymptotic bounds showing that, with respect to the true underlying distribution, this relaxed version of maxent produces density estimates that are almost as good as the best possible. These bounds are in terms of the deviation of the feature empirical averages relative to their true expectations, a number that can be bounded using standard uniform-convergence techniques. In particular, this leads to bounds that drop quickly with the number of samples, and that depend very moderately on the number or complexity of the features. We also derive and prove convergence for both sequential-update and parallel-update algorithms. Finally, we briefly describe experiments on data relevant to the modeling of species geographical distributions.

...read moreread less

Proceedings Article•DOI•

Dynamics of hot-potato routing in IP networks

[...]

Renata Teixeira¹, Aman Shaikh², Timothy G. Griffin³, Jennifer Rexford²•Institutions (3)

University of California, San Diego¹, AT&T Labs², Intel³

01 Jun 2004

TL;DR: It is shown that hot-potato routing changes lead to longer delays in forwarding-plane convergence, shifts in the flow of traffic to neighboring domains, extra externally-visible BGP update messages, and inaccuracies in Internet performance measurements.

...read moreread less

Abstract: Despite the architectural separation between intradomain and interdomain routing in the Internet, intradomain protocols do influence the path-selection process in the Border Gateway Protocol (BGP). When choosing between multiple equally-good BGP routes, a router selects the one with the closest egress point, based on the intradomain path cost. Under such hot-potato routing, an intradomain event can trigger BGP routing changes. To characterize the influence of hot-potato routing, we conduct controlled experiments with a commercial router. Then, we propose a technique for associating BGP routing changes with events visible in the intradomain protocol, and apply our algorithm to AT&T's backbone network. We show that (i) hot-potato routing can be a significant source of BGP updates, (ii) BGP updates can lag 60 seconds or more behind the intradomain event, (iii) the number of BGP path changes triggered by hot-potato routing has a nearly uniform distribution across destination prefixes, and (iv) the fraction of BGP messages triggered by intradomain changes varies significantly across time and router locations. We show that hot-potato routing changes lead to longer delays in forwarding-plane convergence, shifts in the flow of traffic to neighboring domains, extra externally-visible BGP update messages, and inaccuracies in Internet performance measurements.

...read moreread less

Proceedings Article•DOI•

Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications

[...]

Yin Zhang¹, Sumeet Singh², Subhabrata Sen¹, Nick Duffield¹, Carsten Lund¹ - Show less +1 more•Institutions (2)

AT&T Labs¹, University of California, San Diego²

25 Oct 2004

TL;DR: It is shown that the problem of HHH detection can be transformed to one of dynamic packet classification by taking a top-down approach and adaptively creating new rules to match HHHs, which have much lower worst-case update costs than existing algorithms and can provide tunable deterministic accuracy guarantees.

...read moreread less

Abstract: In traffic monitoring, accounting, and network anomaly detection, it is often important to be able to detect high-volume traffic clusters in near real-time. Such heavy-hitter traffic clusters are often hierarchical (ie, they may occur at different aggregation levels like ranges of IP addresses) and possibly multidimensional (ie, they may involve the combination of different IP header fields like IP addresses, port numbers, and protocol). Without prior knowledge about the precise structures of such traffic clusters, a naive approach would require the monitoring system to examine all possible ombinations of aggregates in order to detect the heavy hitters, which can be proohibitive in terms of computation resources.In this paper, we focus on online identification of 1-dimensional and 2-dimensional hierarchical heavy hitters (HHHs), arguably the two most important scenarios in traffic analysis. We show that the problem of HHH detection can be transformed to one of dynamic packet classification by taking a top-down approach and adaptively creating new rules to match HHHs. We then adapt several existing static packet classification algorithms to support dynamic packet classification. The resulting HHH detection algorithms have much lower worst-case update costs than existing algorithms and can provide tunable deterministic accuracy guarantees. As an application of these algorithms, we also propose robust techniques to detect changes among heavy-hitter traffic clusters. Our techniques can accommodate variability due to sampling that is increasingly used in network measurement. Evaluation based on real Internet traces collected at a Tier-1 ISP suggests that these techniques are remarkably accurate and efficient.

...read moreread less

Proceedings Article•DOI•

FleXPath: flexible structure and full-text querying for XML

[...]

Sihem Amer-Yahia¹, Laks V. S. Lakshmanan², Shashank Pandit³•Institutions (3)

AT&T Labs¹, University of British Columbia², Indian Institute of Technology Bombay³

13 Jun 2004

TL;DR: This paper provides an elegant definition of relaxation on structure and defines primitive operators to span the space of relaxations for ranking schemes and proposes natural ranking schemes that adhere to these principles.

...read moreread less

Abstract: Querying XML data is a well-explored topic with powerful database-style query languages such as XPath and XQuery set to become W3C standards. An equally compelling paradigm for querying XML documents is full-text search on textual content. In this paper, we study fundamental challenges that arise when we try to integrate these two querying paradigms.While keyword search is based on approximate matching, XPath has exact match semantics. We address this mismatch by considering queries on structure as a "template", and looking for answers that best match this template and the full-text search. To achieve this, we provide an elegant definition of relaxation on structure and define primitive operators to span the space of relaxations. Query answering is now based on ranking potential answers on structural and full-text search conditions. We set out certain desirable principles for ranking schemes and propose natural ranking schemes that adhere to these principles. We develop efficient algorithms for answering top-K queries and discuss results from a comprehensive set of experiments that demonstrate the utility and scalability of the proposed framework and algorithms.

...read moreread less

Proceedings Article•DOI•

Optimizing cost and performance for multihoming

[...]

David K. Goldenberg¹, Lili Qiuy², Haiyong Xie¹, Yang Richard Yang¹, Yin Zhang³ - Show less +1 more•Institutions (3)

Yale University¹, Microsoft², AT&T Labs³

30 Aug 2004

TL;DR: This paper designs a series of novel smart routing algorithms to optimize cost and performance for multihomed users and suggests that these algorithms are very effective in minimizing cost and at the same time improving performance.

...read moreread less

Abstract: Multihoming is often used by large enterprises and stub ISPs to connect to the Internet. In this paper, we design a series of novel smart routing algorithms to optimize cost and performance for multihomed users. We evaluate our algorithms through both analysis and extensive simulations based on realistic charging models, traffic demands, performance data, and network topologies. Our results suggest that these algorithms are very effective in minimizing cost and at the same time improving performance. We further examine the equilibrium performance of smart routing in a global setting and show that a smart routing user can improve its performance without adversely affecting other users.

...read moreread less

Journal Article•DOI•

Towards capturing representative AS-level Internet topologies

[...]

Hyunseok Chang¹, Ramesh Govindan², Sugih Jamin¹, Scott Shenker³, Walter Willinger⁴ - Show less +1 more•Institutions (4)

University of Michigan¹, University of Southern California², Institute of Company Secretaries of India³, AT&T Labs⁴

22 Apr 2004-Computer Networks

TL;DR: The findings point out the need for continuously questioning the applicability and completeness of data sets at hand when establishing the generality of any particular Internet-specific observation and for assessing its (in)sensitivity to deficiencies in the measurements.

...read moreread less

Proceedings Article•DOI•

Locating internet bottlenecks: algorithms, measurements, and implications

[...]

Ningning Hu¹, Li Li², Zhuoqing Morley Mao³, Peter Steenkiste¹, Jia Wang⁴ - Show less +1 more•Institutions (4)

Carnegie Mellon University¹, Bell Labs², University of Michigan³, AT&T Labs⁴

30 Aug 2004

TL;DR: This paper presents Pathneck, a tool that allows end users to efficiently and accurately locate the bottleneck link on an Internet path based on a novel probing technique called Recursive Packet Train (RPT) and does not require access to the destination.

...read moreread less

Abstract: The ability to locate network bottlenecks along end-to-end paths on the Internet is of great interest to both network operators and researchers. For example, knowing where bottleneck links are, network operators can apply traffic engineering either at the interdomain or intradomain level to improve routing. Existing tools either fail to identify the location of bottlenecks, or generate a large amount of probing packets. In addition, they often require access to both end points. In this paper we present Pathneck, a tool that allows end users to efficiently and accurately locate the bottleneck link on an Internet path. Pathneck is based on a novel probing technique called Recursive Packet Train (RPT) and does not require access to the destination. We evaluate Pathneck using wide area Internet experiments and trace-driven emulation. In addition, we present the results of an extensive study on bottlenecks in the Internet using carefully selected, geographically diverse probing sources and destinations. We found that Pathneck can successfully detect bottlenecks for almost 80% of the Internet paths we probed. We also report our success in using the bottleneck location and bandwidth bounds provided by Pathneck to infer bottlenecks and to avoid bottlenecks in multihoming and overlay routing.

...read moreread less

Proceedings Article•DOI•

Tabulation based 4-universal hashing with applications to second moment estimation

[...]

Mikkel Thorup¹, Yin Zhang¹•Institutions (1)

AT&T Labs¹

11 Jan 2004

TL;DR: It is shown that 4-universal hashing can be implemented efficiently using tabulated 4- universal hashing for characters, gaining a factor of 5 in speed over the fastest existing methods.

...read moreread less

Abstract: We show that 4-universal hashing can be implemented efficiently using tabulated 4-universal hashing for characters, gaining a factor of 5 in speed over the fastest existing methods. We also consider generalization to k-universal hashing, and as a prime application, we consider the approximation of the second moment of a data stream.

...read moreread less

Proceedings Article•DOI•

Routing design in operational networks: a look from the inside

[...]

David A. Maltz¹, Geoffrey G. Xie¹, Jibin Zhan¹, Hui Zhang¹, Gísli Hjálmtýsson², Albert Greenberg² - Show less +2 more•Institutions (2)

Carnegie Mellon University¹, AT&T Labs²

30 Aug 2004

TL;DR: This paper has developed a methodology for reverse engineering a coherent global view of a network's routing design from the static analysis of dumps of the local configuration state of each router.

...read moreread less

Abstract: In any IP network, routing protocols provide the intelligence that takes a collection of physical links and transforms them into a network that enables packets to travel from one host to another. Though routing design is arguably the single most important design task for large IP networks, there has been very little systematic investigation into how routing protocols are actually used in production networks to implement the goals of network architects. We have developed a methodology for reverse engineering a coherent global view of a network's routing design from the static analysis of dumps of the local configuration state of each router. Starting with a set of 8,035 configuration files, we have applied this method to 31 production networks. In this paper we present a detailed examination of how routing protocols are used in operational networks. In particular, the results show the conventional model of "interior" and "exterior" gateway protocols is insufficient to describe the diverse set of mechanisms used by architects, and we provide examples of the more unusual designs and examine their trade-offs. We discuss the strengths and weaknesses of our methodology, and argue that it opens paths towards new understandings of network behavior and design.

...read moreread less

Proceedings Article•DOI•

Compressing historical information in sensor networks

[...]

Antonios Deligiannakis¹, Yannis Kotidis², Nick Roussopoulos¹•Institutions (2)

University of Maryland, College Park¹, AT&T Labs²

13 Jun 2004

TL;DR: This paper proposes a new technique for compressing multiple streams containing historical data from each sensor, exploits correlation and redundancy among multiple measurements on the same sensor and achieves high degree of data reduction while managing to capture even the smallest details of the recorded measurements.

...read moreread less

Abstract: We are inevitably moving into a realm where small and inexpensive wireless devices would be seamlessly embedded in the physical world and form a wireless sensor network in order to perform complex monitoring and computational tasks. Such networks pose new challenges in data processing and dissemination because of the limited resources (processing, bandwidth, energy) that such devices possess. In this paper we propose a new technique for compressing multiple streams containing historical data from each sensor. Our method exploits correlation and redundancy among multiple measurements on the same sensor and achieves high degree of data reduction while managing to capture even the smallest details of the recorded measurements. The key to our technique is the base signal, a series of values extracted from the real measurements, used for encoding piece-wise linear correlations among the collected data values. We provide efficient algorithms for extracting the base signal features from the data and for encoding the measurements using these features. Our experiments demonstrate that our method by far outperforms standard approximation techniques like Wavelets. Histograms and the Discrete Cosine Transform, on a variety of error metrics and for real datasets from different domains.

...read moreread less

Proceedings Article•DOI•

Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm

[...]

Brian Roark¹, Murat Saraclar¹, Michael Collins², Mark Johnson³•Institutions (3)

AT&T Labs¹, Massachusetts Institute of Technology², Brown University³

21 Jul 2004

TL;DR: This paper compares two parameter estimation methods: the perceptron algorithm, and a method based on conditional random fields (CRFs), which have the benefit of automatically selecting a relatively small feature set in just a couple of passes over the training data.

...read moreread less

Abstract: This paper describes discriminative language modeling for a large vocabulary speech recognition task. We contrast two parameter estimation methods: the perceptron algorithm, and a method based on conditional random fields (CRFs). The models are encoded as deterministic weighted finite state automata, and are applied by intersecting the automata with word-lattices that are the output from a baseline recognizer. The perceptron algorithm has the benefit of automatically selecting a relatively small feature set in just a couple of passes over the training data. However, using the feature set output from the perceptron algorithm (initialized with their weights), CRF training provides an additional 0.5% reduction in word error rate, for a total 1.8% absolute reduction from the baseline of 39.2%.

...read moreread less

Book Chapter•DOI•

Hierarchical In-Network Data Aggregation with Quality Guarantees

[...]

Antonios Deligiannakis¹, Yannis Kotidis², Nick Roussopoulos¹•Institutions (2)

University of Maryland, College Park¹, AT&T Labs²

14 Mar 2004

TL;DR: A new algorithm is introduced, based on potential gains, which adaptively redistributes the error thresholds to those nodes that benefit the most and tries to minimize the total number of transmitted messages in the network.

...read moreread less

Abstract: Earlier work has demonstrated the effectiveness of in-network data aggregation in order to minimize the amount of messages exchanged during continuous queries in large sensor networks. The key idea is to build an aggregation tree, in which parent nodes aggregate the values received from their children. Nevertheless, for large sensor networks with severe energy constraints the reduction obtained through the aggregation tree might not be sufficient. In this paper we extend prior work on in-network data aggregation to support approximate evaluation of queries to further reduce the number of exchanged messages among the nodes and extend the longevity of the network. A key ingredient to our framework is the notion of the residual mode of operation that is used to eliminate messages from sibling nodes when their cumulative change is small. We introduce a new algorithm, based on potential gains, which adaptively redistributes the error thresholds to those nodes that benefit the most and tries to minimize the total number of transmitted messages in the network. Our experiments demonstrate that our techniques significantly outperform previous approaches and reduce the network traffic by exploiting the super-imposed tree hierarchy.

...read moreread less

Collapse