scispace - formally typeset
Search or ask a question

Showing papers by "AT&T Labs published in 2004"


Proceedings ArticleDOI
04 Jul 2004
TL;DR: This work proposes the use of maximum-entropy techniques for this problem, specifically, sequential-update algorithms that can handle a very large number of features, and investigates the interpretability of models constructed using maxent.
Abstract: We study the problem of modeling species geographic distributions, a critical problem in conservation biology. We propose the use of maximum-entropy techniques for this problem, specifically, sequential-update algorithms that can handle a very large number of features. We describe experiments comparing maxent with a standard distribution-modeling tool, called GARP, on a dataset containing observation data for North American breeding birds. We also study how well maxent performs as a function of the number of training examples and training time, analyze the use of regularization to avoid overfitting when the number of examples is small, and explore the interpretability of models constructed using maxent.

1,956 citations


Journal ArticleDOI
Gary M. Weiss1
TL;DR: It is demonstrated that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods.
Abstract: Rare objects are often of great interest and great value Until recently, however, rarity has not received much attention in the context of data mining Now, as increasingly complex real-world problems are addressed, rarity, and the related problem of imbalanced data, are taking center stage This article discusses the role that rare classes and rare cases play in data mining The problems that can result from these two forms of rarity are described in detail, as are methods for addressing these problems These descriptions utilize examples from existing research So that this article provides a good survey of the literature on rarity in data mining This article also demonstrates that rare classes and rare cases are very similar phenomena---both forms of rarity are shown to cause similar problems during data mining and benefit from the same remediation methods

1,409 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: In this article, the authors identify the application level signatures by examining some available documentations, and packet-level traces, and then utilize the identified signatures to develop online filters that can efficiently and accurately track the P2P traffic even on high-speed network links.
Abstract: The ability to accurately identify the network traffic associated with different P2P applications is important to a broad range of network operations including application-specific traffic engineering, capacity planning, provisioning, service differentiation,etc. However, traditional traffic to higher-level application mapping techniques such as default server TCP or UDP network-port baseddisambiguation is highly inaccurate for some P2P applications.In this paper, we provide an efficient approach for identifying the P2P application traffic through application level signatures. We firstidentify the application level signatures by examining some available documentations, and packet-level traces. We then utilize the identified signatures to develop online filters that can efficiently and accurately track the P2P traffic even on high-speed network links.We examine the performance of our application-level identification approach using five popular P2P protocols. Our measurements show thatour technique achieves less than 5% false positive and false negative ratios in most cases. We also show that our approach only requires the examination of the very first few packets (less than 10packets) to identify a P2P connection, which makes our approach highly scalable. Our technique can significantly improve the P2P traffic volume estimates over what pure network port based approaches provide. For instance, we were able to identify 3 times as much traffic for the popular Kazaa P2P protocol, compared to the traditional port-based approach.

856 citations


Journal ArticleDOI
Subhabrata Sen1, Jia Wang1
TL;DR: The high volume and good stability properties of P2P traffic suggests that the P1P workload is a good candidate for being managed via application-specific layer-3 traffic engineering in an ISP's network.
Abstract: The use of peer-to-peer (P2P) applications is growing dramatically, particularly for sharing large video/audio files and software. In this paper, we analyze P2P traffic by measuring flow-level information collected at multiple border routers across a large ISP network, and report our investigation of three popular P2P systems--FastTrack, Gnutella, and Direct-Connect. We characterize the P2P trafffic observed at a single ISP and its impact on the underlying network. We observe very skewed distribution in the traffic across the network at different levels of spatial aggregation (IP, prefix, AS). All three P2P systems exhibit significant dynamics at short time scale and particularly at the IP address level. Still, the fraction of P2P traffic contributed by each prefix is more stable than the corresponding distribution of either Web traffic or overall traffic. The high volume and good stability properties of P2P traffic suggests that the P2P workload is a good candidate for being managed via application-specific layer-3 traffic engineering in an ISP's network.

691 citations


Proceedings ArticleDOI
25 Oct 2004
TL;DR: It is argued that measurement based automated Class of Service (CoS) mapping is an important practical problem that needs to be studied, and a solution framework for measurement based classification of traffic for QoS based on statistical application signatures is outlined.
Abstract: The ability to provide different Quality of Service (QoS) guarantees to traffic from different applications is a highly desired feature for many IP network operators, particularly for enterprise networks. Although various mechanisms exist for providing QoS in the network, QoS is yet to be widely deployed. We believe that a key factor holding back widespread QoS adoption is the absence of suitable methodologies/processes for appropriately mapping the traffic from different applications to different QoS classes. This is a challenging task, because many enterprise network operators who are interested in QoS do not know all the applications running on their network, and furthermore, over recent years port-based application classification has become problematic. We argue that measurement based automated Class of Service (CoS) mapping is an important practical problem that needs to be studied.In this paper we describe the requirements and associated challenges, and outline a solution framework for measurement based classification of traffic for QoS based on statistical application signatures. In our approach the signatures are chosen in such as way as to make them insensitive to the particular application layer protocol, but rather to determine the way in which an application is used -- for instance is it used interactively, or for bulk-data transport. The resulting application signature can then be used to derive the network layer signatures required to determine the CoS class for individual IP datagrams. Our evaluations using traffic traces from a variety of network locations, demonstrate the feasibility and potential of the approach.

552 citations


Book ChapterDOI
01 Jan 2004
TL;DR: Graphviz is a collection of software for viewing and manipulating abstract graphs that provides graph visualization for tools and web sites in domains such as software engineering, networking, databases, knowledge representation, and bioinformatics.
Abstract: Graphviz is a collection of software for viewing and manipulating abstract graphs. It provides graph visualization for tools and web sites in domains such as software engineering, networking, databases, knowledge representation, and bioinformatics. Hundreds of thousands of copies have been distributed under an open source license.

469 citations


Proceedings ArticleDOI
21 Jul 2004
TL;DR: It is demonstrated that training a perceptron model to combine with the generative model during search provides a 2.1 percent F-measure improvement over the Generative model alone, to 88.8 percent.
Abstract: This paper describes an incremental parsing approach where parameters are estimated using a variant of the perceptron algorithm. A beam-search algorithm is used during both training and decoding phases of the method. The perceptron approach was implemented with the same feature set as that of an existing generative model (Roark, 2001a), and experimental results show that it gives competitive performance to the generative model on parsing the Penn treebank. We demonstrate that training a perceptron model to combine with the generative model during search provides a 2.1 percent F-measure improvement over the generative model alone, to 88.8 percent.

457 citations


Proceedings ArticleDOI
30 Aug 2004
TL;DR: It is claimed that very simple models that incorporate hard technological constraints on router and link bandwidth and connectivity, together with abstract models of user demand and network performance, can successfully address this challenge and further resolve much of the confusion and controversy that has surrounded topology generation and evaluation.
Abstract: A detailed understanding of the many facets of the Internet's topological structure is critical for evaluating the performance of networking protocols, for assessing the effectiveness of proposed techniques to protect the network from nefarious intrusions and attacks, or for developing improved designs for resource provisioning. Previous studies of topology have focused on interpreting measurements or on phenomenological descriptions and evaluation of graph-theoretic properties of topology generators. We propose a complementary approach of combining a more subtle use of statistics and graph theory with a first-principles theory of router-level topology that reflects practical constraints and tradeoffs. While there is an inevitable tradeoff between model complexity and fidelity, a challenge is to distill from the seemingly endless list of potentially relevant technological and economic issues the features that are most essential to a solid understanding of the intrinsic fundamentals of network topology. We claim that very simple models that incorporate hard technological constraints on router and link bandwidth and connectivity, together with abstract models of user demand and network performance, can successfully address this challenge and further resolve much of the confusion and controversy that has surrounded topology generation and evaluation.

422 citations


Book ChapterDOI
29 Sep 2004
TL;DR: This work shows how to draw graphs by stress majorization, adapting a technique known in the MDS community for more than two decades and appears that majorization has advantages over the technique of Kamada and Kawai in running time and stability.
Abstract: One of the most popular graph drawing methods is based on achieving graph-theoretic target distances. This method was used by Kamada and Kawai [15], who formulated it as an energy optimization problem. Their energy is known in the multidimensional scaling (MDS) community as the stress function. In this work, we show how to draw graphs by stress majorization, adapting a technique known in the MDS community for more than two decades. It appears that majorization has advantages over the technique of Kamada and Kawai in running time and stability. We also found the majorization-based optimization being essential to a few extensions to the basic energy model. These extensions can improve layout quality and computation speed in practice.

402 citations


Journal ArticleDOI
S.S. Ghassemzadeh1, Rittwik Jana1, Christopher W. Rice1, W. Turin1, Vahid Tarokh1 
TL;DR: A path loss model as well as a second-order autoregressive model is proposed for frequency response generation of the UWB indoor channel and results of frequency-domain channel sounding in residential environments are described.
Abstract: This paper describes the results of frequency-domain channel sounding in residential environments. It consists of detailed characterization of complex frequency responses of ultra-wideband (UWB) signals having a nominal center frequency of 5 GHz. A path loss model as well as a second-order autoregressive model is proposed for frequency response generation of the UWB indoor channel. Probability distributions of the model parameters for different locations are presented. Also, time-domain results such as root mean square delay spread and percent of captured power are presented.

336 citations


Journal ArticleDOI
Mikkel Thorup1
TL;DR: It is shown that a planar digraph can be preprocessed in near-linear time, producing a near- linear space oracle that can answer reachability queries in constant time.
Abstract: It is shown that a planar digraph can be preprocessed in near-linear time, producing a near-linear space oracle that can answer reachability queries in constant time. The oracle can be distributed as an O(log n) space label for each vertex and then we can determine if one vertex can reach another considering their two labels only.The approach generalizes to give a near-linear space approximate distances oracle for a weighted planar digraph. With weights drawn from l0, …, Nr, it approximates distances within a factor (1 + e) in O(log log (nN) + 1/e) time. Our scheme can be extended to find and route along correspondingly short dipaths.

Book ChapterDOI
19 Feb 2004
TL;DR: Starting with the seminal paper of Impagliazzo and Rudich [17], there has been a large body of work showing that various cryptographic primitives cannot be reduced to each other via “black-box” reductions.
Abstract: Starting with the seminal paper of Impagliazzo and Rudich [17], there has been a large body of work showing that various cryptographic primitives cannot be reduced to each other via “black-box” reductions. The common interpretation of these results is that there are inherent limitations in using a primitive as a black box, and that these impossibility results can be overcome only by explicitly using the code of the primitive in the construction.

Proceedings ArticleDOI
30 Aug 2004
TL;DR: This work presents a design overview of RCP based on three architectural principles path computation based on a consistent view of network state, controlled interactions between routing protocol layers, and expressive specification of routing policies and discusses the architectural strengths and weaknesses of the proposal.
Abstract: Over the past decade, the complexity of the Internet's routing infrastructure has increased dramatically. This complexity and the problems it causes stem not just from various new demands made of the routing infrastructure, but also from fundamental limitations in the ability of today's distributed infrastructure to scalably cope with new requirements.The limitations in today's routing system arise in large part from the fully distributed path-selection computation that the IP routers in an autonomous system (AS) must perform. To overcome this weakness, interdomain routing should be separated from today's IP routers, which should simply forward packets (for the most part). Instead, a separate Routing Control Platform (RCP) should select routes on behalf of the IP routers in each AS and exchange reachability information with other domains.Our position is that an approach like RCP is a good way of coping with complexity while being responsive to new demands and can lead to a routing system that is substantially easier to manage than today. We present a design overview of RCP based on three architectural principles path computation based on a consistent view of network state, controlled interactions between routing protocol layers, and expressive specification of routing policies and discuss the architectural strengths and weaknesses of our proposal.

Proceedings ArticleDOI
01 Jul 2004
TL;DR: A negative binomial regression model using information from previous releases has been developed and used to predict the numbers of faults for a large industrial inventory system, and was extremely accurate.
Abstract: The ability to predict which files in a large software system are most likely to contain the largest numbers of faults in the next release can be a very valuable asset. To accomplish this, a negative binomial regression model using information from previous releases has been developed and used to predict the numbers of faults for a large industrial inventory system. The files of each release were sorted in descending order based on the predicted number of faults and then the first 20% of the files were selected. This was done for each of fifteen consecutive releases, representing more than four years of field usage. The predictions were extremely accurate, correctly selecting files that contained between 71% and 92% of the faults, with the overall average being 83%. In addition, the same model was used on data for the same system's releases, but with all fault data prior to integration testing removed. The prediction was again very accurate, ranging from 71% to 93%, with the average being 84%. Predictions were made for a second system, and again the first 20% of files accounted for 83% of the identified faults. Finally, a highly simplified predictor was considered which correctly predicted 73% and 74% of the faults for the two systems.

Proceedings ArticleDOI
01 Jun 2004
TL;DR: A novel data streaming algorithm to provide much more accurate estimates of flow distribution, using a "lossy data structure" which consists of an array of counters fitted well into SRAM, which not only dramatically improves the accuracy offlow distribution measurement, but also contributes to the field of data streaming.
Abstract: Knowing the distribution of the sizes of traffic flows passing through a network link helps a network operator to characterize network resource usage, infer traffic demands, detect traffic anomalies, and accommodate new traffic demands through better traffic engineering. Previous work on estimating the flow size distribution has been focused on making inferences from sampled network traffic. Its accuracy is limited by the (typically) low sampling rate required to make the sampling operation affordable. In this paper we present a novel data streaming algorithm to provide much more accurate estimates of flow distribution, using a "lossy data structure" which consists of an array of counters fitted well into SRAM. For each incoming packet, our algorithm only needs to increment one underlying counter, making the algorithm fast enough even for 40 Gbps (OC-768) links. The data structure is lossy in the sense that sizes of multiple flows may collide into the same counter. Our algorithm uses Bayesian statistical methods such as Expectation Maximization to infer the most likely flow size distribution that results in the observed counter values after collision. Evaluations of this algorithm on large Internet traces obtained from several sources (including a tier-1 ISP) demonstrate that it has very high measurement accuracy (within 2%). Our algorithm not only dramatically improves the accuracy of flow distribution measurement, but also contributes to the field of data streaming by formalizing an existing methodology and applying it to the context of estimating the flow-distribution.

Journal ArticleDOI
Andreas F. Molisch1
TL;DR: A geometry-based model is proposed that includes the propagation effects that are critical for MIMO performance: i) single scattering around the BS and MS, ii) scattering by far clusters, iii) double-scattering, iv) waveguiding, and v) diffraction by roof edges.
Abstract: This paper derives a generic model for the multiple-input multiple-output (MIMO) wireless channel. The model incorporates important effects, including i) interdependency of directions-of-arrival and directions-of-departure, ii) large delay and angle dispersion by propagation via far clusters, and iii) rank reduction of the transfer function matrix. We propose a geometry-based model that includes the propagation effects that are critical for MIMO performance: i) single scattering around the BS and MS, ii) scattering by far clusters, iii) double-scattering, iv) waveguiding, and v) diffraction by roof edges. The required parameters for the complete definition of the model are enumerated, and typical parameter values in macro and microcellular environments are discussed.

Journal ArticleDOI
TL;DR: The problem of optimizing OSPF weights for a given a set of projected demands so as to avoid congestion is shown to be NP-hard, even for approximation, and a local search heuristic is proposed to solve it.
Abstract: Open Shortest Path First (OSPF) is one of the most commonly used intra-domain internet routing protocol. Traffic flow is routed along shortest paths, splitting flow evenly at nodes where several outgoing links are on shortest paths to the destination. The weights of the links, and thereby the shortest path routes, can be changed by the network operator. The weights could be set proportional to the physical lengths of the links, but often the main goal is to avoid congestion, i.e. overloading of links, and the standard heuristic recommended by Cisco (a major router vendor) is to make the weight of a link inversely proportional to its capacity. We study the problem of optimizing OSPF weights for a given a set of projected demands so as to avoid congestion. We show this problem is NP-hard, even for approximation, and propose a local search heuristic to solve it. We also provide worst-case results about the performance of OSPF routing vs. an optimal multi-commodity flow routing. Our numerical experiments compare the results obtained with our local search heuristic to the optimal multi-commodity flow routing, as well as simple and commonly used heuristics for setting the weights. Experiments were done with a proposed next-generation AT&T WorldNet backbone as well as synthetic internetworks.

Journal ArticleDOI
TL;DR: A new approach for obtaining machine cells and product families is presented that combines a local search heuristic with a genetic algorithm and produced solutions with a grouping efficacy that is at least as good as any results previously reported in literature.

Book ChapterDOI
01 Jul 2004
TL;DR: It is proved non-asymptotic bounds showing that, with respect to the true underlying distribution, this relaxed version of maxent produces density estimates that are almost as good as the best possible.
Abstract: We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of the features. By convex duality, this turns out to be equivalent to finding the Gibbs distribution minimizing a regularized version of the empirical log loss. We prove non-asymptotic bounds showing that, with respect to the true underlying distribution, this relaxed version of maxent produces density estimates that are almost as good as the best possible. These bounds are in terms of the deviation of the feature empirical averages relative to their true expectations, a number that can be bounded using standard uniform-convergence techniques. In particular, this leads to bounds that drop quickly with the number of samples, and that depend very moderately on the number or complexity of the features. We also derive and prove convergence for both sequential-update and parallel-update algorithms. Finally, we briefly describe experiments on data relevant to the modeling of species geographical distributions.

Proceedings ArticleDOI
01 Jun 2004
TL;DR: It is shown that hot-potato routing changes lead to longer delays in forwarding-plane convergence, shifts in the flow of traffic to neighboring domains, extra externally-visible BGP update messages, and inaccuracies in Internet performance measurements.
Abstract: Despite the architectural separation between intradomain and interdomain routing in the Internet, intradomain protocols do influence the path-selection process in the Border Gateway Protocol (BGP). When choosing between multiple equally-good BGP routes, a router selects the one with the closest egress point, based on the intradomain path cost. Under such hot-potato routing, an intradomain event can trigger BGP routing changes. To characterize the influence of hot-potato routing, we conduct controlled experiments with a commercial router. Then, we propose a technique for associating BGP routing changes with events visible in the intradomain protocol, and apply our algorithm to AT&T's backbone network. We show that (i) hot-potato routing can be a significant source of BGP updates, (ii) BGP updates can lag 60 seconds or more behind the intradomain event, (iii) the number of BGP path changes triggered by hot-potato routing has a nearly uniform distribution across destination prefixes, and (iv) the fraction of BGP messages triggered by intradomain changes varies significantly across time and router locations. We show that hot-potato routing changes lead to longer delays in forwarding-plane convergence, shifts in the flow of traffic to neighboring domains, extra externally-visible BGP update messages, and inaccuracies in Internet performance measurements.

Proceedings ArticleDOI
25 Oct 2004
TL;DR: It is shown that the problem of HHH detection can be transformed to one of dynamic packet classification by taking a top-down approach and adaptively creating new rules to match HHHs, which have much lower worst-case update costs than existing algorithms and can provide tunable deterministic accuracy guarantees.
Abstract: In traffic monitoring, accounting, and network anomaly detection, it is often important to be able to detect high-volume traffic clusters in near real-time. Such heavy-hitter traffic clusters are often hierarchical (ie, they may occur at different aggregation levels like ranges of IP addresses) and possibly multidimensional (ie, they may involve the combination of different IP header fields like IP addresses, port numbers, and protocol). Without prior knowledge about the precise structures of such traffic clusters, a naive approach would require the monitoring system to examine all possible ombinations of aggregates in order to detect the heavy hitters, which can be proohibitive in terms of computation resources.In this paper, we focus on online identification of 1-dimensional and 2-dimensional hierarchical heavy hitters (HHHs), arguably the two most important scenarios in traffic analysis. We show that the problem of HHH detection can be transformed to one of dynamic packet classification by taking a top-down approach and adaptively creating new rules to match HHHs. We then adapt several existing static packet classification algorithms to support dynamic packet classification. The resulting HHH detection algorithms have much lower worst-case update costs than existing algorithms and can provide tunable deterministic accuracy guarantees. As an application of these algorithms, we also propose robust techniques to detect changes among heavy-hitter traffic clusters. Our techniques can accommodate variability due to sampling that is increasingly used in network measurement. Evaluation based on real Internet traces collected at a Tier-1 ISP suggests that these techniques are remarkably accurate and efficient.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This paper provides an elegant definition of relaxation on structure and defines primitive operators to span the space of relaxations for ranking schemes and proposes natural ranking schemes that adhere to these principles.
Abstract: Querying XML data is a well-explored topic with powerful database-style query languages such as XPath and XQuery set to become W3C standards. An equally compelling paradigm for querying XML documents is full-text search on textual content. In this paper, we study fundamental challenges that arise when we try to integrate these two querying paradigms.While keyword search is based on approximate matching, XPath has exact match semantics. We address this mismatch by considering queries on structure as a "template", and looking for answers that best match this template and the full-text search. To achieve this, we provide an elegant definition of relaxation on structure and define primitive operators to span the space of relaxations. Query answering is now based on ranking potential answers on structural and full-text search conditions. We set out certain desirable principles for ranking schemes and propose natural ranking schemes that adhere to these principles. We develop efficient algorithms for answering top-K queries and discuss results from a comprehensive set of experiments that demonstrate the utility and scalability of the proposed framework and algorithms.

Proceedings ArticleDOI
30 Aug 2004
TL;DR: This paper designs a series of novel smart routing algorithms to optimize cost and performance for multihomed users and suggests that these algorithms are very effective in minimizing cost and at the same time improving performance.
Abstract: Multihoming is often used by large enterprises and stub ISPs to connect to the Internet. In this paper, we design a series of novel smart routing algorithms to optimize cost and performance for multihomed users. We evaluate our algorithms through both analysis and extensive simulations based on realistic charging models, traffic demands, performance data, and network topologies. Our results suggest that these algorithms are very effective in minimizing cost and at the same time improving performance. We further examine the equilibrium performance of smart routing in a global setting and show that a smart routing user can improve its performance without adversely affecting other users.

Journal ArticleDOI
TL;DR: The findings point out the need for continuously questioning the applicability and completeness of data sets at hand when establishing the generality of any particular Internet-specific observation and for assessing its (in)sensitivity to deficiencies in the measurements.

Proceedings ArticleDOI
30 Aug 2004
TL;DR: This paper presents Pathneck, a tool that allows end users to efficiently and accurately locate the bottleneck link on an Internet path based on a novel probing technique called Recursive Packet Train (RPT) and does not require access to the destination.
Abstract: The ability to locate network bottlenecks along end-to-end paths on the Internet is of great interest to both network operators and researchers. For example, knowing where bottleneck links are, network operators can apply traffic engineering either at the interdomain or intradomain level to improve routing. Existing tools either fail to identify the location of bottlenecks, or generate a large amount of probing packets. In addition, they often require access to both end points. In this paper we present Pathneck, a tool that allows end users to efficiently and accurately locate the bottleneck link on an Internet path. Pathneck is based on a novel probing technique called Recursive Packet Train (RPT) and does not require access to the destination. We evaluate Pathneck using wide area Internet experiments and trace-driven emulation. In addition, we present the results of an extensive study on bottlenecks in the Internet using carefully selected, geographically diverse probing sources and destinations. We found that Pathneck can successfully detect bottlenecks for almost 80% of the Internet paths we probed. We also report our success in using the bottleneck location and bandwidth bounds provided by Pathneck to infer bottlenecks and to avoid bottlenecks in multihoming and overlay routing.

Proceedings ArticleDOI
Mikkel Thorup1, Yin Zhang1
11 Jan 2004
TL;DR: It is shown that 4-universal hashing can be implemented efficiently using tabulated 4- universal hashing for characters, gaining a factor of 5 in speed over the fastest existing methods.
Abstract: We show that 4-universal hashing can be implemented efficiently using tabulated 4-universal hashing for characters, gaining a factor of 5 in speed over the fastest existing methods. We also consider generalization to k-universal hashing, and as a prime application, we consider the approximation of the second moment of a data stream.

Proceedings ArticleDOI
30 Aug 2004
TL;DR: This paper has developed a methodology for reverse engineering a coherent global view of a network's routing design from the static analysis of dumps of the local configuration state of each router.
Abstract: In any IP network, routing protocols provide the intelligence that takes a collection of physical links and transforms them into a network that enables packets to travel from one host to another. Though routing design is arguably the single most important design task for large IP networks, there has been very little systematic investigation into how routing protocols are actually used in production networks to implement the goals of network architects. We have developed a methodology for reverse engineering a coherent global view of a network's routing design from the static analysis of dumps of the local configuration state of each router. Starting with a set of 8,035 configuration files, we have applied this method to 31 production networks. In this paper we present a detailed examination of how routing protocols are used in operational networks. In particular, the results show the conventional model of "interior" and "exterior" gateway protocols is insufficient to describe the diverse set of mechanisms used by architects, and we provide examples of the more unusual designs and examine their trade-offs. We discuss the strengths and weaknesses of our methodology, and argue that it opens paths towards new understandings of network behavior and design.

Proceedings ArticleDOI
13 Jun 2004
TL;DR: This paper proposes a new technique for compressing multiple streams containing historical data from each sensor, exploits correlation and redundancy among multiple measurements on the same sensor and achieves high degree of data reduction while managing to capture even the smallest details of the recorded measurements.
Abstract: We are inevitably moving into a realm where small and inexpensive wireless devices would be seamlessly embedded in the physical world and form a wireless sensor network in order to perform complex monitoring and computational tasks. Such networks pose new challenges in data processing and dissemination because of the limited resources (processing, bandwidth, energy) that such devices possess. In this paper we propose a new technique for compressing multiple streams containing historical data from each sensor. Our method exploits correlation and redundancy among multiple measurements on the same sensor and achieves high degree of data reduction while managing to capture even the smallest details of the recorded measurements. The key to our technique is the base signal, a series of values extracted from the real measurements, used for encoding piece-wise linear correlations among the collected data values. We provide efficient algorithms for extracting the base signal features from the data and for encoding the measurements using these features. Our experiments demonstrate that our method by far outperforms standard approximation techniques like Wavelets. Histograms and the Discrete Cosine Transform, on a variety of error metrics and for real datasets from different domains.

Proceedings ArticleDOI
21 Jul 2004
TL;DR: This paper compares two parameter estimation methods: the perceptron algorithm, and a method based on conditional random fields (CRFs), which have the benefit of automatically selecting a relatively small feature set in just a couple of passes over the training data.
Abstract: This paper describes discriminative language modeling for a large vocabulary speech recognition task. We contrast two parameter estimation methods: the perceptron algorithm, and a method based on conditional random fields (CRFs). The models are encoded as deterministic weighted finite state automata, and are applied by intersecting the automata with word-lattices that are the output from a baseline recognizer. The perceptron algorithm has the benefit of automatically selecting a relatively small feature set in just a couple of passes over the training data. However, using the feature set output from the perceptron algorithm (initialized with their weights), CRF training provides an additional 0.5% reduction in word error rate, for a total 1.8% absolute reduction from the baseline of 39.2%.

Book ChapterDOI
14 Mar 2004
TL;DR: A new algorithm is introduced, based on potential gains, which adaptively redistributes the error thresholds to those nodes that benefit the most and tries to minimize the total number of transmitted messages in the network.
Abstract: Earlier work has demonstrated the effectiveness of in-network data aggregation in order to minimize the amount of messages exchanged during continuous queries in large sensor networks. The key idea is to build an aggregation tree, in which parent nodes aggregate the values received from their children. Nevertheless, for large sensor networks with severe energy constraints the reduction obtained through the aggregation tree might not be sufficient. In this paper we extend prior work on in-network data aggregation to support approximate evaluation of queries to further reduce the number of exchanged messages among the nodes and extend the longevity of the network. A key ingredient to our framework is the notion of the residual mode of operation that is used to eliminate messages from sibling nodes when their cumulative change is small. We introduce a new algorithm, based on potential gains, which adaptively redistributes the error thresholds to those nodes that benefit the most and tries to minimize the total number of transmitted messages in the network. Our experiments demonstrate that our techniques significantly outperform previous approaches and reduce the network traffic by exploiting the super-imposed tree hierarchy.