scispace - formally typeset
Search or ask a question

Showing papers by "AT&T Labs published in 2005"


Journal ArticleDOI
TL;DR: A negative binomial regression model has been developed and used to predict the expected number of faults in each file of the next release of a system, based on the code of the file in the current release, and fault and modification history of thefile from previous releases.
Abstract: Advance knowledge of which files in the next release of a large software system are most likely to contain the largest numbers of faults can be a very valuable asset. To accomplish this, a negative binomial regression model has been developed and used to predict the expected number of faults in each file of the next release of a system. The predictions are based on the code of the file in the current release, and fault and modification history of the file from previous releases. The model has been applied to two large industrial systems, one with a history of 17 consecutive quarterly releases over 4 years, and the other with nine releases over 2 years. The predictions were quite accurate: for each release of the two systems, the 20 percent of the files with the highest predicted number of faults contained between 71 percent and 92 percent of the faults that were actually detected, with the overall average being 83 percent. The same model was also used to predict which files of the first system were likely to have the highest fault densities (faults per KLOC). In this case, the 20 percent of the files with the highest predicted fault densities contained an average of 62 percent of the system's detected faults. However, the identified files contained a much smaller percentage of the code mass than the files selected to maximize the numbers of faults. The model was also used to make predictions from a much smaller input set that only contained fault data from integration testing and later. The prediction was again very accurate, identifying files that contained from 71 percent to 93 percent of the faults, with the average being 84 percent. Finally, a highly simplified version of the predictor selected files containing, on average, 73 percent and 74 percent of the faults for the two systems.

704 citations


Journal ArticleDOI
TL;DR: The most impressive feature of the data structure is its constant query time, hence the name "oracle", and it provides faster constructions of sparse spanners of weighted graphs, and improved tree covers and distance labelings of weighted or unweighted graphs.
Abstract: Let G = (V,E) be an undirected weighted graph with vVv = n and vEv = m. Let k ≥ 1 be an integer. We show that G = (V,E) can be preprocessed in O(kmn1/k) expected time, constructing a data structure of size O(kn1p1/k), such that any subsequent distance query can be answered, approximately, in O(k) time. The approximate distance returned is of stretch at most 2k−1, that is, the quotient obtained by dividing the estimated distance by the actual distance lies between 1 and 2k−1. A 1963 girth conjecture of Erdos, implies that Ω(n1p1/k) space is needed in the worst case for any real stretch strictly smaller than 2kp1. The space requirement of our algorithm is, therefore, essentially optimal. The most impressive feature of our data structure is its constant query time, hence the name "oracle". Previously, data structures that used only O(n1p1/k) space had a query time of Ω(n1/k).Our algorithms are extremely simple and easy to implement efficiently. They also provide faster constructions of sparse spanners of weighted graphs, and improved tree covers and distance labelings of weighted or unweighted graphs.

618 citations


Journal ArticleDOI
TL;DR: This paper presents a hybrid genetic algorithm for the job shop scheduling problem that is based on random keys and tested on a set of standard instances taken from the literature and compared with other approaches.

577 citations


Proceedings ArticleDOI
02 May 2005
TL;DR: It is shown that RCP assigns routes correctly, even when the functionality is replicated and distributed, and that networks using RCP can expect comparable convergence delays to those using today's iBGP architectures.
Abstract: The routers in an Autonomous System (AS) must distribute the information they learn about how to reach external destinations. Unfortunately, today's internal Border Gateway Protocol (iBGP) architectures have serious problems: a "full mesh" iBGP configuration does not scale to large networks and "route reflection" can introduce problems such as protocol oscillations and persistent loops. Instead, we argue that a Routing Control Platform (RCP) should collect information about external destinations and internal topology and select the BGP routes for each router in an AS. RCP is a logically-centralized platform, separate from the IP forwarding plane, that performs route selection on behalf of routers and communicates selected routes to the routers using the unmodified iBGP protocol. RCP provides scalability without sacrificing correctness. In this paper, we present the design and implementation of an RCP prototype on commodity hardware. Using traces of BGP and internal routing data from a Tier-1 backbone, we demonstrate that RCP is fast and reliable enough to drive the BGP routing decisions for a large network. We show that RCP assigns routes correctly, even when the functionality is replicated and distributed, and that networks using RCP can expect comparable convergence delays to those using today's iBGP architectures.

556 citations


Journal ArticleDOI
TL;DR: An upper bound on the capacity that can be expressed as the sum of the logarithms of ordered chi-square-distributed variables is derived and evaluated analytically and compared to the results obtained by Monte Carlo simulations.
Abstract: We consider the capacity of multiple-input multiple-output systems with reduced complexity. One link-end uses all available antennas, while the other chooses the L out of N antennas that maximize capacity. We derive an upper bound on the capacity that can be expressed as the sum of the logarithms of ordered chi-square-distributed variables. This bound is then evaluated analytically and compared to the results obtained by Monte Carlo simulations. Our results show that the achieved capacity is close to the capacity of a full-complexity system provided that L is at least as large as the number of antennas at the other link-end. For example, for L = 3, N = 8 antennas at the receiver and three antennas at the transmitter, the capacity of the reduced-complexity scheme is 20 bits/s/Hz compared to 23 bits/s/Hz of a full-complexity scheme. We also present a suboptimum antenna subset selection algorithm that has a complexity of N/sup 2/ compared to the optimum algorithm with a complexity of (N/sub L/).

494 citations


Journal ArticleDOI
TL;DR: In this paper, the authors introduce a structural metric that allows us to differentiate between simple, connected graphs having an identical degree sequence, which is of particular interest when that sequence satisfies a power law relationship.
Abstract: There is a large, popular, and growing literature on "scale-free" networks with the Internet along with metabolic networks representing perhaps the canonical examples. While this has in many ways reinvigorated graph theory, there is unfortunately no consistent, precise definition of scale-free graphs and few rigorous proofs of many of their claimed properties. In fact, it is easily shown that the existing theory has many inherent contradictions and that the most celebrated claims regarding the Internet and biology are verifiably false. In this paper, we introduce a structural metric that allows us to differentiate between all simple, connected graphs having an identical degree sequence, which is of particular interest when that sequence satisfies a power law relationship. We demonstrate that the proposed structural metric yields considerable insight into the claimed properties of SF graphs and provides one possible measure of the extent to which a graph is scale-free. This structural view can be related t...

469 citations


Proceedings ArticleDOI
19 Oct 2005
TL;DR: An analysis of representative Bit-Torrent traffic provides several new findings regarding the limitations of BitTorrent systems: due to the exponentially decreasing peer arrival rate in reality, service availability in such systems becomes poor quickly, after which it is difficult for the file to be located and downloaded.
Abstract: Existing studies on BitTorrent systems are single-torrent based, while more than 85% of all peers participate in multiple torrents according to our trace analysis. In addition, these studies are not sufficiently insightful and accurate even for single-torrent models, due to some unrealistic assumptions. Our analysis of representative Bit-Torrent traffic provides several new findings regarding the limitations of BitTorrent systems: (1) Due to the exponentially decreasing peer arrival rate in reality, service availability in such systems becomes poor quickly, after which it is difficult for the file to be located and downloaded. (2) Client performance in the BitTorrent-like systems is unstable, and fluctuates widely with the peer population. (3) Existing systems could provide unfair services to peers, where peers with high downloading speed tend to download more and upload less. In this paper, we study these limitations on torrent evolution in realistic environments. Motivated by the analysis and modeling results, we further build a graph based multi-torrent model to study inter-torrent collaboration. Our model quantitatively provides strong motivation for inter-torrent collaboration instead of directly stimulating seeds to stay longer. We also discuss a system design to show the feasibility of multi-torrent collaboration.

432 citations


Proceedings ArticleDOI
22 Aug 2005
TL;DR: This paper applies three statistical machine learning algorithms to automatically identify signatures for a range of applications and finds that this approach is highly accurate and scales to allow online application identification on high speed links.
Abstract: An accurate mapping of traffic to applications is important for a broad range of network management and measurement tasks. Internet applications have traditionally been identified using well-known default server network-port numbers in the TCP or UDP headers. However this approach has become increasingly inaccurate. An alternate, more accurate technique is to use specific application-level features in the protocol exchange to guide the identification. Unfortunately deriving the signatures manually is very time consuming and difficult.In this paper, we explore automatically extracting application signatures from IP traffic payload content. In particular we apply three statistical machine learning algorithms to automatically identify signatures for a range of applications. The results indicate that this approach is highly accurate and scales to allow online application identification on high speed links. We also discovered that content signatures still work in the presence of encryption. In these cases we were able to derive content signature for unencrypted handshakes negotiating the encryption parameters of a particular connection.

420 citations


Journal ArticleDOI
TL;DR: This paper provides methods that use flow statistics formed from sampled packet stream to infer the frequencies of the number of packets per flow in the unsampled stream, and by exploiting protocol level detail reported in flow records.
Abstract: Passive traffic measurement increasingly employs sampling at the packet level. Many high-end routers form flow statistics from a sampled substream of packets. Sampling controls the consumption of resources by the measurement operations. However, knowledge of the statistics of flows in the unsampled stream remains useful, for understanding both characteristics of source traffic, and consumption of resources in the network. This paper provides methods that use flow statistics formed from sampled packet stream to infer the frequencies of the number of packets per flow in the unsampled stream. A key task is to infer the properties of flows of original traffic that evaded sampling altogether. We achieve this through statistical inference, and by exploiting protocol level detail reported in flow records. We investigate the impact on our results of different versions of packet sampling.

285 citations


Proceedings ArticleDOI
19 Oct 2005
TL;DR: A new dynamic anomography algorithm is introduced, which effectively tracks routing and traffic change, so as to alert with high fidelity on intrinsic changes in network-level traffic, yet not on internal routing changes, an additional benefit of dynamicanomography is that it is robust to missing data, an important operational reality.
Abstract: Anomaly detection is a first and important step needed to respond to unexpected problems and to assure high performance and security in IP networks. We introduce a framework and a powerful class of algorithms for network anomography, the problem of inferring network-level anomalies from widely available data aggregates. The framework contains novel algorithms, as well as a recently published approach based on Principal Component Analysis (PCA). Moreover, owing to its clear separation of inference and anomaly detection, the framework opens the door to the creation of whole families of new algorithms. We introduce several such algorithms here, based on ARIMA modeling, the Fourier transform, Wavelets, and Principal Component Analysis. We introduce a new dynamic anomography algorithm, which effectively tracks routing and traffic change, so as to alert with high fidelity on intrinsic changes in network-level traffic, yet not on internal routing changes. An additional benefit of dynamic anomography is that it is robust to missing data, an important operational reality. To the best of our knowledge, this is the first anomography algorithm that can handle routing changes and missing data. To evaluate these algorithms, we used several months of traffic data collected from the Abilene network and from a large Tier-1 ISP network. To compare performance, we use the methodology put forward earlier for the Abilene data set. The findings are encouraging. Among the new algorithms introduced here, we see: high accuracy in detection (few false negatives and few false positives), and high robustness (little performance degradation in the presence of measurement noise, missing data and routing changes).

265 citations


Book ChapterDOI
01 Jan 2005
TL;DR: Recent advances and applications of GRASP with path-relinking are reviewed, and a discussion of extensions to this strategy, concerning in particular parallel implementations andApplications of path- Relinking with other metaheuristics are discussed.
Abstract: Path-relinking is a major enhancement to the basic greedy randomized adaptive search procedure (GRASP), leading to significant improvements in solution time and quality. Path-relinking adds a memory mechanism to GRASP by providing an intensification strategy that explores trajectories connecting GRASP solutions and the best elite solutions previously produced during the search. This paper reviews recent advances and applications of GRASP with path-relinking. A brief review of GRASP is given. This is followed by a description of path-relinking and how it is incorporated into GRASP. Several recent applications of GRASP with path-relinking are reviewed. The paper concludes with a discussion of extensions to this strategy, concerning in particular parallel implementations and applications of path-relinking with other metaheuristics.

Proceedings ArticleDOI
02 May 2005
TL;DR: This work introduces a fault-localization methodology based on the use of risk models and an associated troubleshooting system, SCORE (Spatial Correlation Engine), which automatically identifies likely root causes across layers in IP and optical networks.
Abstract: Automated, rapid, and effective fault management is a central goal of large operational IP networks Today's networks suffer from a wide and volatile set of failure modes, where the underlying fault proves difficult to detect and localize, thereby delaying repair One of the main challenges stems from operational reality: IP routing and the underlying optical fiber plant are typically described by disparate data models and housed in distinct network management systems We introduce a fault-localization methodology based on the use of risk models and an associated troubleshooting system, SCORE (Spatial Correlation Engine), which automatically identifies likely root causes across layers In particular, we apply SCORE to the problem of localizing link failures in IP and optical networks In experiments conducted on a tier-1 ISP backbone, SCORE proved remarkably effective at localizing optical link failures using only IP-layer event logs Moreover, SCORE was often able to automatically uncover inconsistencies in the databases that maintain the critical associations between the IP and optical networks

Journal ArticleDOI
TL;DR: For the Internet, an improved understanding of its physical infrastructure is possible by viewing the physical connectivity as an annotated graph that delivers raw connectivity and bandwidth to the upper layers in the TCP/IP protocol stack, subject to practical constraints and economic considerations.
Abstract: Building on a recent effort that combines a first-principles approach to modeling router-level connectivity with a more pragmatic use of statistics and graph theory, we show in this paper that for the Internet, an improved understanding of its physical infrastructure is possible by viewing the physical connectivity as an annotated graph that delivers raw connectivity and bandwidth to the upper layers in the TCP/IP protocol stack, subject to practical constraints (e.g., router technology) and economic considerations (e.g., link costs). More importantly, by relying on data from Abilene, a Tier-1 ISP, and the Rocketfuel project, we provide empirical evidence in support of the proposed approach and its consistency with networking reality. To illustrate its utility, we: 1) show that our approach provides insight into the origin of high variability in measured or inferred router-level maps; 2) demonstrate that it easily accommodates the incorporation of additional objectives of network design (e.g., robustness to router failure); and 3) discuss how it complements ongoing community efforts to reverse-engineer the Internet.

Journal ArticleDOI
TL;DR: This paper presents a new approach to traffic matrix estimation using a regularization based on "entropy penalization", and chooses the traffic matrix consistent with the measured data that is information-theoretically closest to a model in which source/destination pairs are stochastically independent.
Abstract: Traffic matrices are required inputs for many IP network management tasks, such as capacity planning, traffic engineering, and network reliability analysis. However, it is difficult to measure these matrices directly in large operational IP networks, so there has been recent interest in inferring traffic matrices from link measurements and other more easily measured data. Typically, this inference problem is ill-posed, as it involves significantly more unknowns than data. Experience in many scientific and engineering fields has shown that it is essential to approach such ill-posed problems via "regularization". This paper presents a new approach to traffic matrix estimation using a regularization based on "entropy penalization". Our solution chooses the traffic matrix consistent with the measured data that is information-theoretically closest to a model in which source/destination pairs are stochastically independent. It applies to both point-to-point and point-to-multipoint traffic matrix estimation. We use fast algorithms based on modern convex optimization theory to solve for our traffic matrices. We evaluate our algorithm with real backbone traffic and routing data, and demonstrate that it is fast, accurate, robust, and flexible.

Proceedings ArticleDOI
Yannis Kotidis1
05 Apr 2005
TL;DR: This paper introduces the idea of snapshot queries for energy efficient data acquisition in sensor networks, and presents a detailed experimental study of the framework and algorithms, varying multiple parameters like the available memory of the sensor nodes, their transmission range, the network message loss etc.
Abstract: In this paper we introduce the idea of snapshot queries for energy efficient data acquisition in sensor networks. Network nodes generate models of their surrounding environment that are used for electing, using a localized algorithm, a small set of representative nodes in the network. These representative nodes constitute a network snapshot and can be used to provide quick approximate answers to user queries while reducing substantially the energy consumption in the network. We present a detailed experimental study of our framework and algorithms, varying multiple parameters like the available memory of the sensor nodes, their transmission range, the network message loss etc. Depending on the configuration, snapshot queries provide a reduction of up to 90% in the number of nodes that need to participate in a user query.

Journal ArticleDOI
TL;DR: In this paper, threshold sampling is introduced as a sampling scheme that optimally controls the expected volume of samples and the variance of estimators over any classification of flows over a large stream.
Abstract: This paper deals with sampling objects from a large stream. Each object possesses a size, and the aim is to be able to estimate the total size of an arbitrary subset of objects whose composition is not known at the time of sampling. This problem is motivated from network measurements in which the objects are flow records exported by routers and the sizes are the number of packet or bytes reported in the record. Subsets of interest could be flows from a certain customer or flows from a worm attack. This paper introduces threshold sampling as a sampling scheme that optimally controls the expected volume of samples and the variance of estimators over any classification of flows. It provides algorithms for dynamic control of sample volumes and evaluates them on flow data gathered from a commercial Internet Protocol (IP) network. The algorithms are simple to implement and robust to variation in network conditions. The work reported here has been applied in the measurement infrastructure of the commercial IP network. To not have employed sampling would have entailed an order of magnitude greater capital expenditure to accommodate the measurement traffic and its processing.

Journal ArticleDOI
TL;DR: This paper describes how to estimate the confidence score for each utterance through an on-line algorithm using the lattice output of a speech recognizer and shows that the amount of labeled data needed for a given word accuracy can be reduced by more than 60% with respect to random sampling.
Abstract: We are interested in the problem of adaptive learning in the context of automatic speech recognition (ASR). In this paper, we propose an active learning algorithm for ASR. Automatic speech recognition systems are trained using human supervision to provide transcriptions of speech utterances. The goal of Active Learning is to minimize the human supervision for training acoustic and language models and to maximize the performance given the transcribed and untranscribed data. Active learning aims at reducing the number of training examples to be labeled by automatically processing the unlabeled examples, and then selecting the most informative ones with respect to a given cost function for a human to label. In this paper we describe how to estimate the confidence score for each utterance through an on-line algorithm using the lattice output of a speech recognizer. The utterance scores are filtered through the informativeness function and an optimal subset of training samples is selected. The active learning algorithm has been applied to both batch and on-line learning scheme and we have experimented with different selective sampling algorithms. Our experiments show that by using active learning the amount of labeled data needed for a given word accuracy can be reduced by more than 60% with respect to random sampling.

Proceedings ArticleDOI
22 Aug 2005
TL;DR: This work introduces a new algorithm for packet loss measurement that is designed to overcome the deficiencies in standard Poisson-based tools and develops and implements a prototype tool, called BADABING, which reports loss characteristics far more accurately than traditional loss measurement tools.
Abstract: Measurement and estimation of packet loss characteristics are challenging due to the relatively rare occurrence and typically short duration of packet loss episodes. While active probe tools are commonly used to measure packet loss on end-to-end paths, there has been little analysis of the accuracy of these tools or their impact on the network. The objective of our study is to understand how to measure packet loss episodes accurately with end-to-end probes. We begin by testing the capability of standard Poisson-modulated end-to-end measurements of loss in a controlled laboratory environment using IP routers and commodity end hosts. Our tests show that loss characteristics reported from such Poisson-modulated probe tools can be quite inaccurate over a range of traffic conditions. Motivated by these observations, we introduce a new algorithm for packet loss measurement that is designed to overcome the deficiencies in standard Poisson-based tools. Specifically, our method creates a probe process that (1) enables an explicit trade-off between accuracy and impact on the network, and (2) enables more accurate measurements than standard Poisson probing at the same rate. We evaluate the capabilities of our methodology experimentally by developing and implementing a prototype tool, called BADABING. The experiments demonstrate the trade-offs between impact on the network and measurement accuracy. We show that BADABING reports loss characteristics far more accurately than traditional loss measurement tools.

Journal ArticleDOI
12 Jun 2005
TL;DR: From such descriptions, the PADS compiler generates libraries and tools for manipulating the data, including parsing routines, statistical profiling tools, translation programs to produce well-behaved formats such as Xml or those required for loading relational databases, and Tools for running XQueries over raw PADS data sources.
Abstract: PADS is a declarative data description language that allows data analysts to describe both the physical layout of ad hoc data sources and semantic properties of that data. From such descriptions, the PADS compiler generates libraries and tools for manipulating the data, including parsing routines, statistical profiling tools, translation programs to produce well-behaved formats such as Xml or those required for loading relational databases, and tools for running XQueries over raw PADS data sources. The descriptions are concise enough to serve as "living" documentation while flexible enough to describe most of the ASCII, binary, and Cobol formats that we have seen in practice. The generated parsing library provides for robust, application-specific error handling.

Proceedings Article
30 Aug 2005
TL;DR: This work proposes novel XML scoring methods that are inspired by tf*idf and that account for both structure and content while considering query relaxations and proposes efficient data structures in order to speed up ranked query processing.
Abstract: XML repositories are usually queried both on structure and content. Due to structural heterogeneity of XML, queries are often interpreted approximately and their answers are returned ranked by scores. Computing answer scores in XML is an active area of research that oscillates between pure content scoring such as the well-known tf*idf and taking structure into account. However, none of the existing proposals fully accounts for structure and combines it with content to score query answers. We propose novel XML scoring methods that are inspired by tf*idf and that account for both structure and content while considering query relaxations. Twig scoring, accounts for the most structure and content and is thus used as our reference method. Path scoring is an approximation that loosens correlations between query nodes hence reducing the amount of time required to manipulate scores during top-k query processing. We propose efficient data structures in order to speed up ranked query processing. We run extensive experiments that validate our scoring methods and that show that path scoring provides very high precision while improving score computation time.

Journal ArticleDOI
TL;DR: Top trees as mentioned in this paper are a data structure for maintaining information in a fully dynamic forest, which can be updated by insertion and deletion of edges and by changes to vertex and edge weights in O(log n) time.
Abstract: We design top trees as a new simpler interface for data structures maintaining information in a fully dynamic forest. We demonstrate how easy and versatile they are to use on a host of different applications. For example, we show how to maintain the diameter, center, and median of each tree in the forest. The forest can be updated by insertion and deletion of edges and by changes to vertex and edge weights. Each update is supported in O(log n) time, where n is the size of the tree(s) involved in the update. Also, we show how to support nearest common ancestor queries and level ancestor queries with respect to arbitrary roots in O(log n) time. Finally, with marked and unmarked vertices, we show how to compute distances to a nearest marked vertex. The latter has applications to approximate nearest marked vertex in general graphs, and thereby to static optimization problems over shortest path metrics.Technically speaking, top trees are easily implemented either with Frederickson's [1997a] topology trees or with Sleator and Tarjan's [1983] dynamic trees. However, we claim that the interface is simpler for many applications, and indeed our new bounds are quadratic improvements over previous bounds where they exist.

Journal ArticleDOI
01 Aug 2005-Networks
TL;DR: A genetic algorithm with a local improvement procedure for the OSPF weight-setting problem makes use of an efficient dynamic shortest path algorithm to recompute shortest paths after the modification of link weights.
Abstract: Intradomain traffic engineering aims to make more efficient use of network resources within an autonomous system. Interior Gateway Protocols such as OSPF (Open Shortest Path First) and IS-IS (Intermediate System-Intermediate System) are commonly used to select the paths along which traffic is routed within an autonomous system. These routing protocols direct traffic based on link weights assigned by the network operator. Each router in the autonomous system computes shortest paths and creates destination tables used to direct each packet to the next router on the path to its final destination. Given a set of traffic demands between origin-destination pairs, the OSPF weight setting problem consists of determining weights to be assigned to the links so as to optimize a cost function, typically associated with a network congestion measure. In this article, we propose a genetic algorithm with a local improvement procedure for the OSPF weight-setting problem. The local improvement procedure makes use of an efficient dynamic shortest path algorithm to recompute shortest paths after the modification of link weights. We test the algorithm on a set of real and synthetic test problems, and show that it produces near-optimal solutions. We compare the hybrid algorithm with other algorithms for this problem illustrating its efficiency and robustness. © 2005 Wiley Periodicals, Inc. NETWORKS, Vol. 46(1), 36–56 2005(This research was done while the first author was a visiting scholar at the Internet and Network Systems Research Center at AT&T Labs Research (AT&T Labs Research Technical Report TD-5NTN5G).)

Proceedings ArticleDOI
06 Jun 2005
TL;DR: RouteScope-a tool for inferring AS-level paths by finding the shortest policy paths in an AS graph obtained from BGP tables collected from multiple vantage points and a novel scheme to infer the first AS hop by exploiting the TTL information in IP packets is described.
Abstract: The ability to discover the AS-level path between two end-points is valuable for network diagnosis, performance optimization, and reliability enhancement. Virtually all existing techniques and tools for path discovery require direct access to the source. However, the uncooperative nature of the Internet makes it difficult to get direct access to any remote end-point. Path inference becomes challenging when we have no access to the source or the destination. Moveover even when we have access to the source and know the forward path, it is nontrivial to infer the reverse path, since the Internet routing is often asymmetric.In this paper, we explore the feasibility of AS-level path inference without direct access to either end-points. We describe RouteScope-a tool for inferring AS-level paths by finding the shortest policy paths in an AS graph obtained from BGP tables collected from multiple vantage points. We identify two main factors that affect the path inference accuracy: the accuracy of AS relationship inference and the ability to determine the first AS hop. To address the issues, we propose two novel techniques: a new AS relation-ship inference algorithm, and a novel scheme to infer the first AS hop by exploiting the TTL information in IP packets. We evaluate the effectiveness of RouteScope using both BGP tables and the AS paths collected from public BGP gateways. Our results show that it achieves 70% - 88% accuracy in path inference.

Proceedings ArticleDOI
02 May 2005
TL;DR: The design and evaluation of an online system that converts millions of BGP update messages a day into a few dozen actionable reports about significant routing disruptions are presented and validation using other data sources confirms the accuracy of the algorithms and the tool's additional value in detecting routing disruptions.
Abstract: The performance of a backbone network is vulnerable to interdomain routing changes that affect how traffic travels to destinations in other Autonomous Systems (ASes) Despite having poor visibility into these routing changes, operators often need to react quickly by tuning the network configuration to alleviate congestion or by notifying other ASes about serious reachability problems Fortunately, operators can improve their visibility by monitoring the Border Gateway Protocol (BGP) decisions of the routers at the periphery of their AS However, the volume of measurement data is very large and extracting the important information is challenging In this paper, we present the design and evaluation of an online system that converts millions of BGP update messages a day into a few dozen actionable reports about significant routing disruptions We apply our tool to two months of BGP and traffic data collected from a Tier-1 ISP backbone and discover several network problems previously unknown to the operators Validation using other data sources confirms the accuracy of our algorithms and the tool's additional value in detecting routing disruptions

Journal ArticleDOI
Yehuda Koren1
TL;DR: This paper explores spectral visualization techniques and study their properties from different points of view and suggests a novel algorithm for calculating spectral layouts resulting in an extremely fast computation by optimizing the layout within a small vector space.
Abstract: The spectral approach for graph visualization computes the layout of a graph using certain eigenvectors of related matrices. Two important advantages of this approach are an ability to compute optimal layouts (according to specific requirements) and a very rapid computation time. In this paper, we explore spectral visualization techniques and study their properties from different points of view. We also suggest a novel algorithm for calculating spectral layouts resulting in an extremely fast computation by optimizing the layout within a small vector space.

Journal ArticleDOI
TL;DR: Architecture reviews are used to identify project problems before they become costly to fix and to provide timely information to upper management so that they can make better-informed decisions.
Abstract: Architecture reviews have evolved over the past decade to become a critical part of our continuing efforts to improve the state of affairs. We use them to identify project problems before they become costly to fix and to provide timely information to upper management so that they can make better-informed decisions. It provides the foundation for reuse, using commercially available software, and getting to the marketplace fast. The reviews also help identify best practices to projects and socialize such practices across the organization, thereby improving the organization's quality and operations.

Proceedings ArticleDOI
19 Oct 2005
TL;DR: This paper characterize the graph-related properties of individual overlay snapshots and overlay dynamics across hundreds of back-to-back snapshots, and shows how inaccuracy in snapshots can lead to erroneous conclusions--such as a power-law degree distribution.
Abstract: During recent years, peer-to-peer (P2P) file-sharing systems have evolved in many ways to accommodate growing numbers of participating peers In particular, new features have changed the properties of the unstructured overlay topology formed by these peers Despite their importance, little is known about the characteristics of these topologies and their dynamics in modern file-sharing applicationsThis paper presents a detailed characterization of P2P overlay topologies and their dynamics, focusing on the modern Gnutella network Using our fast and accurate P2P crawler, we capture a complete snapshot of the Gnutella network with more than one million peers in just a few minutes Leveraging more than 18,000 recent overlay snapshots, we characterize the graph-related properties of individual overlay snapshots and overlay dynamics across hundreds of back-to-back snapshots We show how inaccuracy in snapshots can lead to erroneous conclusions--such as a power-law degree distribution Our results reveal that while the Gnutella network has dramatically grown and changed in many ways, it still exhibits the clustering and short path lengths of a small world network Furthermore, its overlay topology is highly resilient to random peer departure and even systematic attacks More interestingly, overlay dynamics lead to an "onion-like" biased connectivity among peers where each peer is more likely connected to peers with higher uptime Therefore, long-lived peers form a stable core that ensures reachability among peers despite overlay dynamics

Proceedings ArticleDOI
04 Jul 2005
TL;DR: A new analytic method based on singular value decompositions that yields a closed-form solution for simultaneous multiview registration in the noise-free scenario and an iterative scheme based on Newton's method on SO3 that has locally quadratic convergence is presented.
Abstract: We propose a novel algorithm to register multiple 3D point sets within a common reference frame using a manifold optimization approach. The point sets are obtained with multiple laser scanners or a mobile scanner. Unlike most prior algorithms, our approach performs an explicit optimization on the manifold of rotations, allowing us to formulate the registration problem as an unconstrained minimization on a constrained manifold. This approach exploits the Lie group structure of SO3 and the simple representation of its associated Lie algebra so3 in terms of R3.Our contributions are threefold. We present a new analytic method based on singular value decompositions that yields a closed-form solution for simultaneous multiview registration in the noise-free scenario. Secondly, we use this method to derive a good initial estimate of a solution in the noise-free case. This initialization step may be of use in any general iterative scheme. Finally, we present an iterative scheme based on Newton's method on SO3 that has locally quadratic convergence. We demonstrate the efficacy of our scheme on scan data taken both from the Digital Michelangelo project and from scans extracted from models, and compare it to some of the other well known schemes for multiview registration. In all cases, our algorithm converges much faster than the other approaches, (in some cases orders of magnitude faster), and generates consistently higher quality registrations.

Proceedings ArticleDOI
14 Jun 2005
TL;DR: This paper addresses the problem of efficiently computing multiple aggregations over high speed data streams, based on a two-level LFTA/HFTA DSMS architecture, inspired by Gigascope, and formally shows the hardness of determining the optimal configuration.
Abstract: Monitoring aggregates on IP traffic data streams is a compelling application for data stream management systems. The need for exploratory IP traffic data analysis naturally leads to posing related aggregation queries on data streams, that differ only in the choice of grouping attributes. In this paper, we address this problem of efficiently computing multiple aggregations over high speed data streams, based on a two-level LFTA/HFTA DSMS architecture, inspired by Gigascope.Our first contribution is the insight that in such a scenario, additionally computing and maintaining fine-granularity aggregation queries (phantoms) at the LFTA has the benefit of supporting shared computation. Our second contribution is an investigation into the problem of identifying beneficial LFTA configurations of phantoms and user-queries. We formulate this problem as a cost optimization problem, which consists of two sub-optimization problems: how to choose phantoms and how to allocate space for them in the LFTA. We formally show the hardness of determining the optimal configuration, and propose cost greedy heuristics for these independent sub-problems based on detailed analyses. Our final contribution is a thorough experimental study, based on real IP traffic data, as well as synthetic data, to demonstrate the effectiveness of our techniques for identifying beneficial configurations.

Proceedings ArticleDOI
Mikkel Thorup1
22 May 2005
TL;DR: The first solution to the fully-dynamic all pairs shortest path problem where every update is faster than a recomputation from scratch in Ω(n) time is presented, for a directed graph with arbitrary non-negative edge weights.
Abstract: We present here the first solution to the fully-dynamic all pairs shortest path problem where every update is faster than a recomputation from scratch in Ω(n3log ⁄n) time. This is for a directed graph with arbitrary non-negative edge weights. An update inserts or deletes a vertex with all incident edges. After each such vertex update, we update a complete distance matrix in O(n2.75) time.