We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

Random graphs

Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months or years, and the underlying processes generating them changed during this time, sometimes radically. Although a number of algorithms have been proposed for learning time-changing concepts, they generally do not scale well to very large databases. In this paper we propose an efficient algorithm for mining decision trees from continuously-changing data streams, based on the ultra-fast VFDT decision tree learner. This algorithm, called CVFDT, stays current while making the most of old data by growing an alternative subtree whenever an old one becomes questionable, and replacing the old with the new when the new becomes more accurate. CVFDT learns a model which is similar in accuracy to the one that would be learned by reapplying VFDT to a moving window of examples every time a new example arrives, but with O(1) complexity per example, as opposed to O(w), where w is the size of the window. Experiments on a set of large time-changing data streams demonstrate the utility of this approach.

/pdf/mining-time-changing-data-streams-a9kghvc13p.pdf

Mining time-changing data streams

The dynamic and lossy nature of wireless communication poses major challenges to reliable, self-organizing multihop networks. These non-ideal characteristics are more problematic with the primitive, low-power radio transceivers found in sensor networks, and raise new issues that routing protocols must address. Link connectivity statistics should be captured dynamically through an efficient yet adaptive link estimator and routing decisions should exploit such connectivity statistics to achieve reliability. Link status and routing information must be maintained in a neighborhood table with constant space regardless of cell density. We study and evaluate link estimator, neighborhood table management, and reliable routing protocol techniques. We focus on a many-to-one, periodic data collection workload. We narrow the design space through evaluations on large-scale, high-level simulations to 50-node, in-depth empirical experiments. The most effective solution uses a simple time averaged EWMA estimator, frequency based table management, and cost-based routing.

/pdf/taming-the-underlying-challenges-of-reliable-multihop-2e2c4yefix.pdf

Taming the underlying challenges of reliable multihop routing in sensor networks

In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [1].

/pdf/data-streams-algorithms-and-applications-5fo85u5tcy.pdf

Data streams: algorithms and applications

Constraint programming is a powerful paradigm for solving combinatorial search problems that draws on a wide range of techniques from artificial intelligence, computer science, databases, programming languages, and operations research. Constraint programming is currently applied with success to many domains, such as scheduling, planning, vehicle routing, configuration, networks, and bioinformatics.

The aim of this handbook is to capture the full breadth and depth of the constraint programming field and to be encyclopedic in its scope and coverage. While there are several excellent books on constraint programming, such books necessarily focus on the main notions and techniques and cannot cover also extensions, applications, and languages. The handbook gives a reasonably complete coverage of all these lines of work, based on constraint programming, so that a reader can have a rather precise idea of the whole field and its potential. Of course each line of work is dealt with in a survey-like style, where some details may be neglected in favor of coverage. However, the extensive bibliography of each chapter will help the interested readers to find suitable sources for the missing details. Each chapter of the handbook is intended to be a self-contained survey of a topic, and is written by one or more authors who are leading researchers in the area.

The intended audience of the handbook is researchers, graduate students, higher-year undergraduates and practitioners who wish to learn about the state-of-the-art in constraint programming. No prior knowledge about the field is necessary to be able to read the chapters and gather useful knowledge. Researchers from other fields should find in this handbook an effective way to learn about constraint programming and to possibly use some of the constraint programming concepts and techniques in their work, thus providing a means for a fruitful cross-fertilization among different research areas.

The handbook is organized in two parts. The first part covers the basic foundations of constraint programming, including the history, the notion of constraint propagation, basic search methods, global constraints, tractability and computational complexity, and important issues in modeling a problem as a constraint problem. The second part covers constraint languages and solver, several useful extensions to the basic framework (such as interval constraints, structured domains, and distributed CSPs), and successful application areas for constraint programming.

- Covers the whole field of constraint programming
- Survey-style chapters
- Five chapters on applications

Table of Contents

Foreword (Ugo Montanari) 
Part I : Foundations
Chapter 1. Introduction (Francesca Rossi, Peter van Beek, Toby Walsh) 
Chapter 2. Constraint Satisfaction: An Emerging Paradigm (Eugene C. Freuder, Alan K. Mackworth) 
Chapter 3. Constraint Propagation (Christian Bessiere) 
Chapter 4. Backtracking Search Algorithms (Peter van Beek) 
Chapter 5. Local Search Methods (Holger H. Hoos, Edward Tsang) 
Chapter 6. Global Constraints (Willem-Jan van Hoeve, Irit Katriel)
Chapter 7. Tractable Structures for CSPs (Rina Dechter)
Chapter 8. The Complexity of Constraint Languages
(David Cohen, Peter Jeavons) 
Chapter 9. Soft Constraints (Pedro Meseguer, Francesca Rossi, Thomas Schiex) 
Chapter 10. Symmetry in Constraint Programming
(Ian P. Gent, Karen E. Petrie, Jean-Francois Puget)
Chapter 11. Modelling (Barbara M. Smith) 
Part II : Extensions, Languages, and Applications

Chapter 12. Constraint Logic Programming (Kim Marriott, Peter J. Stuckey, Mark Wallace) 
Chapter 13. Constraints in Procedural and Concurrent Languages (Thom Fruehwirth, Laurent Michel, Christian Schulte)
Chapter 14. Finite Domain Constraint Programming Systems (Christian Schulte, Mats Carlsson) 
Chapter 15. Operations Research Methods in Constraint Programming (John Hooker) 
Chapter 16. Continuous and Interval Constraints(Frederic Benhamou, Laurent Granvilliers) 
Chapter 17. Constraints over Structured Domains
(Carmen Gervet) 
Chapter 18. Randomness and Structure (Carla Gomes, Toby Walsh)
Chapter 19. Temporal CSPs (Manolis Koubarakis)
Chapter 20. Distributed Constraint Programming
(Boi Faltings) 
Chapter 21. Uncertainty and Change (Kenneth N. Brown, Ian Miguel)
Chapter 22. Constraint-Based Scheduling and Planning
(Philippe Baptiste, Philippe Laborie, Claude Le Pape, Wim Nuijten)
Chapter 23. Vehicle Routing (Philip Kilby, Paul Shaw) 
Chapter 24. Configuration (Ulrich Junker)
Chapter 25. Constraint Applications in Networks
(Helmut Simonis)
Chapter 26. Bioinformatics and Constraints (Rolf Backofen, David Gilbert)

/pdf/handbook-of-constraint-programming-53srutguo2.pdf

Handbook of Constraint Programming

We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring traffic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a result, network routers can collect only relatively few statistics about the data. The central problem addressed here is to use the limited memory of routers to determine essential features of the network traffic stream. A particularly difficult and representative subproblem is to determine the top k categories to which the most packets belong, for a desired value of k and for a given notion of categorization such as the destination IP address.We present an algorithm that deterministically finds (in particular) all categories having a frequency above 1/(m+1) using m counters, which we prove is best possible in the worst case. We also present a sampling-based algorithm for the case that packet categories follow an arbitrary distribution, but their order over time is permuted uniformly at random. Under this model, our algorithm identifies flows above a frequency threshold of roughly 1/?nm with high probability, where m is the number of counters and n is the number of packets observed. This guarantee is not far off from the ideal of identifying all flows (probability 1/n), and we prove that it is best possible up to a logarithmic factor. We show that the algorithm ranks the identified flows according to frequency within any desired constant factor of accuracy.

/pdf/frequency-estimation-of-internet-packet-streams-with-limited-2kxt70rj4v.pdf

Frequency Estimation of Internet Packet Streams with Limited Space

Motivated by boolean queries in text database systems, we consider the problems of finding the intersection, union, or difference of a collection of sorted sets. While the worst-case complexity of these problems is straightforward , we consider a notion of complexity that depends on the particular instance. We develop the idea of a proof that a given set is indeed the correct answer. Proofs, and in particular shortest proofs, are characterized. We present adaptive algorithms that make no a priori assumptions about the problem instance, and show that their running times are within a constant factor of optimal with respect to a natural measure of the difficulty of an instance. In the process, we develop a framework for designing and evaluating adaptive algorithms in the comparison model. 1 Introduction and Overview Our work can be seen in the general context of performing searches quickly in a database or data warehousing environment. The broad issue is that of characterizing what type of join operations can be performed without scanning the relations involved or actually materializing intermediate relations. The specific problem addressed here can be seen in that context or in the context of performing a web search, or a search in another large text database, for documents containing some or all of a set of keywords. For each keyword we are given the set of references to documents in which it occurs [2, 6, 9]. These sets are stored in some natural order, such as document date. In practice, the sets are large. For example, the average word from user query logs matches approximately a million documents on the AltaVista web search engine. Of course, one would hope that the answer to the query is small, particularly if the query is an intersection. It may also be expected that the elements of such an intersection are not spread uniformly through the initial

Adaptive set intersections, unions, and differences

Internet traffic patterns are believed to obey the power law, implying that most of the bandwidth is consumed by a small set of heavy users. Hence, queries that return a list of frequently occurring items are important in the analysis of real-time Internet packet streams. While several results exist for computing frequent item queries using limited memory in the infinite stream model, in this paper we consider the limited-memory sliding window model. This model maintains the last $N$ items that have arrived at any given time and forbids the storage of the entire window in memory. We present a deterministic algorithm for identifying frequent items in sliding windows defined over real-time packet streams. The algorithm uses limited memory, requires constant processing time per packet (amortized), makes only one pass over the data, and is shown to work well when tested on TCP traffic logs.

/pdf/identifying-frequent-items-in-sliding-windows-over-on-line-5536sl8xag.pdf

Identifying frequent items in sliding windows over on-line packet streams

Internet topology information is only made available in aggregate form by standard routing protocols. Connectivity information and latency characteristics must therefore be inferred using indirect techniques. In this paper we consider measurements using a distributed set of measurement points or beacons. We show that computing the minimum number of required beacons on a network under a BGP-like routing policy is NP-hard and at best Ω(log n)-approximable. In the worst case at least (n-1)/3 and at most (n+1)/3 beacons are required for a network with n nodes. We then introduce some observations that allow us to propose a relatively small candidate set of beacons for the current Internet topology. The set proposed has properties with relevant applications for all-paths routing on the public Internet and performance based routing.

/pdf/on-the-number-of-distributed-measurement-points-for-network-149zvjns9q.pdf

On the number of distributed measurement points for network tomography

The intersection of large ordered sets is a common problem in the context of the evaluation of boolean queries to a search engine. In this paper we engineer a better algorithm for this task, which improves over those proposed by Demaine, Munro and Lopez-Ortiz [SODA 2000/ALENEX 2001], by using a variant of interpolation search. More specifically, our contributions are threefold. First, we corroborate and complete the practical study from Demaine et al. on comparison based intersection algorithms. Second, we show that in practice replacing binary search and galloping (one-sided binary) search [4] by interpolation search improves the performance of each main intersection algorithms. Third, we introduce and test variants of interpolation search: this results in an even better intersection algorithm.

/pdf/faster-adaptive-set-intersections-for-text-searching-5a406z13wh.pdf

Alejandro López-Ortiz

Papers

Frequency Estimation of Internet Packet Streams with Limited Space

Adaptive set intersections, unions, and differences

Identifying frequent items in sliding windows over on-line packet streams

On the number of distributed measurement points for network tomography

Faster adaptive set intersections for text searching