scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Knowledge Discovery From Data in 2007"


Journal ArticleDOI
TL;DR: This paper shows with two simple attacks that a \kappa-anonymized dataset has some subtle, but severe privacy problems, and proposes a novel and powerful privacy definition called \ell-diversity, which is practical and can be implemented efficiently.
Abstract: Publishing data about individuals without revealing sensitive information about them is an important problem. In recent years, a new definition of privacy called k-anonymity has gained popularity. In a k-anonymized dataset, each record is indistinguishable from at least k − 1 other records with respect to certain identifying attributes.In this article, we show using two simple attacks that a k-anonymized dataset has some subtle but severe privacy problems. First, an attacker can discover the values of sensitive attributes when there is little diversity in those sensitive attributes. This is a known problem. Second, attackers often have background knowledge, and we show that k-anonymity does not guarantee privacy against attackers using background knowledge. We give a detailed analysis of these two attacks, and we propose a novel and powerful privacy criterion called e-diversity that can defend against such attacks. In addition to building a formal foundation for e-diversity, we show in an experimental evaluation that e-diversity is practical and can be implemented efficiently.

3,780 citations


Journal ArticleDOI
TL;DR: In this paper, a new graph generator based on a forest fire spreading process was proposed, which has a simple, intuitive justification, requires very few parameters (like the flammability of nodes), and produces graphs exhibiting the full range of properties observed both in prior work and in the present study.
Abstract: How do real graphs evolve over timeq What are normal growth patterns in social, technological, and information networksq Many studies have discovered patterns in static graphs, identifying properties in a single snapshot of a large network or in a very small number of snapshots; these include heavy tails for in- and out-degree distributions, communities, small-world phenomena, and others. However, given the lack of information about network evolution over long periods, it has been hard to convert these findings into statements about trends over time.Here we study a wide range of real graphs, and we observe some surprising phenomena. First, most of these graphs densify over time with the number of edges growing superlinearly in the number of nodes. Second, the average distance between nodes often shrinks over time in contrast to the conventional wisdom that such distance parameters should increase slowly as a function of the number of nodes (like O(log n) or O(log(log n)).Existing graph generation models do not exhibit these types of behavior even at a qualitative level. We provide a new graph generator, based on a forest fire spreading process that has a simple, intuitive justification, requires very few parameters (like the flammability of nodes), and produces graphs exhibiting the full range of properties observed both in prior work and in the present study.We also notice that the forest fire model exhibits a sharp transition between sparse graphs and graphs that are densifying. Graphs with decreasing distance between the nodes are generated around this transition point.Last, we analyze the connection between the temporal evolution of the degree distribution and densification of a graph. We find that the two are fundamentally related. We also observe that real networks exhibit this type of relation between densification and the degree distribution.

2,414 citations


Journal ArticleDOI
TL;DR: In this article, a relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities is proposed, which improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively.
Abstract: Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple real-world databases. We show that it improves entity resolution performance over both attribute-based baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attribute-based algorithms.

635 citations



Journal ArticleDOI
TL;DR: For some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.
Abstract: The problem of assessing the significance of data mining results on high-dimensional 0--1 datasets has been studied extensively in the literature. For problems such as mining frequent sets and finding correlations, significance testing can be done by standard statistical tests such as chi-square, or other methods. However, the results of such tests depend only on the specific attributes and not on the dataset as a whole. Moreover, the tests are difficult to apply to sets of patterns or other complex results of data mining algorithms. In this article, we consider a simple randomization technique that deals with this shortcoming. The approach consists of producing random datasets that have the same row and column margins as the given dataset, computing the results of interest on the randomized instances and comparing them to the results on the actual data. This randomization technique can be used to assess the results of many different types of data mining algorithms, such as frequent sets, clustering, and spectral analysis. To generate random datasets with given margins, we use variations of a Markov chain approach which is based on a simple swap operation. We give theoretical results on the efficiency of different randomization methods, and apply the swap randomization method to several well-known datasets. Our results indicate that for some datasets the structure discovered by the data mining algorithms is expected, given the row and column margins of the datasets, while for other datasets the discovered structure conveys information that is not captured by the margin counts.

284 citations


Journal ArticleDOI
TL;DR: The complexity of the mining problem is shown and why traditional mining algorithms are computationally infeasible is discussed, and practical algorithms for solving the problem are proposed and proposed.
Abstract: We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observingP given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study.

110 citations


Journal ArticleDOI
TL;DR: This work proposes a new way of measuring and extracting proximity in networks called “cycle-free effective conductance” (CFEC), which can handle more than two endpoints, directed edges, is statistically well behaved, and produces an effectiveness score for the computed subgraphs.
Abstract: Measuring distance or some other form of proximity between objects is a standard data mining tool. Connection subgraphs were recently proposed as a way to demonstrate proximity between nodes in networks. We propose a new way of measuring and extracting proximity in networks called “cycle-free effective conductance” (CFEC). Importantly, the measured proximity is accompanied with a proximity subgraph which allows assessing and understanding measured values. Our proximity calculation can handle more than two endpoints, directed edges, is statistically well behaved, and produces an effectiveness score for the computed subgraphs. We provide an efficient algorithm to measure and extract proximity. Also, we report experimental results and show examples for four large network datasets: a telecommunications calling graph, the IMDB actors graph, an academic coauthorship network, and a movie recommendation system.

55 citations


Journal ArticleDOI
TL;DR: The results indicate that the Markov-modulated Poisson framework provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.
Abstract: Time-series of count data occur in many different contexts, including Internet navigation logs, freeway traffic monitoring, and security logs associated with buildings. In this article we describe a framework for detecting anomalous events in such data using an unsupervised learning approach. Normal periodic behavior is modeled via a time-varying Poisson process model, which in turn is modulated by a hidden Markov process that accounts for bursty events. We outline a Bayesian framework for learning the parameters of this model from count time-series. Two large real-world datasets of time-series counts are used as testbeds to validate the approach, consisting of freeway traffic data and logs of people entering and exiting a building. We show that the proposed model is significantly more accurate at detecting known events than a more traditional threshold-based technique. We also describe how the model can be used to investigate different degrees of periodicity in the data, including systematic day-of-week and time-of-day effects, and to make inferences about different aspects of events such as number of vehicles or people involved. The results indicate that the Markov-modulated Poisson framework provides a robust and accurate framework for adaptively and autonomously learning how to separate unusual bursty events from traces of normal human activity.

55 citations


Journal ArticleDOI
TL;DR: An efficient algorithm Twain is devised that outperforms the prior works in the quality of frequent patterns, execution time, I/O cost, CPU overhead and scalability, and discovers more interesting frequent patterns.
Abstract: We investigate the general model of mining associations in a temporal database, where the exhibition periods of items are allowed to be different from one to another. The database is divided into partitions according to the time granularity imposed. Such temporal association rules allow us to observe short-term but interesting patterns that are absent when the whole range of the database is evaluated altogether. Prior work may omit some temporal association rules and thus have limited practicability. To remedy this and to give more precise frequent exhibition periods of frequent temporal itemsets, we devise an efficient algorithm Twain (standing for TWo end AssocIation miNer.) Twain not only generates frequent patterns with more precise frequent exhibition periods, but also discovers more interesting frequent patterns. Twain employs Start time and End time of each item to provide precise frequent exhibition period while progressively handling itemsets from one partition to another. Along with one scan of the database, Twain can generate frequent 2-itemsets directly according to the cumulative filtering threshold. Then, Twain adopts the scan reduction technique to generate all frequent k-itemsets (k > 2) from the generated frequent 2-itemsets. Theoretical properties of Twain are derived as well in this article. The experimental results show that Twain outperforms the prior works in the quality of frequent patterns, execution time, I/O cost, CPU overhead and scalability.

34 citations



Journal ArticleDOI
TL;DR: This article proposes several ways to curtail the size of the errors, and uses a large collection of real datasets to demonstrate that the solutions are effective in reducing the average mean squared prediction error.
Abstract: Prediction errors from a linear model tend to be larger when extrapolation is involved, particularly when the model is wrong. This article considers the problem of extrapolation and interpolation errors when a linear model tree is used for prediction. It proposes several ways to curtail the size of the errors, and uses a large collection of real datasets to demonstrate that the solutions are effective in reducing the average mean squared prediction error. The article also provides a proof that, if a linear model is correct, the proposed solutions have no undesirable effects as the training sample size tends to infinity.

Journal ArticleDOI
TL;DR: A robust framework for determining a natural clustering of a given dataset, based on the minimum description length (MDL) principle, and can be combined with any clustering technique ranging from K-means and K-medoids to advanced methods such as spectral clustering.
Abstract: How do we find a natural clustering of a real-world point set which contains an unknown number of clusters with different shapes, and which may be contaminated by noiseq As most clustering algorithms were designed with certain assumptions (Gaussianity), they often require the user to give input parameters, and are sensitive to noise. In this article, we propose a robust framework for determining a natural clustering of a given dataset, based on the minimum description length (MDL) principle. The proposed framework, robust information-theoretic clustering (RIC), is orthogonal to any known clustering algorithm: Given a preliminary clustering, RIC purifies these clusters from noise, and adjusts the clusterings such that it simultaneously determines the most natural amount and shape (subspace) of the clusters. Our RIC method can be combined with any clustering technique ranging from K-means and K-medoids to advanced methods such as spectral clustering. In fact, RIC is even able to purify and improve an initial coarse clustering, even if we start with very simple methods. In an extension, we propose a fully automatic stand-alone clustering method and efficiency improvements. RIC scales well with the dataset size. Extensive experiments on synthetic and real-world datasets validate the proposed RIC framework.

Journal ArticleDOI
TL;DR: The goal is to discover the hidden meanings of a frequent pattern by annotating it with in-depth, concise, and structured information by constructing its context model, selecting informative context indicators, and extracting representative transactions and semantically similar patterns.
Abstract: Using frequent patterns to analyze data has been one of the fundamental approaches in many data mining applications. Research in frequent pattern mining has so far mostly focused on developing efficient algorithms to discover various kinds of frequent patterns, but little attention has been paid to the important next step—interpreting the discovered frequent patterns. Although the compression and summarization of frequent patterns has been studied in some recent work, the proposed techniques there can only annotate a frequent pattern with nonsemantical information (e.g., support), which provides only limited help for a user to understand the patterns.In this article, we study the novel problem of generating semantic annotations for frequent patterns. The goal is to discover the hidden meanings of a frequent pattern by annotating it with in-depth, concise, and structured information. We propose a general approach to generate such an annotation for a frequent pattern by constructing its context model, selecting informative context indicators, and extracting representative transactions and semantically similar patterns. This general approach can well incorporate the user's prior knowledge, and has potentially many applications, such as generating a dictionary-like description for a pattern, finding synonym patterns, discovering semantic relations, and summarizing semantic classes of a set of frequent patterns. Experiments on different datasets show that our approach is effective in generating semantic pattern annotations.

Journal ArticleDOI
TL;DR: This issue of TKDD selected some of the best papers that were first presented at the ACM SIGKDD 2006 conference and proposes interesting swap-randomization-based methods for binary data that are expected to work for a broad category of data mining algorithms.
Abstract: For this special issue of TKDD, we selected some of the best papers that were first presented at the ACM SIGKDD 2006 conference. These invited papers went through the standard TKDD review process, and the accepted versions have substantial new content beyond the original SIGKDD 2006 papers. We thank all the authors who responded to our invitation as well as all the reviewers whose aid in evaluating the papers was invaluable. The articles cover several emerging areas of data mining research as well as provide interesting new insight into fundamental problems in the field. The article Parameter-Free Noise-Robust Clustering by Bohm et al. sheds new light on the classical problem of data clustering and develops robust information-theoretic clustering techniques that that are less reliant on input parameters. The article Semantic Annotation of Frequent Patterns by Mei et al. investigates the novel problem of associating semantic annotations with frequent patterns discovered by mining algorithms. The proposed techniques have the potential to impact many applications such as discovering semantic relations, summarizing semantic classes of a set of frequent patterns, and others. The article Measuring and Extracting Proximity Graphs in Networks by Koren et al. proposes an interesting new measure of proximity in networks that can be extracted and visualized in the form of a proximity graph. The proximity measure is statistically well-behaved and can be generalized to find proximities between an arbitrary number of objects. The article Learning to Detect Events with Markov-Modulated Poisson Processes by Ihler et al. describes techniques from which abnormal events can be detected and removed from time-series data representing normal rhythmic activities. The problem is challenging in an unsupervised context when events are not labeled in the data, and the proposed techniques are expected to have significant applications in the analysis of user logs on the internet, RFID traces, security video archives, and so on. The article Assessing Data Mining Results via Swap Randomization by Gionis et al. proposes a new approach to the important problem of determining the significance of data mining results, a problem that has been extensively investigated in recent years. This article proposes interesting swap-randomization-based methods for binary data that are expected to work for a broad category of data mining algorithms. We hope that you will enjoy reading this issue as much as we did.