scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Knowledge Discovery From Data in 2012"


Journal ArticleDOI
TL;DR: This article proposes a method called Isolation Forest (iForest), which detects anomalies purely based on the concept of isolation without employing any distance or density measure---fundamentally different from all existing methods.
Abstract: Anomalies are data points that are few and different. As a result of these properties, we show that, anomalies are susceptible to a mechanism called isolation. This article proposes a method called Isolation Forest (iForest), which detects anomalies purely based on the concept of isolation without employing any distance or density measure---fundamentally different from all existing methods.As a result, iForest is able to exploit subsampling (i) to achieve a low linear time-complexity and a small memory-requirement and (ii) to deal with the effects of swamping and masking effectively. Our empirical evaluation shows that iForest outperforms ORCA, one-class SVM, LOF and Random Forests in terms of AUC, processing time, and it is robust against masking and swamping effects. iForest also works well in high dimensional problems containing a large number of irrelevant attributes, and when anomalies are not available in training sample.

1,266 citations


Journal ArticleDOI
TL;DR: This work develops an efficient approximation algorithm that scales to large datasets and finds provably near-optimal networks for tracing paths of diffusion and influence through networks and inferring the networks over which contagions propagate.
Abstract: Information diffusion and virus propagation are fundamental processes taking place in networks. While it is often possible to directly observe when nodes become infected with a virus or publish the information, observing individual transmissions (who infects whom, or who influences whom) is typically very difficult. Furthermore, in many applications, the underlying network over which the diffusions and propagations spread is actually unobserved. We tackle these challenges by developing a method for tracing paths of diffusion and influence through networks and inferring the networks over which contagions propagate. Given the times when nodes adopt pieces of information or become infected, we identify the optimal network that best explains the observed infection times. Since the optimization problem is NP-hard to solve exactly, we develop an efficient approximation algorithm that scales to large datasets and finds provably near-optimal networks.We demonstrate the effectiveness of our approach by tracing information diffusion in a set of 170 million blogs and news articles over a one year period to infer how information flows through the online media space. We find that the diffusion network of news for the top 1,000 media sites and blogs tends to have a core-periphery structure with a small set of core media sites that diffuse information to the rest of the Web. These sites tend to have stable circles of influence with more general news media sites acting as connectors between them.

337 citations


Journal ArticleDOI
TL;DR: It is shown that it is possible to avoid leakage with a simple specific approach to data management followed by what is called a learn-predict separation, and several ways of detecting leakage when the modeler has no control over how the data have been collected are presented.
Abstract: Deemed “one of the top ten data mining mistakes”, leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected. We also offer an alternative point of view on leakage that is based on causal graph modeling concepts.

259 citations


Journal ArticleDOI
TL;DR: This work proposes and analyzes a block minimization framework for data larger than the memory size, and investigates two implementations of the proposed framework for primal and dual SVMs, respectively.
Abstract: Recent advances in linear classification have shown that for applications such as document classification, the training process can be extremely efficient. However, most of the existing training methods are designed by assuming that data can be stored in the computer memory. These methods cannot be easily applied to data larger than the memory capacity due to the random access to the disk. We propose and analyze a block minimization framework for data larger than the memory size. At each step a block of data is loaded from the disk and handled by certain learning methods. We investigate two implementations of the proposed framework for primal and dual SVMs, respectively. Because data cannot fit in memory, many design considerations are very different from those for traditional algorithms. We discuss and compare with existing approaches that are able to handle data larger than memory. Experiments using data sets 20 times larger than the memory demonstrate the effectiveness of the proposed method.

145 citations


Journal ArticleDOI
TL;DR: This work considers the problem of learning incoherent sparse and low-rank patterns from multiple tasks and proposes employing the general projected gradient scheme to efficiently solve such a convex surrogate.
Abstract: We consider the problem of learning incoherent sparse and low-rank patterns from multiple tasks. Our approach is based on a linear multitask learning formulation, in which the sparse and low-rank patterns are induced by a cardinality regularization term and a low-rank constraint, respectively. This formulation is nonconvex; we convert it into its convex surrogate, which can be routinely solved via semidefinite programming for small-size problems. We propose employing the general projected gradient scheme to efficiently solve such a convex surrogate; however, in the optimization formulation, the objective function is nondifferentiable and the feasible domain is nontrivial. We present the procedures for computing the projected gradient and ensuring the global convergence of the projected gradient scheme. The computation of the projected gradient involves a constrained optimization problem; we show that the optimal solution to such a problem can be obtained via solving an unconstrained optimization subproblem and a Euclidean projection subproblem. We also present two projected gradient algorithms and analyze their rates of convergence in detail. In addition, we illustrate the use of the presented projected gradient algorithms for the proposed multitask learning formulation using the least squares loss. Experimental results on a collection of real-world data sets demonstrate the effectiveness of the proposed multitask learning formulation and the efficiency of the proposed projected gradient algorithms.

142 citations


Journal ArticleDOI
TL;DR: Two transfer learning frameworks based on the multisource domain adaptation methodology for detecting different stages of fatigue using SEMG signals, that addresses the distribution differences are proposed.
Abstract: We consider the characterization of muscle fatigue through a noninvasive sensing mechanism such as Surface ElectroMyoGraphy (SEMG). While changes in the properties of SEMG signals with respect to muscle fatigue have been reported in the literature, the large variation in these signals across different individuals makes the task of modeling and classification of SEMG signals challenging. Indeed, the variation in SEMG parameters from subject to subject creates differences in the data distribution. In this article, we propose two transfer learning frameworks based on the multisource domain adaptation methodology for detecting different stages of fatigue using SEMG signals, that addresses the distribution differences. In the proposed frameworks, the SEMG data of a subject represent a domain; data from multiple subjects in the training set form the multiple source domains and the test subject data form the target domain. SEMG signals are predominantly different in conditional probability distribution across subjects. The key feature of the first framework is a novel weighting scheme that addresses the conditional probability distribution differences across multiple domains (subjects) and the key feature of the second framework is a two-stage domain adaptation methodology which combines weighted data from multiple sources based on marginal probability differences (first stage) as well as conditional probability differences (second stage), with the target domain data. The weights for minimizing the marginal probability differences are estimated independently, while the weights for minimizing conditional probability differences are computed simultaneously by exploiting the potential interaction among multiple sources. We also provide a theoretical analysis on the generalization performance of the proposed multisource domain adaptation formulation using the weighted Rademacher complexity measure. We have validated the proposed frameworks on Surface ElectroMyoGram signals collected from 8 people during a fatigue-causing repetitive gripping activity. Comprehensive experiments on the SEMG dataset demonstrate that the proposed method improves the classification accuracy by 20p to 30p over the cases without any domain adaptation method and by 13p to 30p over existing state-of-the-art domain adaptation methods.

141 citations


Journal ArticleDOI
TL;DR: This article proposes a framework to learn a context-based distance for categorical attributes, and shows that the method is competitive with respect to the state of the art of categorical data clustering approaches and has a low impact on the overall computational time of a clustering task.
Abstract: Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of a categorical attribute, since the values are not ordered. In this article, we propose a framework to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute Ai can be determined by the way in which the values of the other attributes Aj are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of Ai a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes Aj. We validate our approach by embedding our distance learning framework in a hierarchical clustering algorithm. We applied it on various real world and synthetic datasets, both low and high-dimensional. Experimental results show that our method is competitive with respect to the state of the art of categorical data clustering approaches. We also show that our approach is scalable and has a low impact on the overall computational time of a clustering task.

129 citations


Journal ArticleDOI
TL;DR: A novel two-step model, which seamlessly integrates these collective traffic statistics into the existing traffic classification system is proposed, which easily scales to classify traffic on 10Gbps links and displays performance improvement on all traffic classes and an overall error rate reduction.
Abstract: The ability to accurately and scalably classify network traffic is of critical importance to a wide range of management tasks of large networks, such as tier-1 ISP networks and global enterprise networks. Guided by the practical constraints and requirements of traffic classification in large networks, in this article, we explore the design of an accurate and scalable machine learning based flow-level traffic classification system, which is trained on a dataset of flow-level data that has been annotated with application protocol labels by a packet-level classifier. Our system employs a lightweight modular architecture, which combines a series of simple linear binary classifiers, each of which can be efficiently implemented and trained on vast amounts of flow data in parallel, and embraces three key innovative mechanisms, weighted threshold sampling, logistic calibration, and intelligent data partitioning, to achieve scalability while attaining high accuracy. Evaluations using real traffic data from multiple locations in a large ISP show that our system accurately reproduces the labels of the packet level classifier when runs on (unlabeled) flow records, while meeting the scalability and stability requirements of large ISP networks. Using training and test datasets that are two months apart and collected from two different locations, the flow error rates are only 3p for TCP flows and 0.4p for UDP flows. We further show that such error rates can be reduced by combining the information of spatial distributions of flows, or collective traffic statistics, during classification. We propose a novel two-step model, which seamlessly integrates these collective traffic statistics into the existing traffic classification system. Experimental results display performance improvement on all traffic classes and an overall error rate reduction by 15p. In addition to a high accuracy, at runtime, our implementation easily scales to classify traffic on 10Gbps links.

90 citations


Journal ArticleDOI
TL;DR: This work proposes an I/O-efficient algorithm for triangle listing that is exact, scalable and outperforms the state-of-the-art in-memory and local triangle estimation algorithms.
Abstract: Triangle listing is one of the fundamental algorithmic problems whose solution has numerous applications especially in the analysis of complex networks, such as the computation of clustering coefficients, transitivity, triangular connectivity, trusses, etc. Existing algorithms for triangle listing are mainly in-memory algorithms, whose performance cannot scale with the massive volume of today's fast growing networks. When the input graph cannot fit in main memory, triangle listing requires random disk accesses that can incur prohibitively huge I/O cost. Some streaming, semistreaming, and sampling algorithms have been proposed but these are approximation algorithms. We propose an I/O-efficient algorithm for triangle listing. Our algorithm is exact and avoids random disk access. Our results show that our algorithm is scalable and outperforms the state-of-the-art in-memory and local triangle estimation algorithms.

80 citations


Journal ArticleDOI
TL;DR: The Gap-BIDE algorithm adopts an efficient approach to finding the complete set of closed sequential patterns with gap constraints, while the Gap-Connect algorithm efficiently mines an approximate set of long patterns by connecting short patterns.
Abstract: Mining frequent subsequence patterns is a typical data-mining problem and various efficient sequential pattern mining algorithms have been proposed. In many application domains (e.g., biology), the frequent subsequences confined by the predefined gap requirements are more meaningful than the general sequential patterns. In this article, we propose two algorithms, Gap-BIDE for mining closed gap-constrained subsequences from a set of input sequences, and Gap-Connect for mining repetitive gap-constrained subsequences from a single input sequence. Inspired by some state-of-the-art closed or constrained sequential pattern mining algorithms, the Gap-BIDE algorithm adopts an efficient approach to finding the complete set of closed sequential patterns with gap constraints, while the Gap-Connect algorithm efficiently mines an approximate set of long patterns by connecting short patterns. We also present several methods for feature selection from the set of gap-constrained patterns for the purpose of classification and clustering. Our extensive performance study shows that our approaches are very efficient in mining frequent subsequences with gap constraints, and the gap-constrained pattern based classification/clustering approaches can achieve high-quality results.

74 citations


Journal ArticleDOI
TL;DR: In this article, a probabilistic maximum entropy model is used to find the itemset that provides us the most novel information, that is, for which the frequency in the data can surprise us most, and in turn, update the model accordingly.
Abstract: Knowledge discovery from data is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and therefore, what results we would find interesting and/or surprising. Given new knowledge about the data, our expectations will change. Hence, in order to avoid redundant results, knowledge discovery algorithms ideally should follow such an iterative updating procedure.With this in mind, we introduce a well-founded approach for succinctly summarizing data with the most informative itemsets; using a probabilistic maximum entropy model, we iteratively find the itemset that provides us the most novel information—that is, for which the frequency in the data surprises us the most—and in turn we update our model accordingly. As we use the maximum entropy principle to obtain unbiased probabilistic models, and only include those itemsets that are most informative with regard to the current model, the summaries we construct are guaranteed to be both descriptive and nonredundant.The algorithm that we present, called mtv, can either discover the top-k most informative itemsets, or we can employ either the Bayesian Information Criterion (bic) or the Minimum Description Length (mdl) principle to automatically identify the set of itemsets that together summarize the data well. In other words, our method will “tell you what you need to know” about the data. Importantly, it is a one-phase algorithm: rather than picking itemsets from a user-provided candidate set, itemsets and their supports are mined on-the-fly. To further its applicability, we provide an efficient method to compute the maximum entropy distribution using Quick Inclusion-Exclusion.Experiments on our method, using synthetic, benchmark, and real data, show that the discovered summaries are succinct, and correctly identify the key patterns in the data. The models they form attain high likelihoods, and inspection shows that they summarize the data well with increasingly specific, yet nonredundant itemsets.

Journal ArticleDOI
TL;DR: A discrim inative sentence selection method is proposed to extract the most discriminative sentences which represent the specific characteristics of each document group.
Abstract: Given a collection of document groups, a natural question is to identify the differences among these groups. Although traditional document summarization techniques can summarize the content of the document groups one by one, there exists a great necessity to generate a summary of the differences among the document groups. In this article, we study a novel problem of summarizing the differences between document groups. A discriminative sentence selection method is proposed to extract the most discriminative sentences that represent the specific characteristics of each document group. Experiments and case studies on real-world data sets demonstrate the effectiveness of our proposed method.

Journal ArticleDOI
TL;DR: This article focuses on the news domain: given two news articles, a system automatically finds a coherent chain linking them together, and provides a fast search-driven algorithm to connect two fixed endpoints.
Abstract: Finding information is becoming a major part of our daily life. Entire sectors, from Web users to scientists and intelligence analysts, are increasingly struggling to keep up with the larger and larger amounts of content published every day. With this much data, it is often easy to miss the big picture.In this article, we investigate methods for automatically connecting the dots---providing a structured, easy way to navigate within a new topic and discover hidden connections. We focus on the news domain: given two news articles, our system automatically finds a coherent chain linking them together. For example, it can recover the chain of events starting with the decline of home prices (January 2007), and ending with the health care debate (2009).We formalize the characteristics of a good chain and provide a fast search-driven algorithm to connect two fixed endpoints. We incorporate user feedback into our framework, allowing the stories to be refined and personalized. We also provide a method to handle partially-specified endpoints, for users who do not know both ends of a story. Finally, we evaluate our algorithm over real news data. Our user studies demonstrate that the objective we propose captures the users’ intuitive notion of coherence, and that our algorithm effectively helps users understand the news.

Journal ArticleDOI
TL;DR: The principle of latent maximum entropy proposed is different from both Jaynes’ maximum entropy principle and maximum likelihood estimation, but can yield better estimates in the presence of hidden variables and limited training data.
Abstract: We present an extension to Jaynes’ maximum entropy principle that incorporates latent variables. The principle of latent maximum entropy we propose is different from both Jaynes’ maximum entropy principle and maximum likelihood estimation, but can yield better estimates in the presence of hidden variables and limited training data. We first show that solving for a latent maximum entropy model poses a hard nonlinear constrained optimization problem in general. However, we then show that feasible solutions to this problem can be obtained efficiently for the special case of log-linear models---which forms the basis for an efficient approximation to the latent maximum entropy principle. We derive an algorithm that combines expectation-maximization with iterative scaling to produce feasible log-linear solutions. This algorithm can be interpreted as an alternating minimization algorithm in the information divergence, and reveals an intimate connection between the latent maximum entropy and maximum likelihood principles. To select a final model, we generate a series of feasible candidates, calculate the entropy of each, and choose the model that attains the highest entropy. Our experimental results show that estimation based on the latent maximum entropy principle generally gives better results than maximum likelihood when estimating latent variable models on small observed data samples.

Journal ArticleDOI
TL;DR: A new approach is proposed, Discriminative Topic Model (DTM), which separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving local consistency.
Abstract: Topic modeling has become a popular method used for data analysis in various domains including text documents. Previous topic model approaches, such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA), have shown impressive success in discovering low-rank hidden structures for modeling text documents. These approaches, however do not take into account the manifold structure of the data, which is generally informative for nonlinear dimensionality reduction mapping. More recent topic model approaches, Laplacian PLSI (LapPLSI) and Locally-consistent Topic Model (LTM), have incorporated the local manifold structure into topic models and have shown resulting benefits. But they fall short of achieving full discriminating power of manifold learning as they only enhance the proximity between the low-rank representations of neighboring pairs without any consideration for non-neighboring pairs. In this article, we propose a new approach, Discriminative Topic Model (DTM), which separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving local consistency. We also present a novel model-fitting algorithm based on the generalized EM algorithm and the concept of Pareto improvement. We empirically demonstrate the success of DTM in terms of unsupervised clustering and semisupervised classification accuracies on text corpora and robustness to parameters compared to state-of-the-art techniques.

Journal ArticleDOI
TL;DR: This work proposes two network-based models to predict the behavior of teams in sports leagues that are parameter-free, that is, they do not have a single parameter, and moreover are sport-agnostic: they can be applied directly to any team sports league.
Abstract: The multi-million sports-betting market is based on the fact that the task of predicting the outcome of a sports event is very hard. Even with the aid of an uncountable number of descriptive statistics and background information, only a few can correctly guess the outcome of a game or a league. In this work, our approach is to move away from the traditional way of predicting sports events, and instead to model sports leagues as networks of players and teams where the only information available is the work relationships among them. We propose two network-based models to predict the behavior of teams in sports leagues. These models are parameter-free, that is, they do not have a single parameter, and moreover are sport-agnostic: they can be applied directly to any team sports league. First, we view a sports league as a network in evolution, and we infer the implicit feedback behind network changes and properties over the years. Then, we use this knowledge to construct the network-based prediction models, which can, with a significantly high probability, indicate how well a team will perform over a season. We compare our proposed models with other prediction models in two of the most popular sports leagues: the National Basketball Association (NBA) and the Major League Baseball (MLB). Our model shows consistently good results in comparison with the other models and, relying upon the network properties of the teams, we achieved a a 14p rank prediction accuracy improvement over our best competitor.

Journal ArticleDOI
TL;DR: An online topic model for sequentially analyzing the time evolution of topics in document collections is proposed based on a stochastic EM algorithm, in which the model is sequentially updated using newly obtained data; this means that past data are not required to make the inference.
Abstract: We propose an online topic model for sequentially analyzing the time evolution of topics in document collections. Topics naturally evolve with multiple timescales. For example, some words may be used consistently over one hundred years, while other words emerge and disappear over periods of a few days. Thus, in the proposed model, current topic-specific distributions over words are assumed to be generated based on the multiscale word distributions of the previous epoch. Considering both the long- and short-timescale dependency yields a more robust model. We derive efficient online inference procedures based on a stochastic EM algorithm, in which the model is sequentially updated using newly obtained data; this means that past data are not required to make the inference. We demonstrate the effectiveness of the proposed method in terms of predictive performance and computational efficiency by examining collections of real documents with timestamps.

Journal ArticleDOI
TL;DR: Two generative models are developed that substantially advance the literature on evolutionary clustering, in the sense that not only do they both perform better than those in the existing literature, but more importantly, they are capable of automatically learning the cluster numbers and explicitly addressing the corresponding issues.
Abstract: This article studies evolutionary clustering, a recently emerged hot topic with many important applications, noticeably in dynamic social network analysis. In this article, based on the recent literature on nonparametric Bayesian models, we have developed two generative models: DPChain and HDP-HTM. DPChain is derived from the Dirichlet process mixture (DPM) model, with an exponential decaying component along with the time. HDP-HTM combines the hierarchical dirichlet process (HDP) with a hierarchical transition matrix (HTM) based on the proposed Infinite hierarchical Markov state model (iHMS). Both models substantially advance the literature on evolutionary clustering, in the sense that not only do they both perform better than those in the existing literature, but more importantly, they are capable of automatically learning the cluster numbers and explicitly addressing the corresponding issues. Extensive evaluations have demonstrated the effectiveness and the promise of these two solutions compared to the state-of-the-art literature.

Journal ArticleDOI
TL;DR: This article addresses a two-phase algorithm based on the observation that the structures of Markov random fields are similar to those of Bayesian networks and uses Markov blanket information to reduce the sizes of conditioning sets and the number of CI tests without sacrificing accuracy.
Abstract: Dependency analysis is a typical approach for Bayesian network learning, which infers the structures of Bayesian networks by the results of a series of conditional independence (CI) tests. In practice, testing independence conditioning on large sets hampers the performance of dependency analysis algorithms in terms of accuracy and running time for the following reasons. First, testing independence on large sets of variables with limited samples is not stable. Second, for most dependency analysis algorithms, the number of CI tests grows at an exponential rate with the sizes of conditioning sets, and the running time grows of the same rate. Therefore, determining how to reduce the number of CI tests and the sizes of conditioning sets becomes a critical step in dependency analysis algorithms. In this article, we address a two-phase algorithm based on the observation that the structures of Markov random fields are similar to those of Bayesian networks. The first phase of the algorithm constructs a Markov random field from data, which provides a close approximation to the structure of the true Bayesian network; the second phase of the algorithm removes redundant edges according to CI tests to get the true Bayesian network. Both phases use Markov blanket information to reduce the sizes of conditioning sets and the number of CI tests without sacrificing accuracy. An empirical study shows that the two-phase algorithm performs well in terms of accuracy and efficiency.

Journal ArticleDOI
TL;DR: A new algorithm called ciForager is introduced, that addresses this scalability challenge and offers considerable improvements and becomes feasible for the first time to discover regions of correlated change in extremely large graphs, such as the entire BGP routing topology of the Internet.
Abstract: Data mining techniques for understanding how graphs evolve over time have become increasingly important. Evolving graphs arise naturally in diverse applications such as computer network topologies, multiplayer games and medical imaging. A natural and interesting problem in evolving graph analysis is the discovery of compact subgraphs that change in a similar manner. Such subgraphs are known as regions of correlated change and they can both summarise change patterns in graphs and help identify the underlying events causing these changes. However, previous techniques for discovering regions of correlated change suffer from limited scalability, making them unsuitable for analysing the evolution of very large graphs. In this paper, we introduce a new algorithm called ciForager, that addresses this scalability challenge and offers considerable improvements. The efficiency of ciForager is based on the use of new incremental techniques for detecting change, as well as the use of Voronoi representations for efficiently determining distance. We experimentally show that ciForager can achieve speedups of up to 1000 times over previous approaches. As a result, it becomes feasible for the first time to discover regions of correlated change in extremely large graphs, such as the entire BGP routing topology of the Internet.

Journal ArticleDOI
TL;DR: A sequential sampling framework that iteratively enlarges the sample size until the k-means results become indistinguishable from the asymptotic (infinite-data) output is proposed, and it is demonstrated that the proposed framework can be generalized to handle spectral clustering.
Abstract: The scalability of learning algorithms has always been a central concern for data mining researchers, and nowadays, with the rapid increase in data storage capacities and availability, its importance has increased. To this end, sampling has been studied by several researchers in an effort to derive sufficiently accurate models using only small data fractions. In this article we focus on spectral k-means, that is, the k-means approximation as derived by the spectral relaxation, and propose a sequential sampling framework that iteratively enlarges the sample size until the k-means results (objective function and cluster structure) become indistinguishable from the asymptotic (infinite-data) output. In the proposed framework we adopt a commonly applied principle in data mining research that considers the use of minimal assumptions concerning the data generating distribution. This restriction imposes several challenges, mainly related to the efficiency of the sequential sampling procedure. These challenges are addressed using elements of matrix perturbation theory and statistics. Moreover, although the main focus is on spectral k-means, we also demonstrate that the proposed framework can be generalized to handle spectral clustering.The proposed sequential sampling framework is consecutively employed for addressing the distributed clustering problem, where the task is to construct a global model for data that resides in distributed network nodes. The main challenge in this context is related to the bandwidth constraints that are commonly imposed, thus requiring that the distributed clustering algorithm consumes a minimal amount of network load. This illustrates the applicability of the proposed approach, as it enables the determination of a minimal sample size that can be used for constructing an accurate clustering model that entails the distributional characteristics of the data. As opposed to the relevant distributed k-means approaches, our framework takes into account the fact that the choice of the number of clusters has a crucial effect on the required amount of communication. More precisely, the proposed algorithm is able to derive a statistical estimation of the required relative sizes for all possible values of k. This unique feature of our distributed clustering framework enables a network administrator to choose an economic solution that identifies the crude cluster structure of a dataset and not devote excessive network resources for identifying all the “correct” detailed clusters.

Journal ArticleDOI
TL;DR: This work proposes a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions and proposes a projection approach using pivot vocabularies for the cross-domain similarity measure.
Abstract: Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred for clustering a target task, by providing a relevant supervised partitioning of a dataset from a different source task. The target clustering is made more meaningful for the human user by trading-off intrinsic clustering goodness on the target task for alignment with relevant supervised partitions in the source task, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-task similarity measure that discovers hidden relationships across tasks. When the source and target tasks correspond to different domains with potentially different vocabularies, we propose a projection approach using pivot vocabularies for the cross-domain similarity measure. Using multiple real-world and synthetic datasets, we show that our approach improves clustering accuracy significantly over traditional k-means and state-of-the-art semi-supervised clustering baselines, over a wide range of data characteristics and parameter settings.

Journal ArticleDOI
TL;DR: In this article, a set-covering random projection classifier was proposed to deal with the curse of dimensionality, computational complexity, and nonlinear separability of CHIRP.
Abstract: In Wilkinson et al. [2011] we introduced a new set-covering random projection classifier that achieved average error lower than that of other classifiers in the Weka platform. This classifier was based on an L∞ norm distance function and exploited an iterative sequence of three stages (projecting, binning, and covering) to deal with the curse of dimensionality, computational complexity, and nonlinear separability. We now present substantial changes that improve robustness and reduce training and testing time by almost an order of magnitude without jeopardizing CHIRP's outstanding error performance.

Journal ArticleDOI
TL;DR: This work proposes a model for information growth which relies on two main observations: as an article’s quality improves, it attracts visitors at a faster rate (a rich-get-richer phenomenon) and the chances that a new visitor will improve the article drops (there is only so much that can be said about a particular topic).
Abstract: Collaborative media such as wikis have become enormously successful venues for information creation. Articles accrue information through the asynchronous editing of users who arrive both seeking information and possibly able to contribute information. Most articles stabilize to high-quality, trusted sources of information representing the collective wisdom of all the users who edited the article. We propose a model for information growth which relies on two main observations: (i) as an article’s quality improves, it attracts visitors at a faster rate (a rich-get-richer phenomenon); and, simultaneously, (ii) the chances that a new visitor will improve the article drops (there is only so much that can be said about a particular topic). Our model is able to reproduce many features of the edit dynamics observed on Wikipedia; in particular, it captures the observed rise in the edit rate, followed by 1/t decay. Despite differences in the media, we also document similar features in the comment rates for a segment of the LiveJournal blogosphere.

Journal ArticleDOI
TL;DR: This special issue of the ACM Transactions on Knowledge Discovery from Data includes extended versions of six papers selected from the works presented at the 16th ACM SIGKDD Conference on Knowledge discovery and Data Mining (KDD) held in Washington, DC from July 25 to 28 in 2010.
Abstract: This special issue of the ACM Transactions on Knowledge Discovery from Data (TKDD) includes extended versions of six papers selected from the works presented at the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) held in Washington, DC from July 25 to 28 in 2010. Only a few papers among the most highly ranked ones were invited to appear in this special issue. They provide a view of the depth and breadth of many of the most interesting areas in data mining research. The invited articles have gone through the standard refereeing process of TKDD to ensure high publication standards. Each one contains important new material in comparison with its conference predecessor. This special issue is a sample of recent advances and research trends in predictive and descriptive analytics, including fundamental topics such as scalable classification , recently growing topics such as topic modeling, network analysis and multitask learning, and a novel application in mining news articles. Topic models and latent factor models have emerged as a vital tool for modeling text and dyadic data. Two articles presented in this issue enhance topic models. Iwata et al. in " Sequential Modeling of Topic Dynamics with Multiple Timescales " propose a model for analyzing the evolution over time of themes in document collections. Topic-specific distributions over words are assumed to be generated based on word distributions from multiple timescales. The model is trained by a randomized expectation-maximization (EM) algorithm, in which parameters are sequentially updated using newly obtained data. Huh and Fienberg in " Discriminative Topic Modeling based on Manifold Learning " present a predictive topic modeling method based on learning spaces of reduced dimension , which incorporates both similarities and dissimilarities among the modeled documents through a regularization term. The resulting method outperforms previous state-of-the-art methodologies in a document classification task. The increasing publication of digital information on the Web has led to many studies of information diffusion and propagation in networks. Gomez-Rodriguez, Leskovec, and Krause in " Inferring Networks of Diffusion and Influence " investigate the way that information flows across the online media space. They develop a method for tracing paths of propagation through networks and for inferring the underlying networks over which contagion occurs. An efficient approximation algorithm for the task scales to large datasets and in practice gives near-optimal performance. In the past decade, there has been a growing interest in the problem …