scispace - formally typeset
Search or ask a question

Showing papers in "Sigkdd Explorations in 2009"


Journal ArticleDOI
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

19,603 citations


Journal ArticleDOI
TL;DR: Some of the design aspects of the underlying architecture are described, briey sketch how new nodes can be incorporated, and some of the new features of version 2.0 are highlighted.
Abstract: The Konstanz Information Miner is a modular environment, which enables easy visual assembly and interactive execution of a data pipeline. It is designed as a teaching, research and collaboration platform, which enables simple integration of new algorithms and tools as well as data manipulation or visualization methods in the form of new modules or nodes. In this paper we describe some of the design aspects of the underlying architecture, briey sketch how new nodes can be incorporated, and highlight some of the new features of version 2.0.

812 citations


Journal ArticleDOI
TL;DR: This thesis proposes and illustrates a framework for developing algorithms that can adaptively learn from data streams that change over time, and introduces a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory.
Abstract: This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees.In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naive Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks.Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full condence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures.And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.

98 citations


Journal ArticleDOI
TL;DR: A predictive analytics scoring engine platform that leverages the benefits of open standards and cloud computing to deliver an efficient deployment process for statistical models and discusses emerging trends in cloud computing and Software as a Service.
Abstract: Over the past decade, we have seen tremendous interest in the application of data mining and statistical algorithms, first in research and science and, more recently, across various industries. This has translated into the development of a myriad of solutions by the data mining community that today impact scientific and business applications alike. However, even in this scenario, interoperability and open standards still lack broader adoption among data miners and modelers.In this article we highlight the use of the Predictive Model Markup Language (PMML) standard, which allows for models to be easily exchanged between analytic applications. With a focus on interoperability and PMML, we also discuss here emerging trends in cloud computing and Software as a Service, which have already started to play a critical role in promoting a more effective implementation and widespread application of predictive models.As an illustration of how the benefits of open standards and cloud computing can be combined, we describe a predictive analytics scoring engine platform that leverages these elements to deliver an efficient deployment process for statistical models.

53 citations


Journal ArticleDOI
TL;DR: This is a short summary of the author's thesis on "Correlation Clustering" (Ludwig-Maximilians-Universität München, Germany, 2008).
Abstract: This is a short summary of the author's thesis on "Correlation Clustering" (Ludwig-Maximilians-Universitat Munchen, Germany, 2008). The complete thesis is available at http://edoc.ub.uni-muenchen.de/8736/.

40 citations


Journal ArticleDOI
Rick Pechter1
TL;DR: This paper provides a primer on the PMML standard and its applications along with a description of the new features in PMML 4.0 which was released in May 2009.
Abstract: The Predictive Model Markup Language (PMML) data mining standard has arguably become one of the most widely adopted data mining standards in use today. Two years in the making, the latest release of PMML contains several new features and many enhancements to existing ones. This paper provides a primer on the PMML standard and its applications along with a description of the new features in PMML 4.0 which was released in May 2009.

28 citations


Journal ArticleDOI
TL;DR: This article discusses the importance of analytic infrastructure and some of the standards that can be used to support analytic infrastructure, including applications that can manage very large datasets and build models over them and cloud based analytic services.
Abstract: We define analytic infrastructure to be the services, applications, utilities and systems that are used for either preparing data for modeling, estimating models, validating models, scoring data, or related activities. For example, analytic infrastructure includes databases and data warehouses, statistical and data mining systems, scoring engines, grids and clouds. Note that, with this definition, analytic infrastructure does not need to be used exclusively for modeling but simply useful as part of the modeling process. In this article, we discuss the importance of analytic infrastructure and some of the standards that can be used to support analytic infrastructure. We also discuss some specialized analytic infrastructure applications and services, including applications that can manage very large datasets and build models over them and cloud based analytic services.

16 citations


Journal ArticleDOI
TL;DR: This article provides an overview of the dissertation problems presented at a Ph.D. workshop in the ACM Conference on Information and Knowledge Management and serves as a motivation for researchers to delve deeper into the innovative dissertation problems summarized here and the related work in these areas.
Abstract: Data mining research along with related fields such as databases and information retrieval poses challenging problems, especially for doctoral students. The research spreads over a variety of topics such as text mining, semantic web, multilingual information analysis, heterogeneous data management, database learning, digital libraries and more. Much of this research cuts across multiple fields and presents interesting issues for discussion at conferences with a confluence of several tracks. The ACM Conference on Information and Knowledge Management provides an excellent environment for presenting such research problems spanning the three tracks of database systems, information retrieval and knowledge management. This article provides an overview of the dissertation problems presented at a Ph.D. workshop in the ACM Conference on Information and Knowledge Management. The goal of such workshops is to allow students to showcase their creative ideas at an early stage. This enables experts to critique their work and also gives the students an opportunity to exchange their thoughts with each other, besides providing excellent networking opportunities with industry and academia. This article provides an overview of the papers presented at the Ph.D workshop. It serves as a motivation for researchers to delve deeper into the innovative dissertation problems summarized here and the related work in these areas.

13 citations


Journal ArticleDOI
TL;DR: An in-depth study aiming at only retaining irreducible minimal generators in each equivalence class, and pruning the remaining ones, and proposes lossless reductions of the minimal generator set thanks to a new substitution-based process.
Abstract: The last years witnessed an explosive progress in networking, storage, and processing technologies resulting in an unprecedented amount of digitalization of data There is hence a considerable need for tools or techniques to delve and efficiently discover valuable, non-obvious information from large databases In this situation, Knowledge Discovery in Databases offers a complete process for the non-trivial extraction of implicit, previously unknown, and potentially useful knowledge from data Amongst its steps, data mining offers tools and techniques for such an extraction Much research in data mining from large databases has focused on the discovery of association rules which are used to identify relationships between sets of items in a database The discovered association rules can be used in various tasks, such as depicting purchase dependencies, classification, medical data analysis, etc In practice however, the number of frequently occurring itemsets, used as a basis for rule derivation, is very large, hampering their effective exploitation by the end-users In this situation, a determined effort focused on defining manageably-sized sets of patterns, called concise representations, from which redundant patterns can be regenerated The purpose of such representations is to reduce the number of mined patterns to make them manageable by the end-users while preserving as much as possible the hidden and interesting information about data Many concise representations for frequent patterns were so far proposed in the literature, mainly exploring the conjunctive search space In this space, itemsets are characterized by the frequency of their co-occurrence A detailed study proposed in this thesis shows that closed itemsets and minimal generators play a key role for concisely representing both frequent itemsets and association rules These itemsets structure the search space into equivalence classes such that each class gathers the itemsets appearing in the same subset aka objects or transactions of the given data A closed itemset includes the most specific expression describing the associated transactions, while a minimal generator includes one of the most general expressions However, an intra-class combinatorial redundancy would logically results from the inherent absence of a unique minimal generator associated to a given closed itemset This motivated us to carry out an in-depth study aiming at only retaining irreducible minimal generators in each equivalence class, and pruning the remaining ones In this respect, we propose lossless reductions of the minimal generator set thanks to a new substitution-based process We then carry out a thorough study of the associated properties of the obtained families Our theoretical results will then be extended to the association rule framework in order to reduce as much as possible the number of retained rules without information loss We then give a thorough formal study of the related inference mechanism allowing to derive all redundant association rules, starting from the retained ones In order to validate our approach, computing means for the new pattern families are presented together with empirical evidences about their relative sizes wrt the entire sets of patterns We also lead a thorough exploration of the disjunctive search space, where itemsets are characterized by their respective disjunctive supports, instead of the conjunctive ones Thus, an itemset verifies a portion of data if at least one of its items belongs to it Disjunctive itemsets thus convey knowledge about complementary occurrences of items in a dataset This exploration is motivated by the fact that, in some applications, such information -- conveyed through disjunctive support -- brings richer knowledge to the end-users In order to obtain a redundancy-free representation of the disjunctive search space, an interesting solution consists in selecting a unique element to represent itemsets covering the same set of data Two itemsets are equivalent if their respective items cover the same set of data In this regard, we introduce a new operator dedicated to this task In each induced equivalence class, minimal elements are called essential itemsets, while the largest one is called disjunctive closed itemset The introduced operator is then at the roots of new concise representations of frequent itemsets We also exploit the disjunctive search space to derive generalized association rules These latter rules generalize classic ones to also offer disjunction and negation connectors between items, in addition to the conjunctive one Dedicated tools were then designed and implemented for extracting disjunctive itemsets and generalized association rules Our experiments showed the usefulness of our exploration and highlighted interesting compactness rates

4 citations


Journal ArticleDOI
TL;DR: This special issue contains six articles on open source analytics, including an article describing the Weka data mining system, an article on the PMML standard for statistical and data mining models, and an article about an open source tool for cleaning data.
Abstract: This special issue contains six articles on open source analytics. It includes an article describing the Weka data mining system, two articles on infrastructure to support analytics, an article on the PMML standard for statistical and data mining models, an article on how clouds are being used in analytics, and an article about an open source tool for cleaning data.