scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The WEKA data mining software: an update

16 Nov 2009-Sigkdd Explorations (ACM)-Vol. 11, Iss: 1, pp 10-18
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: Fiji is a distribution of the popular open-source software ImageJ focused on biological-image analysis that facilitates the transformation of new algorithms into ImageJ plugins that can be shared with end users through an integrated update system.
Abstract: Fiji is a distribution of the popular open-source software ImageJ focused on biological-image analysis. Fiji uses modern software engineering practices to combine powerful software libraries with a broad range of scripting languages to enable rapid prototyping of image-processing algorithms. Fiji facilitates the transformation of new algorithms into ImageJ plugins that can be shared with end users through an integrated update system. We propose Fiji as a platform for productive collaboration between computer science and biology research communities.

43,540 citations

Journal ArticleDOI
TL;DR: This paper aims to provide a timely review on this area with emphasis on state-of-the-art multi-label learning algorithms with relevant analyses and discussions.
Abstract: Multi-label learning studies the problem where each example is represented by a single instance while associated with a set of labels simultaneously. During the past decade, significant amount of progresses have been made toward this emerging machine learning paradigm. This paper aims to provide a timely review on this area with emphasis on state-of-the-art multi-label learning algorithms. Firstly, fundamentals on multi-label learning including formal definition and evaluation metrics are given. Secondly and primarily, eight representative multi-label learning algorithms are scrutinized under common notations with relevant analyses and discussions. Thirdly, several related learning settings are briefly summarized. As a conclusion, online resources and open research problems on multi-label learning are outlined for reference purposes.

2,495 citations

Journal ArticleDOI
04 Mar 2015-PLOS ONE
TL;DR: It is shown that the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity.
Abstract: Binary classifiers are routinely evaluated with performance measures such as sensitivity and specificity, and performance is frequently illustrated with Receiver Operating Characteristics (ROC) plots. Alternative measures such as positive predictive value (PPV) and the associated Precision/Recall (PRC) plots are used less frequently. Many bioinformatics studies develop and evaluate classifiers that are to be applied to strongly imbalanced datasets in which the number of negatives outweighs the number of positives significantly. While ROC plots are visually appealing and provide an overview of a classifier's performance across a wide range of specificities, one can ask whether ROC plots could be misleading when applied in imbalanced classification scenarios. We show here that the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity. PRC plots, on the other hand, can provide the viewer with an accurate prediction of future classification performance due to the fact that they evaluate the fraction of true positives among positive predictions. Our findings have potential implications for the interpretation of a large number of studies that use ROC plots on imbalanced datasets.

2,451 citations

Journal ArticleDOI
TL;DR: The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art and aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.
Abstract: Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this article, we characterize adaptive learning processes; categorize existing strategies for handling concept drift; overview the most representative, distinct, and popular techniques and algorithms; discuss evaluation methodology of adaptive algorithms; and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state of the art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts, and practitioners.

2,374 citations

Journal ArticleDOI
TL;DR: This paper presents a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity, and illustrates the competitiveness of the chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.
Abstract: The widely known binary relevance method for multi-label classification, which considers each label as an independent binary problem, has often been overlooked in the literature due to the perceived inadequacy of not directly modelling label correlations. Most current methods invest considerable complexity to model interdependencies between labels. This paper shows that binary relevance-based methods have much to offer, and that high predictive performance can be obtained without impeding scalability to large datasets. We exemplify this with a novel classifier chains method that can model label correlations while maintaining acceptable computational complexity. We extend this approach further in an ensemble framework. An extensive empirical evaluation covers a broad range of multi-label datasets with a variety of evaluation metrics. The results illustrate the competitiveness of the chaining method against related and state-of-the-art methods, both in terms of predictive performance and time complexity.

2,046 citations


Cites methods from "The WEKA data mining software: an u..."

  • ...We evaluate all algorithms using our open-source WEKA-based (Hall et al. 2009) software,2 which also provides a wrapper around the MULAN software3 that contains additional methods....

    [...]

  • ...Improving these algorithms, including threshold selection, was a focus of the work in Kiritchenko (2005). AdaBoost-based methods have mainly been used in bioinformatics applications (where boosting and decision trees are particularly popular (Kiritchenko 2005))....

    [...]

  • ...We evaluate all algorithms under a WEKA-based [17] framework running under Java JDK 1.6 with the following settings....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Abstract: LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

40,826 citations


"The WEKA data mining software: an u..." refers methods in this paper

  • ...• Wrapper classifiers: allow the well known algorithms provided by the LibSVM [5] and LibLINEAR [9] thirdparty libraries to be used in WEKA....

    [...]

  • ...Supported .le formats include WEKA s own ARFF format, CSV, LibSVM s format, and C4.5 s format....

    [...]

  • ...6 is the ability to read and write data in the format used by the well known LibSVM and SVM-Light support vector machine implementations [5]....

    [...]

  • ...This complements the new LibSVM and LibLIN-EAR wrapper classi.ers....

    [...]

  • ...Wrapper classi.ers: allow the well known algorithms provided by the LibSVM [5] and LibLINEAR [9] third­party libraries to be used in WEKA....

    [...]

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Book
01 Jan 1983
TL;DR: The methodology used to construct tree structured rules is the focus of a monograph as mentioned in this paper, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Abstract: The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, which moved from pencil and paper to calculators, this text's use of trees was unthinkable before computers. Both the practical and theoretical sides have been developed in the authors' study of tree methods. Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.

14,825 citations

Book
01 Jan 2008
TL;DR: In this paper, generalized estimating equations (GEE) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC are discussed.
Abstract: tic regression, and it concerns studying the effect of covariates on the risk of disease. The chapter includes generalized estimating equations (GEE’s) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC. As a prelude to the following chapter on repeated-measures data, Chapter 5 presents time series analysis. The material on repeated-measures analysis uses linear additive models with GEE’s and PROC MIXED in SAS for linear mixed-effects models. Chapter 7 is about survival data analysis. All computing throughout the book is done using SAS procedures.

9,995 citations