scispace - formally typeset
Search or ask a question
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization
Citations
More filters
Book
08 Sep 2000
TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.
Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

23,600 citations

Journal ArticleDOI
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

19,603 citations

Journal ArticleDOI
TL;DR: This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
Abstract: Classification and regression trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14-23 DOI: 10.1002/widm.8 This article is categorized under: Technologies > Classification Technologies > Machine Learning Technologies > Prediction Technologies > Statistical Fundamentals

16,974 citations

Journal ArticleDOI
TL;DR: A basic taxonomy of feature selection techniques is provided, providing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
Abstract: Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications. Contact: yvan.saeys@psb.ugent.be Supplementary information: http://bioinformatics.psb.ugent.be/supplementary_data/yvsae/fsreview

4,706 citations


Cites methods from "Data Mining: Practical Machine Lear..."

  • ...On this website, the publications are indexed according to the FS technique used, a number of keywords accompanying each reference to understand its FS methodological contributions....

    [...]

  • ...Software for feature selection General purpose FS software WEKA Java Witten and Frank (2005) http://www.cs.waikato.ac.nz/ml/weka Fast Correlation Based Filter Java Yu and Liu (2004) http://www.public.asu.edu/ huanliu/FCBF/FCBFsoftware.html Feature Selection Book Ansi C Liu and Motoda (1998)…...

    [...]

  • ...Software for feature selection General purpose FS software WEKA Java Witten and Frank (2005) http://www.cs.waikato.ac.nz/ml/weka Fast Correlation Based Filter Java Yu and Liu (2004) http://www.public.asu.edu/ huanliu/FCBF/FCBFsoftware.html Feature Selection Book Ansi C Liu and Motoda (1998) http://www.public.asu.edu/ huanliu/Fsbook MLCþþ Cþþ Kohavi et al. (1996) http://www.sgi.com/tech/mlc Spider Matlab – http://www.kyb.tuebingen.mpg.de/bs/people/spider SVM and Kernel Methods Matlab Canu et al. (2003) http://asi.insa-rouen.fr/ arakotom/toolbox/index Matlab Toolbox Microarray analysis FS software SAM R, Excel Tusher et al. (2001) http://www-stat.stanford.edu/ tibs/SAM/ GALGO R Trevino and Falciani (2006) http://www.bip.bham.ac.uk/bioinf/galgo.html PCP C, Cþþ Buturovic (2005) http://pcp.sourceforge.net GA-KNN C Li et al. (2001) http://dir.niehs.nih.gov/microarray/datamining/ Rankgene C Su et al. (2003) http://genomics10.bu.edu/yangsu/rankgene/ EDGE R Leek et al. (2006) http://www.biostat.washington.edu/software/jstorey/edge/ GEPAS-Prophet Perl, C Medina et al. (2007) http://prophet.bioinfo.cipf.es/ DEDS (Bioconductor) R Yang et al. (2005) http://www.bioconductor.org/ RankProd (Bioconductor) R Breitling et al. (2004) http://www.bioconductor.org/ Limma (Bioconductor) R Smyth (2004) http://www.bioconductor.org/ Multtest (Bioconductor) R Dudoit et al. (2003) http://www.bioconductor.org/ Nudge (Bioconductor) R Dean and Raftery (2005) http://www.bioconductor.org/ Qvalue (Bioconductor) R Storey (2002) http://www.bioconductor.org/ twilight (Bioconductor) R Scheid and Spang (2005) http://www.bioconductor.org/ ComparativeMarkerSelection JAVA, R Gould et al. (2006) http://www.broad.mit.edu/genepattern (GenePattern) Mass spectra analysis FS software GA-KNN C Li et al. (2004) http://dir.niehs.nih.gov/microarray/datamining/ R-SVM R, C, Cþþ Zhang et al. (2006) http://www.hsph.harvard.edu/bioinfocore/RSVMhome/ R-SVM.html SNP analysis FS software CHOISS Cþþ, Perl Lee and Kang (2004) http://biochem.kaist.ac.kr/choiss.htm MLR-tagging C He and Zelikovsky (2006) http://alla.cs.gsu.ed/ software/tagging/tagging.html WCLUSTAG JAVA Sham et al. (2007) http://bioinfo.hku.hk/wclustag D ow nloaded from https://academ ic.oup.com /bioinform atics/article/23/19/2507/185254 by guest on 08 M arch 2022 discusses the use of feature selection for a document classification task....

    [...]

Book ChapterDOI
21 Apr 2004
TL;DR: This is the first work to investigate performance of recognition algorithms with multiple, wire-free accelerometers on 20 activities using datasets annotated by the subjects themselves, and suggests that multiple accelerometers aid in recognition.
Abstract: In this work, algorithms are developed and evaluated to de- tect physical activities from data acquired using five small biaxial ac- celerometers worn simultaneously on different parts of the body. Ac- celeration data was collected from 20 subjects without researcher su- pervision or observation. Subjects were asked to perform a sequence of everyday tasks but not told specifically where or how to do them. Mean, energy, frequency-domain entropy, and correlation of acceleration data was calculated and several classifiers using these features were tested. De- cision tree classifiers showed the best performance recognizing everyday activities with an overall accuracy rate of 84%. The results show that although some activities are recognized well with subject-independent training data, others appear to require subject-specific training data. The results suggest that multiple accelerometers aid in recognition because conjunctions in acceleration feature values can effectively discriminate many activities. With just two biaxial accelerometers - thigh and wrist - the recognition performance dropped only slightly. This is the first work to investigate performance of recognition algorithms with multiple, wire-free accelerometers on 20 activities using datasets annotated by the subjects themselves.

3,223 citations

References
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations

Proceedings Article
01 Jan 1996
TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

14,297 citations

Book
01 Jan 2000
TL;DR: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory, and will guide practitioners to updated literature, new applications, and on-line software.
Abstract: From the publisher: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc., and are now established as one of the standard tools for machine learning and data mining. Students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and its applications. The concepts are introduced gradually in accessible and self-contained stages, while the presentation is rigorous and thorough. Pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. Equally, the book and its associated web site will guide practitioners to updated literature, new applications, and on-line software.

13,736 citations

Book
01 Jan 1973
TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.
Abstract: Provides a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition. The topics treated include Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.

13,647 citations