Data Mining: Practical Machine Learning Tools and Techniques

Home
/
Papers
/
Data Mining: Practical Machine Learning Tools and Techniques

Book•

Data Mining: Practical Machine Learning Tools and Techniques

25 Oct 1999-

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

read less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

Citations

PDF

Open Access

More filters

Book•

Data Mining: Concepts and Techniques

[...]

Jiawei Han¹, Micheline Kamber², Jian Pei²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Simon Fraser University²

08 Sep 2000

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

...read moreread less

Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

...read moreread less

23,600 citations

Journal Article•DOI•

The WEKA data mining software: an update

[...]

Mark Hall, Eibe Frank¹, Geoffrey Holmes¹, Bernhard Pfahringer¹, Peter Reutemann¹, Ian H. Witten¹ - Show less +2 more•Institutions (1)

University of Waikato¹

16 Nov 2009-Sigkdd Explorations

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

19,603 citations

Journal Article•DOI•

Classification and regression trees

[...]

Wei-Yin Loh¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 2011-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.

...read moreread less

Abstract: Classification and regression trees are machine-learning methods for constructing prediction models from data. The models are obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. As a result, the partitioning can be represented graphically as a decision tree. Classification trees are designed for dependent variables that take a finite number of unordered values, with prediction error measured in terms of misclassification cost. Regression trees are for dependent variables that take continuous or ordered discrete values, with prediction error typically measured by the squared difference between the observed and predicted values. This article gives an introduction to the subject by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples. © 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 14-23 DOI: 10.1002/widm.8 This article is categorized under: Technologies > Classification Technologies > Machine Learning Technologies > Prediction Technologies > Statistical Fundamentals

...read moreread less

16,974 citations

Journal Article•DOI•

A review of feature selection techniques in bioinformatics

[...]

Yvan Saeys¹, Iñaki Inza¹, Pedro Larrañaga¹•Institutions (1)

University of the Basque Country¹

10 Sep 2007-Bioinformatics

TL;DR: A basic taxonomy of feature selection techniques is provided, providing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.

...read moreread less

Abstract: Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications. Contact: yvan.saeys@psb.ugent.be Supplementary information: http://bioinformatics.psb.ugent.be/supplementary_data/yvsae/fsreview

...read moreread less

4,706 citations

Cites methods from "Data Mining: Practical Machine Lear..."

...On this website, the publications are indexed according to the FS technique used, a number of keywords accompanying each reference to understand its FS methodological contributions....
[...]
...Software for feature selection General purpose FS software WEKA Java Witten and Frank (2005) http://www.cs.waikato.ac.nz/ml/weka Fast Correlation Based Filter Java Yu and Liu (2004) http://www.public.asu.edu/ huanliu/FCBF/FCBFsoftware.html Feature Selection Book Ansi C Liu and Motoda (1998)…...
[...]
...Software for feature selection General purpose FS software WEKA Java Witten and Frank (2005) http://www.cs.waikato.ac.nz/ml/weka Fast Correlation Based Filter Java Yu and Liu (2004) http://www.public.asu.edu/ huanliu/FCBF/FCBFsoftware.html Feature Selection Book Ansi C Liu and Motoda (1998) http://www.public.asu.edu/ huanliu/Fsbook MLCþþ Cþþ Kohavi et al. (1996) http://www.sgi.com/tech/mlc Spider Matlab – http://www.kyb.tuebingen.mpg.de/bs/people/spider SVM and Kernel Methods Matlab Canu et al. (2003) http://asi.insa-rouen.fr/ arakotom/toolbox/index Matlab Toolbox Microarray analysis FS software SAM R, Excel Tusher et al. (2001) http://www-stat.stanford.edu/ tibs/SAM/ GALGO R Trevino and Falciani (2006) http://www.bip.bham.ac.uk/bioinf/galgo.html PCP C, Cþþ Buturovic (2005) http://pcp.sourceforge.net GA-KNN C Li et al. (2001) http://dir.niehs.nih.gov/microarray/datamining/ Rankgene C Su et al. (2003) http://genomics10.bu.edu/yangsu/rankgene/ EDGE R Leek et al. (2006) http://www.biostat.washington.edu/software/jstorey/edge/ GEPAS-Prophet Perl, C Medina et al. (2007) http://prophet.bioinfo.cipf.es/ DEDS (Bioconductor) R Yang et al. (2005) http://www.bioconductor.org/ RankProd (Bioconductor) R Breitling et al. (2004) http://www.bioconductor.org/ Limma (Bioconductor) R Smyth (2004) http://www.bioconductor.org/ Multtest (Bioconductor) R Dudoit et al. (2003) http://www.bioconductor.org/ Nudge (Bioconductor) R Dean and Raftery (2005) http://www.bioconductor.org/ Qvalue (Bioconductor) R Storey (2002) http://www.bioconductor.org/ twilight (Bioconductor) R Scheid and Spang (2005) http://www.bioconductor.org/ ComparativeMarkerSelection JAVA, R Gould et al. (2006) http://www.broad.mit.edu/genepattern (GenePattern) Mass spectra analysis FS software GA-KNN C Li et al. (2004) http://dir.niehs.nih.gov/microarray/datamining/ R-SVM R, C, Cþþ Zhang et al. (2006) http://www.hsph.harvard.edu/bioinfocore/RSVMhome/ R-SVM.html SNP analysis FS software CHOISS Cþþ, Perl Lee and Kang (2004) http://biochem.kaist.ac.kr/choiss.htm MLR-tagging C He and Zelikovsky (2006) http://alla.cs.gsu.ed/ software/tagging/tagging.html WCLUSTAG JAVA Sham et al. (2007) http://bioinfo.hku.hk/wclustag D ow nloaded from https://academ ic.oup.com /bioinform atics/article/23/19/2507/185254 by guest on 08 M arch 2022 discusses the use of feature selection for a document classification task....
[...]

Book Chapter•DOI•

Activity recognition from user-annotated acceleration data

[...]

Ling Bao¹, Stephen S. Intille¹•Institutions (1)

Massachusetts Institute of Technology¹

21 Apr 2004

TL;DR: This is the first work to investigate performance of recognition algorithms with multiple, wire-free accelerometers on 20 activities using datasets annotated by the subjects themselves, and suggests that multiple accelerometers aid in recognition.

...read moreread less

Abstract: In this work, algorithms are developed and evaluated to de- tect physical activities from data acquired using five small biaxial ac- celerometers worn simultaneously on different parts of the body. Ac- celeration data was collected from 20 subjects without researcher su- pervision or observation. Subjects were asked to perform a sequence of everyday tasks but not told specifically where or how to do them. Mean, energy, frequency-domain entropy, and correlation of acceleration data was calculated and several classifiers using these features were tested. De- cision tree classifiers showed the best performance recognizing everyday activities with an overall accuracy rate of 84%. The results show that although some activities are recognized well with subject-independent training data, others appear to require subject-specific training data. The results suggest that multiple accelerometers aid in recognition because conjunctions in acceleration feature values can effectively discriminate many activities. With just two biaxial accelerometers - thigh and wrist - the recognition performance dropped only slightly. This is the first work to investigate performance of recognition algorithms with multiple, wire-free accelerometers on 20 activities using datasets annotated by the subjects themselves.

...read moreread less

3,223 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

The anatomy of a large-scale hypertextual Web search engine

[...]

Sergey Brin¹, Lawrence Page¹•Institutions (1)

Stanford University¹

01 Apr 1998

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

...read moreread less

Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

...read moreread less

14,696 citations

Proceedings Article•

A density-based algorithm for discovering clusters in large spatial Databases with Noise

[...]

Martin Ester¹, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu¹•Institutions (1)

Ludwig Maximilian University of Munich¹

01 Jan 1996

TL;DR: DBSCAN, a new clustering algorithm relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape, is presented which requires only one input parameter and supports the user in determining an appropriate value for it.

...read moreread less

Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, we present the new clustering algorithm DBSCAN relying on a density-based notion of clusters which is designed to discover clusters of arbitrary shape. DBSCAN requires only one input parameter and supports the user in determining an appropriate value for it. We performed an experimental evaluation of the effectiveness and efficiency of DBSCAN using synthetic data and real data of the SEQUOIA 2000 benchmark. The results of our experiments demonstrate that (1) DBSCAN is significantly more effective in discovering clusters of arbitrary shape than the well-known algorithm CLARANS, and that (2) DBSCAN outperforms CLARANS by a factor of more than 100 in terms of efficiency.

...read moreread less

14,297 citations

Journal Article•DOI•

The use of multiple measurements in taxonomic problems

[...]

R. A. Fisher

01 Sep 1936-Annals of Human Genetics

14,009 citations

Book•

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

[...]

Nello Cristianini¹, John Shawe-Taylor²•Institutions (2)

University of Bristol¹, Royal Holloway, University of London²

01 Jan 2000

TL;DR: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory, and will guide practitioners to updated literature, new applications, and on-line software.

...read moreread less

Abstract: From the publisher: This is the first comprehensive introduction to Support Vector Machines (SVMs), a new generation learning system based on recent advances in statistical learning theory. SVMs deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc., and are now established as one of the standard tools for machine learning and data mining. Students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and its applications. The concepts are introduced gradually in accessible and self-contained stages, while the presentation is rigorous and thorough. Pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. Equally, the book and its associated web site will guide practitioners to updated literature, new applications, and on-line software.

...read moreread less

13,736 citations

Book•

Pattern classification and scene analysis

[...]

Richard O. Duda, Peter E. Hart

01 Jan 1973

TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.

...read moreread less

Abstract: Provides a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition. The topics treated include Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.

...read moreread less

13,647 citations