Home
/
Authors
/
S. Inglis

Author

S. Inglis

Bio: S. Inglis is an academic researcher from University of Waikato. The author has contributed to research in topics: Image compression & Data compression. The author has an hindex of 10, co-authored 18 publications receiving 918 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Using Model Trees for Classification

[...]

Eibe Frank¹, Yong Wang¹, S. Inglis¹, Geoffrey Holmes¹, Ian H. Witten¹ - Show less +1 more•Institutions (1)

University of Waikato¹

01 Jul 1998-Machine Learning

TL;DR: Surprisingly, using this simple transformation the model tree inducer M5′, based on Quinlan's M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.

...read moreread less

Abstract: Model trees, which are a type of decision tree with linear regression functions at the leaves, form the basis of a recent successful technique for predicting continuous numeric values. They can be applied to classification problems by employing a standard method of transforming a classification problem into a problem of function approximation. Surprisingly, using this simple transformation the model tree inducerM5 ′, based on Quinlan‘s M5, generates more accurate classifiers than the state-of-the-art decision tree learner C5.0, particularly when most of the attributes are numeric.

...read moreread less

396 citations

Posted Content•DOI•

Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines

[...]

John G. Cleary, Ross Braithwaite, Kurt Gaastra, Brian S Hilbush, S. Inglis, Sean A. Irvine, Alan Jackson, Richard Littin, Mehul Kamlesh Rathod, David Ware, Justin M. Zook¹, Len Trigg, Francisco M. De La Vega² - Show less +9 more•Institutions (2)

National Institute of Standards and Technology¹, Stanford University²

03 Aug 2015-bioRxiv

TL;DR: A novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs is developed.

...read moreread less

Abstract: To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a ?gold standard? need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.

...read moreread less

176 citations

Journal Article•DOI•

Joint Variant and De Novo Mutation Identification on Pedigrees from High-Throughput Sequencing Data

[...]

John G Cleary, Ross Braithwaite, Kurt Gaastra, Brian S Hilbush, S. Inglis, Sean A. Irvine, Alan Jackson, Richard Littin, Sahar Nohzadeh-Malakshah, Mehul Kamlesh Rathod, David Ware, Len Trigg, Francisco M. De La Vega - Show less +9 more

29 May 2014-Journal of Computational Biology

TL;DR: A Bayesian network framework is presented that jointly analyzes data from all members of a pedigree simultaneously using Mendelian segregation priors, yet providing the ability to detect de novo mutations in offspring, and is scalable to large pedigrees.

...read moreread less

Abstract: The analysis of whole-genome or exome sequencing data from trios and pedigrees has been successfully applied to the identification of disease-causing mutations. However, most methods used to identify and genotype genetic variants from next-generation sequencing data ignore the relationships between samples, resulting in significant Mendelian errors, false positives and negatives. Here we present a Bayesian network framework that jointly analyzes data from all members of a pedigree simultaneously using Mendelian segregation priors, yet providing the ability to detect de novo mutations in offspring, and is scalable to large pedigrees. We evaluated our method by simulations and analysis of whole-genome sequencing (WGS) data from a 17-individual, 3-generation CEPH pedigree sequenced to 50× average depth. Compared with singleton calling, our family caller produced more high-quality variants and eliminated spurious calls as judged by common quality metrics such as Ti/Tv, Het/Hom ratios, and dbSNP/SNP a...

...read moreread less

89 citations

Proceedings Article•DOI•

Jumble Java Byte Code to Measure the Effectiveness of Unit Tests

[...]

Sean A. Irvine, T. Pavlinic, Leonard Eric Trigg, John G. Cleary, S. Inglis, Mark Utting¹ - Show less +2 more•Institutions (1)

University of Waikato¹

10 Sep 2007

TL;DR: Jumble is a byte code level mutation testing tool for Java which inter-operates with JUnit and significant effort has been put into ensuring that it can test code which uses custom class loading and reflection.

...read moreread less

Abstract: Jumble is a byte code level mutation testing tool for Java which inter-operates with JUnit. It has been designed to operate in an industrial setting with large projects. Heuristics have been included to speed the checking of mutations, for example, noting which test fails for each mutation and running this first in subsequent mutation checks. Significant effort has been put into ensuring that it can test code which uses custom class loading and reflection. This requires careful attention to class path handling and coexistence with foreign class-loaders. Jumble is currently used on a continuous basis within an agile programming environment with approximately 370,000 lines of Java code under source control. This checks out project code every fifteen minutes and runs an incremental set of unit tests and mutation tests for modified classes. Jumble is being made available as open source.

...read moreread less

70 citations

Proceedings Article•DOI•

Compression-based template matching

[...]

S. Inglis¹, Ian H. Witten¹•Institutions (1)

University of Waikato¹

29 Mar 1994

TL;DR: The authors use the amount of uncertainty or entropy between marks as the criterion for the matching process and present a novel method of screening which uses a quad-tree decomposition and finds local centroids at each tree level.

...read moreread less

Abstract: Textual image compression is a method of both lossy and lossless image compression that is particularly effective for images containing repeated sub-images, notably pages of text. This paper addresses the problem of pattern comparison by using an information or compression based approach. Following Mohiuddin et al. ( 1984), the authors use the amount of uncertainty or entropy between marks as the criterion for the matching process. The entropy model they use is the context-based compression model proposed by Langdon and Rissanen (1981) and further developed by Moffat (1991). There are two principal issues to investigate when studying template matching methods: their susceptibility to different kinds of noise, and how they respond to errors in the initial registration. Because of the computation-intensive nature of the comparison operation, many schemes have been devised to pre-filter or screen the marks in advance to determine those that will surely fail the match. They present a novel method of screening which uses a quad-tree decomposition and finds local centroids at each tree level. >

...read moreread less

70 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations

Journal Article•DOI•

A global reference for human genetic variation.

[...]

Adam Auton¹, Gonçalo R. Abecasis², David Altshuler³, Richard Durbin⁴ +514 more•Institutions (90)

01 Oct 2015-Nature

TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.

...read moreread less

Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

...read moreread less

12,661 citations

Book•

Evolutionary algorithms for solving multi-objective problems

[...]

Gary B. Lamont, David A. Van Veldhuizen

30 Jun 2002

TL;DR: This paper presents a meta-anatomy of the multi-Criteria Decision Making process, which aims to provide a scaffolding for the future development of multi-criteria decision-making systems.

...read moreread less

Abstract: List of Figures. List of Tables. Preface. Foreword. 1. Basic Concepts. 2. Evolutionary Algorithm MOP Approaches. 3. MOEA Test Suites. 4. MOEA Testing and Analysis. 5. MOEA Theory and Issues. 3. MOEA Theoretical Issues. 6. Applications. 7. MOEA Parallelization. 8. Multi-Criteria Decision Making. 9. Special Topics. 10. Epilog. Appendix A: MOEA Classification and Technique Analysis. Appendix B: MOPs in the Literature. Appendix C: Ptrue & PFtrue for Selected Numeric MOPs. Appendix D: Ptrue & PFtrue for Side-Constrained MOPs. Appendix E: MOEA Software Availability. Appendix F: MOEA-Related Information. Index. References.

...read moreread less

5,994 citations

Book•

Applied Predictive Modeling

[...]

Max Kuhn, Kjell Johnson

17 May 2013

TL;DR: This research presents a novel and scalable approach called “Smartfitting” that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of designing and implementing statistical models for regression models.

...read moreread less

Abstract: General Strategies.- Regression Models.- Classification Models.- Other Considerations.- Appendix.- References.- Indices.

...read moreread less

3,672 citations

Correlation-based Feature Selection for Machine Learning

[...]

Mark Hall

01 Jan 1998

TL;DR: This thesis addresses the problem of feature selection for machine learning through a correlation based approach with CFS (Correlation based Feature Selection), an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy.

...read moreread less

Abstract: A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problem of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated with each other. A feature evaluation formula, based on ideas from test theory, provides an operational definition of this hypothesis. CFS (Correlation based Feature Selection) is an algorithm that couples this evaluation formula with an appropriate correlation measure and a heuristic search strategy. CFS was evaluated by experiments on artificial and natural datasets. Three machine learning algorithms were used: C4.5 (a decision tree learner), IB1 (an instance based learner), and naive Bayes. Experiments on artificial datasets showed that CFS quickly identifies and screens irrelevant, redundant, and noisy features, and identifies relevant features as long as their relevance does not strongly depend on other features. On natural domains, CFS typically eliminated well over half the features. In most cases, classification accuracy using the reduced feature set equaled or bettered accuracy using the complete feature set. Feature selection degraded machine learning performance in cases where some features were eliminated which were highly predictive of very small areas of the instance space. Further experiments compared CFS with a wrapper—a well known approach to feature selection that employs the target learning algorithm to evaluate feature sets. In many cases CFS gave comparable results to the wrapper, and in general, outperformed the wrapper on small datasets. CFS executes many times faster than the wrapper, which allows it to scale to larger datasets. Two methods of extending CFS to handle feature interaction are presented and experimentally evaluated. The first considers pairs of features and the second incorporates iii feature weights calculated by the RELIEF algorithm. Experiments on artificial domains showed that both methods were able to identify interacting features. On natural domains, the pairwise method gave more reliable results than using weights provided by RELIEF.

...read moreread less

3,533 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189

Collapse