scispace - formally typeset
Search or ask a question
DOI

A Comparative Analysis on Improving Covid-19 Prediction by Using Ensemble Learning Methods

Elif Kartal1
01 Jan 2022-pp 3-14
TL;DR: In this article, the performance of base machine learning models was compared with the ensemble ones (majority voting, C5.0, stochastic gradient boosting, bagged CART, random forest, and stacking models) on a public Covid-19 dataset.
Abstract: In this study, it is aimed to improve the Covid-19 predictions in terms of the distinction between Covid-19 and Flu by using several well-known ensemble learning methods namely, majority voting, bagging, boosting, and stacking. In this scope, the performance of base machine learning models was compared with the ensemble ones (majority voting, C5.0, stochastic gradient boosting, bagged CART, random forest, and stacking models) on a public Covid-19 dataset in which observations are labelled as Covid-19 and Flu. Since the task belongs to a classification problem, supervised machine learning algorithms (logistic regression (via generalized linear model), classification and regression trees, artificial neural networks, and support vector machines) are used as base learners. The Cross-Industry Standard Process Model for Data Mining, which is consisted of six stages: business understanding, data understanding, data preparation, modeling, evaluation, and deployment, is used as the study method. In the model performance evaluation stage, an additional metric was proposed by considering the accuracy and its change interval (max-min). The performance of the models was discussed in terms of accuracy and the proposed metric. A Shiny application is developed by using the best performing model. The application enables users to predict Covid-19 status through a web interface and to use it interactively. Analyses are performed with R and RStudio.
References
More filters
Book
13 Aug 2009
TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.
Abstract: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics. With ggplot2, its easy to: produce handsome, publication-quality plots, with automatic legends created from the plot specification superpose multiple layers (points, lines, maps, tiles, box plots to name a few) from different data sources, with automatically adjusted common scales add customisable smoothers that use the powerful modelling capabilities of R, such as loess, linear models, generalised additive models and robust regression save any ggplot2 plot (or part thereof) for later modification or reuse create custom themes that capture in-house or journal style requirements, and that can easily be applied to multiple plots approach your graph from a visual perspective, thinking about how each component of the data is represented on the final plot. This book will be useful to everyone who has struggled with displaying their data in an informative and attractive way. You will need some basic knowledge of R (i.e. you should be able to get your data into R), but ggplot2 is a mini-language specifically tailored for producing graphics, and youll learn everything you need in the book. After reading this book youll be able to produce graphics customized precisely for your problems,and youll find it easy to get graphics out of your head and on to the screen or page.

29,504 citations

Book
28 Jul 2013
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

19,261 citations

01 Jan 2007
TL;DR: random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.
Abstract: Recently there has been a lot of interest in “ensemble learning” — methods that generate many classifiers and aggregate their results. Two well-known methods are boosting (see, e.g., Shapire et al., 1998) and bagging Breiman (1996) of classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do not depend on earlier trees — each is independently constructed using a bootstrap sample of the data set. In the end, a simple majority vote is taken for prediction. Breiman (2001) proposed random forests, which add an additional layer of randomness to bagging. In addition to constructing each tree using a different bootstrap sample of the data, random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables. In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy turns out to perform very well compared to many other classifiers, including discriminant analysis, support vector machines and neural networks, and is robust against overfitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest), and is usually not very sensitive to their values. The randomForest package provides an R interface to the Fortran programs by Breiman and Cutler (available at http://www.stat.berkeley.edu/ users/breiman/). This article provides a brief introduction to the usage and features of the R functions.

14,830 citations

Journal ArticleDOI
TL;DR: In this comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected and the out-of-bag imputation error estimates of missForest prove to be adequate in all settings.
Abstract: Motivation Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. Results We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. Availability The package missForest is freely available from http://stat.ethz.ch/CRAN/. Contact stekhoven@stat.math.ethz.ch; buhlmann@stat.math.ethz.ch

2,928 citations

Book
06 Jun 2012
TL;DR: An up-to-date, self-contained introduction to a state-of-the-art machine learning approach, Ensemble Methods: Foundations and Algorithms shows how these accurate methods are used in real-world tasks and gives the necessary groundwork to carry out further research in this evolving field.
Abstract: An up-to-date, self-contained introduction to a state-of-the-art machine learning approach, Ensemble Methods: Foundations and Algorithms shows how these accurate methods are used in real-world tasks. It gives you the necessary groundwork to carry out further research in this evolving field. After presenting background and terminology, the book covers the main algorithms and theories, including Boosting, Bagging, Random Forest, averaging and voting schemes, the Stacking method, mixture of experts, and diversity measures. It also discusses multiclass extension, noise tolerance, error-ambiguity and bias-variance decompositions, and recent progress in information theoretic diversity. Moving on to more advanced topics, the author explains how to achieve better performance through ensemble pruning and how to generate better clustering results by combining multiple clusterings. In addition, he describes developments of ensemble methods in semi-supervised learning, active learning, cost-sensitive learning, class-imbalance learning, and comprehensibility enhancement.

1,834 citations