A comprehensive data level analysis for cancer diagnosis on imbalanced data.

Characteristics extracted from the training datasets of classification problems have proven to be effective predictors in a number of meta-analyses. Among them, measures of classification complexity can be used to estimate the difficulty in separating the data points into their expected classes. Descriptors of the spatial distribution of the data and estimates of the shape and size of the decision boundary are among the known measures for this characterization. This information can support the formulation of new data-driven pre-processing and pattern recognition techniques, which can in turn be focused on challenges highlighted by such characteristics of the problems. This article surveys and analyzes measures that can be extracted from the training datasets to characterize the complexity of the respective classification problems. Their use in recent literature is also reviewed and discussed, allowing to prospect opportunities for future work in the area. Finally, descriptions are given on an R package named Extended Complexity Library (ECoL) that implements a set of complexity measures and is made publicly available.

How Complex Is Your Classification Problem?: A Survey on Measuring Classification Complexity

Background: Amyotrophic Lateral Sclerosis (ALS) is a relentlessly progressive neurodegenerative condition with limited therapeutic options at present. Survival from symptom onset ranges from 3 to 5 years depending on genetic, demographic, and phenotypic factors. Despite tireless research efforts, the core etiology of the disease remains elusive and drug development efforts are confounded by the lack of accurate monitoring markers. Disease heterogeneity, late-stage recruitment into pharmaceutical trials, and inclusion of phenotypically admixed patient cohorts are some of the key barriers to successful clinical trials. Machine Learning (ML) models and large international data sets offer unprecedented opportunities to appraise candidate diagnostic, monitoring, and prognostic markers. Accurate patient stratification into well-defined prognostic categories is another aspiration of emerging classification and staging systems.
Methods: The objective of this paper is the comprehensive, systematic, and critical review of ML initiatives in ALS to date and their potential in research, clinical, and pharmacological applications. The focus of this review is to provide a dual, clinical-mathematical perspective on recent advances and future directions of the field. Another objective of the paper is the frank discussion of the pitfalls and drawbacks of specific models, highlighting the shortcomings of existing studies and to provide methodological recommendations for future study designs.
Results: Despite considerable sample size limitations, ML techniques have already been successfully applied to ALS data sets and a number of promising diagnosis models have been proposed. Prognostic models have been tested using core clinical variables, biological, and neuroimaging data. These models also offer patient stratification opportunities for future clinical trials. Despite the enormous potential of ML in ALS research, statistical assumptions are often violated, the choice of specific statistical models is seldom justified, and the constraints of ML models are rarely enunciated.
Conclusions: From a mathematical perspective, the main barrier to the development of validated diagnostic, prognostic, and monitoring indicators stem from limited sample sizes. The combination of multiple clinical, biofluid, and imaging biomarkers is likely to increase the accuracy of mathematical modeling and contribute to optimized clinical trial designs.

/pdf/machine-learning-in-amyotrophic-lateral-sclerosis-10evcvz9py.pdf

Machine Learning in Amyotrophic Lateral Sclerosis: Achievements, Pitfalls, and Future Directions.

As predicting the trajectory of COVID-19 is challenging, machine learning models could assist physicians in identifying high-risk individuals. This study compares the performance of 18 machine learning algorithms for predicting ICU admission and mortality among COVID-19 patients. Using COVID-19 patient data from the Mass General Brigham (MGB) Healthcare database, we developed and internally validated models using patients presenting to the Emergency Department (ED) between March-April 2020 (n = 3597) and further validated them using temporally distinct individuals who presented to the ED between May-August 2020 (n = 1711). We show that ensemble-based models perform better than other model types at predicting both 5-day ICU admission and 28-day mortality from COVID-19. CRP, LDH, and O2 saturation were important for ICU admission models whereas eGFR <60 ml/min/1.73 m2, and neutrophil and lymphocyte percentages were the most important variables for predicting mortality. Implementing such models could help in clinical decision-making for future infectious disease outbreaks including COVID-19.

/pdf/comparing-machine-learning-algorithms-for-predicting-icu-3temm7qmzz.pdf

Comparing machine learning algorithms for predicting ICU admission and mortality in COVID-19.

COVID-19, otherwise known as the coronavirus, has precipitated the world into a pandemic that has infected, as of the time of writing, more than 10 million persons worldwide and caused the death of more than 500,000 persons. Early symptoms of the virus include trouble breathing, fever and fatigue and over 60% of people experience a dry cough. Due to the devastating impact of COVID-19 and the tragic loss of lives, it is of the utmost urgency to develop methods for the early detection of the disease that may help limit its spread as well as aid in the development of targeted solutions. Coughs and other vocal sounds contain pulmonary health information that can be used for diagnostic purposes, and recent studies in chaotic dynamics have shown that nonlinear phenomena exist in vocal signals. The present work investigates the use of symbolic recurrence quantification measures with MFCC features for the automatic detection of COVID-19 in cough sounds of healthy and sick individuals. Our performance evaluation reveals that our symbolic dynamics measures capture the complex dynamics in the vocal sounds and are highly effective at discriminating sick and healthy coughs. We apply our method to sustained vowel ’ah’ recordings, and show that our model is robust for the detection of the disease in sustained vowel utterances as well. Furthermore, we introduce a robust novel method of informative undersampling using information rate to deal with the imbalance in our dataset, due to the unavailability of an equal number of sick and healthy recordings. The proposed model achieves a mean classification performance of 97% and 99%, and a mean $$F_1$$
 -score of 91% and 89% after optimization, for coughs and sustained vowels, respectively.

/pdf/robust-detection-of-covid-19-in-cough-sounds-using-2szz7is8wz.pdf

Robust Detection of COVID-19 in Cough Sounds: Using Recurrence Dynamics and Variable Markov Model

Although cross-validation is a standard procedure for performance evaluation, its joint application with oversampling remains an open question for researchers farther from the imbalanced data topic. A frequent experimental flaw is the application of oversampling algorithms to the entire dataset, resulting in biased models and overly-optimistic estimates.

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]

The performance evaluation of imputation algorithms often involves the generation of missing values. Missing values can be inserted in only one feature (univariate configuration) or in several features (multivariate configuration) at different percentages (missing rates) and according to distinct missing mechanisms, namely, missing completely at random, missing at random, and missing not at random. Since the missing data generation process defines the basis for the imputation experiments (configuration, missing rate, and missing mechanism), it is essential that it is appropriately applied; otherwise, conclusions derived from ill-defined setups may be invalid. The goal of this paper is to review the different approaches to synthetic missing data generation found in the literature and discuss their practical details, elaborating on their strengths and weaknesses. Our analysis revealed that creating missing at random and missing not at random scenarios in datasets comprising qualitative features is the most challenging issue in the related work and, therefore, should be the focus of future work in the field.

/pdf/generating-synthetic-missing-data-a-review-by-missing-3836og04bh.pdf

Generating Synthetic Missing Data: A Review by Missing Mechanism

Missing data consists in the lack of information in a dataset and since it directly influences classification performance, neglecting it is not a valid option. Over the years, several studies presented alternative imputation strategies to deal with the three missing data mechanisms, Missing Completely At Random, Missing At Random and Missing Not At Random. However, there are no studies regarding the influence of all these three mechanisms on the latest high-performance Artificial Intelligence techniques, such as Deep Learning. The goal of this work is to perform a comparison study between state-of-the-art imputation techniques and a Stacked Denoising Autoencoders approach. To that end, the missing data mechanisms were synthetically generated in 6 different ways; 8 different imputation techniques were implemented; and finally, 33 complete datasets from different open source repositories were selected. The obtained results showed that Support Vector Machines imputation ensures the best classification performance while Multiple Imputation by Chained Equations performs better in terms of imputation quality.

Missing Data Imputation via Denoising Autoencoders: The Untold Story

Dealing with missing data is a crucial step in the preprocessing stage of most data mining projects. Especially in healthcare contexts, addressing this issue is fundamental, since it may result in keeping or loosing critical patient information that can help physicians in their daily clinical practice. Over the years, many researchers have addressed this problem, basing their approach on the implementation of a set of imputation techniques and evaluating their performance in classification tasks. These classic approaches, however, do not consider some intrinsic data information that could be related to the performance of those algorithms, such as features’ distribution. Establishing a correspondence between data distribution and the most proper imputation method avoids the need of repeatedly testing a large set of methods, since it provides a heuristic on the best choice for each feature in the study. The goal of this work is to understand the relationship between data distribution and the performance of well-known imputation techniques, such as Mean, Decision Trees, k-Nearest Neighbours, Self-Organizing Maps and Support Vector Machines imputation. Several publicly available datasets, all complete, were selected attending to several characteristics such as number of distributions, features and instances. Missing values were artificially generated at different percentages and the imputation methods were evaluated in terms of Predictive and Distributional Accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, although some factors must be taken into account, such as the number of features per distribution and the missing rate at state.

Influence of Data Distribution in Missing Data Imputation

In data imputation problems, researchers typically use several techniques, individually or in combination, in order to find the one that presents the best performance over all the features comprised in the dataset. This strategy, however, neglects the nature of data (data distribution) and makes impractical the generalisation of the findings, since for new datasets, a huge number of new, time consuming experiments need to be performed. To overcome this issue, this work aims to understand the relationship between data distribution and the performance of standard imputation techniques, providing a heuristic on the choice of proper imputation methods and avoiding the needs to test a large set of methods. To this end, several datasets were selected considering different sample sizes, number of features, distributions and contexts and missing values were inserted at different percentages and scenarios. Then, different imputation methods were evaluated in terms of predictive and distributional accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, and that their performance seems to be affected by the combination of missing rate and scenario at state and also other less obvious factors such as sample size, goodness-of-fit of features and the ratio between the number of features and the different distributions comprised in the dataset.

Jastin Pompeu Soares

Papers

Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier]

Generating Synthetic Missing Data: A Review by Missing Mechanism

Missing Data Imputation via Denoising Autoencoders: The Untold Story

Influence of Data Distribution in Missing Data Imputation

Exploring the Effects of Data Distribution in Missing Data Imputation