scispace - formally typeset
Search or ask a question
Book ChapterDOI

Partially Synthesised Dataset to Improve Prediction Accuracy

TL;DR: The use of partially synthesised data to improve the prediction of heart diseases from risk factors using rule-based method and in accordance with the World Health Organisation criteria is introduced.
Abstract: The real world data sources, such as statistical agencies, library databanks and research institutes are the major data sources for researchers. Using this type of data involves several advantages including, the improvement of credibility and validity of the experiment and more importantly, it is related to a real world problems and typically unbiased. However, this type of data is most likely unavailable or inaccessible for everyone due to the following reasons. First, privacy and confidentiality concerns, since the data must to be protected on legal and ethical basis. Second, collecting real world data is costly and time consuming. Third, the data may be unavailable, particularly in the newly arises research subjects. Therefore, many studies have attributed the use of fully and/or partially synthesised data instead of real world data due to simplicity of creation, requires a relatively small amount of time and sufficient quantity can be generated to fit the requirements. In this context, this study introduces the use of partially synthesised data to improve the prediction of heart diseases from risk factors. We are proposing the generation of partially synthetic data from agreed principles using rule-based method, in which an extra risk factor will be added to the real-world data. In the conducted experiment, more than 85 % of the data was derived from observed values (i.e., real-world data), while the remaining data has been synthetically generated using a rule-based method and in accordance with the World Health Organisation criteria. The analysis revealed an improvement of the variance in the data using the first two principal components of partially synthesised data. A further evaluation has been conducted using five popular supervised machine-learning classifiers. In which, partially synthesised data considerably improves the prediction of heart diseases. Where the majority of classifiers have approximately doubled their predictive performance using an extra risk factor.

Summary (2 min read)

1 Introduction

  • There is growing interest from external researchers for access to data records collected by statistical agencies, organisations and research institutes.
  • More importantly, this type of data is typically unbiased [4].
  • The available quantities of the real-world data may not be sufficient for the purposes of the experiment.
  • This study introduces a new method of creating synthetic data.
  • The researchers gave considerable attention to the prediction of heart diseases.

2 Synthesised Data in Real-world Applications

  • The use of synthetic data has become an appealing alternative in many diverse scientific disciplines, including performance analysis, software testing, privacy protection and synthetic oversampling.
  • These methods have been used as an alternative to identify the direct interactions between brain regions.
  • As they have reported, this method may provide a ground truth data for researchers to validate their proposed medical image processing methods.
  • The experiment and evaluation was conducted using different medical related datasets with promising results [14].
  • Using different data mining and statistical analysis methods, they have concluded that the synthetic dataset delivers results that largely similar to the original dataset [15].

3.1 Dataset

  • The experiment conducted using a partially synthesised data, which consists of two parts.
  • This part includes six risk factors extracted from the Cleveland Clinic Foundation heart disease dataset, which is available online at [16].
  • These statements are implemented to generate the synthetic part of data.
  • As mentioned by WHO's global atlas on cardiovascular disease prevention and control, risks of coronary heart disease, raised blood pressure, type 2 diabetes and ischaemic stroke increase steadily with an increasing BMI [21].
  • Unit 25 of BMI is mutual between no risk and low risk class labels, and so on for the remaining class labels.

3.2 Statistical Analysis

  • This section inspects the feasibility of involving the BMI as another risk factor for improving the prediction of heart diseases.
  • A principal component analysis (PCA) is employed and the data is normalised.
  • Therefore, the z-score normalisation method has been used to transform and unify the ranges of all attributes within the dataset.
  • Figure 3 shows a score plot of the first principal component versus the second principal component of both real and partially synthesised data.

3.3 Results

  • This section utilises five popular supervised machine-learning classifiers to assess both real and partially synthesised dataset, particularly evaluating the value of employing an extra risk factor to improve the prediction of heart diseases.
  • The cross validation is then repeated k times, until each subset applied exactly once for testing.
  • Results indicate a considerably low overall predictive performance, where the prediction accuracy ranges between 51% and 56%.
  • Table 2 introduces the overall predictive performance using partially synthesised data.
  • The majority of classifiers have achieved impressive overall results with more than 90% of sensitivity, specificity and prediction accuracy.

3.4 Discussion

  • The conducted experiment highlighted the involvement of an additional risk factor, which synthetically generated using rule-based method and according to the standards of WHO, with a set of risk factors that extracted from real-world data to predict heart disease.
  • It seems that synthesising less than 15% of the data will not have a serious impact on the quality of statistical inference.
  • A rule-based method has been used for this purpose.
  • This strategy can be generalised into many other research areas.
  • In contrast to the first contribution, the method of using extra risk factor to improve the prediction accuracy cannot be generalised as a new way to improve the prediction of certain diseases.

4 Conclusion

  • This paper presents the idea of generating synthetic data from agreed principles.
  • The main aim was the improvement of the prediction of heart diseases from risk factors.
  • Partially synthesised data have been used, in which more than 85% of the data extracted from real-world data, while the remaining was synthetically generated using rule-based method and in accordance with the criteria of World Health Organisation.
  • A statistical analysis has shown an improvement in the variance of data after adding an extra risk factor.
  • The classifiers have approximately doubled their predictive performance using an extra risk factor, which confirms the statistical analysis result.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Partially Synthesised Dataset to Improve Prediction
Accuracy
(Case Study: Prediction of Heart Diseases)
Ahmed J. Aljaaf
1
, Dhiya Al-Jumeily
1
, Abir J. Hussain
1
, Paul Fergus
1
, Mohammed Al-
Jumaily
2
and Hani Hamdan
3
1
Applied Computing Research Group, Liverpool John Moores University, Byrom Street,
Liverpool, L3 3AF, UK
2
Dept. of Neurosurgery, Dr. Sulaiman Al Habib Hospital, Dubai Healthcare City, UAE
3
CentraleSupélec, Département Signal & Statistiques, FRANCE
A.J.Kaky@2013.ljmu.ac.uk; {d.aljumeily, a.hussain, p.fergus}@ljmu.ac.uk;
Hani.Hamdan@centralesupelec.fr
Abstract. The real world data sources, such as statistical agencies, library data-
banks and research institutes are the major data sources for researchers. Using
this type of data involves several advantages including, the improvement of
credibility and validity of the experiment and more importantly, it is related to a
real world problems and typically unbiased. However, this type of data is most
likely unavailable or inaccessible for everyone due to the following reasons.
First, privacy and confidentiality concerns, since the data must to be protected
on legal and ethical basis. Second, collecting real world data is costly and time
consuming. Third, the data may be unavailable, particularly in the newly arises
research subjects. Therefore, many studies have attributed the use of fully
and/or partially synthesised data instead of real world data due to simplicity of
creation, requires a relatively small amount of time and sufficient quantity can
be generated to fit the requirements. In this context, this study introduces the
use of partially synthesised data to improve the prediction of heart diseases
from risk factors. We are proposing the generation of partially synthetic data
from agreed principles using rule-based method, in which an extra risk factor
will be added to the real-world data. In the conducted experiment, more than
85% of the data was derived from observed values (i.e., real-world data), while
the remaining data has been synthetically generated using a rule-based method
and in accordance with the World Health Organisation criteria. The analysis re-
vealed an improvement of the variance in the data using the first two principal
components of partially synthesised data. A further evaluation has been con-
ducted using five popular supervised machine-learning classifiers. In which,
partially synthesised data considerably improves the prediction of heart diseas-
es. Where the majority of classifiers have approximately doubled their predic-
tive performance using an extra risk factor.
Keywords: Partially synthesised data; prediction, heart diseases, machine
learning, rule-based method.

1 Introduction
There is growing interest from external researchers for access to data records col-
lected by statistical agencies, organisations and research institutes. However, the pri-
vacy of individuals and confidentiality of data must be protected on legal and ethical
grounds. Meanwhile, there is a demand to release a sufficient detail of data to main-
tain the reality and validity of statistical inference on the target population. To satisfy
these desires, one method is to restrict data for approved analyses by authorised indi-
viduals. A second method is to release synthetic data rather than observed values,
which typically conducted by a statistical disclosure control (SDC) technique [1]. The
term of synthetic data has emerged since 1993 by Rubin [2]. The main aim was to
protect the privacy and confidentiality of personal information through releasing syn-
thetically produced data rather than actual data [2]. In general, a synthetic data can be
created by a computer program using a random number generator or a formula that
derived from real-world data [4]. There are two approaches of generating synthetic
data, fully synthesis and partially synthesis data. Under the first approach, all data
attributes are synthesised and no real data are released, while a subset of data attrib-
utes is synthesised under the partially synthesis approach [3, 5].
The real world data sources, such as statistical agencies, library databanks, re-
search institutes and random generation procedures, are the major sources for re-
searchers. Using this type of data involves a range of advantages. First, the data is
relevant to real world problems, which enables reliable estimation of the usefulness of
the results. Second, it improves the credibility and validity of the experiment. More
importantly, this type of data is typically unbiased [4]. However, many studies
showed that the use of synthetically generated data instead of real-world data is at-
tributed to several factors including; a) the difficulty of using real-world data because
of the privacy policies. (b) The available quantities of the real-world data may not be
sufficient for the purposes of the experiment. (c) The collection of real-world data
might be inapplicable, costly or time consuming. (d) The real-world data might be
unavailable, particularly in the newly arises research subjects [3, 4]. Moreover, Syn-
thetic data have a considerable advantages including; the simplicity of generation,
requires relatively small amount of time in comparison with a real-world data collec-
tion, a sufficient quantity can be generated to fit the requirements with the diversity
and relevance that can mimic the real-world data [4].
This study introduces a new method of creating synthetic data. We are proposing
the generation of synthetic data from agreed principles using rule-based method. This
new method has been proposed with the aim of improving prediction accuracy. In
particular the prediction of heart diseases. We are targeting the improvement of heart
diseases prediction through adding an extra risk factor. This risk factor will be syn-
thetically generated in accordance with the World Health Organisation (WHO) crite-
ria for classification of adults underweight, overweight and obesity according to BMI
[20]. The experiment will be conducted using partially synthesised data, where more
than 85% of the data have been extracted from real-world data, while less than 15%
has been synthetically generated using rule-based method. The real-world part of data
consists of six risk factors extracted from the Cleveland Clinic Foundation heart dis-

ease dataset, which available online at [16]. The synthesised part of data consists of
one additional risk factor, which synthetically generated based on agreed principles.
The Cleveland Clinic Foundation heart disease dataset was intensively used in the
majority of studies that addressed the early prediction of heart diseases. These studies
have used a full range of data attributes. In contrast, we are extracting only the risk
factors, which represents the real-world part of data in this study. An adequate review
of studies that targeted the prediction of heart diseases can be found in [6].
The researchers gave considerable attention to the prediction of heart diseases.
Where the early prediction of heart disease has a significant influence on patient safe-
ty, as it can contributes to an effective and successful treatment before any severe
degradation of cardiac output [6]. Heart diseases is a public health problem with high
societal and economic burdens. It is considered the main cause of frequent hospitali-
sations in individuals 65 years of age or older, and slightly less than 5 million Ameri-
cans suffer from heart diseases [8]. Heart diseases can occur because of many poten-
tial causes, some are illnesses in their own right, while others are secondarily to an-
other underlying diseases [7, 8]. The commonest cause of heart failure is coronary
disease by 62% compared to other risk factors such as hypertension, valvular disease,
myocarditis, diabetes, alcohol excess, obesity and smoking [8, 9]. In general, heart
diseases can be used to describe a condition in which the heart is unable to pump a
sufficient amount of blood around the body [7].
In this paper, we aimed to a) review the latest studies that addressed the aspect of
synthetic data generation, b) describe our proposed method of synthetic data genera-
tion using rule-based method and in accordance with agreed principles, c) inspect the
feasibility of our method and adding an extra risk factor using principal component
analysis, d) evaluate the utilisation of an extra risk factor to improve the prediction of
heart diseases using five popular supervised machine-learning classifiers, and finally,
e) highlight the results and study contributions.
2 Synthesised Data in Real-world Applications
Although the focus and the requirement are quite different in each field, the use of
synthetic data has become an appealing alternative in many diverse scientific disci-
plines, including performance analysis, software testing, privacy protection and syn-
thetic oversampling. Macia et al. [10] have proposed the use of fully synthesised data
to investigate the performance of machine learning classifiers. As they have stated,
the use of synthetically generated datasets can offer a controlled environment to ana-
lyse the performance of machine learning classifiers and therefore provide a better
understanding of their behaviours. In the same context, Sojoudi and Doyle [11] have
used a synthetic data generated by an electrical circuit model to investigate the per-
formance of three methods, namely thresholding the correlation matrix, graphical
lasso and Chow-Liu algorithm. These methods have been used as an alternative to
identify the direct interactions between brain regions. They noticed that the first two

methods (i.e., thresholding the correlation and graphical lasso algorithm) are suscepti-
ble to errors.
In area of software evaluation, Whiting et al. [12] contributed in creating fully
synthesised data to test visual analytics applications. Their main aim was to enable
tool developers to determine the effectiveness of their software within an acceptable
time frame. Similarly, Babaee and her colleague [13] have introduced the use of syn-
thetic 2D X-ray images to validate medical image processing applications. Initially, a
model of an organ is created using modelling software, then the model converted to
computerised tomography image (CT) through assigning a proper Hounsfield unit to
each voxel. As they have reported, this method may provide a ground truth data for
researchers to validate their proposed medical image processing methods.
In another study, the researchers were targeting imbalanced learning problems.
They have used a Kernel density estimation method to construct partially synthesised
oversampling approach to address imbalanced class distribution in a particular data
set. The experiment and evaluation was conducted using different medical related
datasets with promising results [14]. Finally, Park and his partners have introduced a
non-parametric systematic data for privacy protection of healthcare data. As they have
claimed, their proposed method synthesises artificial records while maintaining the
statistical features of the original records to the maximum extent possible. Using dif-
ferent data mining and statistical analysis methods, they have concluded that the syn-
thetic dataset delivers results that largely similar to the original dataset [15].
3 Materials and Methods
3.1 Dataset
The experiment conducted using a partially synthesised data, which consists of
two parts. First, real-world observations (i.e., risk factors), which represents 85.72%
of the data. This part includes six risk factors extracted from the Cleveland Clinic
Foundation heart disease dataset, which is available online at [16]. These risk factors
are patient’s age, gender, resting blood pressure, serum cholesterol, fasting blood
sugar and maximum heart rate. This study selects these risk factors according to so-
phisticated researches in cardiovascular disease. As presented in the Framingham
heart study, which predicting risk factors of cardiovascular disease for 30 years, pa-
tient's age, gender, blood pressure, serum cholesterol and blood sugar are considered
standard risk factors [17]. Another long-term follow-up study on healthy individuals
aged 25-74 years found that a high resting heart rate is an independent risk factor for
coronary artery disease incidence or mortality among white and black individuals
[18]. This part of data (i.e., real-world observations) consists of 297 consistent in-
stances and without missing values. The output class includes four labels, which are
no risk, low risk, moderate risk and high risk of developing heart diseases.

Figure 1. Distribution of age
Around two-thirds of the data are for male individuals, which is 201 instances.
Mean age of individuals in the data set is 54 years. Figure 1 presents the age distribu-
tion, which clearly shows that the risk of developing heart disease starts approximate-
ly in the fourth decade of life. The risk is then about to double across every ten years
to reach its peak in the sixth decade of life. Aging as shown by many studies poses the
largest risk factor for cardiovascular diseases, where aging is associated with changes
in cardiovascular tissues, which leads to the loss of arterial elasticity and increase
arterial thickening and stiffness. These changes may subsequently contribute to hyper-
tension, stroke, and arterial fibrillation [19].
The second part is an additional risk factor (i.e., body mass index BMI), which
represents 14.28% of the data and has been synthetically generated in accordance with
the World Health Organisation (WHO) criteria for classification of adults under-
weight, overweight and obesity according to BMI [20]. This study consider adding
BMI as an additional risk factor because; a) it is neither been collected with the origi-
nal data nor been involved with the same data for inference. b) The increase in BMI
could dramatically increase the prospect of heart diseases. A study conducted in the
USA showed that 30% reduction in the proportion of obese people would prevent
approximately 44 thousand cases of heart diseases each year [7]. Moreover, being
obese has been shown to double the risk of heart diseases [7, 8]. Finally, c) adding an
additional risk factor would potentially improve the early prediction of heart diseases.
The second part (i.e., synthetic data) has been generated using rule-based method,
in which, we are modelling the creation of WHO in the form of IF-THEN statements.
These statements are implemented to generate the synthetic part of data. The classifi-
cation of WHO identifies a principal cut-off points to categorise individuals according
to their BMI, which is a manner of labelling someone as underweight, normal, over-

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper , the authors examined existing literature to bridge the gap and highlight the utility of synthetic data in health care, and identified seven use cases of synthetic datasets in Health Care: simulation and prediction research, hypothesis, methods, and algorithm testing, epidemiology/public health research, education and training, public release of datasets, and linking data.
Abstract: Data are central to research, public health, and in developing health information technology (IT) systems. Nevertheless, access to most data in health care is tightly controlled, which may limit innovation, development, and efficient implementation of new research, products, services, or systems. Using synthetic data is one of the many innovative ways that can allow organizations to share datasets with broader users. However, only a limited set of literature is available that explores its potentials and applications in health care. In this review paper, we examined existing literature to bridge the gap and highlight the utility of synthetic data in health care. We searched PubMed, Scopus, and Google Scholar to identify peer-reviewed articles, conference papers, reports, and thesis/dissertations articles related to the generation and use of synthetic datasets in health care. The review identified seven use cases of synthetic data in health care: a) simulation and prediction research, b) hypothesis, methods, and algorithm testing, c) epidemiology/public health research, d) health IT development, e) education and training, f) public release of datasets, and g) linking data. The review also identified readily and publicly accessible health care datasets, databases, and sandboxes containing synthetic data with varying degrees of utility for research, education, and software development. The review provided evidence that synthetic data are helpful in different aspects of health care and research. While the original real data remains the preferred choice, synthetic data hold possibilities in bridging data access gaps in research and evidence-based policymaking.

2 citations

References
More filters
Journal ArticleDOI
TL;DR: Some of the key genes involved in regulating lifespan and health span, including sirtuins, AMP-activated protein kinase, mammalian target of rapamycin, and insulin-like growth factor 1 and their roles regulating cardiovascular health are provided.
Abstract: The average lifespan of humans is increasing, and with it the percentage of people entering the 65 and older age group is growing rapidly and will continue to do so in the next 20 years. Within this age group, cardiovascular disease will remain the leading cause of death, and the cost associated with treatment will continue to increase. Aging is an inevitable part of life and unfortunately poses the largest risk factor for cardiovascular disease. Although numerous studies in the cardiovascular field have considered both young and aged humans, there are still many unanswered questions as to how the genetic pathways that regulate aging in model organisms influence cardiovascular aging. Likewise, in the molecular biology of aging field, few studies fully assess the role of these aging pathways in cardiovascular health. Fortunately, this gap is beginning to close, and these two fields are merging together. We provide an overview of some of the key genes involved in regulating lifespan and health span, including sirtuins, AMP-activated protein kinase, mammalian target of rapamycin, and insulin-like growth factor 1 and their roles regulating cardiovascular health. We then discuss a series of review articles that will appear in succession and provide a more comprehensive analysis of studies carried out linking genes of aging and cardiovascular health, and perspectives of future directions of these two intimately linked fields.

909 citations

Journal ArticleDOI
TL;DR: In this paper, a modified Cox model that allows adjustment for competing risk of noncardiovascular death was used to construct a prediction algorithm for 30-year risk of hard CVD events (coronary death, myocardial infarction, stroke).
Abstract: Background— Present cardiovascular disease (CVD) risk prediction algorithms were developed for a ≤10-year follow up period. Clustering of risk factors at younger ages and increasing life expectancy suggest the need for longer-term risk prediction tools. Methods and Results— We prospectively followed 4506 participants (2333 women) of the Framingham Offspring cohort aged 20 to 59 years and free of CVD and cancer at baseline examination in 1971–1974 for the development of “hard” CVD events (coronary death, myocardial infarction, stroke). We used a modified Cox model that allows adjustment for competing risk of noncardiovascular death to construct a prediction algorithm for 30-year risk of hard CVD. Cross-validated survival C statistic and calibration χ2 were used to assess model performance. The 30-year hard CVD event rates adjusted for the competing risk of death were 7.6% for women and 18.3% for men. Standard risk factors (male sex, systolic blood pressure, antihypertensive treatment, total and high-densit...

724 citations

Journal ArticleDOI
TL;DR: Risks of death from all causes, cardiovascular diseases, and noncardiovascular diseases, were also elevated for white men with elevated pulse rate independent of other risk factors, and the association with cardiovascular death was particularly striking in black women, even after adjusting for baseline risk factors.

418 citations

Journal ArticleDOI
Véronique L. Roger1
TL;DR: In this paper, the authors have shown that, over time, the incidence of heart failure remained overall stable, while survival improved, and therefore, the heart failure epidemic is chiefly one of hospitalizations.
Abstract: Heart failure has been singled out as an emerging epidemic, which could be the result of increased incidence and/or increased survival leading to increased prevalence. Knowledge of the responsibility of each factor in the genesis of the epidemic is crucial for prevention. Population-based studies have shown that, over time, the incidence of heart failure remained overall stable, while survival improved. Therefore, the heart failure epidemic is chiefly one of hospitalizations. Data on temporal trends in the incidence and prevalence of heart failure according to ejection fraction and how it may have changed over time are needed while interventions should focus on reducing the burden of hospitalizations in hear failure.

296 citations

Proceedings ArticleDOI
25 May 2006
TL;DR: This study emphasized on different types of normalization, each of which was tested against the ID3 methodology using the HSV data set, and recommended methods based on the factors and their priorities.
Abstract: This study is emphasized on different types of normalization. Each of which was tested against the ID3 methodology using the HSV data set. Number of leaf nodes, accuracy, and tree growing time are three factors that were taken into account. Comparisons between different learning methods were accomplished as they were applied to each normalization method. A simple matrix was designed to check for the best normalization method based on the factors and their priorities. Recommendations were concluded.

181 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Partially synthesised dataset to improve prediction accuracy (case study: prediction of heart diseases)" ?

However, this type of data is most likely unavailable or inaccessible for everyone due to the following reasons. In this context, this study introduces the use of partially synthesised data to improve the prediction of heart diseases from risk factors. A further evaluation has been conducted using five popular supervised machine-learning classifiers.