Journal Article•DOI•

A clustering approach for autistic trait classification.

Said Baadel¹, Said Baadel², Fadi Thabtah³, Joan Lu¹•Institutions (3)

University of Huddersfield¹, Canadian University of Dubai², Manukau Institute of Technology³

03 Feb 2020-Informatics for Health & Social Care (Inform Health Soc Care)-Vol. 45, Iss: 3, pp 309-326

TL;DR: A new semi-supervised ML framework approach called Clustering-based Autistic Trait Classification (CATC) is proposed that uses a clustering technique and that validates classifiers using classification techniques that identifies potential autism cases based on their similarity traits as opposed to a scoring function used by many ASD screening tools.

read less

Abstract: Machine learning (ML) techniques can be utilized by physicians, clinicians, as well as other users, to discover Autism Spectrum Disorder (ASD) symptoms based on historical cases and controls to enhance autism screening efficiency and accuracy. The aim of this study is to improve the performance of detecting ASD traits by reducing data dimensionality and eliminating redundancy in the autism dataset. To achieve this, a new semi-supervised ML framework approach called Clustering-based Autistic Trait Classification (CATC) is proposed that uses a clustering technique and that validates classifiers using classification techniques. The proposed method identifies potential autism cases based on their similarity traits as opposed to a scoring function used by many ASD screening tools. Empirical results on different datasets involving children, adolescents, and adults were verified and compared to other common machine learning classification techniques. The results showed that CATC offers classifiers with higher predictive accuracy, sensitivity, and specificity rates than those of other intelligent classification approaches such as Artificial Neural Network (ANN), Random Forest, Random Trees, and Rule Induction. These classifiers are useful as they are exploited by diagnosticians and other stakeholders involved in ASD screening.

...read moreread less

Summary (3 min read)

Jump to: [1: Introduction] – [2: Literature Review] – [3: The Proposed Clustering based Autistic Trait Classification (CATC)] – [3.1: Data Collection] – [Ethical Considerations] – [3.2: The initial Dataset and Data Transformation] – [3.3: Clustering Phase] – [3.4: Clustering Phase] – [Key features of applying CATC process includes:] – [4.1: Experimental Settings] – [4.2: Results and Analysis] – [Figure 9. ROC Area of the classifiers] and [5: Conclusion]

1: Introduction

Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that contributes to the delay of social and communication behaviors of individuals.
The official diagnosis process of ASD involves multiple examinations, which in turn cause the waiting time for patients to be lengthy 40 .
6, 7, 24, 38 Most of these screening methods have been developed using existing clinical autism diagnosis methods and are represented as questionnaires in which each question is associated with a few possible answers in a multiple-choice fashion.
The screening of ASD traits can be considered a classification problem in which historical data that have been already classified with and without ASD traits is utilized as an input to construct a classification system.
Thus, by having clustering at the pre-processing phase will enhance the predictability of the classification algorithm and improve the classifier accuracy, sensitivity, specificity, and error rates among others.

2: Literature Review

Crane, et. al. 17 , highlighted some of challenges for a timely and adequate ASD diagnosis including the inadequate of the tools used to aid screening of ASD.
Thabtah et al., 41 improved the efficiency of the screening process by reducing the number of items in the self-assessment screening tool called AQ-10, 3 .
The authors applied their datasets to Random Forests classifiers.
The author also pointed out that while the studies showed promising results, none were embedded in a screening tool.
The authors were able to prove that only ten items can be used for screening first level ASD traits.

3: The Proposed Clustering based Autistic Trait Classification (CATC)

The authors discuss the proposed CATC method based on the architecture shown in Figure 1 below.
Three data sets (adult, adolescent, and child) are collected via a mobile screening app called ASDTest 37, 38 .
The data is then cleaned for their experimentations and is ran through an unsupervised machine learning clustering algorithm.
The result of this process is used as their initial model that is loaded to a classifier for the predictive phase.
Further details for each of the steps are outlined in the subsections that follow.

3.1: Data Collection

Initially, data is collected using a mobile screening tool called ASDTests 37, 38 .
The child, adolescent and adult datasets that have been collected contain instances for individuals between 4-11 years old, 12-16 years old and above 16 years respectively.
A score of 6 and above based on 3 indicates that the individual has some ASD traits and the class label is labeled as YES.
Otherwise, the class is given a value of NO.
The size of the datasets varies between the three groups.

Ethical Considerations

The data is published and made public 25 by its prospective author Thabtah et al., 40 .
The authors of the datasets had obtained ethical approval from the University of Huddersfield, Huddersfield, UK.

3.2: The initial Dataset and Data Transformation

The initial datasets are of multivariable nature with categorical, continuous and binary attributes that contain a total of 23 features (see Table 2 ).
A "slightly disagree" or "definitely disagree" had a score of "1" on all remaining questions.
The authors modified the dataset to include only 18 attributes by removing features marked 16-22 in Table 2 below in the three datasets.
The said features are general questions regarding the user and the app.
The "Screening Score" (Feature #19 in Table.

3.3: Clustering Phase

The datasets are pre-processed by applying an unsupervised machine learning clustering method.
The authors employ the OMCOKE algorithm which groups all items into two clusters.
The centroids are recomputed and the process is repeated until there is no movement or change in the assignment of data points to their closest centroid.
Algorithm 1 below summarizes the OMCOKE clustering.

3.4: Clustering Phase

The datasets contain a Boolean attribute named "Class" that has a value of YES/NO based on a score.
This attribute Class is used to assess whether the user has been screened to have ASD or not and is used in the supervised learning algorithm for their predictions.
These assignments are then compared to the attribute Class to see if they match.
Where there is a match the authors keep that instance, otherwise they discard it and remove it from the dataset.

Key features of applying CATC process includes:

(1) Grouping the data items into two clusters based on their strong attributes.
The clustering algorithm has assisted in identifying relevant and strong features that were only used in the supervised learning models.
(2) Reduce data dimensionality by eliminating redundancy.
The authors adopt the clustering based autistic traits dataset which has been efficiently streamlined and enhanced to be used in the learning phase in the machine learning process.
Assume the following simple dataset represented in figure 4 below as their original data.

4.1: Experimental Settings

The authors experiments are conducted on real-life ASD screening datasets to measure the effectiveness of the enhanced screening data used to identify and predict diagnosis.
The three datasets of adult, adolescence, and child have a wide diversity in their ethnicity, language, and age group and are all in the application domain of the study, hence making it suitable for use as benchmarks.
The authors utilized a number of evaluation measures to show the benefits and negatives of the proposed algorithm when compared with other classification algorithms in ML.
For ML predictive models, a matrix called the error table, or the confusion matrix, has been adopted.
Once this data has been pre-processed, then it is run using the classification algorithms above.

4.2: Results and Analysis

The experiments were conducted for the three datasets i.e. adult, adolescent, and child.
No significant change is noted in the ANN method.
This shows overall better accuracy and lower error rates for all datasets including those that have large numbers of instances, i.e. adult dataset, and those with a lower number of instances, i.e. the adolescent dataset.
These cases tend to confuse the learning algorithm in the classification process hence causing large false positives and false negatives.
The specificity rates as shown in Figure 7 has seen an improvement of 2.2%, 0.8%, 4.7% and 12% for the adult dataset on the classifiers RIPPER, PART, Random Forest, and Random Tree respectively when CATC was applied.

Figure 9. ROC Area of the classifiers

The authors also note that the number of rules generated while running the three datasets on RIPPER and PART decrease when CATC is applied as shown in figure 10 .
This can be attributed to the fact that redundant rules have been removed in the building of the classifier due to the pre-processing of the dataset and clustering them based on their strong attributes.
Thus, the pre-processing with clustering algorithm have assisted in identifying relevant and strong features that were only used in the supervised learning models.
This is useful for diagnosticians as fewer rules could mean a reduced amount of time needed in the screening of autism patients.

5: Conclusion

The utilization of clustering and classification together as a semi-supervised learning is rare in autism screening research.
The authors proposed a method that utilizes both clustering and classification in autism screening, a first that they are aware of.
(4) Clustering the data before application in the learning phase streamlined the data based on only strong features resulting in reduced number of rules generated by the classifiers.
The datasets were limited in size and the adult dataset was slightly imbalanced.
In conclusion, the paper shows employing CATC in the screening phase significantly improved the performance of the classifiers in all measures and especially the accuracy and sensitivity rates.

Did you find this useful? Give us your feedback

Figures (12)

Figure 7. Sensitivity & Specificity Rates of the Classifiers

Figure 4. Sample dataset Figure 5. Clustered dataset

Figure 10. # of Rules Generated in PART and RIPPER classifiers

Figure 8. Harmonic mean on the classifiers

Figure 3. Clustering phase Pseudocode in CATC method

Content maybe subject to copyright Report

University of Huddersfield Repository

Baadel, Said, Thabtah, Fadi and Lu, Joan

A Clustering Approach for Autism based Autistic Trait Classification

Original Citation

Baadel, Said, Thabtah, Fadi and Lu, Joan (2019) A Clustering Approach for Autism based Autistic

Trait Classification. Informatics for Health and Social Care, 45 (3). pp. 309-326. ISSN 1753-8165

This version is available at http://eprints.hud.ac.uk/id/eprint/35055/

The University Repository is a digital collection of the research output of the

University, available on Open Access. Copyright and Moral Rights for the items

on this site are retained by the individual author and/or other copyright owners.

Users may access full items free of charge; copies of full text items generally

can be reproduced, displayed or performed and given to third parties in any

format or medium for personal research or study, educational or not-for-profit

purposes without prior permission or charge, provided:

• The authors, title and full bibliographic details is credited in any copy;

• A hyperlink and/or URL is included for the original metadata page; and

• The content is not changed in any way.

For more information, including our policy and submission procedure, please

contact the Repository Team at: E.mailbox@hud.ac.uk.

http://eprints.hud.ac.uk/

A Clustering Approach for Autism based Autistic Trait Classification

Said Baadel

1,2 *

, Fadi Thabtah,

Joan Lu

1. Faculty of Engineering and Computing Science, University of Huddersfield, Huddersfield,

UK.

2. Faculty of Communication, Arts and Sciences, Canadian University Dubai, Dubai, UAE

3. Dept of Digital Technologies, Manukau Institute of Technology, Manukau, New Zealand

* s.baadel@gmail.com

A Clustering Approach for Autism based Autistic Trait Classification

Machine learning (ML) techniques can be utilized by physicians, clinicians, as well as other

users, to discover Autism Spectrum Disorder (ASD) symptoms based on historical cases

and controls to enhance autism screening efficiency and accuracy. The aim of this study is

to improve the performance of detecting ASD traits by reducing data dimensionality and

eliminating redundancy in the autism dataset. To achieve this, a new semi-supervised ML

framework approach called Clustering-based Autistic Trait Classification (CATC) is

proposed that uses a clustering technique and validation of the classifiers is done by

classification techniques. The proposed method identifies potential autism cases based on

their similarity traits as opposed to a scoring function used by many ASD screening tools.

Empirical results on different datasets involving children, adolescents, and adults were

verified and compared to other common machine learning classification techniques. The

results showed that CATC offers classifiers with higher predictive accuracy, sensitivity, and

specificity rates than those of other intelligent classification approaches such as Artificial

Neural Network (ANN), Random Forest, and Random Trees, and Rule Induction. These

classifiers are useful as they are exploited by diagnosticians and other stakeholders involved

in ASD screening.

Keywords: Autism Diagnosis; Classification; Clustering; Machine Learning; OMCOKE;

Predictive Models

1: Introduction

Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that contributes to the

delay of social and communication behaviors of individuals.

8,10.

Typically, ASD diagnosis is

done by clinicians in a clinical set up using observable behavioral indicators in a process

referred to as clinical judgment (CJ).

37, 45.

The official diagnosis process of ASD involves

multiple examinations, which in turn cause the waiting time for patients to be lengthy

. For

instance, the waiting time for an ASD diagnosis in the UK averages over 3 years

. Therefore, it

is vital that the administration time needed for both screening and diagnosis be reduced to cater

for the growing number of ASD patients.

25, 27

Autism screening is a fundamental step that addresses whether individuals exhibit

potential autistic traits related to communication, social or repeated behaviour.

This step is

crucial as the individual and the concerned family become aware of the possibility of ASD traits

early and hence can search for the needed formal assessments. There are many ASD screening

tools developed by researchers such as Autism Spectrum Quotient (AQ) and Childhood Autism

Rating Scale (CARS).

6, 7, 24, 38

Most of these screening methods have been developed using

existing clinical autism diagnosis methods and are represented as questionnaires in which each

question is associated with a few possible answers in a multiple-choice fashion. The

questionnaires used contain measurable indicators (variables/questions) that address

communication, behavior and social skills, of individuals. For example, the Child Behavior

Checklist (CBCL) screening method contains more than 100 questions,

and the AQ method

contains 50 questions

. These make the process of screening lengthy besides inaccessible as

most existing screening methods normally do not exist in simply accessible platforms such as

mobile.

40, 41

Most of the existing autism screening methods utilize scoring functions that compute a

final score based on the answers given by users undergoing the screening (caregivers, parents,

medical staff, teachers or even the adult patients). To be specific, the screening methods take

the answers given in the questionnaire as an input for the scoring function, which in turn

processes the input and computes a final score to reflect whether the individual is associated

with ASD traits. For instance, in AQ method, a cut-off score of larger than 32 is an indication of

autistic traits.

4, 7

Therefore, the final decision of having ASD traits lay solely on the score

calculated by the function. This function in most cases just sums up the behavioural indicators’

answers and does not attempt to seek for correlations among these indicators and the target class

(ASD traits).

To address these shortcomings, there is a need for intelligent methods that can replace

the scoring function and improve the efficiency of the screening. Since ASD screening involves

forecasting whether individuals have the possibility of ASD traits based on a predefined

characterized variable then this issue be a predictive analysis problem in ML. The screening of

ASD traits can be considered a classification problem in which historical data that have been

already classified with and without ASD traits is utilized as an input to construct a classification

system. This system is then used to guess whether a new individual exhibits any autistic traits.

ML can be utilized for ASD screening to improve the classification of the screening and to

reduce the process of the screening time. More importantly, ML may provide models that can

contain useful information about ASD traits to the diagnosticians especially the correlation

among behavioral indicators and how they relate to ASD screening. ML techniques use artificial

intelligence and statistics to create intelligent models by discovering hidden patterns in data, so

users can improve decisions.

There have been recent attempts to adopt ML techniques in autism screening and

diagnosis, i.e.

1, 9, 11, 15, 25, 37, 40

. These studies focused primarily on improving time, accuracy, and

reducing the dimensionality of the dataset by pinpointing influential autistic symptoms. Thabtah

et al.,

proposed a new feature selection method called Variable Analysis (Va) to determine the

most influential features related to ASD based on datasets related to adults, adolescents, and

children. The authors were able to minimize the number of features to 5-7 based on predictive

analysis and filter methods. Abbas et al.,

used Random Forest to improve the diagnosis process

of autism and Levy et al.,

compared 17 different classification-based ML algorithms to seek

improvements on the diagnosis performance of autism for children.

In this paper, we propose a new semi-supervised learning method called Clustering

based Autistic Trait Classification (CATC), to improve the accuracy of the autism screening

problem. The utilization of clustering and classification together as a semi-supervised learning is

rare in autism screening research. Unlike existing methods that primarily focused on the

classification phase of cases and controls, we intend to utilize clustering with classification to

HTML Viewer

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "A clustering approach for autism based autistic trait classification" ?

Copyright and Moral Rights for the items on this site are retained by the individual author and/or other copyright owners.

Q2. What have the authors stated for future works in "A clustering approach for autism based autistic trait classification" ?

Their future work will be to build a mobile screening app that will embed their clustering algorithm to assist clinicians in the diagnosis process of ASD in a clinical setting by considering wider options of diagnosis methods.

Q3. What percentage of the classifiers were used in the adolescent dataset?

On the adolescent dataset, when CATC was applied, the percentage increment of the RIPPER, PART, Random Forest, and Random Tree classifiersare 21.2%, 0.5%, 2.4%, and 11.5% respectively.

Q4. What is the process of dividing a dataset into two steps?

Classificationalgorithms are generally divided into a two-step process where the dataset is divided into training data and testing data.

Q5. What algorithms are used to process the considered autism datasets?

the authors adopted RIPPER17, PART21, Random Forest13, Random Trees19, and Artificial Neural Network [ANN]45 algorithms to process the considered autism datasets with and without clustering.

Q6. How did the PART classifier's sensitivity rate increase?

In addition, PART classifier's sensitivity rate went up by 0.9%, 6.9% and 7.5% on the adult, adolescent, and child respectively, when CATC was integrated.

Q7. What is the definition of a predictive analysis problem in ML?

Since ASD screening involvesforecasting whether individuals have the possibility of ASD traits based on a predefined characterized variable then this issue be a predictive analysis problem in ML.

Q8. What did they conclude was that the logistic regression and SVM performed best?

They concluded that SVM and logistic regression performed best with ROC of 93% and 92% respectively and logistic regression and Lasso performed best on module 3 with a ROC of 93%.

Q9. What did they conclude that the algorithms performed better with high classification accuracy?

They concluded that function based algorithms such as regression models performed better with high classification accuracy compared to the decision tree based algorithms such as Random Forest.

Q10. What was the classification accuracy for the ASD diagnosis?

The researchers used the support vector machine algorithm and could predict the ASD diagnosis with a classification accuracy of 79%.

Q11. How much accuracy has PART achieved when CATC was applied?

In addition, PART predictive accuracy has improved by 0.8%, 3.8% and 5.5% on the three datasets respectively when CATC was applied.

Q12. What is the main conclusion of the paper?

In conclusion, the paper shows employing CATC in the screening phase significantlyimproved the performance of the classifiers in all measures and especially the accuracy and sensitivity rates.

Q13. What is the predictor of the model performance?

a good predictor of the model performance would be the true positive rate (sensitivity) and the true negative rate (specificity).

Q14. What did the authors find out about the combining of the questionnaire and video assessment?

Their results suggest that combining the video and questionnaire into a single assessment boosted the sensitivity and specificity rates and overall performance of the study sample.

A clustering approach for autistic trait classification.

Summary (3 min read)

1: Introduction

2: Literature Review

3: The Proposed Clustering based Autistic Trait Classification (CATC)

3.1: Data Collection

Ethical Considerations

3.2: The initial Dataset and Data Transformation

3.3: Clustering Phase

3.4: Clustering Phase

Key features of applying CATC process includes:

4.1: Experimental Settings

4.2: Results and Analysis

Figure 9. ROC Area of the classifiers

5: Conclusion

Figures (12)

Citations

References

"A clustering approach for autistic ..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "A clustering approach for autism based autistic trait classification" ?

Q2. What have the authors stated for future works in "A clustering approach for autism based autistic trait classification" ?

Q3. What percentage of the classifiers were used in the adolescent dataset?

Q4. What is the process of dividing a dataset into two steps?

Q5. What algorithms are used to process the considered autism datasets?

Q6. How did the PART classifier's sensitivity rate increase?

Q7. What is the definition of a predictive analysis problem in ML?

Q8. What did they conclude was that the logistic regression and SVM performed best?

Q9. What did they conclude that the algorithms performed better with high classification accuracy?

Q10. What was the classification accuracy for the ASD diagnosis?

Q11. How much accuracy has PART achieved when CATC was applied?

Q12. What is the main conclusion of the paper?

Q13. What is the predictor of the model performance?

Q14. What did the authors find out about the combining of the questionnaire and video assessment?