A clustering approach for autistic trait classification.
Summary (3 min read)
1: Introduction
- Autism Spectrum Disorder (ASD) is a neurodevelopmental condition that contributes to the delay of social and communication behaviors of individuals.
- The official diagnosis process of ASD involves multiple examinations, which in turn cause the waiting time for patients to be lengthy 40 .
- 6, 7, 24, 38 Most of these screening methods have been developed using existing clinical autism diagnosis methods and are represented as questionnaires in which each question is associated with a few possible answers in a multiple-choice fashion.
- The screening of ASD traits can be considered a classification problem in which historical data that have been already classified with and without ASD traits is utilized as an input to construct a classification system.
- Thus, by having clustering at the pre-processing phase will enhance the predictability of the classification algorithm and improve the classifier accuracy, sensitivity, specificity, and error rates among others.
2: Literature Review
- Crane, et. al. 17 , highlighted some of challenges for a timely and adequate ASD diagnosis including the inadequate of the tools used to aid screening of ASD.
- Thabtah et al., 41 improved the efficiency of the screening process by reducing the number of items in the self-assessment screening tool called AQ-10, 3 .
- The authors applied their datasets to Random Forests classifiers.
- The author also pointed out that while the studies showed promising results, none were embedded in a screening tool.
- The authors were able to prove that only ten items can be used for screening first level ASD traits.
3: The Proposed Clustering based Autistic Trait Classification (CATC)
- The authors discuss the proposed CATC method based on the architecture shown in Figure 1 below.
- Three data sets (adult, adolescent, and child) are collected via a mobile screening app called ASDTest 37, 38 .
- The data is then cleaned for their experimentations and is ran through an unsupervised machine learning clustering algorithm.
- The result of this process is used as their initial model that is loaded to a classifier for the predictive phase.
- Further details for each of the steps are outlined in the subsections that follow.
3.1: Data Collection
- Initially, data is collected using a mobile screening tool called ASDTests 37, 38 .
- The child, adolescent and adult datasets that have been collected contain instances for individuals between 4-11 years old, 12-16 years old and above 16 years respectively.
- A score of 6 and above based on 3 indicates that the individual has some ASD traits and the class label is labeled as YES.
- Otherwise, the class is given a value of NO.
- The size of the datasets varies between the three groups.
Ethical Considerations
- The data is published and made public 25 by its prospective author Thabtah et al., 40 .
- The authors of the datasets had obtained ethical approval from the University of Huddersfield, Huddersfield, UK.
3.2: The initial Dataset and Data Transformation
- The initial datasets are of multivariable nature with categorical, continuous and binary attributes that contain a total of 23 features (see Table 2 ).
- A "slightly disagree" or "definitely disagree" had a score of "1" on all remaining questions.
- The authors modified the dataset to include only 18 attributes by removing features marked 16-22 in Table 2 below in the three datasets.
- The said features are general questions regarding the user and the app.
- The "Screening Score" (Feature #19 in Table.
3.3: Clustering Phase
- The datasets are pre-processed by applying an unsupervised machine learning clustering method.
- The authors employ the OMCOKE algorithm which groups all items into two clusters.
- The centroids are recomputed and the process is repeated until there is no movement or change in the assignment of data points to their closest centroid.
- Algorithm 1 below summarizes the OMCOKE clustering.
3.4: Clustering Phase
- The datasets contain a Boolean attribute named "Class" that has a value of YES/NO based on a score.
- This attribute Class is used to assess whether the user has been screened to have ASD or not and is used in the supervised learning algorithm for their predictions.
- These assignments are then compared to the attribute Class to see if they match.
- Where there is a match the authors keep that instance, otherwise they discard it and remove it from the dataset.
Key features of applying CATC process includes:
- (1) Grouping the data items into two clusters based on their strong attributes.
- The clustering algorithm has assisted in identifying relevant and strong features that were only used in the supervised learning models.
- (2) Reduce data dimensionality by eliminating redundancy.
- The authors adopt the clustering based autistic traits dataset which has been efficiently streamlined and enhanced to be used in the learning phase in the machine learning process.
- Assume the following simple dataset represented in figure 4 below as their original data.
4.1: Experimental Settings
- The authors experiments are conducted on real-life ASD screening datasets to measure the effectiveness of the enhanced screening data used to identify and predict diagnosis.
- The three datasets of adult, adolescence, and child have a wide diversity in their ethnicity, language, and age group and are all in the application domain of the study, hence making it suitable for use as benchmarks.
- The authors utilized a number of evaluation measures to show the benefits and negatives of the proposed algorithm when compared with other classification algorithms in ML.
- For ML predictive models, a matrix called the error table, or the confusion matrix, has been adopted.
- Once this data has been pre-processed, then it is run using the classification algorithms above.
4.2: Results and Analysis
- The experiments were conducted for the three datasets i.e. adult, adolescent, and child.
- No significant change is noted in the ANN method.
- This shows overall better accuracy and lower error rates for all datasets including those that have large numbers of instances, i.e. adult dataset, and those with a lower number of instances, i.e. the adolescent dataset.
- These cases tend to confuse the learning algorithm in the classification process hence causing large false positives and false negatives.
- The specificity rates as shown in Figure 7 has seen an improvement of 2.2%, 0.8%, 4.7% and 12% for the adult dataset on the classifiers RIPPER, PART, Random Forest, and Random Tree respectively when CATC was applied.
Figure 9. ROC Area of the classifiers
- The authors also note that the number of rules generated while running the three datasets on RIPPER and PART decrease when CATC is applied as shown in figure 10 .
- This can be attributed to the fact that redundant rules have been removed in the building of the classifier due to the pre-processing of the dataset and clustering them based on their strong attributes.
- Thus, the pre-processing with clustering algorithm have assisted in identifying relevant and strong features that were only used in the supervised learning models.
- This is useful for diagnosticians as fewer rules could mean a reduced amount of time needed in the screening of autism patients.
5: Conclusion
- The utilization of clustering and classification together as a semi-supervised learning is rare in autism screening research.
- The authors proposed a method that utilizes both clustering and classification in autism screening, a first that they are aware of.
- (4) Clustering the data before application in the learning phase streamlined the data based on only strong features resulting in reduced number of rules generated by the classifiers.
- The datasets were limited in size and the adult dataset was slightly imbalanced.
- In conclusion, the paper shows employing CATC in the screening phase significantly improved the performance of the classifiers in all measures and especially the accuracy and sensitivity rates.
Did you find this useful? Give us your feedback
Citations
9,185 citations
52 citations
23 citations
14 citations
14 citations
References
[...]
79,257 citations
21,674 citations
"A clustering approach for autistic ..." refers background in this paper
...5.(36,37) The results analysis showed that VA selected influential features for the three datasets (6, 8, and 8 respectively) without compromising on the specificity, sensitivity, and prediction accuracies measurements....
[...]
9,185 citations
7,012 citations
5,936 citations
Related Papers (5)
Frequently Asked Questions (14)
Q2. What have the authors stated for future works in "A clustering approach for autism based autistic trait classification" ?
Their future work will be to build a mobile screening app that will embed their clustering algorithm to assist clinicians in the diagnosis process of ASD in a clinical setting by considering wider options of diagnosis methods.
Q3. What percentage of the classifiers were used in the adolescent dataset?
On the adolescent dataset, when CATC was applied, the percentage increment of the RIPPER, PART, Random Forest, and Random Tree classifiersare 21.2%, 0.5%, 2.4%, and 11.5% respectively.
Q4. What is the process of dividing a dataset into two steps?
Classificationalgorithms are generally divided into a two-step process where the dataset is divided into training data and testing data.
Q5. What algorithms are used to process the considered autism datasets?
the authors adopted RIPPER17, PART21, Random Forest13, Random Trees19, and Artificial Neural Network [ANN]45 algorithms to process the considered autism datasets with and without clustering.
Q6. How did the PART classifier's sensitivity rate increase?
In addition, PART classifier's sensitivity rate went up by 0.9%, 6.9% and 7.5% on the adult, adolescent, and child respectively, when CATC was integrated.
Q7. What is the definition of a predictive analysis problem in ML?
Since ASD screening involvesforecasting whether individuals have the possibility of ASD traits based on a predefined characterized variable then this issue be a predictive analysis problem in ML.
Q8. What did they conclude was that the logistic regression and SVM performed best?
They concluded that SVM and logistic regression performed best with ROC of 93% and 92% respectively and logistic regression and Lasso performed best on module 3 with a ROC of 93%.
Q9. What did they conclude that the algorithms performed better with high classification accuracy?
They concluded that function based algorithms such as regression models performed better with high classification accuracy compared to the decision tree based algorithms such as Random Forest.
Q10. What was the classification accuracy for the ASD diagnosis?
The researchers used the support vector machine algorithm and could predict the ASD diagnosis with a classification accuracy of 79%.
Q11. How much accuracy has PART achieved when CATC was applied?
In addition, PART predictive accuracy has improved by 0.8%, 3.8% and 5.5% on the three datasets respectively when CATC was applied.
Q12. What is the main conclusion of the paper?
In conclusion, the paper shows employing CATC in the screening phase significantlyimproved the performance of the classifiers in all measures and especially the accuracy and sensitivity rates.
Q13. What is the predictor of the model performance?
a good predictor of the model performance would be the true positive rate (sensitivity) and the true negative rate (specificity).
Q14. What did the authors find out about the combining of the questionnaire and video assessment?
Their results suggest that combining the video and questionnaire into a single assessment boosted the sensitivity and specificity rates and overall performance of the study sample.