scispace - formally typeset
Search or ask a question
Author

Wei Zhong

Other affiliations: Georgia State University
Bio: Wei Zhong is an academic researcher from University of South Carolina Upstate. The author has contributed to research in topics: Cluster analysis & Support vector machine. The author has an hindex of 7, co-authored 28 publications receiving 295 citations. Previous affiliations of Wei Zhong include Georgia State University.

Papers
More filters
Journal ArticleDOI
Wei Zhong1, Gulsah Altun1, Robert W. Harrison1, Phang C. Tai1, Yi Pan1 
TL;DR: Experimental results indicate that the improved K-means algorithm generates more detailed sequence motifs representing common structures than previous research, and may be applied to other areas of bioinformatics research in order to explore the underlying relationships between data samples more effectively.
Abstract: Information about local protein sequence motifs is very important to the analysis of biologically significant conserved regions of protein sequences. These conserved regions can potentially determine the diverse conformation and activities of proteins. In this work, recurring sequence motifs of proteins are explored with an improved K-means clustering algorithm on a new dataset. The structural similarity of these recurring sequence clusters to produce sequence motifs is studied in order to evaluate the relationship between sequence motifs and their structures. To the best of our knowledge, the dataset used by our research is the most updated dataset among similar studies for sequence motifs. A new greedy initialization method for the K-means algorithm is proposed to improve traditional K-means clustering techniques. The new initialization method tries to choose suitable initial points, which are well separated and have the potential to form high-quality clusters. Our experiments indicate that the improved K-means algorithm satisfactorily increases the percentage of sequence segments belonging to clusters with high structural similarity. Careful comparison of sequence motifs obtained by the improved and traditional algorithms also suggests that the improved K-means clustering algorithm may discover some relatively weak and subtle sequence motifs, which are undetectable by the traditional K-means algorithms. Many biochemical tests reported in the literature show that these sequence motifs are biologically meaningful. Experimental results also indicate that the improved K-means algorithm generates more detailed sequence motifs representing common structures than previous research. Furthermore, these motifs are universally conserved sequence patterns across protein families, overcoming some weak points of other popular sequence motifs. The satisfactory result of the experiment suggests that this new K-means algorithm may be applied to other areas of bioinformatics research in order to explore the underlying relationships between data samples more effectively.

113 citations

Journal ArticleDOI
TL;DR: A Multi-Level Deep Learning System (MLDLS) that organizes multiple deep learning models using the tree structure to improve the learning effectiveness of each deep learning model built for one cluster can be improved.
Abstract: To defend against an increasing number of sophisticated malware attacks, deep-learning based Malware Detection Systems (MDSs) have become a vital component of our economic and national security. Traditionally, researchers build the single deep learning model using the entire dataset. However, the single deep learning model may not handle the increasingly complex malware data distributions effectively since different sample subspaces representing a group of similar malware may have unique data distribution. In order to further improve the performance of deep learning based MDSs, we propose a Multi-Level Deep Learning System (MLDLS) that organizes multiple deep learning models using the tree structure. Each model in the tree structure of MLDLS was not built on the whole dataset. Instead, each deep learning model focuses on learning a specific data distribution for a particular group of malware and all deep learning models in the tree work together to make a final decision. Consequently, the learning effectiveness of each deep learning model built for one cluster can be improved. Experimental results show that our proposed system performs better than the traditional approach.

54 citations

Journal ArticleDOI
TL;DR: Experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied, which indicates that the generalization power for CSVMs is strong enough to recognize the complicated pattern of sequence-to-structure relationships.
Abstract: Understanding the sequence-to-structure relationship is a central task in bioinformatics research. Adequate knowledge about this relationship can potentially improve accuracy for local protein structure prediction. One of approaches for protein local structure prediction uses the conventional clustering algorithms to capture the sequence-to-structure relationship. The cluster membership function defined by conventional clustering algorithms may not reveal the complex nonlinear relationship adequately. Compared with the conventional clustering algorithms, Support Vector Machine (SVM) can capture the nonlinear sequence-to-structure relationship by mapping the input space into another higher dimensional feature space. However, SVM is not favorable for huge datasets including millions of samples. Therefore, we propose a novel computational model called Clustering Support Vector Machines (CSVMs). Taking advantage of both theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. This feature makes learning tasks for each CSVM more specific and simpler. CSVMs modeled for each granule can be easily parallelized so that CSVMs can be used to handle complex classification problems for huge datasets. Average accuracy for CSVMs is over 80%, which indicates that the generalization power for CSVMs is strong enough to recognize the complicated pattern of sequence-to-structure relationships. Compared with the conventional clustering algorithm, our experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied.

50 citations

Journal ArticleDOI
TL;DR: The accuracy of the tertiary classifier with PSSM encoding scheme reaches 72.01%, which is almost 10% better than the previous results obtained in 2003, and Hyper-Threading technology for Intel architecture is efficient for parallel biological algorithms.
Abstract: Protein secondary structure prediction has a fundamental influence on today's bioinformatics research. In this work, tertiary classifiers for the protein secondary structure prediction are implemented on Denoeux Belief Neural Network (DBNN) architecture. Hydrophobicity matrix, orthogonal matrix, BLOSUM62 matrix and PSSM matrix are experimented separately as the encoding schemes for DBNN. Hydrophobicity matrix, BLOSUM62 matrix and PSSM matrix are applied to DBNN architecture for the first time. The experimental results contribute to the design of new encoding schemes. Our accuracy of the tertiary classifier with PSSM encoding scheme reaches 72.01%, which is almost 10% better than the previous results obtained in 2003. Due to the time consuming task of training the neural networks, Pthread and OpenMP are employed to parallelize DBNN in the Hyper-Threading enabled Intel architecture. Speedup for 16 Pthreads is 4.9 and speedup for 16 OpenMP threads is 4 in the 4 processors shared memory architecture. Both speedup performance of OpenMP and Pthread is superior to that of other research. With the new parallel training algorithm, thousands of amino acids can be processed in reasonable amount of time. Our research also shows that Hyper-Threading technology for Intel architecture is efficient for parallel biological algorithms.

22 citations

Journal ArticleDOI
TL;DR: The Multi-level Support Vector Machine (MLSVM) that organizes the dataset as clusters in a tree to produce better partitions for more effective SVM classification is proposed and running time analysis shows that MLSVM can accelerate SVM's training process noticeably when the parallel algorithm is employed.
Abstract: This research utilizes the national Healthcare Cost & Utilization Project (HCUP-3) databases to construct Support Vector Machine (SVM) classifiers to predict clinical charge profiles, including hospital charges and length of stay (LOS), for patients diagnosed with heart and circulatory disease, diabetes and cancer, respectively Clinical charge profiles predictions can provides relevant clinical knowledge for healthcare policy makers to effectively manage healthcare services and costs at the national, state, and local levels Despite its solid mathematical foundation and promising experimental results, SVM is not favorable for large-scale data mining tasks since its training time complexity is at least quadratic to the number of samples Furthermore, traditional SVM classification algorithms cannot build an effective SVM when different data distribution patterns are intermingled in a large dataset In order to enhance SVM training for large, complex and noisy healthcare datasets, we propose the Multi-level Support Vector Machine (MLSVM) that organizes the dataset as clusters in a tree to produce better partitions for more effective SVM classification The MLSVM model utilizes multiple SVMs, each of which learns the local data distribution patterns in a cluster efficiently A decision fusion algorithm is used to generate an effective global decision that incorporates local SVM decisions at different levels of the tree Consequently, MLSVM can handle complex and often conflicting data distributions in large datasets more effectively than the single-SVM based approaches and the multiple SVM systems Both the combined 5x2-fold cross validation F test and the independent test show that classification performance of MLSVM is much superior to that of a CVM, ACSVM and CSVM based on three popular performance evaluation metrics In this work, CSVM and MLSVM are parallelized to speed up the slow SVM training process for very large and complex datasets Running time analysis shows that MLSVM can accelerate SVM's training process noticeably when the parallel algorithm is employed

14 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Att approaches, models and methods from the graph theory universe are demonstrated and ways in which they can be used to reveal hidden properties and features of a network are discussed to better understand the biological significance of the system.
Abstract: Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.

595 citations

Journal ArticleDOI
01 Jan 2010
TL;DR: This paper creates an algorithm to automatically analyze the emotional polarity of a text and to obtain a value for each piece of text, which is combined with K-means clustering and support vector machine (SVM) to develop unsupervised text mining approach.
Abstract: Text sentiment analysis, also referred to as emotional polarity computation, has become a flourishing frontier in the text mining community. This paper studies online forums hotspot detection and forecast using sentiment analysis and text mining approaches. First, we create an algorithm to automatically analyze the emotional polarity of a text and to obtain a value for each piece of text. Second, this algorithm is combined with K-means clustering and support vector machine (SVM) to develop unsupervised text mining approach. We use the proposed text mining approach to group the forums into various clusters, with the center of each representing a hotspot forum within the current time span. The data sets used in our empirical studies are acquired and formatted from Sina sports forums, which spans a range of 31 different topic forums and 220,053 posts. Experimental results demonstrate that SVM forecasting achieves highly consistent results with K-means clustering. The top 10 hotspot forums listed by SVM forecasting resembles 80% of K-means clustering results. Both SVM and K-means achieve the same results for the top 4 hotspot forums of the year.

452 citations

Posted Content
TL;DR: It is shown that only when the optimal parameter-selection procedure is applied, support vector machines outperform traditional logistic regression, whereas random forests outperform both kinds of support vector machine models.
Abstract: CRM gains increasing importance due to intensive competition and saturated markets. With the purpose of retaining customers, academics as well as practitioners find it crucial to build a churn prediction model that is as accurate as possible. This study applies support vector machines in a newspaper subscription context in order to construct a churn model with a higher predictive performance. Moreover, a comparison is made between two parameter-selection techniques, needed to implement support vector machines. Both techniques are based on grid search and cross-validation. Afterwards, the predictive performance of both kinds of support vector machine models is benchmarked to logistic regression and random forests. Our study shows that support vector machines show good generalization performance when applied to noisy marketing data. Nevertheless, the parameter optimization procedure plays an important role in the predictive performance. We show that only when the optimal parameter selection procedure is applied, support vector machines outperform traditional logistic regression, whereas random forests outperform both kinds of support vector machines. As a substantive contribution, an overview of the most important churn drivers is given. Unlike ample research, monetary value and frequency do not play an important role in explaining churn in this subscription-services application. Even though most important churn predictors belong to the category of variables describing the subscription, the influence of several client/company-interaction variables can not be neglected.

371 citations

Journal ArticleDOI
TL;DR: The main idea in this paper is to describe key papers and provide some guidelines to help medical practitioners to explore previous works and identify interesting areas for future research.
Abstract: Data mining is a powerful method to extract knowledge from data. Raw data faces various challenges that make traditional method improper for knowledge extraction. Data mining is supposed to be able to handle various data types in all formats. Relevance of this paper is emphasized by the fact that data mining is an object of research in different areas. In this paper, we review previous works in the context of knowledge extraction from medical data. The main idea in this paper is to describe key papers and provide some guidelines to help medical practitioners. Medical data mining is a multidisciplinary field with contribution of medicine and data mining. Due to this fact, previous works should be classified to cover all users' requirements from various fields. Because of this, we have studied papers with the aim of extracting knowledge from structural medical data published between 1999 and 2013. We clarify medical data mining and its main goals. Therefore, each paper is studied based on the six medical tasks: screening, diagnosis, treatment, prognosis, monitoring and management. In each task, five data mining approaches are considered: classification, regression, clustering, association and hybrid. At the end of each task, a brief summarization and discussion are stated. A standard framework according to CRISP-DM is additionally adapted to manage all activities. As a discussion, current issue and future trend are mentioned. The amount of the works published in this scope is substantial and it is impossible to discuss all of them on a single work. We hope this paper will make it possible to explore previous works and identify interesting areas for future research.

220 citations

Journal ArticleDOI
TL;DR: This article compares k-mean to fuzzy c-means and rough k-Means as important representatives of soft clustering, and surveys important extensions and derivatives of these algorithms.

157 citations