Charles X. Ling
Bio: Charles X. Ling is an academic researcher from University of Western Ontario. The author has contributed to research in topics: Genetic algorithm & k-nearest neighbors algorithm. The author has an hindex of 14, co-authored 37 publications receiving 1761 citations.
01 Jan 2011
TL;DR: Compared with other state-of-the-art DE approaches, this approach performs better, or at least comparably, in terms of the quality of the final solutions and the reduction of the number of fitness function evaluations (NFFEs).
Abstract: Hybridization with other different algorithms is an interesting direction for the improvement of differential evolution (DE). In this paper, a hybrid DE based on the one-step k-means clustering, called clustering-based DE (CDE), is presented for the unconstrained global optimization problems. The one-step k-means clustering acts as several multi-parent crossover operators to utilize the information of the population efficiently, and hence it can enhance the performance of DE. To validate the performance of our approach, 30 benchmark functions of a wide range of dimensions and diversity complexities are employed. Experimental results indicate that our approach is effective and efficient. Compared with other state-of-the-art DE approaches, our approach performs better, or at least comparably, in terms of the quality of the final solutions and the reduction of the number of fitness function evaluations (NFFEs).
••29 Sep 2014
TL;DR: An online system is implemented to show how the reviewer recommender helps project managers to find potential reviewers from crowds, and can reach a precision of 74% for top-1 recommendation, and a recall of 71% forTop-10 recommendation.
Abstract: Pull-Request (PR) is the primary method for code contributions from thousands of developers in GitHub. To maintain the quality of software projects, PR review is an essential part of distributed software development. Assigning new PRs to appropriate reviewers will make the review process more effective which can reduce the time between the submission of a PR and the actual review of it. However, reviewer assignment is now organized manually in GitHub. To reduce this cost, we propose a reviewer recommender to predict highly relevant reviewers of incoming PRs. Combining information retrieval with social network analyzing, our approach takes full advantage of the textual semantic of PRs and the social relations of developers. We implement an online system to show how the reviewer recommender helps project managers to find potential reviewers from crowds. Our approach can reach a precision of 74% for top-1 recommendation, and a recall of 71% for top-10 recommendation.
TL;DR: A novel approach to empirically verify the two assumptions of cotraining given two views is proposed, and several methods to split single view data sets into two views are designed in order to make cot training work reliably well.
Abstract: Cotraining, a paradigm of semisupervised learning, is promised to alleviate effectively the shortage of labeled examples in supervised learning. The standard two-view cotraining requires the data set to be described by two views of features, and previous studies have shown that cotraining works well if the two views satisfy the sufficiency and independence assumptions. In practice, however, these two assumptions are often not known or ensured (even when the two views are given). More commonly, most supervised data sets are described by one set of attributes (one view). Thus, they need be split into two views in order to apply the standard two-view cotraining. In this paper, we first propose a novel approach to empirically verify the two assumptions of cotraining given two views. Then, we design several methods to split single view data sets into two views, in order to make cotraining work reliably well. Our empirical results show that, given a whole or a large labeled training set, our view verification and splitting methods are quite effective. Unfortunately, cotraining is called for precisely when the labeled training set is small. However, given small labeled training sets, we show that the two cotraining assumptions are difficult to verify, and view splitting is unreliable. Our conclusions for cotraining's effectiveness are mixed. If two views are given, and known to satisfy the two assumptions, cotraining works well. Otherwise, based on small labeled training sets, verifying the assumptions or splitting single view into two views are unreliable; thus, it is uncertain whether the standard cotraining would work or not.
••18 Dec 2006
TL;DR: An automatic keyphrase extraction algorithm, which can be used in both supervised and unsupervised tasks, which treats each document as a semantic network and develops a classifier with an overall accuracy up to 80%.
Abstract: Keyphrases play a key role in text indexing, summarization and categorization. However, most of the existing keyphrase extraction approaches require human-labeled training sets. In this paper, we propose an automatic keyphrase extraction algorithm, which can be used in both supervised and unsupervised tasks. This algorithm treats each document as a semantic network. Structural dynamics of the network are used to extract keyphrases (key nodes) unsupervised. Experiments demonstrate the proposed algorithm averagely improves 50% in effectiveness and 30% in efficiency in unsupervised tasks and performs comparatively with supervised extractors. Moreover, by applying this algorithm to supervised tasks, we develop a classifier with an overall accuracy up to 80%.
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).
TL;DR: This article shows how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario.
Abstract: To evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets. The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.
01 Jul 2012
TL;DR: A taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based is proposed and a thorough empirical comparison is developed by the consideration of the most significant published approaches to show whether any of them makes a difference.
Abstract: Classifier learning with data-sets that suffer from imbalanced class distributions is a challenging problem in data mining community. This issue occurs when the number of examples that represent one class is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. In machine learning, the ensemble of classifiers are known to increase the accuracy of single classifiers by combining several of them, but neither of these learning techniques alone solve the class imbalance problem, to deal with this issue the ensemble learning algorithms have to be designed specifically. In this paper, our aim is to review the state of the art on ensemble techniques in the framework of imbalanced data-sets, with focus on two-class problems. We propose a taxonomy for ensemble-based methods to address the class imbalance where each proposal can be categorized depending on the inner ensemble methodology in which it is based. In addition, we develop a thorough empirical comparison by the consideration of the most significant published approaches, within the families of the taxonomy proposed, to show whether any of them makes a difference. This comparison has shown the good behavior of the simplest approaches which combine random undersampling techniques with bagging or boosting ensembles. In addition, the positive synergy between sampling techniques and bagging has stood out. Furthermore, our results show empirically that ensemble-based algorithms are worthwhile since they outperform the mere use of preprocessing techniques before learning the classifier, therefore justifying the increase of complexity by means of a significant enhancement of the results.
TL;DR: In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g., confidences on labels.
Abstract: Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g., confidences on labels.
TL;DR: This work carries out a thorough discussion on the main issues related to using data intrinsic characteristics in this classification problem, and introduces several approaches and recommendations to address these problems in conjunction with imbalanced data.
Abstract: Training classifiers with datasets which suffer of imbalanced class distributions is an important problem in data mining. This issue occurs when the number of examples representing the class of interest is much lower than the ones of the other classes. Its presence in many real-world applications has brought along a growth of attention from researchers. We shortly review the many issues in machine learning and applications of this problem, by introducing the characteristics of the imbalanced dataset scenario in classification, presenting the specific metrics for evaluating performance in class imbalanced learning and enumerating the proposed solutions. In particular, we will describe preprocessing, costsensitive learning and ensemble techniques, carrying out an experimental study to contrast these approaches in an intra and inter-family comparison. We will carry out a thorough discussion on the main issues related to using data intrinsic characteristics in this classification problem. This will help to improve the current models with respect to: the presence of small disjuncts, the lack of density in the training data, the overlapping between classes, the identification of noisy data, the significance of the borderline instances, and the dataset shift between the training and the test distributions. Finally, we introduce several approaches and recommendations to address these problems in conjunction with imbalanced data, and we will show some experimental examples on the behavior of the learning algorithms on data with such intrinsic characteristics.