scispace - formally typeset
Search or ask a question

Showing papers on "Classifier chains published in 2019"


Proceedings ArticleDOI
01 Aug 2019
TL;DR: This research discusses multi-label text classification for abusive language and hate speech detection including detecting the target, category, and level of hate speech in Indonesian Twitter using machine learning approach with Support Vector Machine, Naive Bayes, and Random Forest Decision Tree methods.
Abstract: Hate speech and abusive language spreading on social media need to be detected automatically to avoid conflict between citizen. Moreover, hate speech has a target, category, and level that also needs to be detected to help the authority in prioritizing which hate speech must be addressed immediately. This research discusses multi-label text classification for abusive language and hate speech detection including detecting the target, category, and level of hate speech in Indonesian Twitter using machine learning approach with Support Vector Machine (SVM), Naive Bayes (NB), and Random Forest Decision Tree (RFDT) classifier and Binary Relevance (BR), Label Power-set (LP), and Classifier Chains (CC) as the data transformation method. We used several kinds of feature extractions which are term frequency, orthography, and lexicon features. Our experiment results show that in general RFDT classifier using LP as the transformation method gives the best accuracy with fast computational time.

109 citations


Posted Content
TL;DR: The goal of this work is to provide a review of classifier chains, a survey of the techniques and extensions provided in the literature, as well as perspectives for this approach in the domain of multi-label classification in the future.
Abstract: The family of methods collectively known as classifier chains has become a popular approach to multi-label learning problems. This approach involves linking together off-the-shelf binary classifiers in a chain structure, such that class label predictions become features for other classifiers. Such methods have proved flexible and effective and have obtained state-of-the-art empirical performance across many datasets and multi-label evaluation metrics. This performance led to further studies of how exactly it works, and how it could be improved, and in the recent decade numerous studies have explored classifier chains mechanisms on a theoretical level, and many improvements have been made to the training and inference procedures, such that this method remains among the state-of-the-art options for multi-label learning. Given this past and ongoing interest, which covers a broad range of applications and research themes, the goal of this work is to provide a review of classifier chains, a survey of the techniques and extensions provided in the literature, as well as perspectives for this approach in the domain of multi-label classification in the future. We conclude positively, with a number of recommendations for researchers and practitioners, as well as outlining a number of areas for future research.

40 citations


Journal ArticleDOI
TL;DR: “RTAnews” is introduced, a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks, and an extensive comparison of most of the well-known multi- label learning algorithms for ArabicText categorization is conducted.
Abstract: Multi-label text categorization refers to the problem of assigning each document to a subset of categories by means of multi-label learning algorithms. Unlike English and most other languages, the unavailability of Arabic benchmark datasets prevents evaluating multi-label learning algorithms for Arabic text categorization. As a result, only a few recent studies have dealt with multi-label Arabic text categorization on non-benchmark and inaccessible datasets. Therefore, this work aims to promote multi-label Arabic text categorization through (a) introducing “RTAnews”, a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks. The benchmark is publicly available in several formats compatible with the existing multi-label learning tools, such as MEKA and Mulan. (b) Conducting an extensive comparison of most of the well-known multi-label learning algorithms for Arabic text categorization in order to have baseline results and show the effectiveness of these algorithms for Arabic text categorization on RTAnews. The evaluation involves four multi-label transformation-based algorithms: Binary Relevance, Classifier Chains, Calibrated Ranking by Pairwise Comparison and Label Powerset, with three base learners (Support Vector Machine, k-Nearest-Neighbors and Random Forest); and four adaptation-based algorithms (Multi-label kNN, Instance-Based Learning by Logistic Regression Multi-label, Binary Relevance kNN and RFBoost). The reported baseline results show that both RFBoost and Label Powerset with Support Vector Machine as base learner outperformed other compared algorithms. Results also demonstrated that adaptation-based algorithms are faster than transformation-based algorithms.

36 citations


Journal ArticleDOI
TL;DR: This paper proposes ordering methods based on the conditional entropy of labels that generate a single order instead of multiple orders and shows that the proposed methods achieve good performance.

31 citations


Journal ArticleDOI
TL;DR: This study is the first to compare both multi-label classification techniques and recommender systems for cross-sell purposes in the financial services sector, and identifies user-based collaborative filtering as the top performing recommender system.

26 citations


Posted Content
TL;DR: This work analyzes the influence of a potential pitfall of the learning process, namely the discrepancy between the feature spaces used in training and testing, and proposes two modifications of classifier chains that are meant to overcome this problem.
Abstract: Classifier chains have recently been proposed as an appealing method for tackling the multi-label classification task. In addition to several empirical studies showing its state-of-the-art performance, especially when being used in its ensemble variant, there are also some first results on theoretical properties of classifier chains. Continuing along this line, we analyze the influence of a potential pitfall of the learning process, namely the discrepancy between the feature spaces used in training and testing: While true class labels are used as supplementary attributes for training the binary models along the chain, the same models need to rely on estimations of these labels at prediction time. We elucidate under which circumstances the attribute noise thus created can affect the overall prediction performance. As a result of our findings, we propose two modifications of classifier chains that are meant to overcome this problem. Experimentally, we show that our variants are indeed able to produce better results in cases where the original chaining process is likely to fail.

25 citations


Journal ArticleDOI
TL;DR: It is shown that local outperforms global feature selection in terms of classification accuracy, without drawbacks in runtime performance.
Abstract: Multilabel classification has become increasingly important for various use cases. Amongst the existing multilabel classification methods, problem transformation approaches, such as Binary Relevance, Pruned Problem Transformation, and Classifier Chains, are some of the most popular, since they break a global multilabel classification problem into a set of smaller binary or multiclass classification problems. Transformation methods enable the use of two different feature selection approaches: local, where the selection is performed independently for each of the transformed problems, and global, where the selection is performed on the original dataset, meaning that all local classifiers work on the same set of features. While global methods have been widely researched, local methods have received little attention so far. In this paper, we compare those two strategies on one of the most straight forward transformation approaches, i.e., Binary Relevance. We empirically compare their performance on various flat and hierarchical multilabel datasets of different application domains. We show that local outperforms global feature selection in terms of classification accuracy, without drawbacks in runtime performance.

23 citations


Journal ArticleDOI
TL;DR: An experimental framework in which the features are observed with measurement errors and the costs depend on the quality of the features, which can be recommended in a situation when one wants to balance low costs and high prediction performance.

20 citations


Journal ArticleDOI
TL;DR: An extensive experimental analysis with several multi-label datasets, different noise levels and a large number of evaluation metrics for MLC has shown that the ensemble of classifier chains (ECC) algorithm has better performance with CC4.5 as base classifier than using C 4.5.
Abstract: In this work, we have considered the ensemble of classifier chains (ECC) algorithm in order to solve the multi-label classification (MLC) task. It starts from binary relevance algorithm (BR), a simple and direct approach to MLC that has been shown to provide good results in practice. Nevertheless, unlike BR, ECC aims to exploit the correlations between labels. ECC uses an algorithm of traditional supervised classification in order to approach the binary problems. Within this field, Credal C4.5 (CC4.5) is a new version of the well-known C4.5 algorithm that uses imprecise probabilities in order to estimate the probability distribution of the class variable. This new version of C4.5 algorithm has been shown to provide better performance when noisy datasets are classified. In MLC, the intrinsic noise might be higher than in traditional supervised classification. The reason is very simple: in MLC, there are multiple labels, whereas in traditional classification there is just a class variable. Thus, there is more probability of error for an instance. For the previous reasons, the performance of ECC with CC4.5 as base classifier is studied in this work. We have carried out an extensive experimental analysis with several multi-label datasets, different noise levels and a large number of evaluation metrics for MLC. This experimental study has shown that, generally, ECC has better performance with CC4.5 as base classifier than using C4.5. The higher is the label noise level introduced in the data, the more significative is this improvement. Therefore, it is probably suitable to use imprecise probabilities in Decision Trees within MLC.

16 citations


Journal ArticleDOI
Xiaodong Yang1, Jie Song1, Xin Wu1, Lin Xie1, Xuwen Liu1, Guanglin Li1 
TL;DR: A novel multi-label classification (MLC) method is introduced to identify healthy and unhealthy P. notoginseng powders from three different geographical origins and ECC exhibits superior performance in particular.

13 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: A computational model is proposed that incorporates the dependencies between four states (tiredness, anxiety, pain, and engagement) known to appear in virtual rehabilitation sessions of post-stroke patients, to improve the automatic recognition of the patients' states.
Abstract: The automatic recognition of multiple affective states can be enhanced if the underpinning computational models explicitly consider the interactions between the states. This work proposes a computational model that incorporates the dependencies between four states (tiredness, anxiety, pain, and engagement)known to appear in virtual rehabilitation sessions of post-stroke patients, to improve the automatic recognition of the patients' states. A dataset of five stroke patients which includes their fingers' pressure (PRE), hand movements (MOV)and facial expressions (FAE)during ten sessions of virtual rehabilitation was used. Our computational proposal uses the Semi-Naive Bayesian classifier (SNBC)as base classifier in a multiresolution approach to create a multimodal model with the three sensors (PRE, MOV, and FAE)with late fusion using SNBC (FSNB classifier). There is a FSNB classifier for each state, and they are linked in a circular classifier chain (CCC)to exploit the dependency relationships between the states. Results of CCC are over 90% of ROC AUC for the four states. Relationships of mutual exclusion between engagement and all the other states and some co-occurrences between pain and anxiety for the five patients were detected. Virtual rehabilitation platforms that incorporate the automatic recognition of multiple patient's states could leverage intelligent and empathic interactions to promote adherence to rehabilitation exercises.

Posted Content
Yi Zhang1, Cheng Zeng1, Hao Cheng1, Chongjun Wang1, Lei Zhang1 
TL;DR: This work proposes a novel instance-oriented Multi-modal Classifier Chains (MCC) algorithm, which can make convince prediction with partial modalities for MMML problem and reveals that it may be better to extract many instead of all of the modalities at hand.
Abstract: With the emergence of diverse data collection techniques, objects in real applications can be represented as multi-modal features. What's more, objects may have multiple semantic meanings. Multi-modal and Multi-label (MMML) problem becomes a universal phenomenon. The quality of data collected from different channels are inconsistent and some of them may not benefit for prediction. In real life, not all the modalities are needed for prediction. As a result, we propose a novel instance-oriented Multi-modal Classifier Chains (MCC) algorithm for MMML problem, which can make convince prediction with partial modalities. MCC extracts different modalities for different instances in the testing phase. Extensive experiments are performed on one real-world herbs dataset and two public datasets to validate our proposed algorithm, which reveals that it may be better to extract many instead of all of the modalities at hand.

Book ChapterDOI
17 Dec 2019
TL;DR: This paper proposes a neural network algorithm, CascadeML, to train multi-label neural network based on cascade neural networks, which requires minimal or no hyperparameter tuning and also considers pairwise label associations.
Abstract: In multi-label classification a datapoint can be labelled with more than one class at the same time. A common but trivial approach to multi-label classification is to train individual binary classifiers per label, but the performance can be improved by considering associations between the labels, and algorithms like classifier chains and RAKEL do this effectively. Like most machine learning algorithms, however, these approaches require accurate hyperparameter tuning, a computationally expensive optimisation problem. Tuning is important to train a good multi-label classifier model. There is a scarcity in the literature of effective multi-label classification approaches that do not require extensive hyperparameter tuning. This paper addresses this scarcity by proposing CascadeML, a multi-label classification approach based on cascade neural network that takes label associations into account and requires minimal hyperparameter tuning. The performance of the CasecadeML approach is evaluated using 10 multi-label datasets and compared with other leading multi-label classification algorithms. Results show that CascadeML performs comparatively with the leading approaches but without a need for hyperparameter tuning.

Proceedings ArticleDOI
Yi Zhang1, Cheng Zeng1, Hao Cheng1, Chongjun Wang1, Lei Zhang1 
08 Jul 2019
TL;DR: Zhang et al. as mentioned in this paper proposed an instance-oriented multi-modal classifier chains (MCC) algorithm for MMML problem, which can make convince prediction with partial modalities.
Abstract: With the emergence of diverse data collection techniques, objects in real applications can be represented as multi-modal features. What's more, objects may have multiple semantic meanings. Multi-modal and Multi-label [1] (MMML) problem becomes a universal phenomenon. The quality of data collected from different channels are inconsistent and some of them may not benefit for prediction. In real life, not all the modalities are needed for prediction. As a result, we propose a novel instance-oriented Multi-modal Classifier Chains (MCC) algorithm for MMML problem, which can make convince prediction with partial modalities. MCC extracts different modalities for different instances in the testing phase. Extensive experiments are performed on one real-world herbs dataset and two public datasets to validate our proposed algorithm, which reveals that it may be better to extract many instead of all of the modalities at hand.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper has used machine learning techniques, especially multi-label classification methods, to classify whether the given source code is affected with more than one code smells or not, and shows that Random Forest algorithm performs better than Decision Tree, Naive Bayes, Support Vector Machine and Neural Network algorithms.
Abstract: Code smells in a source code shows the weakness of design or implementation. To detect code smells, several detection tools have been developed. However, these tools generally produce different results, since code smells are subjectively interpreted, informally defined and configured by the developers, domain-dependent and based on opinions and experiences. To cope with these issues, in this paper, we have used machine learning techniques, especially multi-label classification methods, to classify whether the given source code is affected with more than one code smells or not. We have conducted experiments on four code smell datasets and transformed them into two multi-label datasets (one for method level and the other one for class level). Two multi-label classification methods (Classifier Chains and Label Combination) and their ensemble models performed on the converted datasets using five different base classifiers. The results show that, as a base classifier, Random Forest algorithm performs better than Decision Tree, Naive Bayes, Support Vector Machine and Neural Network algorithms.

Posted Content
TL;DR: A student performance prediction model that predicts the performance of high school students for the next semester for five courses is developed and achieved better performance in terms of different evaluation metrics when compared to other multi-label learning tasks such as binary relevance and classifier chains.
Abstract: One of the important measures of quality of education is the performance of students in the academic settings. Nowadays, abundant data is stored in educational institutions about students which can help to discover insight on how students are learning and how to improve their performance ahead of time using data mining techniques. In this paper, we developed a student performance prediction model that predicts the performance of high school students for the next semester for five courses. We modeled our prediction system as a multi-label classification task and used support vector machine (SVM), Random Forest (RF), K-nearest Neighbors (KNN), and Mult-layer perceptron (MLP) as base-classifiers to train our model. We further improved the performance of the prediction model using state-of-the-art partitioning schemes to divide the label space into smaller spaces and use Label Powerset (LP) transformation method to transform each labelset into a multi-class classification task. The proposed model achieved better performance in terms of different evaluation metrics when compared to other multi-label learning tasks such as binary relevance and classifier chains.

Journal ArticleDOI
TL;DR: By incorporating the proposed two-stage feature selection approach, the multi-label classifiers with label-dependent features achieve on average 9.4% performance improvement in Exact-Match compared with the original classifiers.
Abstract: Multi-label classification faces several critical challenges, including modeling label correlations, mitigating label imbalance, removing irrelevant and redundant features, and reducing the complexity for large-scale problems. To address these issues, in this paper, we propose a novel method—polytree-augmented classifier chains with label-dependent features—that models label correlations through flexible polytree structures based on low-dimensional label-dependent feature spaces learned by a two-stage feature selection approach. First, a feature weighting approach is applied to efficiently remove irrelevant features for each label and mitigate the effect of label imbalance. Second, a polytree structure is built in the label space using estimated conditional mutual information. Third, an appropriate label-dependent feature subset is found by taking account of label correlations in the polytree. Extensive empirical studies on six synthetic datasets and 12 real-world datasets demonstrate the superior performance of the proposed method. In addition, by incorporating the proposed two-stage feature selection approach, the multi-label classifiers with label-dependent features achieve on average 9.4% performance improvement in Exact-Match compared with the original classifiers.

Proceedings ArticleDOI
01 Dec 2019
TL;DR: This paper proposes a multilabel aspect-based sentiment classification model for Abilify drug user reviews and studies the problem transformation approaches, binary relevance, classifier chains, and label Powerset to classify Abilified user reviews into a set of aspect term sentiment.
Abstract: Multilabel text classification plays an important role in text mining applications such as sentiment analysis and health informatics. In this paper, we propose a multilabel aspect-based sentiment classification model for Abilify drug user reviews. First, we employ preprocessing techniques to obtain the quality of data. Second, the term frequency-inverse document frequency (TF-IDF) features are extracted with Bag of words (BoWs). Third, a joint feature selection (JFS) method with Information Gain (IG) is applied to select label specific features and label sharing features. Moreover, multilabel classification task can be solved using the problem transformation approaches, adapted algorithm approaches, and ensemble approaches. Finally, we study the problem transformation approaches, binary relevance (BR), classifier chains (CC), and label Powerset (LP) to classify Abilify user reviews into a set of aspect term sentiment (ATS). The baseline classifiers Naive Bayes (NB), decision tree (DT), and support vector machine (SVM) is employed on both feature sets. The proposed method evaluated on multilabel metrics such as accuracy, Hamming Loss, F1-micro averaged, and accuracy per Label. The empirical results show that the support vector machine outperforms.

Proceedings ArticleDOI
01 Feb 2019
TL;DR: This paper proposes an approach for using correlation among labels based on structure of CC by defining a large-margin model between two predicted labels, directly exploiting the correlation between them in a more interpretable way.
Abstract: Multi label classification is a challenging task in machine learning concerned with assigning a sample to a subset of available label set. Meaning, a sample can belong to multiple labels. Furthermore, high dimensionality of data and complex correlation between labels makes it even more interesting. For this reason, it attracted many researchers in recent years. classifier-chains (CC), one of well-known methods for multi label classification which is based on binary relevance (BR) method, incorporates label correlation by assuming an order for labels and inserting previous label outputs in feature space and achieves higher performance while still retaining relatively low time complexity. But using predicted labels as features might not be very interpretable with regards to integrating label correlation into the model, especially considering there could be different types of features in a dataset. In this paper, we propose an approach for using correlation among labels based on structure of CC by defining a large-margin model between two predicted labels. Thus directly exploiting the correlation between them in a more interpretable way. The proposed approach is evaluated using 9 multi label datasets and 2 evaluation metrics. Empirical experiments show promising results and demonstrate the effectiveness of proposed method against classifier chains algorithm.

Proceedings ArticleDOI
18 Oct 2019
TL;DR: In this paper, the authors proposed another approach by building stacking MLC with model selection, which has three steps: (1) building MLC model; (2) using process from the first step and applying with a stacking model and (3) utilizing feature selection technique to select the proper models for final prediction.
Abstract: The objective of this study was to automate job performance prediction based on DISC personality test. We transformed this problem to Multi-Label Classification (MLC) by using employee's job performances as labels. In this study, three widely used MLC techniques have been employed such as Binary Relevance (BR), Label Powerset (LP) and Classifier Chains (CC) for prediction of job performances. However, these traditional techniques didn't show promising results. Therefore, we proposed another approach by building stacking MLC with model selection. The proposed method has three steps: (1) building MLC model; (2) using process from the first step and applying with a stacking model and (3) utilizing feature selection technique to select the proper models for final prediction. Using the surveys from a big financial company in Thailand, we found that the last proposed approach shows better performance, compared to the traditional MLC.

Proceedings ArticleDOI
02 May 2019
TL;DR: This paper proposes a method called Parallel Classifier Chains which enables the parallelization of Classifier Chain and builds k binary classifiers in parallel, where each of them includes as extra input features the predictions of those labels that have been previously built.
Abstract: Multi-label classification has attracted increasing attention of the scientific community in recent years, given its ability to solve problems where each of the examples simultaneously belongs to multiple labels. From all the techniques developed to solve multi-label classification problems, Classifier Chains has been demonstrated to be one of the best performing techniques. However, one of its main drawbacks is its inherently sequential definition. Although many research works aimed to reduce the runtime of multi-label classification algorithms, to the best of our knowledge, there are no proposals to specifically reduce the runtime of Classifier Chains. Therefore, in this paper we propose a method called Parallel Classifier Chains which enables the parallelization of Classifier Chain. In this way, Parallel Classifier Chains builds k binary classifiers in parallel, where each of them includes as extra input features the predictions of those labels that have been previously built. We performed an experimental evaluation over 20 datasets using 5 metrics to analyze both the runtime and the predictive performance of our proposal. The results of the experiments affirmed that our proposal was able to significantly reduce the runtime of Classifier Chains while the predictive performance was not statistically significantly harmed.

Patent
19 Mar 2019
TL;DR: In this paper, a dynamic classifier chain adjusting method for multi-label classification is proposed, which reduces the probability that the output is wrongly divided in eyes near a threshold value, the randomness of a mark prediction sequence existing in a classifier sequence is relieved; uncertainty and instability of classification results are realized.
Abstract: The invention discloses a dynamic classifier chain adjusting method for multi-label classification, belonging to the field of machine learning field. The technical problem to be solved by the invention is how to reduce the probability that the output is wrongly divided in eyes near a threshold value, the randomness of a mark prediction sequence existing in a classifier chain is relieved; uncertainty and instability of classification results are realized; the adopted technical scheme is as follows: the device comprises a base, the method is characterized in that training data are concentrated;respectively counting the co-occurrence frequency of each mark and the marks except the mark, and progressively decreasing and sorting; the method comprises the following steps: selecting a classifierfrom a training data set to complete the sorting of classifier chains in the training data set, randomly selecting one classifier from the classifier chains to complete the classification of unknownsamples, setting two thresholds in advance during the classification of the unknown samples, and completing the classification of the unknown samples according to the output values of the randomly selected classifiers and the sizes of the two thresholds.

Book ChapterDOI
10 Sep 2019
TL;DR: This paper proposed two concepts of classifier chains algorithms that are able to change label order of the chain without rebuilding the entire model and developed a simple heuristic that allows the system to find relatively good label order.
Abstract: In this paper, we deal with the task of building a dynamic ensemble of chain classifiers for multi-label classification. To do so, we proposed two concepts of the classifier chain algorithms that are able to change the label order of the chain without rebuilding the entire model. Such models allow anticipating the instance-specific chain order without the significant increase in the computational burden. The proposed chain models are built using the Naive Bayes classifier and nearest neighbour approaches. To take the benefits of the proposed algorithms, we developed a simple heuristic that allows the system to find relatively good label order. The experimental results showed that the proposed models and the heuristic are efficient tools for building dynamic chain classifiers.


Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper aims to create a prototype model that is capable of detecting various types of toxicity like neutral, toxic, severe toxic, threats, obscenity, insults and identity hate using Genetic Algorithms over a Partial CC (PartCC) model, which is a modification over CC.
Abstract: Multi-label classification (MLC) can be defined as the objective of learning a classification model which has the capability to infer the accurate labels of new, previously unseen, objects where it is a likely situation that each object of the dataset may rightfully belong to multiple class labels. While single-label classification problems have been thoroughly researched, the same cannot be said for MLC. A gradually increasing number of problems are now being tackled as multi-label, allowing for richer and more accurate knowledge mining in real-world domains, such as medical diagnoses, social media, text classification, etc. Currently, there are two ways of solving MLC problems; Problem Transformation Approach and Algorithm Adaptation Method. Of the two, the former has in its domain Classifier Chains (CC) which is the most effective and popular method of solving MLC problems because of its simplicity in implementation. Unfortunately, CC is not favoured due to 2 drawbacks, [1] ordering of the labels for classification are randomly decided without a fixed logic or algorithm to it which results in varying accuracy, [2] all the labels, even those which may be redundant for a particular dataset are put into the chain despite the probability that some may be carrying irrelevant details. Through the research conducted for the purpose of this study, both challenges are tackled along with others detailed further on simultaneously using Genetic Algorithms (GA) over a Partial CC (PartCC) model, which is a modification over CC. A toxic comments dataset is used since its classification is a multi-label text classification problem with a highly imbalanced dataset. This paper aims to create a prototype model that is capable of detecting various types of toxicity like neutral, toxic, severe toxic, threats, obscenity, insults and identity hate. With the explosion of social media in the modern world and the resulting increasing phenomenon of social media hatred and bullying, there is a need for an advanced prototype model to predict the toxicity of each class of comments.

Dissertation
01 Jan 2019
TL;DR: A comparably simple NN architecture that uses a loss function which ignores label dependencies is proposed and it is demonstrated that simpler NNs using cross-entropy per label works better than more complex NNs, particularly in terms of rank loss.
Abstract: Multi-label classification (MLC) is the task of predicting a set of labels for a given input instance. A key challenge in MLC is how to capture underlying structures in label spaces. Due to the computational cost of learning from all possible label combinations, it is crucial to take into account scalability as well as predictive performance when we deal with large scale MLC problems. Another problem that arises when building MLC systems is which evaluation measures need to be used for performance comparison. Unlike traditional multi-class classification, several evaluation measures are often used together in MLC because each measure prefers a different MLC system. In other words, we need to understand the properties of MLC evaluation measures and build a system which performs well in terms of those evaluation measures in which we are particularly interested. In this thesis, we develop neural network architectures that efficiently and effectively utilize underlying label structures in large-scale MLC problems. In the literature, neural networks (NNs) that learn from pairwise relationships between labels have been used, but they do not scale well on large-scale label spaces. Thus, we propose a comparably simple NN architecture that uses a loss function which ignores label dependencies. We demonstrate that simpler NNs using cross-entropy per label works better than more complex NNs, particularly in terms of rank loss, an evaluation measure that takes into account the number of incorrectly ranked label pairs. Another commonly considered evaluation measure is subset 0/1 loss. Classifier chains (CCs) have shown state-of-the-art performance in terms of that measure because the joint probability of labels is optimized explicitly. CCs essentially convert the problem of learning the joint probability into a sequential prediction problem. Then, the task is to predict a sequence of binary values for labels. Contrary to the aforementioned NN architecture which ignores label structures, we study recurrent neural networks (RNNs) so as to make use of sequential structures on label chains. The proposed RNNs are advantageous over CC approaches when dealing with a large number of labels due to parameter sharing effects in RNNs and their abilities to learn from long sequences. Our experimental results also confirm that their superior performance on very large label spaces. In addition to NNs that learn from label sequences, we present two novel NN-based methods that learn a joint space of instances and labels efficiently while exploiting label structures. The proposed joint space learning methods project both instances and labels into a lower dimensional space in a way that minimizes the distance between an instance and its relevant labels in that space. While the goal of both joint space learning methods is same, they use different additional information on label spaces during training: One approach makes use of hierarchical structures of labels and can be useful when such label structures are given by human experts. The other uses latent label spaces learned from textual label descriptions so that we can apply it to more general MLC problems where no explicit label structures are available. Notwithstanding the difference between the two approaches, both approaches allow us to make predictions with respect to labels that have not been seen during training.

Posted Content
TL;DR: In this paper, the authors identify and discuss the main limitations of regressor chains, including an analysis of different base models, loss functions, explainability, and other desiderata of real-world applications.
Abstract: A large number and diversity of techniques have been offered in the literature in recent years for solving multi-label classification tasks, including classifier chains where predictions are cascaded to other models as additional features. The idea of extending this chaining methodology to multi-output regression has already been suggested and trialed: regressor chains. However, this has so-far been limited to greedy inference and has provided relatively poor results compared to individual models, and of limited applicability. In this paper we identify and discuss the main limitations, including an analysis of different base models, loss functions, explainability, and other desiderata of real-world applications. To overcome the identified limitations we study and develop methods for regressor chains. In particular we present a sequential Monte Carlo scheme in the framework of a probabilistic regressor chain, and we show it can be effective, flexible and useful in several types of data. We place regressor chains in context in general terms of multi-output learning with continuous outputs, and in doing this shed additional light on classifier chains.