Showing papers in &quot;Machine Learning in 2004&quot;

Is Combining Classifiers with Stacking Better than Selecting the Best One

TL;DR: An algorithmic framework to classify a partially labeled data set in a principled manner and models the manifold using the adjacency graph for the data and approximates the Laplace-Beltrami operator by the graph Laplacian.

...read moreread less

Abstract: We consider the general problem of utilizing both labeled and unlabeled data to improve classification accuracy. Under the assumption that the data lie on a submanifold in a high dimensional space, we develop an algorithmic framework to classify a partially labeled data set in a principled manner. The central idea of our approach is that classification functions are naturally defined only on the submanifold in question rather than the total ambient space. Using the Laplace-Beltrami operator one produces a basis (the Laplacian Eigenmaps) for a Hilbert space of square integrable functions on the submanifold. To recover such a basis, only unlabeled examples are required. Once such a basis is obtained, training can be performed using the labeled data set. Our algorithm models the manifold using the adjacency graph for the data and approximates the Laplace-Beltrami operator by the graph Laplacian. We provide details of the algorithm, its theoretical justification, and several practical applications for image, speech, and text classification.

...read moreread less

886 citations

Journal Article•DOI•

[...]

Sašo Džeroski¹, Bernard Ženko¹•Institutions (1)

Jožef Stefan Institute¹

Learning to Decode Cognitive States from Brain Images

TL;DR: This work empirically evaluates several state-of-the-art methods for constructing ensembles of heterogeneous classifiers with stacking and shows that they perform (at best) comparably to selecting the best classifier from the ensemble by cross validation and proposes two extensions of this method using an extended set of meta-level features and multi-response model trees to learn at the meta- level.

...read moreread less

Abstract: We empirically evaluate several state-of-the-art methods for constructing ensembles of heterogeneous classifiers with stacking and show that they perform (at best) comparably to selecting the best classifier from the ensemble by cross validation. Among state-of-the-art stacking methods, stacking with probability distributions and multi-response linear regression performs best. We propose two extensions of this method, one using an extended set of meta-level features and the other using multi-response model trees to learn at the meta-level. We show that the latter extension performs better than existing stacking approaches and better than selecting the best classifier by cross validation.

...read moreread less

768 citations

Journal Article•DOI•

[...]

Tom M. Mitchell¹, Rebecca A. Hutchinson¹, Radu Stefan Niculescu¹, Francisco Pereira¹, Xuerui Wang¹, Marcel Adam Just¹, Sharlene D. Newman¹ - Show less +3 more•Institutions (1)

Carnegie Mellon University¹

Benchmarking Least Squares Support Vector Machine Classifiers

TL;DR: This paper describes recent research applying machine learning methods to the problem of classifying the cognitive state of a human subject based on fRMI data observed over a single time interval, and presents case studies in which classifiers are successfully trained to distinguish cognitive states.

...read moreread less

Abstract: Over the past decade, functional Magnetic Resonance Imaging (fMRI) has emerged as a powerful new instrument to collect vast quantities of data about activity in the human brain. A typical fMRI experiment can produce a three-dimensional image related to the human subject's brain activity every half second, at a spatial resolution of a few millimeters. As in other modern empirical sciences, this new instrumentation has led to a flood of new data, and a corresponding need for new data analysis methods. We describe recent research applying machine learning methods to the problem of classifying the cognitive state of a human subject based on fRMI data observed over a single time interval. In particular, we present case studies in which we have successfully trained classifiers to distinguish cognitive states such as (1) whether the human subject is looking at a picture or a sentence, (2) whether the subject is reading an ambiguous or non-ambiguous sentence, and (3) whether the word the subject is viewing is a word describing food, people, buildings, etc. This learning problem provides an interesting case study of classifier learning from extremely high dimensional (105 features), extremely sparse (tens of training examples), noisy data. This paper summarizes the results obtained in these three case studies, as well as lessons learned about how to successfully apply machine learning methods to train classifiers in such settings.

...read moreread less

733 citations

Journal Article•DOI•

[...]

Tony Van Gestel¹, Johan A. K. Suykens¹, Bart Baesens¹, Stijn Viaene¹, Jan Vanthienen¹, Guido Dedene¹, Bart De Moor¹, Joos Vandewalle¹ - Show less +4 more•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2004-Machine Learning

TL;DR: Both the SVM and LS-SVM classifier with RBF kernel in combination with standard cross-validation procedures for hyperparameter selection achieve comparable test set performances, consistently very good when compared to a variety of methods described in the literature.

...read moreread less

Abstract: In Support Vector Machines (SVMs), the solution of the classification problem is characterized by a (convex) quadratic programming (QP) problem. In a modified version of SVMs, called Least Squares SVM classifiers (LS-SVMs), a least squares cost function is proposed so as to obtain a linear set of equations in the dual space. While the SVM classifier has a large margin interpretation, the LS-SVM formulation is related in this paper to a ridge regression approach for classification with binary targets and to Fisher's linear discriminant analysis in the feature space. Multiclass categorization problems are represented by a set of binary classifiers using different output coding schemes. While regularization is used to control the effective number of parameters of the LS-SVM classifier, the sparseness property of SVMs is lost due to the choice of the 2-norm. Sparseness can be imposed in a second stage by gradually pruning the support value spectrum and optimizing the hyperparameters during the sparse approximation procedure. In this paper, twenty public domain benchmark datasets are used to evaluate the test set performance of LS-SVM classifiers with linear, polynomial and radial basis function (RBF) kernels. Both the SVM and LS-SVM classifier with RBF kernel in combination with standard cross-validation procedures for hyperparameter selection achieve comparable test set performances. These SVM and LS-SVM performances are consistently very good when compared to a variety of methods described in the literature including decision tree based algorithms, statistical algorithms and instance based learning methods. We show on ten UCI datasets that the LS-SVM sparse approximation procedure can be successfully applied.

...read moreread less

698 citations

Journal Article•DOI•

Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

[...]

Ying Zhao¹, George Karypis¹•Institutions (1)

University of Minnesota¹

01 Jun 2004-Machine Learning

TL;DR: This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets, and shows that there are a set of criterion functions that consistently outperform the rest.

...read moreread less

Abstract: This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

...read moreread less

660 citations

Journal Article•DOI•

Clustering Large Graphs via the Singular Value Decomposition

[...]

Petros Drineas¹, Alan Frieze², Ravi Kannan³, Santosh Vempala⁴, V. Vinay⁵ - Show less +1 more•Institutions (5)

Rensselaer Polytechnic Institute¹, Carnegie Mellon University², Yale University³, Massachusetts Institute of Technology⁴, Indian Institute of Science⁵

Selective Sampling for Nearest Neighbor Classifiers

TL;DR: This paper considers the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters, and considers a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points, and argues that the relaxation provides a generalized clustering which is useful in its own right.

...read moreread less

Abstract: We consider the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the k-means clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NP-hard even for k e 2, and we consider a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the m × n matrix A that represents the m pointss this solution can be used to get a 2-approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right. Finally, we show that the SVD of a random submatrix—chosen according to a suitable probability distribution—of a given matrix provides an approximation to the SVD of the whole matrix, thus yielding a very fast randomized algorithm. We expect this algorithm to be the main contribution of this paper, since it can be applied to problems of very large size which typically arise in modern applications.

...read moreread less

523 citations

Journal Article•DOI•

Functional Trees

[...]

João Gama¹•Institutions (1)

University of Porto¹

01 Jun 2004-Machine Learning

TL;DR: In this article, the authors study the effects of using combinations of attributes at decision nodes, leaf nodes, or both nodes and leaves in regression and classification tree learning, and propose a framework combining a univariate decision tree with a linear function by means of constructive induction.

...read moreread less

Abstract: In the context of classification problems, algorithms that generate multivariate trees are able to explore multiple representation languages by using decision tests based on a combination of attributes. In the regression setting, model trees algorithms explore multiple representation languages but using linear models at leaf nodes. In this work we study the effects of using combinations of attributes at decision nodes, leaf nodes, or both nodes and leaves in regression and classification tree learning. In order to study the use of functional nodes at different places and for different types of modeling, we introduce a simple unifying framework for multivariate tree learning. This framework combines a univariate decision tree with a linear function by means of constructive induction. Decision trees derived from the framework are able to use decision nodes with multivariate tests, and leaf nodes that make predictions using linear functions. Multivariate decision nodes are built when growing the tree, while functional leaves are built when pruning the tree. We experimentally evaluate a univariate tree, a multivariate tree using linear combinations at inner and leaf nodes, and two simplified versions restricting linear combinations to inner nodes and leaves. The experimental evaluation shows that all functional trees variants exhibit similar performance, with advantages in different datasets. In this study there is a marginal advantage of the full model. These results lead us to study the role of functional leaves and nodes. We use the bias-variance decomposition of the error, cluster analysis, and learning curves as tools for analysis. We observe that in the datasets under study and for classification and regression, the use of multivariate decision nodes has more impact in the bias component of the error, while the use of multivariate decision leaves has more impact in the variance component.

...read moreread less

262 citations

Journal Article•DOI•

[...]

Michael Lindenbaum¹, Shaul Markovitch¹, Dmitry Rusakov¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Feb 2004-Machine Learning

TL;DR: In this article, a look-ahead algorithm for selective sampling of examples for nearest neighbor classifiers is proposed, where the algorithm is looking for the example with the highest utility, taking its effect on the resulting classifier into account.

...read moreread less

Abstract: Most existing inductive learning algorithms work under the assumption that their training examples are already tagged. There are domains, however, where the tagging procedure requires significant computation resources or manual labor. In such cases, it may be beneficial for the learner to be active, intelligently selecting the examples for labeling with the goal of reducing the labeling cost. In this paper we present LSS—a lookahead algorithm for selective sampling of examples for nearest neighbor classifiers. The algorithm is looking for the example with the highest utility, taking its effect on the resulting classifier into account. Computing the expected utility of an example requires estimating the probability of its possible labels. We propose to use the random field model for this estimation. The LSS algorithm was evaluated empirically on seven real and artificial data sets, and its performance was compared to other selective sampling algorithms. The experiments show that the proposed algorithm outperforms other methods in terms of average error rate and stability.

...read moreread less

224 citations

Journal Article•DOI•

Kernels and Distances for Structured Data

[...]

Thomas Gärtner¹, John W. Lloyd², Peter A. Flach³•Institutions (3)

University of Bonn¹, Australian National University², University of Bristol³

01 Dec 2004-Machine Learning

TL;DR: A general method for constructing a kernel following the syntactic structure of the data, as defined by its type signature in a higher-order logic, and the main theoretical result is the positive definiteness of any kernel thus defined.

...read moreread less

Abstract: This paper brings together two strands of machine learning of increasing importance: kernel methods and highly structured data. We propose a general method for constructing a kernel following the syntactic structure of the data, as defined by its type signature in a higher-order logic. Our main theoretical result is the positive definiteness of any kernel thus defined. We report encouraging experimental results on a range of real-world data sets. By converting our kernel to a distance pseudo-metric for 1-nearest neighbour, we were able to improve the best accuracy from the literature on the Diterpene data set by more than 10%.

...read moreread less

179 citations

Journal Article•DOI•

A Meta-Learning Method to Select the Kernel Width in Support Vector Regression

[...]

Carlos Soares¹, Pavel Brazdil¹, Petr Kuba²•Institutions (2)

University of Porto¹, Masaryk University²

Introduction to the Special Issue on Meta-Learning

TL;DR: A meta-learning methodology that can select settings with low error while providing significant savings in time is proposed and applied to set the width of the Gaussian kernel.

...read moreread less

Abstract: The Support Vector Machine algorithm is sensitive to the choice of parameter settings. If these are not set correctly, the algorithm may have a substandard performance. Suggesting a good setting is thus an important problem. We propose a meta-learning methodology for this purpose and exploit information about the past performance of different settings. The methodology is applied to set the width of the Gaussian kernel. We carry out an extensive empirical evaluation, including comparisons with other methods (fixed default ranking; selection based on cross-validation and a heuristic method commonly used to set the width of the SVM kernel). We show that our methodology can select settings with low error while providing significant savings in time. Further work should be carried out to see how the methodology could be adapted to different parameter setting tasks.

...read moreread less

Journal Article•DOI•

[...]

Christophe Giraud-Carrier, Ricardo Vilalta¹, Pavel Brazdil²•Institutions (2)

University of Houston¹, University of Porto²

Naive Bayesian Classification of Structured Data

TL;DR: It is shown how meta-learning can be simply defined as the process of exploiting knowledge about learning that enables us to understand and improve the performance of learning algorithms.

...read moreread less

Abstract: Recent advances in meta-learning are providing the foundations to construct meta-learning assistants and task-adaptive learners The goal of this special issue is to foster an interest in meta-learning by compiling representative work in the field The contributions to this special issue provide strong insights into the construction of future meta-learning tools In this introduction we present a common frame of reference to address work in meta-learning through the concept of meta-knowledge We show how meta-learning can be simply defined as the process of exploiting knowledge about learning that enables us to understand and improve the performance of learning algorithms

...read moreread less

Journal Article•DOI•

[...]

Peter A. Flach¹, Nicolas Lachiche•Institutions (1)

University of Bristol¹

01 Dec 2004-Machine Learning

TL;DR: A unifying view on both systems in which 1BC works in language space, and 1BC2 works in individual space is presented, and a new, efficient recursive algorithm improving upon the original propositionalisation approach of 1BC is presented.

...read moreread less

Abstract: In this paper we present 1BC and 1BC2, two systems that perform naive Bayesian classification of structured individuals. The approach of 1BC is to project the individuals along first-order features. These features are built from the individual using structural predicates referring to related objects (e.g., atoms within molecules), and properties applying to the individual or one or several of its related objects (e.g., a bond between two atoms). We describe an individual in terms of elementary features consisting of zero or more structural predicates and one propertys these features are treated as conditionally independent in the spirit of the naive Bayes assumption. 1BC2 represents an alternative first-order upgrade to the naive Bayesian classifier by considering probability distributions over structured objects (e.g., a molecule as a set of atoms), and estimating those distributions from the probabilities of its elements (which are assumed to be independent). We present a unifying view on both systems in which 1BC works in language space, and 1BC2 works in individual space. We also present a new, efficient recursive algorithm improving upon the original propositionalisation approach of 1BC. Both systems have been implemented in the context of the first-order descriptive learner Tertius, and we investigate the differences between the two systems both in computational terms and on artificially generated data. Finally, we describe a range of experiments on ILP benchmark data sets demonstrating the viability of our approach.

...read moreread less

Journal Article•DOI•

Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers

[...]

Theodoros Evgeniou¹, Massimiliano Pontil², André Elisseeff³•Institutions (3)

INSEAD¹, University of Siena², Max Planck Society³

01 Apr 2004-Machine Learning

TL;DR: Novel bounds on the stability of combinations of any classifiers are derived that can be used to formally show that, for example, bagging increases the Stability of unstable learning machines.

...read moreread less

Abstract: We study the leave-one-out and generalization errors of voting combinations of learning machines. A special case considered is a variant of bagging. We analyze in detail combinations of kernel machines, such as support vector machines, and present theoretical estimates of their leave-one-out error. We also derive novel bounds on the stability of combinations of any classifiers. These bounds can be used to formally show that, for example, bagging increases the stability of unstable learning machines. We report experiments supporting the theoretical findings.

...read moreread less

Journal Article•DOI•

Lessons and Challenges from Mining Retail E-Commerce Data

[...]

Ron Kohavi¹, Llew Mason², Rajesh Parekh², Zijian Zheng³•Institutions (3)

Amazon.com¹, Blue Martini Software², Microsoft³

Bagging Equalizes Influence

TL;DR: The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception, and many lessons learned over the last four years and the challenges that still need to be addressed are discussed.

...read moreread less

Abstract: The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception. With clickstreams being collected at the application-server layer, high-level events being logged, and data automatically transformed into a data warehouse using meta-data, common problems plaguing data mining using weblogs (e.g., sessionization and conflating multi-sourced data) were obviated, thus allowing us to concentrate on actual data mining goals. The paper briefly reviews the architecture and discusses many lessons learned over the last four years and the challenges that still need to be addressed. The lessons and challenges are presented across two dimensions: business-level vs. technical, and throughout the data mining lifecycle stages of data collection, data warehouse construction, business intelligence, and deployment. The lessons and challenges are also widely applicable to data mining domains outside retail e-commerce.

...read moreread less

Journal Article•DOI•

[...]

Yves Grandvalet¹•Institutions (1)

University of Technology of Compiègne¹

01 Jun 2004-Machine Learning

TL;DR: Experimental evidence is provided supporting the hypothesis that bagging stabilizes prediction by equalizing the influence of training examples, and support that other resampling strategies such as half-sampling should provide qualitatively identical effects while being computationally less demanding than bootstrap sampling.

...read moreread less

Abstract: Bagging constructs an estimator by averaging predictors trained on bootstrap samples. Bagged estimates almost consistently improve on the original predictor. It is thus important to understand the reasons for this success, and also for the occasional failures. It is widely believed that bagging is effective thanks to the variance reduction stemming from averaging predictors. However, seven years from its introduction, bagging is still not fully understood. This paper provides experimental evidence supporting the hypothesis that bagging stabilizes prediction by equalizing the influence of training examples. This effect is detailed in two different frameworks: estimation on the real line and regression. Bagging's improvements/deteriorations are explained by the goodness/badness of highly influential examples, in situations where the usual variance reduction argument is at best questionable. Finally, reasons for the equalization effect are advanced. They support that other resampling strategies such as half-sampling should provide qualitatively identical effects while being computationally less demanding than bootstrap sampling.

...read moreread less

Journal Article•DOI•

Inducing Multi-Level Association Rules from Multiple Relations

[...]

Francesca A. Lisi¹, Donato Malerba¹•Institutions (1)

University of Bari¹

01 May 2004-Machine Learning

TL;DR: This paper presents a novel approach to association rule mining which deals with multiple levels of description granularity and relies on the hybrid language A -log which allows a unified treatment of both the relational and structural features of data.

...read moreread less

Abstract: Recently there has been growing interest both to extend ILP to description logics and to apply it to knowledge discovery in databases. In this paper we present a novel approach to association rule mining which deals with multiple levels of description granularity. It relies on the hybrid language $$\mathcal{A}\mathcal{L}$$ -log which allows a unified treatment of both the relational and structural features of data. A generality order and a downward refinement operator for $$\mathcal{A}\mathcal{L}$$ -log pattern spaces is defined on the basis of query subsumption. This framework has been implemented in SPADA, an ILP system for mining multi-level association rules from spatial data. As an illustrative example, we report experimental results obtained by running the new version of SPADA on geo-referenced census data of Manchester Stockport.

...read moreread less

Journal Article•DOI•

Khiops: A Statistical Discretization Method of Continuous Attributes

[...]

Marc Boullé¹•Institutions (1)

Orange S.A.¹

01 Apr 2004-Machine Learning

TL;DR: This method optimizes the chi-square criterion in a global manner on the whole discretization domain and does not require any stopping criterion, in contrast with related methods ChiMerge and ChiSplit.

...read moreread less

Abstract: In supervised machine learning, some algorithms are restricted to discrete data and have to discretize continuous attributes. Many discretization methods, based on statistical criteria, information content, or other specialized criteria, have been studied in the past. In this paper, we propose the discretization method Khiops,1 based on the chi-square statistic. In contrast with related methods ChiMerge and ChiSplit, this method optimizes the chi-square criterion in a global manner on the whole discretization domain and does not require any stopping criterion. A theoretical study followed by experiments demonstrates the robustness and the good predictive performance of the method.

...read moreread less

Journal Article•DOI•

A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000

[...]

Peter van der Putten¹, Maarten van Someren²•Institutions (2)

Leiden University¹, University of Amsterdam²

DeEPs: A New Instance-Based Lazy Discovery and Classification System

TL;DR: The framework of bias-variance decomposition of error is used to analyze what caused the wide range of prediction performance in the CoIL Challenge 2000 data mining competition and finds that variance is the key component of error for this problem.

...read moreread less

Abstract: The CoIL Challenge 2000 data mining competition attracted a wide variety of solutions, both in terms of approaches and performance. The goal of the competition was to predict who would be interested in buying a specific insurance product and to explain why people would buy. Unlike in most other competitions, the majority of participants provided a report describing the path to their solution. In this article we use the framework of bias-variance decomposition of error to analyze what caused the wide range of prediction performance. We characterize the challenge problem to make it comparable to other problems and evaluate why certain methods work or not. We also include an evaluation of the submitted explanations by a marketing expert. We find that variance is the key component of error for this problem. Participants use various strategies in data preparation and model development that reduce variance error, such as feature selection and the use of simple, robust and low variance learners like Naive Bayes. Adding constructed features, modeling with complex, weak bias learners and extensive fine tuning by the participants often increase the variance error.

...read moreread less

Journal Article•DOI•

[...]

Jinyan Li¹, Guozhu Dong², Kotagiri Ramamohanarao³, Limsoon Wong¹•Institutions (3)

Institute for Infocomm Research Singapore¹, Wright State University², University of Melbourne³

01 Feb 2004-Machine Learning

TL;DR: This work makes use of the frequency of an instance's subsets of features and the frequency-change rate of the subsets among training classes to perform both knowledge discovery and classification.

...read moreread less

Abstract: Distance is widely used in most lazy classification systems. Rather than using distance, we make use of the frequency of an instance's subsets of features and the frequency-change rate of the subsets among training classes to perform both knowledge discovery and classification. We name the system DeEPs. Whenever an instance is considered, DeEPs can efficiently discover those patterns contained in the instance which sharply differentiate the training classes from one to another. DeEPs can also predict a class label for the instance by compactly summarizing the frequencies of the discovered patterns based on a view to collectively maximize the discriminating power of the patterns. Many experimental results are used to evaluate the system, showing that the patterns are comprehensible and that DeEPs is accurate and scalable.

...read moreread less

Journal Article•DOI•

Integrating Guidance into Relational Reinforcement Learning

[...]

Kurt Driessens¹, Sašo Džeroski•Institutions (1)

Katholieke Universiteit Leuven¹

01 Dec 2004-Machine Learning

TL;DR: This paper presents a solution based on the use of “reasonable policies” to provide guidance in Relational reinforcement learning, which makes Q-learning feasible in structural domains by incorporating a relational learner into Q- learning.

...read moreread less

Abstract: Reinforcement learning, and Q-learning in particular, encounter two major problems when dealing with large state spaces. First, learning the Q-function in tabular form may be infeasible because of the excessive amount of memory needed to store the table, and because the Q-function only converges after each state has been visited multiple times. Second, rewards in the state space may be so sparse that with random exploration they will only be discovered extremely slowly. The first problem is often solved by learning a generalization of the encountered examples (e.g., using a neural net or decision tree). Relational reinforcement learning (RRL) is such an approachs it makes Q-learning feasible in structural domains by incorporating a relational learner into Q-learning. The problem of sparse rewards has not been addressed for RRL. This paper presents a solution based on the use of “reasonable policies” to provide guidance. Different types of policies and different strategies to supply guidance through these policies are discussed and evaluated experimentally in several relational domains to show the merits of the approach.

...read moreread less

Journal Article•DOI•

Decision Support Through Subgroup Discovery: Three Case Studies and the Lessons Learned

[...]

Nada Lavrač, Bojan Cestnik, Dragan Gamberger, Peter A. Flach¹•Institutions (1)

University of Bristol¹

On Data and Algorithms: Understanding Inductive Performance

TL;DR: Three case studies are used to present the lessons learned in solving problems requiring actionable knowledge generation for decision support, and different subgroup discovery approaches are outlined.

...read moreread less

Abstract: This paper presents ways to use subgroup discovery to generate actionable knowledge for decision support. Actionable knowledge is explicit symbolic knowledge, typically presented in the form of rules, that allows the decision maker to recognize some important relations and to perform an appropriate action, such as targeting a direct marketing campaign, or planning a population screening campaign aimed at detecting individuals with high disease risk. Different subgroup discovery approaches are outlined, and their advantages over using standard classification rule learning are discussed. Three case studies, a medical and two marketing ones, are used to present the lessons learned in solving problems requiring actionable knowledge generation for decision support.

...read moreread less

Journal Article•DOI•

[...]

Alexandros Kalousis¹, João Gama², Melanie Hilario¹•Institutions (2)

University of Geneva¹, University of Porto²

A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis

TL;DR: This paper addresses two symmetrical issues, the discovery of similarities among classification algorithms, and among datasets, on the basis of error measures, which are used to discover similarities between learners, and both of them to discovering similarities between datasets.

...read moreread less

Abstract: In this paper we address two symmetrical issues, the discovery of similarities among classification algorithms, and among datasets. Both on the basis of error measures, which we use to define the error correlation between two algorithms, and determine the relative performance of a list of algorithms. We use the first to discover similarities between learners, and both of them to discover similarities between datasets. The latter sketch maps on the dataset space. Regions within each map exhibit specific patterns of error correlation or relative performance. To acquire an understanding of the factors determining these regions we describe them using simple characteristics of the datasets. Descriptions of each region are given in terms of the distributions of dataset characteristics within it.

...read moreread less

Journal Article•DOI•

[...]

Abhijit Gosavi¹•Institutions (1)

University at Buffalo¹

01 Apr 2004-Machine Learning

TL;DR: A Reinforcement Learning algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems and focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry.

...read moreread less

Abstract: We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations research community, RL has been used to derive good solutions to problems previously considered intractable. Hence in this paper, we have tested the proposed algorithm on a commercially significant case study related to a real-world problem from the airline industry. It focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry. In the experiments conducted, we use our algorithm with a nearest-neighbor approach to tackle a large state space. We also present a convergence analysis of the algorithm via an ordinary differential equation method.

...read moreread less

Journal Article•DOI•

Induction as Consequence Finding

[...]

Katsumi Inoue¹•Institutions (1)

National Institute of Informatics¹

01 May 2004-Machine Learning

TL;DR: This paper presents a general procedure for inverse entailment which constructs inductive hypotheses in inductive logic programming and proposes a method called CF-induction which is sound and complete for finding hypotheses from full clausal theories and can be used for inducing not only definite clauses but also non-Horn clauses and integrity constraints.

...read moreread less

Abstract: This paper presents a general procedure for inverse entailment which constructs inductive hypotheses in inductive logic programming. Based on inverse entailment, not only unit clauses but also characteristic clauses are deduced from a background theory together with the negation of positive examples. Such clauses can be computed by a resolution method for consequence finding. Unlike previous work on inverse entailment, our proposed method called CF-induction is sound and complete for finding hypotheses from full clausal theories, and can be used for inducing not only definite clauses but also non-Horn clauses and integrity constraints. We also show that CF-induction can be used to compute abductive explanations, and then compare induction and abduction from the viewpoint of inverse entailment and consequence finding.

...read moreread less

Journal Article•DOI•

Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics

[...]

Mark-A. Krogel¹, Tobias Scheffer²•Institutions (2)

Otto-von-Guericke University Magdeburg¹, Humboldt University of Berlin²

A New Conceptual Clustering Framework

TL;DR: The goal is to study the effectiveness of approaches that utilize all data sources that are available in this problem setting, including relational data, abstracts of research papers, and unlabeled data, and a propositionalization approach which uses relational gene interaction data.

...read moreread less

Abstract: We focus on the problem of predicting functional properties of the proteins corresponding to genes in the yeast genome. Our goal is to study the effectiveness of approaches that utilize all data sources that are available in this problem setting, including relational data, abstracts of research papers, and unlabeled data. We investigate a propositionalization approach which uses relational gene interaction data. We study the benefit of text classification and information extraction for utilizing a collection of scientific abstracts. We study transduction and co-training for using unlabeled data. We report on both, positive and negative results on the investigated approaches. The studied tasks are KDD Cup tasks of 2001 and 2002. The solutions which we describe achieved the highest score for task 2 in 2001, the fourth rank for task 3 in 2001, the highest score for one of the two subtasks and the third place for the overall task 2 in 2002.

...read moreread less

Journal Article•DOI•

[...]

Nina Mishra¹, Dana Ron², Ram Swaminathan³•Institutions (3)

Stanford University¹, Tel Aviv University², Hewlett-Packard³

A k -Median Algorithm with Running Time Independent of Data Size

TL;DR: Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time and connections between this conceptual clustering problem and the maximum edge biclique problem are made.

...read moreread less

Abstract: We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may not cover all the points. In addition, a point may be assigned to a cluster description even if it only satisfies most, and not necessarily all, of the attributes in the conjunction. Connections between this conceptual clustering problem and the maximum edge biclique problem are made. Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time.

...read moreread less

Journal Article•DOI•

[...]

Adam Meyerson¹, Liadan O'Callaghan², Serge Plotkin²•Institutions (2)

University of California, Los Angeles¹, Stanford University²

Fast Theta-Subsumption with Constraint Satisfaction Algorithms

TL;DR: This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set, and gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω.

...read moreread less

Abstract: We give a sampling-based algorithm for the k-Median problem, with running time O(k(\frac{k^2}{\epsilon} log k)2 log(\frac{k}{\epsilon} log k)), where k is the desired number of clusters and e is a confidence parameter. This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set. It gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω(\frac{n\epsilon}{k}) points. We also give weakly-polynomial-time algorithms for this problem and a relaxed version of k-Median in which a small fraction of outliers can be excluded. We give near-matching lower bounds showing that this assumption about cluster size is necessary. We also present a related algorithm for finding a clustering that excludes a small number of outliers.

...read moreread less

Journal Article•DOI•

[...]

Jérôme Maloberti¹, Michèle Sebag¹•Institutions (1)

University of Paris-Sud¹

01 May 2004-Machine Learning

TL;DR: The performance gains and good scalability of Django and Meta-Django are finally demonstrated on a real-world ILP task (emulating the search for frequent clauses in the mutagenesis domain) though the smaller size of the problems results in smaller gain factors.

...read moreread less

Abstract: Relational learning and Inductive Logic Programming (ILP) commonly use as covering test the ?-subsumption test defined by Plotkin. Based on a reformulation of ?-subsumption as a binary constraint satisfaction problem, this paper describes a novel ?-subsumption algorithm named Django,1 which combines well-known CSP procedures and ?-subsumption-specific data structures. Django is validated using the stochastic complexity framework developed in CSPs, and imported in ILP by Giordana et Saitta. Principled and extensive experiments within this framework show that Django improves on earlier ?-subsumption algorithms by several orders of magnitude, and that different procedures are better at different regions of the stochastic complexity landscape. These experiments allow for building a control layer over Django, termed Meta-Django, which determines the best procedures to use depending on the order parameters of the ?-subsumption problem instance. The performance gains and good scalability of Django and Meta-Django are finally demonstrated on a real-world ILP task (emulating the search for frequent clauses in the mutagenesis domain) though the smaller size of the problems results in smaller gain factors (ranging from 2.5 to 30).

...read moreread less

Journal Article•DOI•

Mining Skewed and Sparse Transaction Data for Personalized Shopping Recommendation

[...]

Chun-Nan Hsu¹, Hao-Hsiang Chung², Han-Shen Huang²•Institutions (2)

Academia Sinica¹, National Taiwan University²