scispace - formally typeset
Search or ask a question

Showing papers in "Machine Learning in 2004"


Journal ArticleDOI
TL;DR: The Support Vector Data Description (SVDD) is presented which obtains a spherically shaped boundary around a dataset and analogous to the Support Vector Classifier it can be made flexible by using other kernel functions.
Abstract: Data domain description concerns the characterization of a data set. A good description covers all target data but includes no superfluous space. The boundary of a dataset can be used to detect novel data or outliers. We will present the Support Vector Data Description (SVDD) which is inspired by the Support Vector Classifier. It obtains a spherically shaped boundary around a dataset and analogous to the Support Vector Classifier it can be made flexible by using other kernel functions. The method is made robust against outliers in the training set and is capable of tightening the description by using negative examples. We show characteristics of the Support Vector Data Descriptions using artificial and real data.

2,789 citations


Journal ArticleDOI
TL;DR: An algorithmic framework to classify a partially labeled data set in a principled manner and models the manifold using the adjacency graph for the data and approximates the Laplace-Beltrami operator by the graph Laplacian.
Abstract: We consider the general problem of utilizing both labeled and unlabeled data to improve classification accuracy. Under the assumption that the data lie on a submanifold in a high dimensional space, we develop an algorithmic framework to classify a partially labeled data set in a principled manner. The central idea of our approach is that classification functions are naturally defined only on the submanifold in question rather than the total ambient space. Using the Laplace-Beltrami operator one produces a basis (the Laplacian Eigenmaps) for a Hilbert space of square integrable functions on the submanifold. To recover such a basis, only unlabeled examples are required. Once such a basis is obtained, training can be performed using the labeled data set. Our algorithm models the manifold using the adjacency graph for the data and approximates the Laplace-Beltrami operator by the graph Laplacian. We provide details of the algorithm, its theoretical justification, and several practical applications for image, speech, and text classification.

886 citations


Journal ArticleDOI
TL;DR: This work empirically evaluates several state-of-the-art methods for constructing ensembles of heterogeneous classifiers with stacking and shows that they perform (at best) comparably to selecting the best classifier from the ensemble by cross validation and proposes two extensions of this method using an extended set of meta-level features and multi-response model trees to learn at the meta- level.
Abstract: We empirically evaluate several state-of-the-art methods for constructing ensembles of heterogeneous classifiers with stacking and show that they perform (at best) comparably to selecting the best classifier from the ensemble by cross validation. Among state-of-the-art stacking methods, stacking with probability distributions and multi-response linear regression performs best. We propose two extensions of this method, one using an extended set of meta-level features and the other using multi-response model trees to learn at the meta-level. We show that the latter extension performs better than existing stacking approaches and better than selecting the best classifier by cross validation.

768 citations


Journal ArticleDOI
TL;DR: This paper describes recent research applying machine learning methods to the problem of classifying the cognitive state of a human subject based on fRMI data observed over a single time interval, and presents case studies in which classifiers are successfully trained to distinguish cognitive states.
Abstract: Over the past decade, functional Magnetic Resonance Imaging (fMRI) has emerged as a powerful new instrument to collect vast quantities of data about activity in the human brain. A typical fMRI experiment can produce a three-dimensional image related to the human subject's brain activity every half second, at a spatial resolution of a few millimeters. As in other modern empirical sciences, this new instrumentation has led to a flood of new data, and a corresponding need for new data analysis methods. We describe recent research applying machine learning methods to the problem of classifying the cognitive state of a human subject based on fRMI data observed over a single time interval. In particular, we present case studies in which we have successfully trained classifiers to distinguish cognitive states such as (1) whether the human subject is looking at a picture or a sentence, (2) whether the subject is reading an ambiguous or non-ambiguous sentence, and (3) whether the word the subject is viewing is a word describing food, people, buildings, etc. This learning problem provides an interesting case study of classifier learning from extremely high dimensional (105 features), extremely sparse (tens of training examples), noisy data. This paper summarizes the results obtained in these three case studies, as well as lessons learned about how to successfully apply machine learning methods to train classifiers in such settings.

733 citations


Journal ArticleDOI
TL;DR: Both the SVM and LS-SVM classifier with RBF kernel in combination with standard cross-validation procedures for hyperparameter selection achieve comparable test set performances, consistently very good when compared to a variety of methods described in the literature.
Abstract: In Support Vector Machines (SVMs), the solution of the classification problem is characterized by a (convex) quadratic programming (QP) problem. In a modified version of SVMs, called Least Squares SVM classifiers (LS-SVMs), a least squares cost function is proposed so as to obtain a linear set of equations in the dual space. While the SVM classifier has a large margin interpretation, the LS-SVM formulation is related in this paper to a ridge regression approach for classification with binary targets and to Fisher's linear discriminant analysis in the feature space. Multiclass categorization problems are represented by a set of binary classifiers using different output coding schemes. While regularization is used to control the effective number of parameters of the LS-SVM classifier, the sparseness property of SVMs is lost due to the choice of the 2-norm. Sparseness can be imposed in a second stage by gradually pruning the support value spectrum and optimizing the hyperparameters during the sparse approximation procedure. In this paper, twenty public domain benchmark datasets are used to evaluate the test set performance of LS-SVM classifiers with linear, polynomial and radial basis function (RBF) kernels. Both the SVM and LS-SVM classifier with RBF kernel in combination with standard cross-validation procedures for hyperparameter selection achieve comparable test set performances. These SVM and LS-SVM performances are consistently very good when compared to a variety of methods described in the literature including decision tree based algorithms, statistical algorithms and instance based learning methods. We show on ten UCI datasets that the LS-SVM sparse approximation procedure can be successfully applied.

698 citations


Journal ArticleDOI
TL;DR: This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets, and shows that there are a set of criterion functions that consistently outperform the rest.
Abstract: This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.

660 citations


Journal ArticleDOI
TL;DR: This paper considers the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters, and considers a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points, and argues that the relaxation provides a generalized clustering which is useful in its own right.
Abstract: We consider the problem of partitioning a set of m points in the n-dimensional Euclidean space into k clusters (usually m and n are variable, while k is fixed), so as to minimize the sum of squared distances between each point and its cluster center. This formulation is usually the objective of the k-means clustering algorithm (Kanungo et al. (2000)). We prove that this problem in NP-hard even for k e 2, and we consider a continuous relaxation of this discrete problem: find the k-dimensional subspace V that minimizes the sum of squared distances to V of the m points. This relaxation can be solved by computing the Singular Value Decomposition (SVD) of the m × n matrix A that represents the m pointss this solution can be used to get a 2-approximation algorithm for the original problem. We then argue that in fact the relaxation provides a generalized clustering which is useful in its own right. Finally, we show that the SVD of a random submatrix—chosen according to a suitable probability distribution—of a given matrix provides an approximation to the SVD of the whole matrix, thus yielding a very fast randomized algorithm. We expect this algorithm to be the main contribution of this paper, since it can be applied to problems of very large size which typically arise in modern applications.

523 citations


Journal ArticleDOI
João Gama1
TL;DR: In this article, the authors study the effects of using combinations of attributes at decision nodes, leaf nodes, or both nodes and leaves in regression and classification tree learning, and propose a framework combining a univariate decision tree with a linear function by means of constructive induction.
Abstract: In the context of classification problems, algorithms that generate multivariate trees are able to explore multiple representation languages by using decision tests based on a combination of attributes. In the regression setting, model trees algorithms explore multiple representation languages but using linear models at leaf nodes. In this work we study the effects of using combinations of attributes at decision nodes, leaf nodes, or both nodes and leaves in regression and classification tree learning. In order to study the use of functional nodes at different places and for different types of modeling, we introduce a simple unifying framework for multivariate tree learning. This framework combines a univariate decision tree with a linear function by means of constructive induction. Decision trees derived from the framework are able to use decision nodes with multivariate tests, and leaf nodes that make predictions using linear functions. Multivariate decision nodes are built when growing the tree, while functional leaves are built when pruning the tree. We experimentally evaluate a univariate tree, a multivariate tree using linear combinations at inner and leaf nodes, and two simplified versions restricting linear combinations to inner nodes and leaves. The experimental evaluation shows that all functional trees variants exhibit similar performance, with advantages in different datasets. In this study there is a marginal advantage of the full model. These results lead us to study the role of functional leaves and nodes. We use the bias-variance decomposition of the error, cluster analysis, and learning curves as tools for analysis. We observe that in the datasets under study and for classification and regression, the use of multivariate decision nodes has more impact in the bias component of the error, while the use of multivariate decision leaves has more impact in the variance component.

262 citations


Journal ArticleDOI
TL;DR: In this article, a look-ahead algorithm for selective sampling of examples for nearest neighbor classifiers is proposed, where the algorithm is looking for the example with the highest utility, taking its effect on the resulting classifier into account.
Abstract: Most existing inductive learning algorithms work under the assumption that their training examples are already tagged. There are domains, however, where the tagging procedure requires significant computation resources or manual labor. In such cases, it may be beneficial for the learner to be active, intelligently selecting the examples for labeling with the goal of reducing the labeling cost. In this paper we present LSS—a lookahead algorithm for selective sampling of examples for nearest neighbor classifiers. The algorithm is looking for the example with the highest utility, taking its effect on the resulting classifier into account. Computing the expected utility of an example requires estimating the probability of its possible labels. We propose to use the random field model for this estimation. The LSS algorithm was evaluated empirically on seven real and artificial data sets, and its performance was compared to other selective sampling algorithms. The experiments show that the proposed algorithm outperforms other methods in terms of average error rate and stability.

224 citations


Journal ArticleDOI
TL;DR: A general method for constructing a kernel following the syntactic structure of the data, as defined by its type signature in a higher-order logic, and the main theoretical result is the positive definiteness of any kernel thus defined.
Abstract: This paper brings together two strands of machine learning of increasing importance: kernel methods and highly structured data. We propose a general method for constructing a kernel following the syntactic structure of the data, as defined by its type signature in a higher-order logic. Our main theoretical result is the positive definiteness of any kernel thus defined. We report encouraging experimental results on a range of real-world data sets. By converting our kernel to a distance pseudo-metric for 1-nearest neighbour, we were able to improve the best accuracy from the literature on the Diterpene data set by more than 10%.

179 citations


Journal ArticleDOI
TL;DR: A meta-learning methodology that can select settings with low error while providing significant savings in time is proposed and applied to set the width of the Gaussian kernel.
Abstract: The Support Vector Machine algorithm is sensitive to the choice of parameter settings. If these are not set correctly, the algorithm may have a substandard performance. Suggesting a good setting is thus an important problem. We propose a meta-learning methodology for this purpose and exploit information about the past performance of different settings. The methodology is applied to set the width of the Gaussian kernel. We carry out an extensive empirical evaluation, including comparisons with other methods (fixed default ranking; selection based on cross-validation and a heuristic method commonly used to set the width of the SVM kernel). We show that our methodology can select settings with low error while providing significant savings in time. Further work should be carried out to see how the methodology could be adapted to different parameter setting tasks.

Journal ArticleDOI
TL;DR: It is shown how meta-learning can be simply defined as the process of exploiting knowledge about learning that enables us to understand and improve the performance of learning algorithms.
Abstract: Recent advances in meta-learning are providing the foundations to construct meta-learning assistants and task-adaptive learners The goal of this special issue is to foster an interest in meta-learning by compiling representative work in the field The contributions to this special issue provide strong insights into the construction of future meta-learning tools In this introduction we present a common frame of reference to address work in meta-learning through the concept of meta-knowledge We show how meta-learning can be simply defined as the process of exploiting knowledge about learning that enables us to understand and improve the performance of learning algorithms

Journal ArticleDOI
TL;DR: A unifying view on both systems in which 1BC works in language space, and 1BC2 works in individual space is presented, and a new, efficient recursive algorithm improving upon the original propositionalisation approach of 1BC is presented.
Abstract: In this paper we present 1BC and 1BC2, two systems that perform naive Bayesian classification of structured individuals. The approach of 1BC is to project the individuals along first-order features. These features are built from the individual using structural predicates referring to related objects (e.g., atoms within molecules), and properties applying to the individual or one or several of its related objects (e.g., a bond between two atoms). We describe an individual in terms of elementary features consisting of zero or more structural predicates and one propertys these features are treated as conditionally independent in the spirit of the naive Bayes assumption. 1BC2 represents an alternative first-order upgrade to the naive Bayesian classifier by considering probability distributions over structured objects (e.g., a molecule as a set of atoms), and estimating those distributions from the probabilities of its elements (which are assumed to be independent). We present a unifying view on both systems in which 1BC works in language space, and 1BC2 works in individual space. We also present a new, efficient recursive algorithm improving upon the original propositionalisation approach of 1BC. Both systems have been implemented in the context of the first-order descriptive learner Tertius, and we investigate the differences between the two systems both in computational terms and on artificially generated data. Finally, we describe a range of experiments on ILP benchmark data sets demonstrating the viability of our approach.

Journal ArticleDOI
TL;DR: Novel bounds on the stability of combinations of any classifiers are derived that can be used to formally show that, for example, bagging increases the Stability of unstable learning machines.
Abstract: We study the leave-one-out and generalization errors of voting combinations of learning machines. A special case considered is a variant of bagging. We analyze in detail combinations of kernel machines, such as support vector machines, and present theoretical estimates of their leave-one-out error. We also derive novel bounds on the stability of combinations of any classifiers. These bounds can be used to formally show that, for example, bagging increases the stability of unstable learning machines. We report experiments supporting the theoretical findings.

Journal ArticleDOI
TL;DR: The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception, and many lessons learned over the last four years and the challenges that still need to be addressed are discussed.
Abstract: The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception. With clickstreams being collected at the application-server layer, high-level events being logged, and data automatically transformed into a data warehouse using meta-data, common problems plaguing data mining using weblogs (e.g., sessionization and conflating multi-sourced data) were obviated, thus allowing us to concentrate on actual data mining goals. The paper briefly reviews the architecture and discusses many lessons learned over the last four years and the challenges that still need to be addressed. The lessons and challenges are presented across two dimensions: business-level vs. technical, and throughout the data mining lifecycle stages of data collection, data warehouse construction, business intelligence, and deployment. The lessons and challenges are also widely applicable to data mining domains outside retail e-commerce.

Journal ArticleDOI
TL;DR: Experimental evidence is provided supporting the hypothesis that bagging stabilizes prediction by equalizing the influence of training examples, and support that other resampling strategies such as half-sampling should provide qualitatively identical effects while being computationally less demanding than bootstrap sampling.
Abstract: Bagging constructs an estimator by averaging predictors trained on bootstrap samples. Bagged estimates almost consistently improve on the original predictor. It is thus important to understand the reasons for this success, and also for the occasional failures. It is widely believed that bagging is effective thanks to the variance reduction stemming from averaging predictors. However, seven years from its introduction, bagging is still not fully understood. This paper provides experimental evidence supporting the hypothesis that bagging stabilizes prediction by equalizing the influence of training examples. This effect is detailed in two different frameworks: estimation on the real line and regression. Bagging's improvements/deteriorations are explained by the goodness/badness of highly influential examples, in situations where the usual variance reduction argument is at best questionable. Finally, reasons for the equalization effect are advanced. They support that other resampling strategies such as half-sampling should provide qualitatively identical effects while being computationally less demanding than bootstrap sampling.

Journal ArticleDOI
TL;DR: This paper presents a novel approach to association rule mining which deals with multiple levels of description granularity and relies on the hybrid language A -log which allows a unified treatment of both the relational and structural features of data.
Abstract: Recently there has been growing interest both to extend ILP to description logics and to apply it to knowledge discovery in databases. In this paper we present a novel approach to association rule mining which deals with multiple levels of description granularity. It relies on the hybrid language $$\mathcal{A}\mathcal{L}$$ -log which allows a unified treatment of both the relational and structural features of data. A generality order and a downward refinement operator for $$\mathcal{A}\mathcal{L}$$ -log pattern spaces is defined on the basis of query subsumption. This framework has been implemented in SPADA, an ILP system for mining multi-level association rules from spatial data. As an illustrative example, we report experimental results obtained by running the new version of SPADA on geo-referenced census data of Manchester Stockport.

Journal ArticleDOI
Marc Boullé1
TL;DR: This method optimizes the chi-square criterion in a global manner on the whole discretization domain and does not require any stopping criterion, in contrast with related methods ChiMerge and ChiSplit.
Abstract: In supervised machine learning, some algorithms are restricted to discrete data and have to discretize continuous attributes. Many discretization methods, based on statistical criteria, information content, or other specialized criteria, have been studied in the past. In this paper, we propose the discretization method Khiops,1 based on the chi-square statistic. In contrast with related methods ChiMerge and ChiSplit, this method optimizes the chi-square criterion in a global manner on the whole discretization domain and does not require any stopping criterion. A theoretical study followed by experiments demonstrates the robustness and the good predictive performance of the method.

Journal ArticleDOI
TL;DR: The framework of bias-variance decomposition of error is used to analyze what caused the wide range of prediction performance in the CoIL Challenge 2000 data mining competition and finds that variance is the key component of error for this problem.
Abstract: The CoIL Challenge 2000 data mining competition attracted a wide variety of solutions, both in terms of approaches and performance. The goal of the competition was to predict who would be interested in buying a specific insurance product and to explain why people would buy. Unlike in most other competitions, the majority of participants provided a report describing the path to their solution. In this article we use the framework of bias-variance decomposition of error to analyze what caused the wide range of prediction performance. We characterize the challenge problem to make it comparable to other problems and evaluate why certain methods work or not. We also include an evaluation of the submitted explanations by a marketing expert. We find that variance is the key component of error for this problem. Participants use various strategies in data preparation and model development that reduce variance error, such as feature selection and the use of simple, robust and low variance learners like Naive Bayes. Adding constructed features, modeling with complex, weak bias learners and extensive fine tuning by the participants often increase the variance error.

Journal ArticleDOI
TL;DR: This work makes use of the frequency of an instance's subsets of features and the frequency-change rate of the subsets among training classes to perform both knowledge discovery and classification.
Abstract: Distance is widely used in most lazy classification systems. Rather than using distance, we make use of the frequency of an instance's subsets of features and the frequency-change rate of the subsets among training classes to perform both knowledge discovery and classification. We name the system DeEPs. Whenever an instance is considered, DeEPs can efficiently discover those patterns contained in the instance which sharply differentiate the training classes from one to another. DeEPs can also predict a class label for the instance by compactly summarizing the frequencies of the discovered patterns based on a view to collectively maximize the discriminating power of the patterns. Many experimental results are used to evaluate the system, showing that the patterns are comprehensible and that DeEPs is accurate and scalable.

Journal ArticleDOI
TL;DR: This paper presents a solution based on the use of “reasonable policies” to provide guidance in Relational reinforcement learning, which makes Q-learning feasible in structural domains by incorporating a relational learner into Q- learning.
Abstract: Reinforcement learning, and Q-learning in particular, encounter two major problems when dealing with large state spaces. First, learning the Q-function in tabular form may be infeasible because of the excessive amount of memory needed to store the table, and because the Q-function only converges after each state has been visited multiple times. Second, rewards in the state space may be so sparse that with random exploration they will only be discovered extremely slowly. The first problem is often solved by learning a generalization of the encountered examples (e.g., using a neural net or decision tree). Relational reinforcement learning (RRL) is such an approachs it makes Q-learning feasible in structural domains by incorporating a relational learner into Q-learning. The problem of sparse rewards has not been addressed for RRL. This paper presents a solution based on the use of “reasonable policies” to provide guidance. Different types of policies and different strategies to supply guidance through these policies are discussed and evaluated experimentally in several relational domains to show the merits of the approach.

Journal ArticleDOI
TL;DR: Three case studies are used to present the lessons learned in solving problems requiring actionable knowledge generation for decision support, and different subgroup discovery approaches are outlined.
Abstract: This paper presents ways to use subgroup discovery to generate actionable knowledge for decision support. Actionable knowledge is explicit symbolic knowledge, typically presented in the form of rules, that allows the decision maker to recognize some important relations and to perform an appropriate action, such as targeting a direct marketing campaign, or planning a population screening campaign aimed at detecting individuals with high disease risk. Different subgroup discovery approaches are outlined, and their advantages over using standard classification rule learning are discussed. Three case studies, a medical and two marketing ones, are used to present the lessons learned in solving problems requiring actionable knowledge generation for decision support.

Journal ArticleDOI
TL;DR: This paper addresses two symmetrical issues, the discovery of similarities among classification algorithms, and among datasets, on the basis of error measures, which are used to discover similarities between learners, and both of them to discovering similarities between datasets.
Abstract: In this paper we address two symmetrical issues, the discovery of similarities among classification algorithms, and among datasets. Both on the basis of error measures, which we use to define the error correlation between two algorithms, and determine the relative performance of a list of algorithms. We use the first to discover similarities between learners, and both of them to discover similarities between datasets. The latter sketch maps on the dataset space. Regions within each map exhibit specific patterns of error correlation or relative performance. To acquire an understanding of the factors determining these regions we describe them using simple characteristics of the datasets. Descriptions of each region are given in terms of the distributions of dataset characteristics within it.

Journal ArticleDOI
TL;DR: A Reinforcement Learning algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems and focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry.
Abstract: We present a Reinforcement Learning (RL) algorithm based on policy iteration for solving average reward Markov and semi-Markov decision problems. In the literature on discounted reward RL, algorithms based on policy iteration and actor-critic algorithms have appeared. Our algorithm is an asynchronous, model-free algorithm (which can be used on large-scale problems) that hinges on the idea of computing the value function of a given policy and searching over policy space. In the applied operations research community, RL has been used to derive good solutions to problems previously considered intractable. Hence in this paper, we have tested the proposed algorithm on a commercially significant case study related to a real-world problem from the airline industry. It focuses on yield management, which has been hailed as the key factor for generating profits in the airline industry. In the experiments conducted, we use our algorithm with a nearest-neighbor approach to tackle a large state space. We also present a convergence analysis of the algorithm via an ordinary differential equation method.

Journal ArticleDOI
TL;DR: This paper presents a general procedure for inverse entailment which constructs inductive hypotheses in inductive logic programming and proposes a method called CF-induction which is sound and complete for finding hypotheses from full clausal theories and can be used for inducing not only definite clauses but also non-Horn clauses and integrity constraints.
Abstract: This paper presents a general procedure for inverse entailment which constructs inductive hypotheses in inductive logic programming. Based on inverse entailment, not only unit clauses but also characteristic clauses are deduced from a background theory together with the negation of positive examples. Such clauses can be computed by a resolution method for consequence finding. Unlike previous work on inverse entailment, our proposed method called CF-induction is sound and complete for finding hypotheses from full clausal theories, and can be used for inducing not only definite clauses but also non-Horn clauses and integrity constraints. We also show that CF-induction can be used to compute abductive explanations, and then compare induction and abduction from the viewpoint of inverse entailment and consequence finding.

Journal ArticleDOI
TL;DR: The goal is to study the effectiveness of approaches that utilize all data sources that are available in this problem setting, including relational data, abstracts of research papers, and unlabeled data, and a propositionalization approach which uses relational gene interaction data.
Abstract: We focus on the problem of predicting functional properties of the proteins corresponding to genes in the yeast genome. Our goal is to study the effectiveness of approaches that utilize all data sources that are available in this problem setting, including relational data, abstracts of research papers, and unlabeled data. We investigate a propositionalization approach which uses relational gene interaction data. We study the benefit of text classification and information extraction for utilizing a collection of scientific abstracts. We study transduction and co-training for using unlabeled data. We report on both, positive and negative results on the investigated approaches. The studied tasks are KDD Cup tasks of 2001 and 2002. The solutions which we describe achieved the highest score for task 2 in 2001, the fourth rank for task 3 in 2001, the highest score for one of the two subtasks and the third place for the overall task 2 in 2002.

Journal ArticleDOI
TL;DR: Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time and connections between this conceptual clustering problem and the maximum edge biclique problem are made.
Abstract: We propose a new formulation of the conceptual clustering problem where the goal is to explicitly output a collection of simple and meaningful conjunctions of attributes that define the clusters. The formulation differs from previous approaches since the clusters discovered may overlap and also may not cover all the points. In addition, a point may be assigned to a cluster description even if it only satisfies most, and not necessarily all, of the attributes in the conjunction. Connections between this conceptual clustering problem and the maximum edge biclique problem are made. Simple, randomized algorithms are given that discover a collection of approximate conjunctive cluster descriptions in sublinear time.

Journal ArticleDOI
TL;DR: This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set, and gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω.
Abstract: We give a sampling-based algorithm for the k-Median problem, with running time O(k(\frac{k^2}{\epsilon} log k)2 log(\frac{k}{\epsilon} log k)), where k is the desired number of clusters and e is a confidence parameter. This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set. It gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω(\frac{n\epsilon}{k}) points. We also give weakly-polynomial-time algorithms for this problem and a relaxed version of k-Median in which a small fraction of outliers can be excluded. We give near-matching lower bounds showing that this assumption about cluster size is necessary. We also present a related algorithm for finding a clustering that excludes a small number of outliers.

Journal ArticleDOI
TL;DR: The performance gains and good scalability of Django and Meta-Django are finally demonstrated on a real-world ILP task (emulating the search for frequent clauses in the mutagenesis domain) though the smaller size of the problems results in smaller gain factors.
Abstract: Relational learning and Inductive Logic Programming (ILP) commonly use as covering test the ?-subsumption test defined by Plotkin. Based on a reformulation of ?-subsumption as a binary constraint satisfaction problem, this paper describes a novel ?-subsumption algorithm named Django,1 which combines well-known CSP procedures and ?-subsumption-specific data structures. Django is validated using the stochastic complexity framework developed in CSPs, and imported in ILP by Giordana et Saitta. Principled and extensive experiments within this framework show that Django improves on earlier ?-subsumption algorithms by several orders of magnitude, and that different procedures are better at different regions of the stochastic complexity landscape. These experiments allow for building a control layer over Django, termed Meta-Django, which determines the best procedures to use depending on the order parameters of the ?-subsumption problem instance. The performance gains and good scalability of Django and Meta-Django are finally demonstrated on a real-world ILP task (emulating the search for frequent clauses in the mutagenesis domain) though the smaller size of the problems results in smaller gain factors (ranging from 2.5 to 30).

Journal ArticleDOI
TL;DR: HyPAM (Hybrid Poisson Aspect Modelling) is derived, a novel probabilistic graphical model for personalized shopping recommendation that outperforms GroupLens and the IBM method by generating much more accurate predictions of what items a customer will actually purchase in the unseen test data.
Abstract: A good shopping recommender system can boost sales in a retailer store. To provide accurate recommendation, the recommender needs to accurately predict a customer's preference, an ability difficult to acquire. Conventional data mining techniques, such as association rule mining and collaborative filtering, can generally be applied to this problem, but rarely produce satisfying results due to the skewness and sparsity of transaction data. In this paper, we report the lessons that we learned in two real-world data mining applications for personalized shopping recommendation. We learned that extending a collaborative filtering method based on ratings (e.g., GroupLens) to perform personalized shopping recommendation is not trivial and that it is not appropriate to apply association-rule based methods (e.g., the IBM SmartPad system) for large scale prediction of customers' shopping preferences. Instead, a probabilistic graphical model can be more effective in handling skewed and sparse data. By casting collaborative filtering algorithms in a probabilistic framework, we derived HyPAM (Hybrid Poisson Aspect Modelling), a novel probabilistic graphical model for personalized shopping recommendation. Experimental results show that HyPAM outperforms GroupLens and the IBM method by generating much more accurate predictions of what items a customer will actually purchase in the unseen test data. The data sets and the results are made available for download at http://chunnan.iis.sinica.edu.tw/hypam/HyPAM.html.