scispace - formally typeset
Search or ask a question

Showing papers on "Unsupervised learning published in 1997"


Journal ArticleDOI
TL;DR: As one of the part of book categories, the nature of statistical learning theory always becomes the most wanted book.
Abstract: If you really want to be smarter, reading can be one of the lots ways to evoke and realize. Many people who like reading will have more knowledge and experiences. Reading can be a way to gain information from economics, politics, science, fiction, literature, religion, and many others. As one of the part of book categories, the nature of statistical learning theory always becomes the most wanted book. Many people are absolutely searching for this book. It means that many love to read this kind of book.

2,716 citations


Journal ArticleDOI
TL;DR: An unsupervised technique for visual learning is presented, which is based on density estimation in high-dimensional spaces using an eigenspace decomposition and is applied to the probabilistic visual modeling, detection, recognition, and coding of human faces and nonrigid objects.
Abstract: We present an unsupervised technique for visual learning, which is based on density estimation in high-dimensional spaces using an eigenspace decomposition. Two types of density estimates are derived for modeling the training data: a multivariate Gaussian (for unimodal distributions) and a mixture-of-Gaussians model (for multimodal distributions). Those probability densities are then used to formulate a maximum-likelihood estimation framework for visual search and target detection for automatic object recognition and coding. Our learning technique is applied to the probabilistic visual modeling, detection, recognition, and coding of human faces and nonrigid objects, such as hands.

1,624 citations


Proceedings Article
01 Dec 1997
TL;DR: A new general framework, called Diverse Density, is described, which is applied to learn a simple description of a person from a series of images containing that person, to a stock selection problem, and to the drug activity prediction problem.
Abstract: Multiple-instance learning is a variation on supervised learning, where the task is to learn a concept given positive and negative bags of instances. Each bag may contain many instances, but a bag is labeled positive even if only one of the instances in it falls within the concept. A bag is labeled negative only if all the instances in it are negative. We describe a new general framework, called Diverse Density, for solving multiple-instance learning problems. We apply this framework to learn a simple description of a person from a series of images (bags) containing that person, to a stock selection problem, and to the drug activity prediction problem.

1,314 citations


Journal ArticleDOI
TL;DR: This article summarizes four directions of machine-learning research, the improvement of classification accuracy by learning ensembles of classifiers, methods for scaling up supervised learning algorithms, reinforcement learning, and the learning of complex stochastic models.
Abstract: Machine-learning research has been making great progress in many directions. This article summarizes four of these directions and discusses some current open problems. The four directions are (1) the improvement of classification accuracy by learning ensembles of classifiers, (2) methods for scaling up supervised learning algorithms, (3) reinforcement learning, and (4) the learning of complex stochastic models.

1,250 citations


Proceedings Article
01 Dec 1997
TL;DR: This work presents provably convergent algorithms for problem-solving and learning with hierarchical machines and demonstrates their effectiveness on a problem with several thousand states.
Abstract: We present a new approach to reinforcement learning in which the policies considered by the learning process are constrained by hierarchies of partially specified machines. This allows for the use of prior knowledge to reduce the search space and provides a framework in which knowledge can be transferred across problems and in which component solutions can be recombined to solve larger and more complicated problems. Our approach can be seen as providing a link between reinforcement learning and "behavior-based" or "teleo-reactive" approaches to control. We present provably convergent algorithms for problem-solving and learning with hierarchical machines and demonstrate their effectiveness on a problem with several thousand states.

746 citations


Journal ArticleDOI
TL;DR: A formulation of reinforcement learning that enables learning in noisy, dynamic environments such as in the complex concurrent multi-robot learning domain and experimentally validate the approach on a group of four mobile robots learning a foraging task.
Abstract: This paper describes a formulation of reinforcement learning that enables learning in noisy, dynamic environments such as in the complex concurrent multi-robot learning domain. The methodology involves minimizing the learning space through the use of behaviors and conditions, and dealing with the credit assignment problem through shaped reinforcement in the form of heterogeneous reinforcement functions and progress estimators. We experimentally validate the approach on a group of four mobile robots learning a foraging task.

488 citations


Journal ArticleDOI
01 Oct 1997
TL;DR: This paper presents a problem of fuzzy clustering with partial supervision, i.e., unsupervised learning completed in the presence of some labeled patterns, and proposes two specific learning scenarios of complete and incomplete class assignment of the labeled patterns.
Abstract: Presented here is a problem of fuzzy clustering with partial supervision, i.e., unsupervised learning completed in the presence of some labeled patterns. The classification information is incorporated additively as a part of an objective function utilized in the standard FUZZY ISODATA. The algorithms proposed in the paper embrace two specific learning scenarios of complete and incomplete class assignment of the labeled patterns. Numerical examples including both synthetic and real-world data arising in the realm of software engineering are also provided.

327 citations


Journal ArticleDOI
TL;DR: The proposed hybrid learning scheme provides a framework for incorporating existing algorithms in the training of GRBF networks, which include unsupervised algorithms for clustering and learning vector quantization, as well as learning algorithms for training single-layer linear neural networks.
Abstract: This paper proposes a framework for constructing and training radial basis function (RBF) neural networks. The proposed growing radial basis function (GRBF) network begins with a small number of prototypes, which determine the locations of radial basis functions. In the process of training, the GRBF network gross by splitting one of the prototypes at each growing cycle. Two splitting criteria are proposed to determine which prototype to split in each growing cycle. The proposed hybrid learning scheme provides a framework for incorporating existing algorithms in the training of GRBF networks. These include unsupervised algorithms for clustering and learning vector quantization, as well as learning algorithms for training single-layer linear neural networks. A supervised learning scheme based on the minimization of the localized class-conditional variance is also proposed and tested. GRBF neural networks are evaluated and tested on a variety of data sets with very satisfactory results.

322 citations


Journal ArticleDOI
TL;DR: Results indicate that neural systems perform a rich repertoire of computations based on action potential timing, which could provide a substrate for unambiguous representations of complex stimuli and be used to solve difficult cognitive tasks, such as the binding problem.
Abstract: Computational neuroscience has contributed significantly to our understanding of higher brain function by combining experimental neurobiology, psychophysics, modeling, and mathematical analysis. This article reviews recent advances in a key area: neural coding and information processing. It is shown that synapses are capable of supporting computations based on highly structured temporal codes. Such codes could provide a substrate for unambiguous representations of complex stimuli and be used to solve difficult cognitive tasks, such as the binding problem. Unsupervised learning rules could generate the circuitry required for precise temporal codes. Together, these results indicate that neural systems perform a rich repertoire of computations based on action potential timing.

259 citations


Journal ArticleDOI
TL;DR: It has been verified experimentally that when nonlinear Principal Component Analysis (PCA) learning rules are used for the weights of a neural layer, the neurons have signal separation capabilities and can be used for image and speech signal separation.

237 citations


Journal ArticleDOI
TL;DR: Improvement through training in vernier acuity under different feedback conditions is compared to clarify the role of feedback during learning of a perceptual task and to test different (neural network) models of perceptual learning.

Journal ArticleDOI
TL;DR: In this paper, the authors discuss two classes of approaches for data mining with neural networks: rule extraction and directly learning simple, easy-to-understand networks, and argue that, given the current state-of-the-art, neural-network methods deserve a place in the tool boxes of data-mining specialists.

Journal ArticleDOI
TL;DR: A filtering model is proposed that decomposes the overall task into subsystem functionalities and highlights the need for multiple adaptation techniques to cope with uncertainties.
Abstract: In information-filtering environments, uncertainties associated with changing interests of the user and the dynamic document stream must be handled efficiently. In this article, a filtering model is proposed that decomposes the overall task into subsystem functionalities and highlights the need for multiple adaptation techniques to cope with uncertainties. A filtering system, SIFTER, has been implemented based on the model, using established techniques in information retrieval and artificial intelligence. These techniques include document representation by a vector-space model, document classification by unsupervised learning, and user modeling by reinforcement learning. The system can filter information based on content and a user's specific interests. The user's interests are automatically learned with only limited user intervention in the form of optional relevance feedback for documents. We also describe experimental studies conducted with SIFTER to filter computer and information science documents collected from the Internet and commercial database services. The experimental results demonstrate that the system performs very well in filtering documents in a realistic problem setting.

Proceedings ArticleDOI
03 Nov 1997
TL;DR: This paper proposes an entropy measure for ranking features, and conducts extensive experiments to show that the method is able to find the important features and compares well with a similar feature ranking method that requires class information unlike this method.
Abstract: Dimensionality reduction is an important problem for efficient handling of large databases. Many feature selection methods exist for supervised data having class information. Little work has been done for dimensionality reduction of unsupervised data in which class information is not available. Principal component analysis (PCA) is often used. However, PCA creates new features. It is difficult to obtain intuitive understanding of the data using the new features only. We are concerned with the problem of determining and choosing the important original features for unsupervised data. Our method is based on the observation that removing an irrelevant feature from the feature set may not change the underlying concept of the data, but not so otherwise. We propose an entropy measure for ranking features, and conduct extensive experiments to show that our method is able to find the important features. Also it compares well with a similar feature ranking method (Relief) that requires class information unlike our method.

Journal ArticleDOI
01 Jan 1997
TL;DR: This paper focussed primarily on issues related to the accuracy and efficacy of meta-learning as ageneral strategy, demonstrating that meta- learning is technically feasible in wide-area, network computing environments.
Abstract: In this paper, we describe a general approach to scaling data mining applications that we have come to call meta-learning. Meta-Learning refers to a general strategy that seeks to learn how to combine a number of separate learning processes in an intelligent fashion. We desire a meta-learning architecture that exhibits two key behaviors. First, the meta-learning strategy must produce an accurate final classification system. This means that a meta-learning architecture must produce a final outcome that is at least as accurate as a conventional learning algorithm applied to all available data. Second, it must be fast, relative to an individual sequential learning algorithm when applied to massive databases of examples, and operate in a reasonable amount of time. This paper focussed primarily on issues related to the accuracy and efficacy of meta-learning as a general strategy. A number of empirical results are presented demonstrating that meta-learning is technically feasible in wide-area, network computing environments.

Journal ArticleDOI
TL;DR: A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors and is utilized to discover very useful survival curves for breast cancer patients from a medical database.
Abstract: Mathematical programming approaches to three fundamental problems will be described: feature selection, clustering and robust representation. The feature selection problem considered is that of discriminating between two sets while recognizing irrelevant and redundant features and suppressing them. This creates a lean model that often generalizes better to new unseen data. Computational results on real data confirm improved generalization of leaner models. Clustering is exemplified by the unsupervised learning of patterns and clusters that may exist in a given database and is a useful tool for knowledge discovery in databases (KDD). A mathematical programming formulation of this problem is proposed that is theoretically justifiable and computationally implementable in a finite number of steps. A resulting k-Median Algorithm is utilized to discover very useful survival curves for breast cancer patients from a medical database. Robust representation is concerned with minimizing trained model degradation when applied to new problems. A novel approach is proposed that purposely tolerates a small error in the training process in order to avoid overfitting data that may contain errors. Examples of applications of these concepts are given.

Posted Content
TL;DR: An experimental comparison of three unsupervised learning algorithms that distinguish the sense of an ambiguous word in untagged text using McQuitty's similarity analysis, Ward's minimum-variance method, and the EM algorithm.
Abstract: This paper describes an experimental comparison of three unsupervised learning algorithms that distinguish the sense of an ambiguous word in untagged text. The methods described in this paper, McQuitty's similarity analysis, Ward's minimum-variance method, and the EM algorithm, assign each instance of an ambiguous word to a known sense definition based solely on the values of automatically identifiable features in text. These methods and feature sets are found to be more successful in disambiguating nouns rather than adjectives or verbs. Overall, the most accurate of these procedures is McQuitty's similarity analysis in combination with a high dimensional feature set.

Book ChapterDOI
01 Aug 1997
TL;DR: A simple decomposition of the expected distortion is shown, showing that K-means (and its extension for inferring general parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clusters.
Abstract: Assignment methods are at the heart of many algorithms for unsupervised learning and clustering -- in particular, the well-known K-means and Expectation-Maximizatian (EM) algorithms. In this work, we study several different methods of assignment, including the "hard" assignments used by K-means and the "soft" assignments used by EM. While it is known that K-means minimizes the distortion on the data and EM maximizes the likelihood, little is known about the systematic differences of behavior between the two algorithms. Here we shed light on these differences via an information-theoretic analysis. The cornerstone of our results is a simple decomposition of the expected distortion, showing that K-means (and its extension for inferring general parametric densities from unlabeled sample data) must implicitly manage a trade-off between how similar the data assigned to each cluster are, and how the data are balanced among the clusters. How well the data are balanced is measured by the entropy of the partition defined by the hard assignments. In addition to letting us predict and verify systematic differences between K-means and EM on specific examples, the decomposition allows us to give a rather general argument showing that K-means will consistently find densities with less "overlap" than EM. We also study a third natural assignment method that we call posterior assignment, that is close in spirit to the soft assignments of EM, but leads to a surprisingly different algorithm.

Proceedings Article
01 Jan 1997
TL;DR: This article presented three unsupervised learning algorithms that are able to distinguish among the known senses (i.e., as defined in some dictionary) of a word, based only on features that can be automatically extracted from untagged text.
Abstract: This paper describes an experimental comparison of three unsupervised learning algorithms that distinguish the sense of an ambiguous word in untagged text. The methods described in this paper, McQuitty's similarity analysis, Ward's minimum-variance method, and the EM algorithm, assign each instance of an ambiguous word to a known sense definition based solely on the values of automatically identifiable features in text. These methods and feature sets are found to be more successful in disambiguating nouns rather than adjectives or verbs. Overall, the most accurate of these procedures is McQuitty's similarity analysis in combination with a high dimensional feature set. 1 I n t r o d u c t i o n Statistical methods for natural language processing are often dependent on the availability of costly knowledge sources such as manually annotated text or semantic networks. This limits the applicability of such approaches to domains where this hard to acquire knowledge is already available. This paper presents three unsupervised learning algorithms that are able to distinguish among the known senses (i.e., as defined in some dictionary) of a word, based only on features that can be automatically extracted from untagged text. The object of unsupervised learning is to determine the class membership of each observation (i.e. each object to be classified), in a sample without using training examples of correct classifications. We discuss three algorithms, McQuitty's similarity analysis (McQuitty, 1966), Ward's minimum-variance method (Ward, 1963) and the EM algorithm (Dempster, Laird, and Rubin, 1977), that can be used to distinguish among the known senses of an ambiguous word without the aid of disambiguated examples. The EM algorithm produces maximum likelihood estimates of the parameters of a probabilistic model, where that model has been specified in advance. Both Ward's and McQuitty's methods are agglomerative clustering algorithms that form classes of unlabeled observations that minimize their respective distance measures between class members. The rest of this paper is organized as follows. First, we present introductions to Ward's and McQuitty 's methods (Section 2) and the EM algorithm (Section 3). We discuss the thirteen words (Section 4) and the three feature sets (Section 5) used in our experiments. We present our experimental results (Section 6) and close with a discussion of related work (Section 7). 2 Agglomerat ive Clustering In general, clustering methods rely on the assumption that classes occupy distinct regions in the feature space. The distance between two points in a multi-dimensional space can be measured using any of a wide variety of metrics (see, e.g. (Devijver and Kittler, 1982)). Observations are grouped in the manner that minimizes the distance between the members of each class. Ward's and McQuitty's method are agglomerative clustering algorithms that differ primarily in how they compute the distance between clusters. All such algorithms begin by placing each observation in a unique cluster, i.e. a cluster of one. The two closest clusters are merged to form a new cluster that replaces the two merged clusters. Merging of the two closest clusters continues until only some specified number of clusters remain. However, our data does not immediately lend itself to a distance-based interpretation. Our features represent part-of-speech (POS) tags, morphological characteristics, and word co-occurrence; such features are nominal and their values do not have scale. Given a POS feature, for example, we could choose noun = 1, verb = 2, adjective = 3, and adverb = 4. That adverb is represented by a larger number than noun is purely coincidental and implies nothing about the relationship between nouns and adverbs. Thus, before we employ either clustering algo-

Journal ArticleDOI
TL;DR: A general two-level learning model is presented that effectively adjusts to changing contexts by trying to detect (via ‘meta-learning’) contextual clues and using this information to focus the learning process.
Abstract: The article deals with the problem of learning incrementally (‘on-line’) in domains where the target concepts are context-dependent, so that changes in context can produce more or less radical changes in the associated concepts. In particular, we concentrate on a class of learning tasks where the domain provides explicit clues as to the current context (e.g., attributes with characteristic values). A general two-level learning model is presented that effectively adjusts to changing contexts by trying to detect (via ‘meta-learning’) contextual clues and using this information to focus the learning process. Context learning and detection occur during regular on-line learning, without separate training phases for context recognition. Two operational systems based on this model are presented that differ in the underlying learning algorithm and in the way they use contextual information: METAL(B) combines meta-learning with a Bayesian classifier, while METAL(IB) is based on an instance-based learning algorithm. Experiments with synthetic domains as well as a number of ‘real-world” problems show that the algorithms are robust in a variety of dimensions, and that meta-learning can produce substantial increases in accuracy over simple object-level learning in situations with changing contexts.

Proceedings Article
08 Jul 1997
TL;DR: An implementation of a sequential feature selection algorithm based on an existing conceptual clustering system is described and an implementation which employs a technique for improving the efficiency of the search for an optimal description is presented.
Abstract: Feature selection has proven to be a valuable technique in supervised learning for improving predictive accuracy while reducing the number of attributes considered in a task. We investigate the potential for similar benefits in an unsupervised learning task, conceptual clustering. The issues raised in feature selection by the absence of class labels are discussed and an implementation of a sequential feature selection algorithm based on an existing conceptual clustering system is described. Additionally, we present a second implementation which employs a technique for improving the efficiency of the search for an optimal description and compare the performance of both algorithms.

Book
01 Jan 1997
TL;DR: It is argued that local models have the potential to help solving problems in high-dimensional spaces and that global models have not and a linear approximation of the system dynamics and a quadratic function describing the long term reward are suggested to constitute a suitable local model.
Abstract: Reinforcement learning is a general and powerful way to formulate complex learning problems and acquire good system behaviour. The goal of a reinforcement learning system is to maximize a long term ...

Proceedings Article
01 Dec 1997
TL;DR: If an object has a continuous family of instantiations, it should be represented by a continuous attractor, and this idea is illustrated with a network that learns to complete patterns.
Abstract: One approach to invariant object recognition employs a recurrent neural network as an associative memory. In the standard depiction of the network's state space, memories of objects are stored as attractive fixed points of the dynamics. I argue for a modification of this picture: if an object has a continuous family of instantiations, it should be represented by a continuous attractor. This idea is illustrated with a network that learns to complete patterns. To perform the task of filling in missing information, the network develops a continuous attractor that models the manifold from which the patterns are drawn. From a statistical view-point, the pattern completion task allows a formulation of unsupervised learning in terms of regression rather than density estimation.

Proceedings ArticleDOI
21 Apr 1997
TL;DR: Neural blind separation techniques developed in the laboratory are applied to the extraction of features from natural images and to the separation of medical EEG signals, which yields features that describe the underlying data better than for example classical principal component analysis.
Abstract: In blind source separation one tries to separate statistically independent unknown source signals from their linear mixtures without knowing the mixing coefficients. Such techniques are currently studied actively both in statistical signal processing and unsupervised neural learning. We apply neural blind separation techniques developed in our laboratory to the extraction of features from natural images and to the separation of medical EEG signals. The new analysis method yields features that describe the underlying data better than for example classical principal component analysis. We discuss difficulties related with real-world applications of blind signal processing, too.

Journal ArticleDOI
TL;DR: Comparing three approaches to machine learning that have developed largely independently: classical statistics, Vapnik's statistical learning theory, and computational learning theory concludes that statisticians and data miners can profit by studying each other's methods and using a judiciously chosen combination of them.

Journal ArticleDOI
TL;DR: Two quantitative measures are introduced which establish a relationship between the formulation that led to FALVQ algorithms and the competition between the prototypes during the learning process and are tested and evaluated using the IRIS data set.
Abstract: This paper presents a general methodology for the development of fuzzy algorithms for learning vector quantization (FALVQ). The design of specific FALVQ algorithms according to existing approaches reduces to the selection of the membership function assigned to the weight vectors of an LVQ competitive neural network, which represent the prototypes. The development of a broad variety of FALVQ algorithms can be accomplished by selecting the form of the interference function that determines the effect of the nonwinning prototypes on the attraction between the winning prototype and the input of the network. The proposed methodology provides the basis for extending the existing FALVQ 1, FALVQ 2, and FALVQ 3 families of algorithms. This paper also introduces two quantitative measures which establish a relationship between the formulation that led to FALVQ algorithms and the competition between the prototypes during the learning process. The proposed algorithms and competition measures are tested and evaluated using the IRIS data set. The significance of the proposed competition measure is illustrated using FALVQ algorithms to perform segmentation of magnetic resonance images of the brain.

Journal ArticleDOI
TL;DR: This paper shows how to develop dynamic programming versions of EBL, which it is called region-based dynamic programming or Explanation-Based Reinforcement Learning (EBRL), and compares batch and online versions of EBRL to batch andOnline versions of point-basedynamic programming and to standard EBL.
Abstract: In speedup-learning problems, where full descriptions of operators are known, both explanation-based learning (EBL) and reinforcement learning (RL) methods can be applied. This paper shows that both methods involve fundamentally the same process of propagating information backward from the goal toward the starting state. Most RL methods perform this propagation on a state-by-state basis, while EBL methods compute the weakest preconditions of operators, and hence, perform this propagation on a region-by-region basis. Barto, Bradtke, and Singh (1995) have observed that many algorithms for reinforcement learning can be viewed as asynchronous dynamic programming. Based on this observation, this paper shows how to develop dynamic programming versions of EBL, which we call region-based dynamic programming or Explanation-Based Reinforcement Learning (EBRL). The paper compares batch and online versions of EBRL to batch and online versions of point-based dynamic programming and to standard EBL. The results show that region-based dynamic programming combines the strengths of EBL (fast learning and the ability to scale to large state spaces) with the strengths of reinforcement learning algorithms (learning of optimal policies). Results are shown in chess endgames and in synthetic maze tasks.

Journal ArticleDOI
TL;DR: A new learning concept and paradigm for neural networks, called multiresolution learning, is presented, based onMultiresolution analysis in wavelet theory, which can significantly improve the generalization performance of neural networks.
Abstract: Current neural network learning processes, regardless of the learning algorithm and preprocessing used, are sometimes inadequate for difficult problems. We present a new learning concept and paradigm for neural networks, called multiresolution learning, based on multiresolution analysis in wavelet theory. The multiresolution learning paradigm can significantly improve the generalization performance of neural networks.

Journal ArticleDOI
TL;DR: This study investigates the emerging possibilities of combining unsupervised and supervised learning in neural network ensembles using an efficient partition of a noisy input data set in order to focus the training of neural networks on the most complex and informative domains of the data set and accelerate the learning phase.

Journal ArticleDOI
01 Aug 1997
TL;DR: It is proved that classes whose VC-dimension is finite are learnable in a very strong sense, while on the other hand,�-covering numbers of a concept class impose lower bounds on the sample size needed for learning in the authors' models.
Abstract: We propose a mathematical model for learning the high-density areas of an unknown distribution from (unlabeled) random points drawn according to this distribution. While this type of a learning task has not been previously addressed in the computational learnability literature, we believe that this it a rather basic problem that appears in many practical learning scenarios. From a statistical theory standpoint, our model may be viewed as a restricted instance of the fundamental issue of inferring information about a probability distribution from the random samples it generates. From a computational learning angle, what we propose is a few framework of unsupervised concept learning. The examples provided to the learner in our model are not labeled (and are not necessarily all positive or all negative). The only information about their membership is indirectly disclosed to the student through the sampling distribution. We investigate the basic features of the proposed model and provide lower and upper bounds on the sample complexity of such learning tasks. We prove that classes whose VC-dimension is finite are learnable in a very strong sense, while on the other hand,�-covering numbers of a concept class impose lower bounds on the sample size needed for learning in our models. One direction of the proof involves a reduction of the density-level learnability to PAC learning with respect to fixed distributions (as well as some fundamental statistical lower bounds), while the sufficiency condition is proved through the introduction of a generic learning algorithm.