scispace - formally typeset
Search or ask a question

Showing papers on "Unsupervised learning published in 1995"


Book
Vladimir Vapnik1
01 Jan 1995
TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Abstract: Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.

40,147 citations


Proceedings ArticleDOI
26 Jun 1995
TL;DR: An unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require time-consuming hand annotations.
Abstract: This paper presents an unsupervised learning algorithm for sense disambiguation that, when trained on unannotated English text, rivals the performance of supervised techniques that require time-consuming hand annotations. The algorithm is based on two powerful constraints---that words tend to have one sense per discourse and one sense per collocation---exploited in an iterative bootstrapping procedure. Tested accuracy exceeds 96%.

2,594 citations


Book
01 Jan 1995
TL;DR: In this article, the authors provide a systematic account of artificial neural network paradigms by identifying the fundamental concepts and major methodologies underlying most of the current theory and practice employed by neural network researchers.
Abstract: From the Publisher: As book review editor of the IEEE Transactions on Neural Networks, Mohamad Hassoun has had the opportunity to assess the multitude of books on artificial neural networks that have appeared in recent years. Now, in Fundamentals of Artificial Neural Networks, he provides the first systematic account of artificial neural network paradigms by identifying clearly the fundamental concepts and major methodologies underlying most of the current theory and practice employed by neural network researchers. Such a systematic and unified treatment, although sadly lacking in most recent texts on neural networks, makes the subject more accessible to students and practitioners. Here, important results are integrated in order to more fully explain a wide range of existing empirical observations and commonly used heuristics. There are numerous illustrative examples, over 200 end-of-chapter analytical and computer-based problems that will aid in the development of neural network analysis and design skills, and a bibliography of nearly 700 references. Proceeding in a clear and logical fashion, the first two chapters present the basic building blocks and concepts of artificial neural networks and analyze the computational capabilities of the basic network architectures involved. Supervised, reinforcement, and unsupervised learning rules in simple nets are brought together in a common framework in chapter three. The convergence and solution properties of these learning rules are then treated mathematically in chapter four, using the "average learning equation" analysis approach. This organization of material makes it natural to switch into learning multilayer nets using backpropand its variants, described in chapter five. Chapter six covers most of the major neural network paradigms, while associative memories and energy minimizing nets are given detailed coverage in the next chapter. The final chapter takes up Boltzmann machines and Boltzmann learning along with other global search/optimization algorithms such as stochastic gradient search, simulated annealing, and genetic algorithms.

2,118 citations


BookDOI
01 Jan 1995
TL;DR: This chapter discusses Backpropagation and Unsupervised Learning in Linear Networks, a model for Spatial Coherence as an Internal Teacher for a Neural Network, and Gradient Descent Learning Algorithms for Recurrent Networks and Their Computational Complexity.
Abstract: Contents: D.E. Rumelhart, R. Durbin, R. Golden, Y. Chauvin, Backpropagation: The Basic Theory. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K.J. Lang, Phoneme Recognition Using Time-Delay Neural Networks. C. Schley, Y. Chauvin, V. Henkle, Automated Aircraft Flare and Touchdown Control Using Neural Networks. F.J. Pineda, Recurrent Backpropagation Networks. M.C. Mozer, A Focused Backpropagation Algorithm for Temporal Pattern Recognition. D.H. Nguyen, B. Widrow, Nonlinear Control with Neural Networks. M.I. Jordan, D.E. Rumelhart, Forward Models: Supervised Learning with a Distal Teacher. S.J. Hanson, Backpropagation: Some Comments and Variations. A. Cleeremans, D. Servan-Schreiber, J.L. McClelland, Graded State Machines: The Representation of Temporal Contingencies in Feedback Networks. S. Becker, G.E. Hinton, Spatial Coherence as an Internal Teacher for a Neural Network. J.R. Bachrach, M.C. Mozer, Connectionist Modeling and Control of Finite State Systems Given Partial State Information. P. Baldi, Y. Chauvin, K. Hornik, Backpropagation and Unsupervised Learning in Linear Networks. R.J. Williams, D. Zipser, Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity. P. Baldi, Y. Chauvin, When Neural Networks Play Sherlock Homes. P. Baldi, Gradient Descent Learning Algorithms: A Unified Perspective.

538 citations


Proceedings ArticleDOI
20 Jun 1995
TL;DR: An unsupervised technique for visual learning which is based on density estimation in high-dimensional spaces using an eigenspace decomposition and a multivariate Mixture-of-Gaussians model is presented.
Abstract: We present an unsupervised technique for visual learning which is based on density estimation in high-dimensional spaces using an eigenspace decomposition. Two types of density estimates are derived for modeling the training data: a multivariate Gaussian (for a unimodal distributions) and a multivariate Mixture-of-Gaussians model (for multimodal distributions). These probability densities are then used to formulate a maximum-likelihood estimation framework for visual search and target detection for automatic object recognition. This learning technique is tested in experiments with modeling and subsequent detection of human faces and non-rigid objects such as hands. >

423 citations


Journal ArticleDOI
TL;DR: A reinforcement learning algorithm is proposed, which can construct a neural fuzzy control network automatically and dynamically through a reward-penalty signal, which combines a proposed on-line supervised structure-parameter learning technique, the temporal difference prediction method, and the stochastic exploratory algorithm.

327 citations


Journal ArticleDOI
TL;DR: This framework study more closely generalizations of the problems of variance maximization and mean-square error minimization and derive gradient-type neural learning algorithms both for symmetric and hierarchic PCA-type networks.

295 citations


Journal ArticleDOI
TL;DR: Most of the known results on linear networks, including backpropagation learning and the structure of the error function landscape, the temporal evolution of generalization, and unsupervised learning algorithms and their properties are surveyed.
Abstract: Networks of linear units are the simplest kind of networks, where the basic questions related to learning, generalization, and self-organization can sometimes be answered analytically. We survey most of the known results on linear networks, including: 1) backpropagation learning and the structure of the error function landscape, 2) the temporal evolution of generalization, and 3) unsupervised learning algorithms and their properties. The connections to classical statistical ideas, such as principal component analysis (PCA), are emphasized as well as several simple but challenging open questions. A few new results are also spread across the paper, including an analysis of the effect of noise on backpropagation networks and a unified view of all unsupervised algorithms. >

258 citations


Journal ArticleDOI
TL;DR: In this paper, a fully connected committee machine with K hidden units is trained by gradient descent to perform a task defined by a teacher committee machine, with M hidden units acting on randomly drawn inputs.
Abstract: The problem of on-line learning in two-layer neural networks is studied within the framework of statistical mechanics. A fully connected committee machine with K hidden units is trained by gradient descent to perform a task defined by a teacher committee machine with M hidden units acting on randomly drawn inputs. The approach, based on a direct averaging over the activation of the hidden units, results in a set of first-order differential equations that describes the dynamical evolution of the overlaps among the various hidden units and allows for a computation of the generalization error. The equations of motion are obtained analytically for general K and M and provide a powerful tool used here to study a variety of realizable, over-realizable, and unrealizable learning scenarios and to analyze the role of the learning rate in controlling the evolution and convergence of the learning process.

206 citations


Journal ArticleDOI
TL;DR: An enhancement of the traditional k-means algorithm that approximates an optimal clustering solution with an efficient adaptive learning rate, which renders it usable even in situations where the statistics of the problem task varies slowly with time.
Abstract: Adaptive k-means clustering algorithms have been used in several artificial neural network architectures, such as radial basis function networks or feature-map classifiers, for a competitive partitioning of the input domain. This paper presents an enhancement of the traditional k-means algorithm. It approximates an optimal clustering solution with an efficient adaptive learning rate, which renders it usable even in situations where the statistics of the problem task varies slowly with time. This modification Is based on the optimality criterion for the k-means partition stating that: all the regions in an optimal k-means partition have the same variations if the number of regions in the partition is large and the underlying distribution for generating input patterns is smooth. The goal of equalizing these variations is introduced in the competitive function that assigns each new pattern vector to the "appropriate" region. To evaluate the optimal k-means algorithm, the authors first compare it to other k-means variants on several simple tutorial examples, then the authors evaluate it on a practical application: vector quantization of image data. >

204 citations


Book ChapterDOI
09 Jul 1995
TL;DR: This paper evaluates different techniques for learning from partitioned data and the meta-learning approach is empirically compared with techniques in the literature that aim to combine multiple evidence to arrive at one prediction.
Abstract: Much of the research in inductive learning concentrates on problems with relatively small amounts of data. With the coming age of very large network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. Some learning algorithms assume that the entire data set fits into main memory, which is not feasible for massive amounts of data. One approach to handling a large data set is to partition the data set into subsets, run the learning algorithm on each of the subsets, and combine the results. In this paper we evaluate different techniques for learning from partitioned data. Our meta-learning approach is empirically compared with techniques in the literature that aim to combine multiple evidence to arrive at one prediction.

Journal ArticleDOI
TL;DR: A new neural network architecture is introduced for the recognition of pattern classes after supervised and unsupervised learning, which achieves a synthesis of adaptive resonance theory (ART) and spatial and temporal evidence integration for dynamic predictive mapping (EMAP).
Abstract: A new neural network architecture is introduced for the recognition of pattern classes after supervised and unsupervised learning. Applications include spatio-temporal image understanding and prediction and 3D object recognition from a series of ambiguous 2D views. The architecture, called ART-EMAP, achieves a synthesis of adaptive resonance theory (ART) and spatial and temporal evidence integration for dynamic predictive mapping (EMAP). ART-EMAP extends the capabilities of fuzzy ARTMAP in four incremental stages. Stage 1 introduces distributed pattern representation at a view category field. Stage 2 adds a decision criterion to the mapping between view and object categories, delaying identification of ambiguous objects when faced with a low confidence prediction. Stage 3 augments the system with a field where evidence accumulates in medium-term memory. Stage 4 adds an unsupervised learning process to fine-tune performance after the limited initial period of supervised network training. Each ART-EMAP stage is illustrated with a benchmark simulation example, using both noisy and noise-free data. >

Patent
David Dolan Lewis1
07 Jun 1995
TL;DR: In this paper, a supervised learning system and an annotation system are operated cooperatively to produce a classification vector which can be used to classify documents with respect to a defined class, where the degree of relevance annotation represents the degree to which the document belongs to the defined class.
Abstract: A method and apparatus for training a text classifier is disclosed. A supervised learning system and an annotation system are operated cooperatively to produce a classification vector which can be used to classify documents with respect to a defined class. The annotation system automatically annotates documents with a degree of relevance annotation to produce machine annotated data. The degree of relevance annotation represents the degree to which the document belongs to the defined class. This machine annotated data is used as input to the supervised learning system. In addition to the machine annotated data, the supervised learning system can also receive manually annotated data and/or a user request. The machine annotated data, along with the manually annotated data and/or the user request, are used by the supervised learning system to produce a classification vector. In one embodiment, the supervised learning system comprises a relevance feedback mechanism. The relevance feedback mechanism is operated cooperatively with the annotation system for multiple iterations until a classification vector of acceptable accuracy is produced. The classification vector produced by the invention is the result of a combination of supervised and unsupervised learning.

Journal ArticleDOI
TL;DR: This paper concentrates on Doppelgänger's learning techniques and their implementation in an application-independent, sensor-independent environment.
Abstract: Doppelganger is a generalized user modeling system that gathers data about users, performs inferences upon the data, and makes the resulting information available to applications.Doppelganger's learning is calledheterogeneous for two reasons: first, multiple learning techniques are used to interpret the data, and second, the learning techniques must often grapple with disparate data types. These computations take place at geographically distributed sites, and make use of portable user models carried by individuals. This paper concentrates onDoppelganger's learning techniques and their implementation in an application-independent, sensor-independent environment.

Journal ArticleDOI
TL;DR: Attribute-based learning is limited to non-relational descriptions of objects in the sense that the learned descriptions do not specify relations among the objects' parts, and the lack of relations makes the concept description language inappropriate for some domains.
Abstract: Techniques of machine learning have been successfully applied to various problems [1, 12]. Most of these applications rely on attribute-based learning, exemplified by the induction of decision trees as in the program C4.5 [20]. Broadly speaking, attribute-based learning also includes such approaches to learning as neural networks and nearest neighbor techniques. The advantages of attribute-based learning are: relative simplicity, efficiency, and existence of effective techniques for handling noisy data. However, attribute-based learning is limited to non-relational descriptions of objects in the sense that the learned descriptions do not specify relations among the objects' parts. Attribute-based learning thus has two strong limitations: the background knowledge can be expressed in rather limited form, andthe lack of relations makes the concept description language inappropriate for some domains.

Journal ArticleDOI
TL;DR: The parametric pattern recognition (PPR) algorithm that facilitates automatic MUAP feature extraction and Artificial Neural Network (ANN) models are combined for providing an integrated system for the diagnosis of neuromuscular disorders.
Abstract: In previous years, several computer-aided quantitative motor unit action potential (MUAP) techniques were reported. It is now possible to add to these techniques the capability of automated medical diagnosis so that all data can be processed in an integrated environment. In this study, the parametric pattern recognition (PPR) algorithm that facilitates automatic MUAP feature extraction and Artificial Neural Network (ANN) models are combined for providing an integrated system for the diagnosis of neuromuscular disorders. Two paradigms of learning for training ANN models were investigated, supervised, and unsupervised. For supervised learning, the back-propagation algorithm and for unsupervised learning, the Kohonen's self-organizing feature maps algorithm were used. The diagnostic yield for models trained with both procedures was similar and on the order of 80%. However, back propagation models required considerably more computational effort compared to the Kohonen's self-organizing feature map models. Poorer diagnostic performance was obtained when the K-means nearest neighbor clustering algorithm was applied on the same set of data. >

Journal ArticleDOI
Gustavo Deco1, Wilfried Brauer1
TL;DR: A model of factorial learning for general nonlinear transformations of an arbitrary non-Gaussian (or Gaussian) environment with statistically nonlinearly correlated input is presented.

Journal ArticleDOI
Eric Saund1
TL;DR: A formulation for unsupervised learning of clusters reflecting multiple causal structure in binary data, which employs an objective function and iterative gradient descent learning algorithm resembling the conventional mixture model and demonstrates its ability to discover coherent multiple causal representations in several experimental data sets.
Abstract: This paper presents a formulation for unsupervised learning of clusters reflecting multiple causal structure in binary data. Unlike the "hard" k-means clustering algorithm and the "soft" mixture model, each of which assumes that a single hidden event generates each data point, a multiple cause model accounts for observed data by combining assertions from many hidden causes, each of which can pertain to varying degree to any subset of the observable dimensions. We employ an objective function and iterative gradient descent learning algorithm resembling the conventional mixture model. A crucial issue is the mixing function for combining beliefs from different cluster centers in order to generate data predictions whose errors are minimized both during recognition and learning. The mixing function constitutes a prior assumption about underlying structural regularities of the data domain; we demonstrate a weakness inherent to the popular weighted sum followed by sigmoid squashing, and offer alternative forms of the nonlinearity for two types of data domain. Results are presented demonstrating the algorithm's ability successfully to discover coherent multiple causal representations in several experimental data sets.

Journal ArticleDOI
TL;DR: These studies contrasted the cue and category validity of features with people's prior knowledge about the relevance of features to the functions of novel artifacts, suggesting that the influences of knowledge and experience are more tightly integrated than some models would predict.
Abstract: Empirical learning models have typically focused on statistical aspects of features (e.g., cue and category validity). In general, these models do not address the contact between people's prior knowledge that lies outside the category and their experiences of the category. A variety of extensions to these models are examined, which combine prior knowledge with empirical learning. Predictions of these models were compared in 4 experiments. These studies contrasted the cue and category validity of features with people's prior knowledge about the relevance of features to the functions of novel artifacts. The findings suggest that the influences of knowledge and experience are more tightly integrated than some models would predict. Furthermore, relatively straightforward ways of incorporating knowledge into an empirical learning algorithm appear insufficient (e.g., use of knowledge to weight features by general relevance or to individually weight features). Other extensions to these models are suggested that focus on the importance of intermediary features, coherence, and conceptual roles.

Journal ArticleDOI
TL;DR: An implementation of an artificial neural network (ANN) which performs unsupervised detection of recognition categories from arbitrary sequences of multivalued input patterns called SARTNN, which gives good results in terms of ease of use, parameter robustness and computation time.
Abstract: This article presents an implementation of an artificial neural network (ANN) which performs unsupervised detection of recognition categories from arbitrary sequences of multivalued input patterns. The proposed ANN is called Simplified Adaptive Resonance Theory Neural Network (SARTNN). First, an Improved Adaptive Resonance Theory 1 (IARTl)-based neural network for binary pattern analysis is discussed and a Simplified ARTl (SART1) model is proposed. Second, the SARTl model is extended to multivalued input pattern clustering and SART” is presented. A normalized coefficient which measures the degree of match between two multivalued vectors, the Vector Degree of Match (VDM), provides SARTNN with the metric needed to perform clustering. Every ART architecture guarantees both plasticity and stability to the unsupervised learning stage. The SARTNN plasticity requirement is satisfied by implementing its attentional subsystem as a self-organized, feed-forward, flat Kohonen’s ANN (KANN). The SARTNN stability requirement is properly driven by its orienting subsystem. SARTNN processes multivalued input vectors while featuring a simplified architectural and mathematical model with respect to both the ARTl and the AkT2 models, the latter being the ART model fitted to multivalued input pattern categorization. While the ART2 model exploits ten user-defined parameters, SARTNN requires only two user-defined parameters to be run: the first parameter is the vigilance threshold, p, that affects the network’s sensibility in detecting new output categories, whereas the second parameter, T, is related to the network's learning rate. Both parameters have an intuitive physical meaning and allow the user to choose easily the proper discriminating power of the category extraction algorithm. The SARTNN performance is tested as a satellite image clustering algorithm. A chromatic component extractor is recommended in a satellite image preprocessing stage, in order to pursue SARTNN invariant pattern recognition. In comparison with classical clustering algorithms like ISODATA, the implemented system gives good results in terms of ease of use, parameter robustness and computation time. SARTNN should improve the performance of a Constraint Satisfaction Neural Network (CSNN) for image segmentation. SARTNN, exploited as a self-organizing first layer, should also improve the performance of both the Counter Propagation Neural Network (CPNN) and the Reduced connectivity Coulomb Energy Neural Network (RCENN).

Journal ArticleDOI
TL;DR: A learning rule of neural networks via a simultaneous perturbation and an analog feedforward neural network circuit using the learning rule, which requires only forward operations of the neural network and is suitable for hardware implementation.

Journal ArticleDOI
TL;DR: A generic, modular, neural network-based feature extraction and pattern classification system is proposed for finding essentially two-dimensional objects or object parts from digital images in a distortion tolerant manner, and the feature space has sufficient resolution power for a moderate number of classes with rather strong distortions.
Abstract: A generic, modular, neural network-based feature extraction and pattern classification system is proposed for finding essentially two-dimensional objects or object parts from digital images in a distortion tolerant manner, The distortion tolerance is built up gradually by successive blocks in a pipeline architecture. The system consists of only feedforward neural networks, allowing efficient parallel implementation. The most time and data-consuming stage, learning the relevant features, is wholly unsupervised and can be made off-line. The consequent supervised stage where the object classes are learned is simple and fast. The feature extraction is based on distortion tolerant Gabor transformations, followed by minimum distortion clustering by multilayer self-organizing maps. Due to the unsupervised learning strategy, there is no need for preclassified training samples or other explicit selection for training patterns during the training, which allows a large amount of training material to be used at the early stages, A supervised, one-layer subspace network classifier on top of the feature extractor is used for object labeling. The system has been trained with natural images giving the relevant features, and human faces and their parts have been used as the object classes for testing. The current experiments indicate that the feature space has sufficient resolution power for a moderate number of classes with rather strong distortions. >

Patent
28 Nov 1995
TL;DR: In this article, a neural network system and unsupervised learning process for separating unknown source signals from their received mixtures by solving the Independent Components Analysis (ICA) problem was proposed, which can be easily adapted to solve the related blind deconvolution problem that extracts an unknown source signal from the output of an unknown reverberating channel.
Abstract: A neural network system and unsupervised learning process for separating unknown source signals from their received mixtures by solving the Independent Components Analysis (ICA) problem. The unsupervised learning procedure solves the general blind signal processing problem by maximizing joint output entropy through gradient ascent to minimize mutual information in the outputs. The neural network system can separate a multiplicity of unknown source signals from measured mixture signals where the mixture characteristics and the original source signals are both unknown. The system can be easily adapted to solve the related blind deconvolution problem that extracts an unknown source signal from the output of an unknown reverberating channel.

Posted Content
TL;DR: An unsupervised learning algorithm is presented that acquires a natural- language lexicon from raw speech based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures.
Abstract: We present an unsupervised learning algorithm that acquires a natural- language lexicon from raw speech. The algorithm is based on the optimal encoding of symbol sequences in an MDL framework, and uses a hierarchical representation of language that overcomes many of the problems that have stymied previous grammar-induction procedures. The forward mapping from symbol sequences to the speech stream is modeled using features based on articulatory gestures. We present results on the acquisition of lexicons and language models from raw speech, text, and phonetic transcripts, and demonstrate that our algorithm compares very favorably to other reported results with respect to segmentation performance and statistical efficiency.

Journal ArticleDOI
TL;DR: The feasibility of using learning classifier systems as a tool for building adaptive control systems for real robots is investigated and it is shown that with this approach it is possible to let the AutonoMouse, a small real robot, learn to approach a light source under a number of different noise and lesion conditions.
Abstract: In this article we investigate the feasibility of using learning classifier systems as a tool for building adaptive control systems for real robots. Their use on real robots imposes efficiency constraints which are addressed by three main tools: parallelism, distributed architecture, and training. Parallelism is useful to speed up computation and to increase the flexibility of the learning system design. Distributed architecture helps in making it possible to decompose the overall task into a set of simpler learning tasks. Finally, training provides guidance to the system while learning, shortening the number of cycles required to learn. These tools and the issues they raise are first studied in simulation, and then the experience gained with simulations is used to implement the learning system on the real robot. Results have shown that with this approach it is possible to let the AutonoMouse, a small real robot, learn to approach a light source under a number of different noise and lesion conditions.

Journal ArticleDOI
TL;DR: This approach joins two forms of learning, the technique of neural networks and rough sets, and aims to improve the overall classification effectiveness of learned objects' description and refine the dependency factors of the rules.

Proceedings ArticleDOI
21 May 1995
TL;DR: A method of vision-based reinforcement learning by which a robot learns to shoot a ball into a goal is presented, and several issues in applying the reinforcement learning method to a real robot with vision sensor are discussed.
Abstract: This paper presents a method of vision-based reinforcement learning by which a robot learns to shoot a ball into a goal, and discusses several issues in applying the reinforcement learning method to a real robot with vision sensor. First, a "state-action deviation" problem is found as a form of perceptual aliasing in constructing the state and action spaces that reflect the outputs from physical sensors and actuators, respectively. To cope with this, an action set is constructed in such a way that one action consists of a series of the same action primitive which is successively executed until the current state changes. Next, to speed up the learning time, a mechanism of learning form easy missions (or LEM) which is a similar technique to "shaping" in animal learning is implemented. LEM reduces the learning time from the exponential order in the size of the state space to about the linear order in the size of the state space. The results of computer simulations and real robot experiments are given.

Proceedings ArticleDOI
26 Jun 1995
TL;DR: The paper points out problems with global learning methods in local model networks and illustrated that local learning has a regularizing effect that can make it favorable compared to global learning in some cases.
Abstract: Local model networks are hybrid models which allow the easy integration of a priori knowledge, as well as the ability to learn from data to represent complex, multidimensional dynamic systems from data. The paper points out problems with global learning methods in local model networks. The bias/variance trade offs for local and global learning are examined, and it is illustrated that local learning has a regularizing effect that can make it favorable compared to global learning in some cases.


Journal ArticleDOI
TL;DR: A network architecture designed for use with a cost function that includes a novel complexity penalty term that effectively describes the network complexity with respect to the given data in an unsupervised fashion is presented.
Abstract: Controlling the network complexity in order to prevent overfitting is one of the major problems encountered when using neural network models to extract the structure from small data sets. In this paper we present a network architecture designed for use with a cost function that includes a novel complexity penalty term. In this architecture the outputs of the hidden units are strictly positive and sum to one, and their outputs are defined as the probability that the actual input belongs to a certain class formed during learning. The penalty term expresses the mutual information between the inputs and the extracted classes. This measure effectively describes the network complexity with respect to the given data in an unsupervised fashion. The efficiency of this architecture/penalty-term when combined with backpropagation training, is demonstrated on a real world economic time series forecasting problem. The model was also applied to the benchmark sunspot data and to a synthetic data set from the statistics community.