scispace - formally typeset
Search or ask a question

Showing papers on "Entropy (information theory) published in 1996"


Book
01 Jan 1996
TL;DR: The Bayes Error and Vapnik-Chervonenkis theory are applied as guide for empirical classifier selection on the basis of explicit specification and explicit enforcement of the maximum likelihood principle.
Abstract: Preface * Introduction * The Bayes Error * Inequalities and alternatedistance measures * Linear discrimination * Nearest neighbor rules *Consistency * Slow rates of convergence Error estimation * The regularhistogram rule * Kernel rules Consistency of the k-nearest neighborrule * Vapnik-Chervonenkis theory * Combinatorial aspects of Vapnik-Chervonenkis theory * Lower bounds for empirical classifier selection* The maximum likelihood principle * Parametric classification *Generalized linear discrimination * Complexity regularization *Condensed and edited nearest neighbor rules * Tree classifiers * Data-dependent partitioning * Splitting the data * The resubstitutionestimate * Deleted estimates of the error probability * Automatickernel rules * Automatic nearest neighbor rules * Hypercubes anddiscrete spaces * Epsilon entropy and totally bounded sets * Uniformlaws of large numbers * Neural networks * Other error estimates *Feature extraction * Appendix * Notation * References * Index

3,598 citations


Journal ArticleDOI
TL;DR: In this article, an entropy criterion is proposed to estimate the number of clusters arising from a mixture model, which is derived from a relation linking the likelihood and the classification likelihood of a mixture.
Abstract: In this paper, we consider an entropy criterion to estimate the number of clusters arising from a mixture model. This criterion is derived from a relation linking the likelihood and the classification likelihood of a mixture. Its performance is investigated through Monte Carlo experiments, and it shows favorable results compared to other classical criteria.

1,689 citations


Journal ArticleDOI
TL;DR: An adaptive statistical language model is described, which successfully integrates long distance linguistic information with other knowledge sources, and shows the feasibility of incorporating many diverse knowledge sources in a single, unified statistical framework.

771 citations


Journal Article
TL;DR: An extended EM algorithm is used to minimize the information divergence (maximize the relative entropy) in the density approximation case and fits to Weibull, log normal, and Erlang distributions are used as illustrations of the latter.
Abstract: Estimation from sample data and density approximation with phase-type distribu- tions are considered. Maximum likelihood estimation via the EM algorithm is discussed and performed for some data sets. An extended EM algorithm is used to minimize the information divergence (maximize the relative entropy) in the density approximation case. Fits to Weibull, log normal, and Erlang distributions are used as illustrations of the latter.

690 citations


Book
28 Aug 1996
TL;DR: In this article, the abstract background of embeddings and function spaces are used to obtain the entropy and approximation numbers of embedding vectors. But the authors do not specify the number of permutations of the permutation vectors.
Abstract: 1. The abstract background 2. Function spaces 3. Entropy and approximation numbers of embeddings 4. Weighted function spaces and entropy numbers 5. Elliptic operators Bibliography.

428 citations


Book
09 Jul 1996
TL;DR: In this article, the B-processes Bibliography index is used to find entropy-related properties for restricted classes of B-Processes, including entropy related properties and properties.
Abstract: Basic concepts Entropy-related properties Entropy for restricted classes B-processes Bibliography Index.

397 citations


Journal ArticleDOI
TL;DR: This note first points out a certain conceptual problem and then proposes two algorithms which are free from that problem, which are tested on several data sets and the results are encouraging.

315 citations


Journal ArticleDOI
01 Sep 1996-Chaos
TL;DR: Algorithms for estimating the Shannon entropy h of finite symbol sequences with long range correlations are considered, and a scaling law is proposed for extrapolation from finite sample lengths.
Abstract: We discuss algorithms for estimating the Shannon entropy h of finite symbol sequences with long range correlations. In particular, we consider algorithms which estimate h from the code lengths produced by some compression algorithm. Our interest is in describing their convergence with sequence length, assuming no limits for the space and time complexities of the compression algorithms. A scaling law is proposed for extrapolation from finite sample lengths. This is applied to sequences of dynamical systems in non‐trivial chaotic regimes, a 1‐D cellular automaton, and to written English texts.

304 citations


Proceedings Article
02 Aug 1996
TL;DR: This study includes both an extensive empirical comparison as well as an analysis of scenarios where error minimization may be an inappropriate discretization criterion, and analyzes the shortcomings of error-based approaches in comparison to entropy-based methods.
Abstract: We present a comparison of error-based and entropy-based methods for discretization of continuous features. Our study includes both an extensive empirical comparison as well as an analysis of scenarios where error minimization may be an inappropriate discretization criterion. We present a discretization method based on the C4.5 decision tree algorithm and compare it to an existing entropy-based discretization algorithm, which employs the Minimum Description Length Principle, and a recently proposed error-based technique. We evaluate these discretization methods with respect to C4.5 and Naive-Bayesian classifiers on datasets from the UCI repository and analyze the computational complexity of each method. Our results indicate that the entropy-based MDL heuristic outperforms error minimization on average. We then analyze the shortcomings of error-based approaches in comparison to entropy-based methods.

303 citations


Journal ArticleDOI
TL;DR: This article defines the concept of an information measure and shows how common information measures such as entropy, Shannon information, and algorithmic information content can be combined to solve problems of characterization, inference, and learning for complex systems.
Abstract: This article defines the concept of an information measure and shows how common information measures such as entropy, Shannon information, and algorithmic information content can be combined to solve problems of characterization, inference, and learning for complex systems. Particularly useful quantities are the effective complexity, which is roughly the length of a compact description of the identified regularities of an entity, and total information, which is effective complexity plus an entropy term that measures the information required to describe the random aspects of the entity. Mathematical definitions are given for both quantities and some applications are discussed. In particular, it is pointed out that if one compares different sets of identified regularities of an entity, the ‘best’ set minimizes the total information, and then, subject to that constraint, minimizes the effective complexity; the resulting effective complexity is then in many respects independent of the observer. © 1996 John Wiley & Sons, Inc.

300 citations


Journal ArticleDOI
TL;DR: It is shown that this measure, the cross-entropy, is related to other commonly used measures of distance or similarity under special conditions, although it is in some senses more general.

Journal ArticleDOI
TL;DR: The idea of recognizing special metrics in terms of this invariant looks at first glance very optimistic as discussed by the authors, since the entropy is sensitive to changes of scale which makes it a bad invariant: however, this is circumvented by looking at the behaviour of the entropy functional on the space of metrics with fixed volume.
Abstract: Let (Y, g) be a compact connected n-dimensional Riemannian manifold and let () be its universal cover endowed with the pulled-back metric. If y ∈ , we definewhere B(y, R) denotes the ball of radius R around y in . It is a well known fact that this limit exists and does not depend on y ([Man]). The invariant h(g) is called the volume entropy of the metric g but, for the sake of simplicity, we shall use the term entropy. The idea of recognizing special metrics in terms of this invariant looks at first glance very optimistic. First the entropy, which behaves like the inverse of a distance, is sensitive to changes of scale which makes it a bad invariant: however, this is a minor drawback that can be circumvented by looking at the behaviour of the entropy functional on the space of metrics with fixed volume (equal to one for example). Nevertheless, it seems very unlikely that two numbers, the entropy and the volume, might characterize any metric. The very first person to consider such a possibility was Katok ([Kat1]). In this article the entropy is thought of as a dynamical invariant which actually is suggested by its name. More precisely, let us define this dynamical invariant, which is called the topological entropy: let (M, d) be a compact metric space and ψt, a flow on it, we define.

Journal ArticleDOI
TL;DR: In the case of Ornstein, Prohorov and other distances of the Kantorovich-Vasershtein type, it is shown that the finite-precision resolvability is equal to the rate-distortion function with a fidelity criterion derived from the accuracy measure, which leads to new results on nonstationary rate- Distortion theory.
Abstract: We study the randomness necessary for the simulation of a random process with given distributions, on terms of the finite-precision resolvability of the process. Finite-precision resolvability is defined as the minimal random-bit rate required by the simulator as a function of the accuracy with which the distributions are replicated. The accuracy is quantified by means of various measures: variational distance, divergence, Orstein (1973), Prohorov (1956) and related measures of distance between the distributions of random process. In the case of Ornstein, Prohorov and other distances of the Kantorovich-Vasershtein type, we show that the finite-precision resolvability is equal to the rate-distortion function with a fidelity criterion derived from the accuracy measure. This connection leads to new results on nonstationary rate-distortion theory. In the case of variational distance, the resolvability of stationary ergodic processes is shown to equal entropy rate regardless of the allowed accuracy. In the case of normalized divergence, explicit expressions for finite-precision resolvability are obtained in many cases of interest; and connections with data compression with minimum probability of block error are shown.

Journal ArticleDOI
TL;DR: The gambler algorithm, an algorithm based on the Chou-Fasman rules for protein structure, gives significantly lower entropies than the k-tuplet analysis, and the number of most probable protein sequences can be calculated.

Journal ArticleDOI
TL;DR: It is found that most speech signals in the form of phoneme articulations are low dimensional, and the second‐order dynamical entropy of speech time series is found to be a lower bound of metric entropy.
Abstract: This paper reports results of the estimation of dynamical invariants, namely Lyapunov exponents, dimension, and metric entropy for speech signals. Two optimality criteria from dynamical systems literature, namely singular value decomposition method and the redundancy method, are used to reconstruct state space trajectories of speech and make observations. The positive values of the largest Lyapunov exponent of speech signals in the form of phoneme articulations show the average exponential divergence of nearby trajectories in the reconstructed state space. The dimension of a time series is a measure of its complexity and gives bounds on the number of state space variables needed to model it. It is found that most speech signals in the form of phoneme articulations are low dimensional. For comparison, a statistical model of a speech time series is also used to estimate the correlation dimension. The second‐order dynamical entropy (which is a lower bound of metric entropy) of speech time series is found to ...

Journal ArticleDOI
TL;DR: In this article, two approaches for obtaining an approximation to the radial distribution function (RDF) and an orientational distribution function, which is a function of the five angles for each distance between the molecules, are introduced.
Abstract: The molecular pair correlation function in water is a function of a distance and five angles. It is here separated into the radial distribution function (RDF), which is only a function of distance, and an orientational distribution function (ODF), which is a function of the five angles for each distance between the molecules. While the RDF can be obtained from computer simulations, this is not practical for the ODF due to its high dimensionality. Two approaches for obtaining an approximation to the ODF are introduced. The first uses a product of one‐ and two‐dimensional marginal distributions from computer simulations. The second uses the gas‐phase low‐density limit as a reference and applies corrections based on (a) the orientationally averaged interactions in the liquid calculated by simulations, and (b) the observed differences in the one‐ and two‐dimensional marginal distributions in the gas and in the liquid. The site superposition approximation was also tested and found to be inadequate for reproducing the orientationally averaged interaction energy and the angular distributions obtained from the simulations. The two approximations to the pair correlation function are employed to estimate the contribution of two‐particle correlations to the excess entropy of TIP4P water. The calculated value is comparable to the excess entropy of TIP4P water estimated by other methods and to the experimental excess entropy of liquid water. More than 90% of the orientational part of the excess entropy is due to correlations between first neighbors. The change in excess entropy with temperature gives a value for the heat capacity that agrees within statistical uncertainty with that obtained from the change in energy with temperature and is reasonably close to the experimental value for water. The effect of pressure on the entropy was examined and it was found that increase in the pressure (density) causes a reduction of orientational correlations, in agreement with the idea of pressure as a ‘‘structure breaker’’ in water. The approach described here provides insight concerning the nature of the contributions to the excess entropy of water and should be applicable to other simple molecular fluids.

Journal ArticleDOI
TL;DR: This paper investigates the fundamental operation of the block sorting algorithm and presents some improvements based on that analysis, which relates the compression to the proportion of zeros after the MTF stage.
Abstract: A recent development in text compression is a 'block sorting' algorithm which permutes the input text according to a special sort procedure and then processes the permuted text with Move-ToFront (MTF) and a final statistical compressor. The technique combines good speed with excellent compression performance. This paper investigates the fundamental operation of the algorithm and presents some improvements based on that analysis. Although block sorting is clearly related to previous compression techniques, it appears that it is best described by techniques derived from work by Shannon on the prediction and entropy of English text. A simple model is developed which relates the compression to the proportion of zeros after the MTF stage.

Journal Article
TL;DR: In this paper, the authors consider a truncated version of the entropy estimator and prove the mean square /spl radic/nconsistency of this estimator for a class of densities with unbounded support, including the Gaussian density.
Abstract: We consider a truncated version of the entropy estimator and prove the mean square /spl radic/n-consistency of this estimator for a class of densities with unbounded support, including the Gaussian density.

Proceedings Article
01 Aug 1996
TL;DR: The sample complexity of MDL based learning procedures for Bayesian networks is examined and the number of samples needed to learn an e-close approximation with confidence δ is shown, which means that the sample complexity is a low-order polynomial in the error threshold and sub-linear in the confidence bound.
Abstract: In recent years there has been an increasing interest in learning Bayesian networks from data. One of the most effective methods for learning such networks is based on the minimum description length (MDL) principle. Previous work has shown that this learning procedure is asymptotically successful: with probability one, it will converge to the target distribution, given a sufficient number of samples. However, the rate of this convergence has been hitherto unknown. In this work we examine the sample complexity of MDL based learning procedures for Bayesian networks. We show that the number of samples needed to learn an e-close approximation (in terms of entropy distance) with confidence δ is O ((1/e)4/3 log 1/e log 1/δ log log 1/δ). This means that the sample complexity is a low-order polynomial in the error threshold and sub-linear in the confidence bound. We also discuss how the constants in this term depend on the complexity of the target distribution. Finally, we address questions of asymptotic minimality and propose a method for using the sample complexity results to speed up the learning process.

Journal ArticleDOI
TL;DR: A method for estimation of coarse-grained entropy rates (CER's) from time series is presented, based on information-theoretic functionals---redundancies, which shows potential application of the CER's in analysis of electrophysiological signals or other complex time series.

Journal ArticleDOI
TL;DR: The theoretical basis for the annealing method is derived, on which it is based the development of a novel design algorithm and its effectiveness and superior performance in the design of practical classifiers for some of the most popular structures currently in use are demonstrated.
Abstract: A global optimization method is introduced that minimize the rate of misclassification. We first derive the theoretical basis for the method, on which we base the development of a novel design algorithm and demonstrate its effectiveness and superior performance in the design of practical classifiers for some of the most popular structures currently in use. The method, grounded in ideas from statistical physics and information theory, extends the deterministic annealing approach for optimization, both to incorporate structural constraints on data assignments to classes and to minimize the probability of error as the cost objective. During the design, data are assigned to classes in probability so as to minimize the expected classification error given a specified level of randomness, as measured by Shannon's entropy. The constrained optimization is equivalent to a free-energy minimization, motivating a deterministic annealing approach in which the entropy and expected misclassification cost are reduced with the temperature while enforcing the classifier's structure. In the limit, a hard classifier is obtained. This approach is applicable to a variety of classifier structures, including the widely used prototype-based, radial basis function, and multilayer perceptron classifiers. The method is compared with learning vector quantization, back propagation (BP), several radial basis function design techniques, as well as with paradigms for more directly optimizing all these structures to minimize probability of error. The annealing method achieves significant performance gains over other design methods on a number of benchmark examples from the literature, while often retaining design complexity comparable with or only moderately greater than that of strict descent methods. Substantial gains, both inside and outside the training set, are achieved for complicated examples involving high-dimensional data and large class overlap.

Journal ArticleDOI
TL;DR: An unsupervised neural network which exhibits competition between units via inhibitory feedback is presented, and it is shown how the assignment of prior probabilities to network outputs can help to reduce entropy.
Abstract: In this paper we present an unsupervised neural network which exhibits competition between units via inhibitory feedback. The operation is such as to minimize reconstruction error, both for individual patterns, and over the entire training set. A key difference from networks which perform principal components analysis, or one of its variants, is the ability to converge to non-orthogonal weight values. We discuss the network's operation in relation to the twin goals of maximizing information transfer and minimizing code entropy, and show how the assignment of prior probabilities to network outputs can help to reduce entropy. We present results from two binary coding problems, and from experiments with image coding.

Proceedings ArticleDOI
18 Jun 1996
TL;DR: These experiments demonstrate that many textures previously considered as different categories can be modeled and synthesized in a common framework, and interprets and clarifies many previous concepts and methods for texture analysis and synthesis from a unified point of view.
Abstract: In this paper, a minimax entropy principle is studied, based on which a novel theory, called FRAME (Filters, Random fields And Minimax Entropy) is proposed for texture modeling. FRAME combines attractive aspects of two important themes in texture modeling: multi-channel filtering and Markov random field (MRF) modeling. It incorporates the responses of a set of well selected filters into the distribution over a random field and hence has a much stronger descriptive ability than the traditional MRF models. Furthermore, it interprets and clarifies many previous concepts and methods for texture analysis and synthesis from a unified point of view. Algorithms are proposed for probability inference, stochastic simulation and filter selection. Experiments on a variety of textures are described to illustrate our theory and to show the performance of our algorithms. These experiments demonstrate that many textures previously considered as different categories can be modeled and synthesized in a common framework.

Journal ArticleDOI
TL;DR: The generalized maximum entropy (GME) model as discussed by the authors includes noise terms in the multinomial information constraints, each noise term is modeled as the mean of a finite set of a priori known points in the interval [−1,1] with unknown probabilities where no parametric assumptions about the error distribution are made.
Abstract: The classical maximum entropy (ME) approach to estimating the unknown parameters of a multinomial discrete choice problem, which is equivalent to the maximum likelihood multinomial logit (ML) estimator, is generalized. The generalized maximum entropy (GME) model includes noise terms in the multinomial information constraints. Each noise term is modeled as the mean of a finite set of a priori known points in the interval [−1,1] with unknown probabilities where no parametric assumptions about the error distribution are made. A GME model for the multinomial probabilities and for the distributions associated with the noise terms is derived by maximizing the joint entropy of multinomial and noise distributions, under the assumption of independence. The GME formulation reduces to the ME in the limit as the sample grows large or when no noise is included in the entropy maximization. Further, even though the GME and the logit estimators are conceptually different, the dual GME model is related to a gener...

Journal ArticleDOI
01 Jan 1996
TL;DR: How the Shannon information theory (Shannon and Weaver 1964) can be used to compute an evaluation index of a map, i.e., a parameter which measures the efficiency of the map is demonstrated.
Abstract: The automation of map design is a challenging task for both researchers and designers of spatial information systems. A main problem in automation is the quantification and formalization of the properties of the process to be automated. This article contributes to the formalization of some steps in the processes involved in map design and demonstrates how the Shannon information theory (Shannon and Weaver 1964) can be used to compute an evaluation index of a map, i.e., a parameter which measures the efficiency of the map. Throughout this article, the term "information" is mostly used in a narrow sense and the application of information theory is restricted to the syntactic level of cartographic communication. Information sources for map entropy computations are identified and elaborated on. A special class of map information sources are defined and termed "orthogonal map information sources". Further, a strategy to consider spatial properties of a map in entropy computations is presented. At the end of th...

Journal ArticleDOI
TL;DR: An estimate of the probability density function of a random vector is obtained by maximizing the output entropy of a feedforward network of sigmoidal units with respect to the input weights.
Abstract: An estimate of the probability density function of a random vector is obtained by maximizing the output entropy of a feedforward network of sigmoidal units with respect to the input weights. Classification problems can be solved by selecting the class associated with the maximal estimated density. Newton's optimization method, applied to the estimated density, yields a recursive estimator for a random variable or a random sequence. A constrained connectivity structure yields a linear estimator, which is particularly suitable for "real time" prediction. A Gaussian nonlinearity yields a closed-form solution for the network's parameters, which may also be used for initializing the optimization algorithm when other nonlinearities are employed. A triangular connectivity between the neurons and the input, which is naturally suggested by the statistical setting, reduces the number of parameters. Applications to classification and forecasting problems are demonstrated.

Journal ArticleDOI
TL;DR: A new result limiting the amount of accessible information in a quantum channel is proved, which generalizes Kholevo{close_quote}s theorem and implies it as a simple corollary.
Abstract: We prove a new result limiting the amount of accessible information in a quantum channel This generalizes Kholevo's theorem and implies it as a simple corollary Our proof uses the strong subadditivity of the von Neumann entropy functional $S(\ensuremath{\rho})$ and a specific physical analysis of the measurement process The result presented here has application in information obtained from ``weak'' measurements, such as those sometimes considered in quantum cryptography

Book
26 Nov 1996
TL;DR: This book is intended to introduce coding theory and information theory to undergraduate students of mathematics and computer science.
Abstract: This book is intended to introduce coding theory and information theory to undergraduate students of mathematics and computer science. It begins with a review of probablity theory as applied to finite sample spaces and a general introduction to the nature and types of codes. The two subsequent chapters discuss information theory: efficiency of codes, the entropy of information sources, and Shannon's Noiseless Coding Theorem. The remaining three chapters deal with coding theory: communication channels, decoding in the presence of errors, the general theory of linear codes, and such specific codes as Hamming codes, the simplex codes, and many others.

Book ChapterDOI
01 Jan 1996
TL;DR: This is a mathematically oriented survey about the method of maximum entropy or minimum I-divergence, with a critical treatment of its various justifications and relation to Bayesian statistics.
Abstract: This is a mathematically oriented survey about the method of maximum entropy or minimum I-divergence, with a critical treatment of its various justifications and relation to Bayesian statistics. Information theoretic ideas are given substantial attention, including “information geometry”. The axiomatic approach is considered as the best justification of maxent, as well as of alternate methods of minimizing some Bregman distance or f-divergence other than I-divergence. The possible interpretation of such alternate methods within the original maxent paradigm is also considered.

Journal ArticleDOI
Eric Pantin1, Jean-Luc Starck1
TL;DR: This work uses the wavelet transform, a mathematical tool to decompose a signal into dierent frequency bands, to introduce the concept of multi-scale entropy of an image, leading to a better restoration at all spatial frequencies.
Abstract: Following the ideas of Bontekoe et al. who noticed that the classical Maximum Entropy Method (MEM) had diculties to eciently restore high and low spatial frequency structure in an image at the same time, we use the wavelet transform, a mathematical tool to decompose a signal into dierent frequency bands. We introduce the concept of multi-scale entropy of an image, leading to a better restoration at all spatial frequencies. This deconvolution method is flux conservative and the use of a multiresolution support solves the problem of MEM to choose the parameter, i.e. the relative weight between the goodness-of-t and the entropy. We show that our algorithm is ecient too for ltering astronomical images. A range of practical examples illustrate this approach.