scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Formal Theory of Inductive Inference. Part II

01 Jun 1964-Information & Computation (Academic Press)-Vol. 7, Iss: 2, pp 1-22
TL;DR: Four ostensibly different theoretical models of induction are presented, in which the problem dealt with is the extrapolation of a very long sequence of symbols—presumably containing all of the information to be used in the induction.
Abstract: 1 Summary In Part I, four ostensibly different theoretical models of induction are presented, in which the problem dealt with is the extrapolation of a very long sequence of symbols—presumably containing all of the information to be used in the induction Almost all, if not all problems in induction can be put in this form Some strong heuristic arguments have been obtained for the equivalence of the last three models One of these models is equivalent to a Bayes formulation, in which a priori probabilities are assigned to sequences of symbols on the basis of the lengths of inputs to a universal Turing machine that are required to produce the sequence of interest as output Though it seems likely, it is not certain whether the first of the four models is equivalent to the other three Few rigorous results are presented Informal investigations are made of the properties of these models There are discussions of their consistency and meaningfulness, of their degree of independence of the exact nature of the Turing machine used, and of the accuracy of their predictions in comparison to those of other induction methods In Part II these models are applied to the solution of three problems—prediction of the Bernoulli sequence, extrapolation of a certain kind of Markov chain, and the use of phrase structure grammars for induction Though some approximations are used, the first of these problems is treated most rigorously The result is Laplace's rule of succession The solution to the second problem uses less certain approximations, but the properties of the solution that are discussed, are fairly independent of these approximations The third application, using phrase structure grammars, is least exact of the three First a formal solution is presented Though it appears to have certain deficiencies, it is hoped that presentation of this admittedly inadequate model will suggest acceptable improvements in it This formal solution is then applied in an approximate way to the determination of the “optimum” phrase structure grammar for a given set of strings The results that are obtained are plausible, but subject to the uncertainties of the approximation used

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites methods from "A Formal Theory of Inductive Infere..."

  • ...ome programming language, the principle of Minimum Description Length (MDL) can be used to measure the complexity of a solution candidate by the length of the shortest program that computes it (e.g., Solomonoff, 1964; Kolmogorov, 1965b; Chaitin, 1966;Wallace and Boulton,1968; Levin, 1973a;Solomonoff,1978; Rissanen, 1986; Blumer et al., 1987; Li and Vita´nyi, 1997; Gru¨nwald et al., 2005). Some methods explicitly ...

    [...]

  • ...…of a solution candidate by the length of the shortest program that computes it (e.g., Blumer, Ehrenfeucht, Haussler, & Warmuth, 1987; Chaitin, 1966; Grünwald,Myung, & Pitt, 2005; Kolmogorov, 1965b; Levin, 1973a; Li & Vitányi, 1997; Rissanen, 1986; Solomonoff, 1964, 1978;Wallace & Boulton, 1968)....

    [...]

Book
01 Jan 2009
TL;DR: The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.
Abstract: Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

7,767 citations

Book
01 Jan 2006
TL;DR: In this paper, the authors provide a comprehensive treatment of the problem of predicting individual sequences using expert advice, a general framework within which many related problems can be cast and discussed, such as repeated game playing, adaptive data compression, sequential investment in the stock market, sequential pattern analysis, and several other problems.
Abstract: This important text and reference for researchers and students in machine learning, game theory, statistics and information theory offers a comprehensive treatment of the problem of predicting individual sequences. Unlike standard statistical approaches to forecasting, prediction of individual sequences does not impose any probabilistic assumption on the data-generating mechanism. Yet, prediction algorithms can be constructed that work well for all possible sequences, in the sense that their performance is always nearly as good as the best forecasting strategy in a given reference class. The central theme is the model of prediction using expert advice, a general framework within which many related problems can be cast and discussed. Repeated game playing, adaptive data compression, sequential investment in the stock market, sequential pattern analysis, and several other problems are viewed as instances of the experts' framework and analyzed from a common nonstochastic standpoint that often reveals new and intriguing connections.

3,615 citations

Journal ArticleDOI
TL;DR: In this paper, a constructive theory of randomness for functions, based on computational complexity, is developed, and a pseudorandom function generator is presented, which is a deterministic polynomial-time algorithm that transforms pairs (g, r), where g is any one-way function and r is a random k-bit string, to computable functions.
Abstract: A constructive theory of randomness for functions, based on computational complexity, is developed, and a pseudorandom function generator is presented. This generator is a deterministic polynomial-time algorithm that transforms pairs (g, r), where g is any one-way function and r is a random k-bit string, to polynomial-time computable functions ƒr: {1, … , 2k} → {1, … , 2k}. These ƒr's cannot be distinguished from random functions by any probabilistic polynomial-time algorithm that asks and receives the value of a function at arguments of its choice. The result has applications in cryptography, random constructions, and complexity theory.

2,043 citations

Journal ArticleDOI
TL;DR: In this article, the minimum description length (MDL) criterion is used to estimate the total number of binary digits required to rewrite the observed data, when each observation is given with some precision.
Abstract: of the number of bits required to write down the observed data, has been reformulated to extend the classical maximum likelihood principle. The principle permits estimation of the number of the parameters in statistical models in addition to their values and even of the way the parameters appear in the models; i.e., of the model structures. The principle rests on a new way to interpret and construct a universal prior distribution for the integers, which makes sense even when the parameter is an individual object. Truncated realvalued parameters are converted to integers by dividing them by their precision, and their prior is determined from the universal prior for the integers by optimizing the precision. 1. Introduction. In this paper we study estimation based upon the principle of minimizing the total number of binary digits required to rewrite the observed data, when each observation is given with some precision. Instead of attempting at an absolutely shortest description, which would be futile, we look for the optimum relative to a class of parametrically given distributions. This Minimum Description Length (MDL) principle, which we introduced in a less comprehensive form in [25], turns out to degenerate to the more familiar Maximum Likelihood (ML) principle in case the number of parameters in the models is fixed, so that the description length of the parameters themselves can be ignored. In another extreme case, where the parameters determine the data, it similarly degenerates to Jaynes's principle of maximum entropy, [14]. But the main power of the new criterion is that it permits estimates of the entire model, its parameters, their number, and even the way the parameters appear in the model; i.e., the model structure. Hence, there will be no need to supplement the estimated parameters with a separate hypothesis test to decide whether a model is adequately parameterized or, perhaps, over parameterized.

1,762 citations

References
More filters
Journal ArticleDOI
TL;DR: This chapter discusses the application of the diagonal process of the universal computing machine, which automates the calculation of circle and circle-free numbers.
Abstract: 1. Computing machines. 2. Definitions. Automatic machines. Computing machines. Circle and circle-free numbers. Computable sequences and numbers. 3. Examples of computing machines. 4. Abbreviated tables Further examples. 5. Enumeration of computable sequences. 6. The universal computing machine. 7. Detailed description of the universal machine. 8. Application of the diagonal process. Pagina 1 di 38 On computable numbers, with an application to the Entscheidungsproblem A. M. ...

7,642 citations

Book
01 Jan 1957

3,909 citations

Book
01 Jan 1921
TL;DR: In this article, the authors present a constructive theory of probability in the theory of groups, with special reference to logical consistence, inference, and logical priority, and the fundamental theorems of probable inference and probability.
Abstract: Part 1 Fundamental ideas: the meaning of probability - probability in relation to the theory of knowledge - the measurement of probabilities - the principle of indifference - other methods of determining probabilities - the weight of arguments - historical retrospect - the frequency theory of probability - the constructive theory of part 1 summarized. Part 2 Fundamental theorems: introductory - the theory of groups, with special reference to logical consistence, inference, and logical priority - the definitions and axioms of inference and probability - the fundamental theorems of probable inference - numerical measurement and approximation of probabilities - observations on the theorems of chapter 14 and their developments, including testimony - some problems in inverse probability, including averages. Part 3 Induction and analogy: introduction - the nature of argument by analogy - the value of multiplication of instances, or pure induction - the nature of inductive argument continued - the justification of these methods - some historical notes on induction - notes on part 3. Part 4 Some philosophical applications of probability: the meanings of objective chance, and of randomness - some problems arising out of the discussion of change - the application of probability to conduct. Part 5 The foundations of statistical inference: the nature of statistical inference - the law of great numbers - the use of a priori probabilities for the prediction of statistical frequency - the mathematical use of statistical frequencies for the determination of probability a posteriori - the inversion of Bernoulli's theorem - the inductive use of statistical frequencies for the determination of probability a posteriori - outline of a constructive theory.

2,633 citations


"A Formal Theory of Inductive Infere..." refers background or methods in this paper

  • ...Later, these code numbers will be used to compute a-priori probabilities of various sequences, and from these, in turn, an expression for conditional probabilities of successive symbols of the sequence will be obtained....

    [...]

  • ...This is about the simplest kind of induction problem that exists, and it has been the subject of much discussion (Keynes, 1921)....

    [...]

Journal ArticleDOI
TL;DR: It is found that no finite-state Markov process that produces symbols with transition from state to state can serve as an English grammar, and the particular subclass of such processes that produce n -order statistical approximations to English do not come closer, with increasing n, to matching the output of anEnglish grammar.
Abstract: We investigate several conceptions of linguistic structure to determine whether or not they can provide simple and "revealing" grammars that generate all of the sentences of English and only these. We find that no finite-state Markov process that produces symbols with transition from state to state can serve as an English grammar. Furthermore, the particular subclass of such processes that produce n -order statistical approximations to English do not come closer, with increasing n , to matching the output of an English grammar. We formalize-the notions of "phrase structure" and show that this gives us a method for describing language which is essentially more powerful, though still representable as a rather elementary type of finite-state process. Nevertheless, it is successful only when limited to a small subset of simple sentences. We study the formal properties of a set of grammatical transformations that carry sentences with phrase structure into new sentences with derived phrase structure, showing that transformational grammars are processes of the same elementary type as phrase-structure grammars; that the grammar of English is materially simplified if phrase structure description is limited to a kernel of simple sentences from which all other sentences are constructed by repeated transformations; and that this view of linguistic structure gives a certain insight into the use and understanding of language.

2,140 citations


"A Formal Theory of Inductive Infere..." refers methods in this paper

  • ...The method that will be used is equivalent to flnding a PSL (context free phrase structure language, Chomsky (1956) ) that in some sense best \flts" the set [fi1]....

    [...]

  • ...The method that will be used is equivalent to finding a PSL (context free phrase structure language, Chomsky (1956)) that in some sense best “fits” the set [α1]....

    [...]

Book
01 Jan 1950

1,960 citations