scispace - formally typeset
Search or ask a question

Showing papers on "Generalization published in 1991"


Journal ArticleDOI
TL;DR: In this article, a generalization of the coefficient of determination R2 to general regression models is discussed, and a modification of an earlier definition to allow for discrete models is proposed.
Abstract: SUMMARY A generalization of the coefficient of determination R2 to general regression models is discussed. A modification of an earlier definition to allow for discrete models is proposed.

5,085 citations


Proceedings Article
02 Dec 1991
TL;DR: It is proven that a weight decay has two effects in a linear network, and it is shown how to extend these results to networks with hidden layers and non-linear units.
Abstract: It has been observed in numerical simulations that a weight decay can improve generalization in a feed-forward neural network. This paper explains why. It is proven that a weight decay has two effects in a linear network. First, it suppresses any irrelevant components of the weight vector by choosing the smallest vector that solves the learning problem. Second, if the size is chosen right, a weight decay can suppress some of the effects of static noise on the targets, which improves generalization quite a lot. It is then shown how to extend these results to networks with hidden layers and non-linear units. Finally the theory is confirmed by some numerical simulations using the data from NetTalk.

1,569 citations


Journal ArticleDOI
TL;DR: Fractional statistics is reformulated as a generalization of the Pauli exclusion principle, and a definition independent of the dimension of space is obtained, which is used to classify spinons in gapless spin-1/2 antiferromagnetic chains as semions.
Abstract: The concept of ``fractional statistics'' is reformulated as a generalization of the Pauli exclusion principle, and a definition independent of the dimension of space is obtained. When applied to the vortexlike quasiparticles of the fractional quantum Hall effect, it gives the same result as that based on the braid-group. It is also used to classify spinons in gapless spin-1/2 antiferromagnetic chains as semions. An extensive one-particle Hilbert-space dimension is essential, limiting fractional statistics of this type to topological excitations confined to the interior of condensed matter. The new definition does not apply to ``anyon gas'' models as currently formulated: A possible resolution of this difficulty is proposed.

830 citations


Proceedings Article
14 Jul 1991
TL;DR: It is shown that any learning algorithm implementing the MIN-FEATURES bias requires Θ(1/e ln 1/δ+ 1/e[2p + p ln n]) training examples to guarantee PAC-learning a concept having p relevant features out of n available features, and suggests that training data should be preprocessed to remove irrelevant features before being given to ID3 or FRINGE.
Abstract: In many domains, an appropriate inductive bias is the MIN-FEATURES bias, which prefers consistent hypotheses definable over as few features as possible. This paper defines and studies this bias. First, it is shown that any learning algorithm implementing the MIN-FEATURES bias requires Θ(1/e ln 1/δ+ 1/e[2p + p ln n]) training examples to guarantee PAC-learning a concept having p relevant features out of n available features. This bound is only logarithmic in the number of irrelevant features. The paper also presents a quasi-polynomial time algorithm, FOCUS, which implements MIN-FEATURES. Experimental studies are presented that compare FOCUS to the ID3 and FRINGE algorithms. These experiments show that-- contrary to expectations--these algorithms do not implement good approximations of MIN-FEATURES. The coverage, sample complexity, and generalization performance of FOCUS is substantially better than either ID3 or FRINGE on learning problems where the MIN-FEATURES bias is appropriate. This suggests that, in practical applications, training data should be preprocessed to remove irrelevant features before being given to ID3 or FRINGE.

716 citations


Journal ArticleDOI
TL;DR: It is shown that a modification to the error functional allows smoothing to be introduced explicitly without significantly affecting the speed of training.
Abstract: An important feature of radial basis function neural networks is the existence of a fast, linear learning algorithm in a network capable of representing complex nonlinear mappings. Satisfactory generalization in these networks requires that the network mapping be sufficiently smooth. We show that a modification to the error functional allows smoothing to be introduced explicitly without significantly affecting the speed of training. A simple example is used to demonstrate the resulting improvement in the generalization properties of the network.

325 citations


Book
01 Jan 1991
TL;DR: Part 1 Rule base organization: design considerations for a rule based system conceptual frameworks for geographical knowledge knowledge engineering for generalization and data modelling issues: suitable representation schema for geographic information knowledge classification and organization object modelling and phenomenon-based generalization.
Abstract: Part 1 Rule base organization: design considerations for a rule based system conceptual frameworks for geographical knowledge knowledge engineering for generalization. Part 2 Data modelling issues: suitable representation schema for geographic information knowledge classification and organization object modelling and phenomenon-based generalization. Part 3 Formulation of rules: constraints on rule formation rule section for small scale map generalizations a rule for describing feature geometry amplified intelligence and rule based systems. Part 4 Computational and representational issues: role of interpolation in feature displacement parallel software and computation integration and evaluation of map generalization.

289 citations


Proceedings Article
24 Aug 1991
TL;DR: This paper describes the input generalization problem (whereby the system must generalize to produce similar actions in similar situations) and an implemented solution, the G algorithm, which is based on recursive splitting of the state space based on statistical measures of differences in reinforcements received.
Abstract: Delayed reinforcement learning is an attractive framework for the unsupervised learning of action policies for autonomous agents Some existing delayed reinforcement learning techniques have shown promise in simple domains However, a number of hurdles must be passed before they are applicable to realistic problems This paper describes one such difficulty, the input generalization problem (whereby the system must generalize to produce similar actions in similar situations) and an implemented solution, the G algorithm This algorithm is based on recursive splitting of the state space based on statistical measures of differences in reinforcements received Connectionist backpropagation has previously been used for input generalization in reinforcement learning We compare the two techniques analytically and empirically The G algorithm's sound statistical basis makes it easy to predict when it should and should not work, whereas the behavior of back-propagation is unpredictable We found that a previous successful use of backpropagation can be explained by the linearity of the application domain We found that in another domain, G reliably found the optimal policy, whereas none of a set of runs of backpropagation with many combinations of parameters did

272 citations


Journal ArticleDOI
TL;DR: In this article, a path is a sequence of points Po P1 9 9 9 Pro, m >I O, where each P, is a lattice point (that is, a point with integer coordinates) and Pz+l, i 1> 0, is obtained by stepping one unit east or one unit north of P,.
Abstract: Probably the most prominent among the special integers that arise in combinatorial contexts are the binomial coefficients (~). These have many uses and, often, fascinating interpretations [9]. We would like to stress one particular interpretation in terms of paths on the integral lattice in the coordinate plane, and discuss the celebrated ballot problem using this interpretation. A path is a sequence of points Po P1 9 9 9 Pro, m >I O, where each P, is a lattice point (that is, a point with integer coordinates) and Pz+l, i 1> 0, is obtained by stepping one unit east or one unit north of P,. We say that this is a path from P to Q if Po = P, Pm= Q. It is now easy to count the number of paths.

262 citations


Journal ArticleDOI
Marlon Núñez1
TL;DR: The algorithm presented in this paper tries to generate more logical and understandable decision trees than those generated by ID3-like algorithms; it executes various types of generalization and at the same time reduces the classification cost by means of background knowledge.
Abstract: At present, algorithms of the ID3 family are not based on background knowledge. For that reason, most of the time they are neither logical nor understandable to experts. These algorithms cannot perform different types of generalization as others can do (Michalski, 1983s Kodratoff, 1983), nor can they can reduce the cost of classifications. The algorithm presented in this paper tries to generate more logical and understandable decision trees than those generated by ID3-like algorithmss it executes various types of generalization and at the same time reduces the classification cost by means of background knowledge. The background knowledge contains the ISA hierarchy and the measurement cost associated with each attribute. The user can define the degrees of economy and generalization. These data will influence directly the quantity of search that the algorithm must undertake. This algorithm, which is an attribute version of the EG2 method (Nunez, 1988a, 1988b), has been implemented and the results appear in this paper comparing them with other methods.

208 citations


Journal ArticleDOI
TL;DR: It is argued that the previous method of solving for x, based on the extension principle and regular fuzzy arithmetic, should be abandoned since it too often fails to produce a solution.

206 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the two methods for bounding the overall properties of nonlinear composites generate precisely the same information, and hence that differences noted by Ponte Castaneda arise from comparing optimal bounds obtained from the new procedure with sub-optimal bounds obtaining from the original one.
Abstract: A new method for bounding the overall properties of nonlinear composites, proposed byPonte Castaneda (J. Mech. Phys. Solids 39, 45, 1991), is compared with an older prescription based on a generalization to nonlinear behaviour of the Hashin-Shtrikman procedure. It is demonstrated that the two methods generate precisely the same information, and hence that differences noted by Ponte Castaneda arise from comparing optimal bounds obtained from the new procedure with sub-optimal bounds obtained from the original one. The relative advantages of either procedure are discussed.

Journal ArticleDOI
TL;DR: In this paper, the dipole intensity function and the time-constant density of RC one-port networks are introduced for the identification and synthesis of distributed RC networks, and the results can also be applied directly for inductance-resistance networks.
Abstract: Representations of infinite distributed RC one-ports are described. Two functions are introduced: the dipole intensity function (as the generalization of pole-zero pattern) and the time-constant density (as the generalization of the discrete time-constant set of a lumped network). Relations between these representations and the complex impedance are presented. These representations can be regarded as the generalization of the descriptions commonly used in the theory of lumped networks. The representations offer possibilities for the identification and for the synthesis of distributed RC networks. Although the representations were introduced for the case of RC networks, the results can also be applied directly for inductance-resistance networks. The use of the new representations is demonstrated by some examples. >

01 Jan 1991
TL;DR: Weighted caching is a generalization of paging in which the cost to evict an item depends on the item as mentioned in this paper, and it is studied as a restriction of the well-known k-server problem.
Abstract: Weighted caching is a generalization ofpaging in which the cost to evict an item depends on the item. We study both of these problems as restrictions of the well-knownk-server problem, which involves moving servers in a graph in response to requests so as to minimize the distance traveled.

Proceedings ArticleDOI
John Moody1
30 Sep 1991
TL;DR: The author proposes a new estimate of generalization performance for nonlinear learning systems called the generalized prediction error (GPE) which is based upon the notion of the effective number of parameters p/sub eff/( lambda ).
Abstract: The author proposes a new estimate of generalization performance for nonlinear learning systems called the generalized prediction error (GPE) which is based upon the notion of the effective number of parameters p/sub eff/( lambda ). GPE does not require the use of a test set or computationally intensive cross validation and generalizes previously proposed model selection criteria (such as GCV, FPE, AIC, and PSE) in that it is formulated to include biased, nonlinear models (such as back propagation networks) which may incorporate weight decay or other regularizers. The effective number of parameters p/sub eff/( lambda ) depends upon the amount of bias and smoothness (as determined by the regularization parameter lambda ) in the model, but generally differs from the number of weights p. Construction of an optimal architecture thus requires not just finding the weights w/sub lambda /* which minimize the training function U( lambda , w) but also the lambda which minimizes GPE( lambda ). >

Journal ArticleDOI
TL;DR: The concept of schema and the Schema Theorem are interpreted from a new perspective, which allows GAs to be regarded as a constrained random walk, and offers a view which is amenable to generalization.

Journal Article
TL;DR: The process of formal definition in advanced mathematics actually consists of two distinct complementary processes: the first is the abstraction of specific properties of one or more mathematical objects to form the basis of the definition of the new abstract mathematical object and the second is the process of construction of the abstract concept through logical deduction from the definition as discussed by the authors.
Abstract: ion An abstraction process occurs when the subject focuses attention on specific properties of a given object and then considers these properties in isolation from the original This might be done, for example, to understand the essence of a certain phenomenon, perhaps later to be able to apply the same theory in other cases to which it applies Such application of an abstract theory would be a case of reconstructive generalization – because the abstracted properties are reconstructions of the original properties, now applied to a broader General, Abstract and Generic Guershon Harel & David Tall – 4 – domain However, note that once the reconstructive generalization has occurred, it may then be possible to extend the range of examples to which the arguments apply through the simpler process of expansive generalization For instance, when the group properties are extracted from various contexts to give the axioms for a group, this must be followed by the reconstruction of other properties (such as uniqueness of identity and of inverses) from the axioms This leads to the construction of an abstract group concept which is a re-constructive generalization of various familiar examples of groups When this abstract construction has been made, further applications of group theory to other contexts (usually performed by specialization from the abstract concept) are now expansive generalizations of the original ideas The case of definition The process of formal definition in advanced mathematics actually consists of two distinct complementary processes One is the abstraction of specific properties of one or more mathematical objects to form the basis of the definition of the new abstract mathematical object The other is the process of construction of the abstract concept through logical deduction from the definition The first of these processes we will call formal abstraction, in that it abstracts the form of the new concept through the selection of generative properties of one or more specific situations; for example, abstracting the vector-space axioms from the space of directed-line segments alone or from what it is noticed to be common to this space and the space of polynomials This formal abstraction historically took many generations, but is now a preferred method of progress in building mathematical theories The student rarely sees this part of the process Instead (s)he is presented with the definition in terms of carefully selected properties as a fait accomplit When presented with the definition, the student is faced with the naming of the concept and the statement of a small number of properties or axioms But the definition is more than a naming It is the selection of generative properties suitable for deductive construction of the abstract concept The abstract concept which satisfies only those properties that may be deduced from the definition and no others requires a massive reconstruction Its construction is guided by the properties which hold in the original mathematical concepts from which it was abstracted, but judgement of the truth of these properties must be suspended until they are deduced from the definition For the novice this is liable to cause great confusion at the time The newly constructed abstract object will then generalize the General, Abstract and Generic Guershon Harel & David Tall – 5 – properties embodied in the definition, because any properties that may be deduced from them will be part of it Because of the difficulties involved in the construction process, this is a reconstructive generalization Occasionally the process leads to a newly constructed abstract object whose properties apply only to the original domain, and not to a more general domain For instance, the formal abstraction of the notion of a complete ordered field from the real numbers, or the abstraction of the group concept from groups of transformations Up to isomorphism there is only one complete ordered field, and Cayley’s theorem shows that every abstract group is isomorphic to a group of transformations In these cases the process leads to an abstract concept which does not extend the class of possible embodiments We include these instances within the same theoretical framework for, though they fail to generalize the notion to a broader class of examples, they very much change the nature of the concept in question The formal abstraction process coupled with the construction of the formal concept, when achieved, leads to a mental object that is easier for the expert to manipulate mentally because the precise properties of the concept have been abstracted and can lead to precise general proofs based on these properties Formal abstraction leading to mathematical definitions usually serves two purposes which are particularly attractive to the expert mathematician: (a) Any arguments valid for the abstracted properties apply to all other instances where the abstracted properties hold, so (provided that there are other instances) the arguments are more general (b) Once the abstraction is made, by concentrating on the abstracted properties and ignoring all others, the abstraction should involve less cognitive strain These two factors make a formal abstraction a powerful tool for the expert yet – because of the cognitive reconstruction involved – they may cause great difficulty for the learner

Journal ArticleDOI
TL;DR: In this article, the risk sensitive maximum principle for optimal stochastic control derived by the author in an earlier work (System Control Letters, vol.15, 1990) is restated.
Abstract: The risk-sensitive maximum principle for optimal stochastic control derived by the author in an earlier work (System Control Letters, vol.15, 1990) is restated. This is an immediate generalization of the classic Pontryagin principle, to which it reduces in the deterministic case, and is expressed immediately in terms of observables. It is derived on the assumption that the criterion function is the exponential of an additive cost function, and is exact under linear-quadratic Gaussian assumptions, but is otherwise valid as a large deviation approximation. The principle is extended to the case of imperfect state observation after preliminary establishment of a certainty-equivalence principle. The derivation yields as byproduct a large-deviation version of the updating equation for nonlinear filtering. The development is heuristic. It is believed that the mathematical arguments given are the essential ones, and provide a self-contained treatment at this level. >

Proceedings ArticleDOI
J. Utans1, J. Moody1
09 Oct 1991
TL;DR: The authors propose the prediction risk as a measure of the generalization ability of multi-layer perceptron networks and use it to select the optimal network architecture.
Abstract: The notion of generalization can be defined precisely as the prediction risk, the expected performance of an estimator on new observations. The authors propose the prediction risk as a measure of the generalization ability of multi-layer perceptron networks and use it to select the optimal network architecture. The prediction risk must be estimated from the available data. The authors approximate the prediction risk by v-fold cross-validation and asymptotic estimates of generalized cross-validation or H. Akaike's (1970) final prediction error. They apply the technique to the problem of predicting corporate bond ratings. This problem is very attractive as a case study, since it is characterized by the limited availability of the data and by the lack of complete a priori information that could be used to impose a structure to the network architecture. >

Book
08 Nov 1991
TL;DR: New bounds the initial generalization to be a logical term semantics from the ground up ways of branching quantifiers a new conception of logic.
Abstract: New bounds the initial generalization to be a logical term semantics from the ground up ways of branching quantifiers a new conception of logic.

Journal ArticleDOI
TL;DR: It is shown that approximations to the generalization error of the Bayes optimal algorithm can be achieved by learning algorithms that use a two-layer neutral net to learn a perceptron.
Abstract: The generalization error of the Bayes optimal classification algorithm when learning a perceptron from noise-free random training examples is calculated exactly using methods of statistical mechanics. It is shown that if an assumption of replica symmetry is made, then, in the thermodynamic limit, the error of the Bayes optimal algorithm is less than the error of a canonical stochastic learning algorithm, by a factor approaching \ensuremath{\surd}2 as the ratio of the number of training examples to perceptron weights grows. In addition, it is shown that approximations to the generalization error of the Bayes optimal algorithm can be achieved by learning algorithms that use a two-layer neutral net to learn a perceptron.

Proceedings Article
02 Dec 1991
TL;DR: The prediction risk is proposed as a measure of the generalization ability of multi-layer perceptron networks and used to select an optimal network architecture from a set of possible architectures and a heuristic search strategy is proposed to explore the space of possible architecture.
Abstract: The notion of generalization ability can be defined precisely as the prediction risk, the expected performance of an estimator in predicting new observations. In this paper, we propose the prediction risk as a measure of the generalization ability of multi-layer perceptron networks and use it to select an optimal network architecture from a set of possible architectures. We also propose a heuristic search strategy to explore the space of possible architectures. The prediction risk is estimated from the available data; here we estimate the prediction risk by v-fold cross-validation and by asymptotic approximations of generalized cross-validation or Akaike's final prediction error. We apply the technique to the problem of predicting corporate bond ratings. This problem is very attractive as a case study, since it is characterized by the limited availability of the data and by the lack of a complete a priori model which could be used to impose a structure to the network architecture.

Journal ArticleDOI
TL;DR: In this article, it was shown that the R-matrix which intertwines twon-by-Nn−1 state cyclic L-operators related with a generalization of Uq(sl(n)) algebra can be considered as a Boltzmann weight of four-spin box for a lattice model with two-spin interaction.
Abstract: We show that theR-matrix which intertwines twon-by-Nn−1 state cyclicL-operators related with a generalization ofUq(sl(n)) algebra can be considered as a Boltzmann weight of four-spin box for a lattice model with two-spin interaction just as theR-matrix of the checkerboard chiral Potts model. The rapidity variables lie on the algebraic curve of the genusg=N2(n−1)((n−1)N-n)+1 defined by 2n–3 independent moduli. This curve is a natural generalization of the curve which appeared in the chiral Potts model. Factorization properties of theL-operator and its connection to the SOS models are also discussed.

Journal ArticleDOI
TL;DR: Results suggest that a large and representative training sample may be the single, most important factor in achieving high recognition accuracy in hand-printed character recognition systems, and benefits of reducing the number of net connections are discussed.
Abstract: We report on results of training backpropagation nets with samples of hand-printed digits scanned off of bank checks and hand-printed letters interactively entered into a computer through a stylus digitizer. Generalization results are reported as a function of training set size and network capacity. Given a large training set, and a net with sufficient capacity to achieve high performance on the training set, nets typically achieved error rates of 4-5% at a 0% reject rate and 1-2% at a 10% reject rate. The topology and capacity of the system, as measured by the number of connections in the net, have surprisingly little effect on generalization. For those developing hand-printed character recognition systems, these results suggest that a large and representative training sample may be the single, most important factor in achieving high recognition accuracy. Benefits of reducing the number of net connections, other than improving generalization, are discussed.

Book
01 Jun 1991
TL;DR: The results show that a network architecture evolved by the genetic algorithm performs better than a large network using backpropagation learning alone when the criterion is correct generalization from a set of examples.
Abstract: Neural networks are known to exhibit emergent behaviors, but it is often far from easy to exploit these properties for desired ends such as effective machine learning. We demonstrate that a genetic algorithm is capable of discovering how to exploit the abilities of one type of network learning, backpropagation in feedforward networks. Our results show that a network architecture evolved by the genetic algorithm performs better than a large network using backpropagation learning alone when the criterion is correct generalization from a set of examples. This is potentially a powerful method for design of neural networks-design by evolution.

Journal ArticleDOI
TL;DR: A technique for constructing neural network architectures with better ability to generalize is presented under the name Ockham's Razor: several networks are trained and then pruned by removing connections one by one and retraining, resulting in perfect generalization.
Abstract: A technique for constructing neural network architectures with better ability to generalize is presented under the name Ockham's Razor: several networks are trained and then pruned by removing connections one by one and retraining. The networks which achieve fewest connections generalize best. The method is tested on a classification of bit strings (the contiguity problem): the optimal architecture emerges, resulting in perfect generalization. The internal representation of the network changes substantially during the retraining, and this distinguishes the method from previous pruning studies.

Journal ArticleDOI
TL;DR: In this paper, the authors generalize the α-cuts of two-place functions defined by the Zadeh's extension principle to the case of extended twoplace functions via a sup-t-norm convolution.


Journal ArticleDOI
Xiao-Qiang Zhao1
TL;DR: In this paper, the authors considered an n-species Lotka-Volterra periodic competition system and obtained sufficient conditions for the ultimate boundedness of solutions and the existence and global attractivity of a positive periodic solution.

Journal ArticleDOI
TL;DR: This paper gives several characterizations of the solution set of convex programs, and the subgradients attaining the minimum principle are explicitly characterized, and this characterization is shown to be independent of any solution.

Proceedings ArticleDOI
H. Drucker1, Y. Le Cun1
08 Jul 1991
TL;DR: It is shown that a training algorithm termed double back- Propagation improves generalization by simultaneously minimizing the normal energy term found in back-propagation and an additional energy term that is related to the sum of the squares of the input derivatives (gradients).
Abstract: One test of a training algorithm is how well the algorithm generalizes from the training data to the test data. It is shown that a training algorithm termed double back-propagation improves generalization by simultaneously minimizing the normal energy term found in back-propagation and an additional energy term that is related to the sum of the squares of the input derivatives (gradients). In normal back-propagation training, minimizing the energy function tends to push the input gradient to zero. However, this is not always possible. Double back-propagation explicitly pushes the input gradients to zero, making the minimum broader, and increases the generalization on the test data. The authors show the improvement over normal back-propagation on four candidate architectures with a training set of 320 handwritten numbers and a test set of size 180. >