scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 1991"


Journal ArticleDOI
TL;DR: One of these variants, called REINFORCE/MENT, represents a novel but principled approach to reinforcement learning in nontrivial networks which incorporates an entropy maximization strategy.
Abstract: Any non-associative reinforcement learning algorithm can be viewed as a method for performing function optimization through (possibly noise-corrupted) sampling of function values. We describe the results of simulations in which the optima of several deterministic functions studied by Ackley were sought using variants of REINFORCE algorithms. Some of the algorithms used here incorporated additional heuristic features resembling certain aspects of some of the algorithms used in Ackley's studies. Differing levels of performance were achieved by the various algorithms investigated, but a number of them performed at a level comparable to the best found in Ackley's studies on a number of the tasks, in spite of their simplicity. One of these variants, called REINFORCE/MENT, represents a novel but principled approach to reinforcement learning in nontrivial networks which incorporates an entropy maximization strategy. This was found to perform especially well on more hierarchically organized tasks.

367 citations


Journal ArticleDOI
TL;DR: This article considers adaptive control architectures that integrate active sensory-motor systems with decision systems based on reinforcement learning and shows that perceptual aliasing destabilizes existing reinforcement learning algorithms with respect to the optimal decision policy.
Abstract: This article considers adaptive control architectures that integrate active sensory-motor systems with decision systems based on reinforcement learning. One unavoidable consequence of active perception is that the agent's internal representation often confounds external world states. We call this phoenomenon perceptual aliasing and show that it destabilizes existing reinforcement learning algorithms with respect to the optimal decision policy. We then describe a new decision system that overcomes these difficulties for a restricted class of decision problems. The system incorporates a perceptual subcycle within the overall decision cycle and uses a modified learning algorithm to suppress the effects of perceptual aliasing. The result is a control architecture that learns not only how to solve a task but also where to focus its visual attention in order to collect necessary sensory information.

310 citations


Proceedings Article
24 Aug 1991
TL;DR: This paper describes the input generalization problem (whereby the system must generalize to produce similar actions in similar situations) and an implemented solution, the G algorithm, which is based on recursive splitting of the state space based on statistical measures of differences in reinforcements received.
Abstract: Delayed reinforcement learning is an attractive framework for the unsupervised learning of action policies for autonomous agents Some existing delayed reinforcement learning techniques have shown promise in simple domains However, a number of hurdles must be passed before they are applicable to realistic problems This paper describes one such difficulty, the input generalization problem (whereby the system must generalize to produce similar actions in similar situations) and an implemented solution, the G algorithm This algorithm is based on recursive splitting of the state space based on statistical measures of differences in reinforcements received Connectionist backpropagation has previously been used for input generalization in reinforcement learning We compare the two techniques analytically and empirically The G algorithm's sound statistical basis makes it easy to predict when it should and should not work, whereas the behavior of back-propagation is unpredictable We found that a previous successful use of backpropagation can be explained by the linearity of the application domain We found that in another domain, G reliably found the optimal policy, whereas none of a set of runs of backpropagation with many combinations of parameters did

272 citations


Journal ArticleDOI
TL;DR: A more biologically plausible learning rule is described, using reinforcement learning, which is applied to the problem of how area 7a in the posterior parietal cortex of monkeys might represent visual space in head-centered coordinates and shows that a neural network does not require back propagation to acquire biologically interesting properties.
Abstract: Many recent studies have used artificial neural network algorithms to model how the brain might process information. However, back-propagation learning, the method that is generally used to train these networks, is distinctly "unbiological." We describe here a more biologically plausible learning rule, using reinforcement learning, which we have applied to the problem of how area 7a in the posterior parietal cortex of monkeys might represent visual space in head-centered coordinates. The network behaves similarly to networks trained by using back-propagation and to neurons recorded in area 7a. These results show that a neural network does not require back propagation to acquire biologically interesting properties.

228 citations


Proceedings ArticleDOI
26 Jun 1991
TL;DR: An emerging deeper understanding of neural network reinforcement learning methods is summarized that is obtained by viewing them as a synthesis of dynamic programming and stochastic approximation methods.
Abstract: Control problems can be divided into two classes: 1) regulation and tracking problems, in which the objective is to follow a reference trajectory, and 2) optimal control problems, in which the objective is to extremize a functional of the controlled system's behavior that is not necessarily defined in terms of a reference trajectory. Adaptive methods for problems of the first kind are well known, and include self-tuning regulators and model-reference methods, whereas adaptive methods for optimal-control problems have received relatively little attention. Moreover, the adaptive optimal-control methods that have been studied are almost all indirect methods, in which controls are recomputed from an estimated system model at each step. This computation is inherently complex, making adaptive methods in which the optimal controls are estimated directly more attractive. Here we present reinforcement learning methods as a computationally simple, direct approach to the adaptive optimal control of nonlinear systems.

197 citations


Proceedings Article
14 Jul 1991
TL;DR: This paper presents a general approach to making robots which can improve their performance from experiences as well as from being taught, and develops a simulated learning robot which could learn three moderately complex behaviors and use what was learned in the simulator to operate in the real world quite successfully.
Abstract: Programming robots is a tedious task. So, there is growing interest in building robots which can learn by themselves. Self-improving, which involves trial and error, however, is often a slow process and could be hazardous in a hostile environment. By teaching robots how tasks can be achieved, learning time can be shortened and hazard can be minimized. This paper presents a general approach to making robots which can improve their performance from experiences as well as from being taught. Based on this proposed approach and other learning speedup techniques, a simulated learning robot was developed and could learn three moderately complex behaviors, which were then integrated in a subsumption style so that the robot could navigate and recharge itself. Interestingly, a real robot could actually use what was learned in the simulator to operate in the real world quite successfully.

188 citations


Proceedings Article
14 Jul 1991
TL;DR: The search time complexity of reinforcement learning algorithms, along with unbiased Q-learning, are analyzed for problem solving tasks on a restricted class of state spaces and shed light on the complexity of search in reinforcement learning in general and the utility of cooperative mechanisms for reducing search.
Abstract: Reinforcement learning algorithms, when used to solve multi-stage decision problems, perform a kind of online (incremental) search to find an optimal decision policy. The time complexity of this search strongly depends upon the size and structure of the state space and upon a priori knowledge encoded in the learners initial parameter values. When a priori knowledge is not available, search is unbiased and can be excessive. Cooperative mechanisms help reduce search by providing the learner with shorter latency feedback and auxiliary sources of experience. These mechanisms are based on the observation that in nature, intelligent agents exist in a cooperative social environment that helps structure and guide learning. Within this context, learning involves information transfer as much as it does discovery by trial-and-error. Two cooperative mechanisms are described: Learning with an External Critic (or LEC) and Learning By Watching (or LBW). The search time complexity of these algorithms, along with unbiased Q-learning, are analyzed for problem solving tasks on a restricted class of state spaces. The results indicate that while unbiased search can be expected to require time moderately exponential in the size of the state space, the LEC and LBW algorithms require at most time linear in the size of the state space and under appropriate conditions, are independent of the state space size altogether; requiring time proportional to the length of the optimal solution path. While these analytic results apply only to a restricted class of tasks, they shed light on the complexity of search in reinforcement learning in general and the utility of cooperative mechanisms for reducing search.

187 citations


Proceedings Article
14 Feb 1991

181 citations


Proceedings Article
14 Jul 1991
TL;DR: Two algorithms for behavior learning are described that combine techniques for propagating reinforcement values temporally across actions and spatially across states that are better than using a monolithic architecture for learning the box pushing task.
Abstract: This paper describes a general approach for automatically programming a behavior-based robot. New behaviors are learned by trial and error using a performance feedback function as reinforcement. Two algorithms for behavior learning are described that combine techniques for propagating reinforcement values temporally across actions and spatially across states. A behavior-based robot called OBELIX (see Figure 1) is described that learns several component behaviors in an example task involving pushing boxes. An experimental study using the robot suggests two conclusions. One, the learning techniques are able to learn the individual behaviors, sometimes outperforming a hand-coded program. Two, using a behavior-based architecture is better than using a monolithic architecture for learning the box pushing task.

95 citations


Book ChapterDOI
01 Jun 1991
TL;DR: This chapter describes two cooperative learning algorithms that can reduce search and decouple the learning rate from state-space size using the idea of a mentor who watches the learner and generates immediate rewards in response to its most recent actions.
Abstract: Publisher Summary This chapter describes two cooperative learning algorithms that can reduce search and decouple the learning rate from state-space size. The first algorithm, called Learning with an External Critic (LEC), is based on the idea of a mentor who watches the learner and generates immediate rewards in response to its most recent actions. This reward is then used temporarily to bias the learner's control strategy. The second algorithm, called Learning By Watching ( LBW), is based on the idea that an agent can gain experience vicariously by relating the observed behavior of others to its own. While LEC algorithms require interaction with knowledgeable agents, LBW algorithms can be effective even when interaction is with equally naive peers. The search time complexity is analyzed for pure unbiased Q-learning, LEC, and LB W algorithms for an important class of state spaces. Generally, the results indicate that unbiased Q-learning can have a search time that is exponential in the depth of the state space, while the LEC and LB W algorithms require at most time linear in the state space size and under appropriate conditions, time independent of the state space size and proportional to the length of the optimal solution path. Homogeneous state spaces are useful for studying the scaling properties of reinforcement learning algorithms because they are analytically tractable.

87 citations



Book ChapterDOI
01 Jun 1991
TL;DR: This paper shows how problems are overcome by using a subsumption architecture: each module can be given its own simple reward function, and state history information can be easily encoded in a module's applicability predicate.
Abstract: Making robots learn complex tasks from reinforcement seems attractive, but a number of problems are encountered in practice. The learning converges slowly because rewards are infrequent, and it is difficult to find effective ways of encoding state history. This paper shows how these problems are overcome by using a subsumption architecture: each module can be given its own simple reward function, and state history information can be easily encoded in a module's applicability predicate. A real robot called OBELIX (see Figure 1) is described that learns several component behaviors in an example task involving pushing boxes. An experimental study demonstrates the feasibility of the subsumption-based approach, and its superiority to a monolithic architecture.

Proceedings Article
14 Feb 1991
TL;DR: This paper describes the learning agents and their performance, and summarizes the learning algorithms and the lessons I learned from this study.
Abstract: The purpose of this work is to investigate and evaluate different reinforcement learning frameworks using connectionist networks. I study four frameworks, which are adopted from the ideas developed by Rich Sutton and his colleagues. The four frameworks are based on two learning procedures: the Temporal Difference methods for solving the credit assignment problem, and the backpropagation algorithm for developing appropriate internal representations. Two of them also involve learning a world model and using it to speed learning. To evaluate their performance, I design a dynamic environment and implement different learning agents, using the different frameworks, to survive in it. The environment is nontrivial and nondeterministic. Surprisingly, all of the agents can learn to survive fairly well in a reasonable time frame. This paper describes the learning agents and their performance, and summarizes the learning algorithms and the lessons I learned from this study. This research was supported by NASA under Contract NAGW-1175. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of NASA.

Book ChapterDOI
01 Jan 1991
TL;DR: It is suggested that given a fixed amount of computational power available per control action, it may be better to use a direct reinforcement learning method augmented with indirect techniques than to devote all available resources to a computationally costly indirect method.
Abstract: Following terminology used in adaptive control, we distinguish between indirect learning methods, which learn explicit models of the dynamic structure of the system to be controlled, and direct learning methods, which do not. We compare an existing indirect method, which uses a conventional dynamic programming algorithm, with a closely related direct reinforcement learning method by applying both methods to an infinite horizon Markov decision problem with unknown state-transition probabilities. The simulations show that although the direct method requires much less space and dramatically less computation per control action, its learning ability in this task is superior to, or compares favorably with, that of the more complex indirect method. Although these results do not address how the methods’ performances compare as problems become more difficult, they suggest that given a fixed amount of computational power available per control action, it may be better to use a direct reinforcement learning method augmented with indirect techniques than to devote all available resources to a computationally costly indirect method. Comprehensive answers to the questions raised by this study depend on many factors making up the economic context of the computation.

Book ChapterDOI
01 Jun 1991
TL;DR: A new method for learning to refine the control rules of approximate reasoning-based controllers that can use the control knowledge of an experienced operator and fine-tune it through the process of learning.
Abstract: Previous reinforcement learning models for learning control do not use existing knowledge of a physical system's behavior, but rather train the network from scratch. The learning process is usually long, and even after the learning is completed, the resulting network can not be easily explained. On the other hand, approximate reasoning-based controllers provide a clear understanding of the control strategy but can not learn from experience. In this paper, we introduce a new method for learning to refine the control rules of approximate reasoning-based controllers. A reinforcement learning technique is used in conjunction with a multi-layer neural network model of an approximate reasoning-based controller. The model learns by updating its prediction of the physical system's behavior. Unlike previous models, our model can use the control knowledge of an experienced operator and fine-tune it through the process of learning. We demonstrate the application of the new approach to a small but challenging real-world control problem.

Proceedings ArticleDOI
08 Jul 1991
TL;DR: It is pointed out that the genetic algorithms which have been shown to yield good performance for neural network weight optimization are really genetic hill-climbers, with a strong reliance on mutation rather than hyperplane sampling.
Abstract: It is pointed out that the genetic algorithms which have been shown to yield good performance for neural network weight optimization are really genetic hill-climbers, with a strong reliance on mutation rather than hyperplane sampling. Neural control problems are more appropriate for these genetic hill-climbers than supervised learning applications because in reinforcement learning applications gradient information is not directly available. Genetic reinforcement learning produces competitive results with the adaptive heuristic critic method, another reinforcement learning paradigm for neural networks that employs temporal difference methods. The genetic hill-climbing algorithm appears to be robust over a wide range of learning conditions. >

Proceedings Article
02 Dec 1991
TL;DR: Experiments show that systems based on these principles can require less computation per time step and many fewer training sequences than conventional training algorithms for recurrent nets.
Abstract: Do you want your neural net algorithm to learn sequences? Do not limit yourself to conventional gradient descent (or approximations thereof). Instead, use your sequence learning algorithm (any will do) to implement the following method for history compression. No matter what your final goals are, train a network to predict its next input from the previous ones. Since only unpredictable inputs convey new information, ignore all predictable inputs but let all unexpected inputs (plus information about the time step at which they occurred) become inputs to a higher-level network of the same kind (working on a slower, self-adjusting time scale). Go on building a hierarchy of such networks. This principle reduces the descriptions of event sequences without loss of information, thus easing supervised or reinforcement learning tasks. Alternatively, you may use two recurrent networks to collapse a multi-level predictor hierarchy into a single recurrent net. Experiments show that systems based on these principles can require less computation per time step and many fewer training sequences than conventional training algorithms for recurrent nets. Finally you can modify the above method such that predictability is not defined in a yes-or-no fashion but in a continuous fashion.

Proceedings Article
14 Jul 1991
TL;DR: The approach integrates cost-sensitive learning with reinforcement learning to learn an efficient internal state representation and a decision policy simultaneously in a finite, deterministic environment and maximizes the long-term discounted reward per action and reduces the average sensing cost per state.
Abstract: Stadard reinforcement learning methods assume they can identify each state distinctly before making an action decision. In reality, a robot agent only has a limited sensing capability and identifying each state by extensive sensing can be time consuming. This paper describes an approach that learns active perception strategies in reinforcement learning and considers sensing costs explicitly. The approach integrates cost-sensitive learning with reinforcement learning to learn an efficient internal state representation and a decision policy simultaneously in a finite, deterministic environment. It not only maximizes the long-term discounted reward per action but also reduces the average sensing cost per state. The initial experimental results in a simulated robot navigation domain are encouraging.

Book ChapterDOI
01 Jun 1991
TL;DR: Three extensions to the two basic learning algorithms are investigated and it is shown that the extensions can effectively improve the learning rate and in many cases even the asymptotic performance.
Abstract: AHC-learning and Q-learning are slow learning methods. This paper investigates three extensions to the two basic learning algorithms. The three extensions are 1) experience replay, 2) learning action models for planning, and 3) teaching. The basic algorithms and their extensions were evaluated using a dynamic environment as a testbed. The environment is nontrivial and nondeter-ministic. The results show that the extensions can effectively improve the learning rate and in many cases even the asymptotic performance.

Book ChapterDOI
01 Jun 1991
TL;DR: A new learning algorithm and an architecture that allows transfer of learning by the “sharing― of solutions to the common parts of multiple tasks is presented.
Abstract: Most “weak― learning algorithms, including reinforcement learning methods, have been applied on tasks with single goals. The effort to build more sophisticated learning systems that operate in complex environments will require the ability to handle multiple goals. Methods that allow transfer of learning will play a crucial role in learning systems that support multiple goals. In this paper I describe a class of multiple tasks that represents a subset of routine animal activity. I present a new learning algorithm and an architecture that allows transfer of learning by the “sharing― of solutions to the common parts of multiple tasks. A proof of the algorithm is also provided.


Proceedings ArticleDOI
13 Oct 1991
TL;DR: A strong convergence theorem is presented that implies a form of optimal performance under certain general conditions of the SRV algorithm on ARL tasks, which is based on the pioneering work of A.G. Barto and P. Anandan (1985).
Abstract: The author describes an algorithm, called the stochastic real-valued (SRV) algorithm, that uses evaluative performance feedback to learn associative maps from input vectors to real-valued actions. This algorithm is based on the pioneering work of A.G. Barto and P. Anandan (1985), in synthesizing associative reinforcement learning (ARL) algorithms using techniques from pattern classification and automata theory. A strong convergence theorem is presented that implies a form of optimal performance under certain general conditions of the SRV algorithm on ARL tasks. Simulation results are presented to illustrate the convergence behavior of the algorithm under the conditions of the theorem. The robustness of the algorithm is also demonstrated by simulations in which some of the conditions of the theorem are violated. >

Book ChapterDOI
01 Jun 1991
TL;DR: This chapter describes how reinforcement learning techniques can take advantage of decomposing a reinforcement learning architecture into modules, where each module learns to solve a subgoal of the task, to improve both structural and temporal credit transfer.
Abstract: Publisher Summary Complex tasks can often be decomposed into subgoals or chunks that can be solved individually. This chapter describes how reinforcement learning techniques can take advantage of this. By decomposing a reinforcement learning architecture into modules, where each module learns to solve a subgoal of the task, both structural and temporal credit transfer can be improved. The chapter also presents a variation of Q-learning that allows the modular architecture to reduce the effects of perceptual aliasing on reward estimation. Q-learning defines a mechanism for propagating estimates of expected utility backwards to the previous actions that resulted in this utility. However, it does not define a mechanism that allows the system to generalize, that is, to allow experience gained for one state to transfer to similar states. This is a serious problem because in naive representations the number of possible states often grows exponentially with the number of bits to be represented. Several approaches have been invented to deal with this scaling problem. White head and Ballard have used indexical representations to avoid having to estimate a Q-value for each distinct world state and each action's possible variable binding. For more complicated tasks that involve generating sequences of actions, a robot's state vector might be devoted mostly to representing internal state, and only a small fraction used to represent perceptual data. In such situations, state vectors that seem very similar may represent very different situations and should not share estimates.

Proceedings Article
02 Dec 1991
TL;DR: A method is described for generating plan-like, reflexive, obstacle avoidance behaviour in a mobile robot that adapts its responses to sensory stimuli so as to minimise the negative reinforcement arising from collisions.
Abstract: A method is described for generating plan-like, reflexive, obstacle avoidance behaviour in a mobile robot. The experiments reported here use a simulated vehicle with a primitive range sensor. Avoidance behaviour is encoded as a set of continuous functions of the perceptual input space. These functions are stored using CMACs and trained by a variant of Barto and Sutton's adaptive critic algorithm. As the vehicle explores its surroundings it adapts its responses to sensory stimuli so as to minimise the negative reinforcement arising from collisions. Strategies for local navigation are therefore acquired in an explicitly goal-driven fashion. The resulting trajectories form elegant collision-free paths through the environment.

01 May 1991
TL;DR: A class of problems that isolate memory exploitation from other aspects of LCS behavior is developed, showing that an LCS can form rule sets that exploit memory and does not form optimal rule sets because of a limitation in its allocation of credit scheme.
Abstract: Automated adaptation in a general setting remains a difficult and poorly understood problem. Reinforcement learning control problems model environments where an automated system must optimize a reinforcement signal by providing inputs to a black-box whose internal structure is initially unknown and persistently uncertain. Learning classifier systems (LCSs) are a class of rule-based systems for reinforcement learning control that use genetic algorithms (GAs) for rule discovery. Genetic algorithms are a class of computerized search procedures whose mechanics are based on natural genetics. This study examines two characteristic aspects of LCSs: default hierarchy formation and memory exploitation. Default hierarchies are sets of rules where the utilities of partially correct, but broadly applicable rules (defaults) are augmented by additional rules (exceptions). By forming default hierarchies, an LCS can store knowledge in parsimonious rule sets that can be incrementally refined. To do this, an LCS must have conflict resolution mechanisms that cause exceptions to consistently override defaults. This study examines typical LCS conflict resolution mechanisms and shows that they are inadequate in many situations. A new conflict resolution strategy called the priority tuning scheme is introduced. Experimentation shows that this scheme properly organizes default hierarchies in situations where traditional schemes fail. Analysis reveals that this technique greatly enlarges the class of exploitable default hierarchies. LCSs have the potential to adaptively exploit memory and extend their capabilities beyond simple stimulus-response behavior. This study develops a class of problems that isolate memory exploitation from other aspects of LCS behavior. Experiments show that an LCS can form rule sets that exploit memory. However, the LCS does not form optimal rule sets because of a limitation in its allocation of credit scheme. This study demonstrates this limitation and suggests an alternate scheme that automatically evolves multi-rule corporations as a remedy. Preliminary analysis illustrates the potential of this method. LCSs are a promising approach to reinforcement learning. This study has suggested several directions for refinement and improved understanding of LCSs. Further development of learning systems like LCSs should extend the applicability of automatic systems to tasks that currently require human intervention.

Book ChapterDOI
01 Jun 1991
TL;DR: The approach learns a task-dependent internal representation and a decision policy simultaneously in a finite, deterministic environment and maximizes the long-term discounted reward per action and reduces the average sensing cost per state.
Abstract: Standard reinforcement learning methods assume they can identify each state distinctly before making an action decision. In reality, a robot agent only has a limited sensing capability, and identifying each state by extensive sensing can be time consuming. This paper describes an approach that learns active perception strategies in reinforcement learning and considers sensing costs explicitly. The approach learns a task-dependent internal representation and a decision policy simultaneously in a finite, deterministic environment. It not only maximizes the long-term discounted reward per action but also reduces the average sensing cost per state. The initial experimental results in a simulated robot navigation domain are encouraging.

Proceedings ArticleDOI
19 Jun 1991
TL;DR: The authors present a machine learning approach based on genetic algorithms and unsupervised reinforcement learning to the generation and organisation of robot behaviour and the implementation of an ethological model of behavioural organisation based on genetics-based machine learning is outlined.
Abstract: Behaviour-based robotics represents a different approach to modelling the interaction of an autonomous agent with its environment hence providing the basis for the development of cognitive capabilities in artificially intelligent systems. The authors present a machine learning approach based on genetic algorithms and unsupervised reinforcement learning to the generation and organisation of robot behaviour. The implementation of an ethological model of behavioural organisation based on genetics-based machine learning is outlined. >

Journal ArticleDOI
01 Jul 1991
TL;DR: A survey of the state of the art in learning systems (automata and neural networks) which are of increasing importance in both theory and practice is presented.
Abstract: A survey of the state of the art in learning systems (automata and neural networks) which are of increasing importance in both theory and practice is presented. Learning systems are a response to engineering design problems arising from nonlinearities and uncertainty. Definitions and properties of learning systems are detailed. An analysis of the reinforcement schemes which are the heart of learning systems is given. Some results related to the asymptotic properties of the learning automata are presented as well as the learning systems models, and at the same time the controller (optimiser) and the controlled process (criterion to be optimised). Two learning schemes for neural networks synthesis are presented. Several applications of learning systems are also described. >

Proceedings ArticleDOI
26 Jun 1991
TL;DR: In this paper, a simple fixed control strategy has been developed, requiring no a priori dynamic model, which is then augmented using neural network learning, combining supervised learning, temporal difference learning, and reinforcement learning.
Abstract: This paper presents preliminary results of a study of the application of CMAC neural networks to the problem of biped walking with dynamic balance. A simple fixed control strategy has been developed, requiring no a priori dynamic model, which is then augmented using neural network learning. Standard supervised learning, temporal difference learning, and reinforcement learning are combined to train the neural network. Results of simulation studies using a simple two-dimensional simulation are presented. Random training using frequent sudden changes in desired velocity produced a robust controller able to track sudden changes in the desired velocity command, and able to rapidly adjust to unexpected disturbances.

Journal ArticleDOI
TL;DR: Associative learning is investigated using neural networks and concepts based on learning automata and the extension of similar concepts to decentralized decision-making in a context space is introduced.
Abstract: Associative learning is investigated using neural networks and concepts based on learning automata. The behavior of a single decision-maker containing a neural network is studied in a random environment using reinforcement learning. The objective is to determine the optimal action corresponding to a particular state. Since decisions have to be made throughout the context space based on a countable number of experiments, generalization is inevitable. Many different approaches can be followed to generate the desired discriminant function. Three different methods which use neural networks are discussed and compared. In the most general method, the output of the network determines the probability with which one of the actions is to be chosen. The weights of the network are updated on the basis of the actions and the response of the environment. The extension of similar concepts to decentralized decision-making in a context space is also introduced. Simulation results are included. Modifications in the implementations of the most general method to make it practically viable are also presented. All the methods suggested are feasible and the choice of a specific method depends on the accuracy desired as well as on the available computational power. >