scispace - formally typeset
Search or ask a question

Showing papers on "Reinforcement learning published in 2005"


Journal ArticleDOI
TL;DR: This survey attempts to draw from multi-agent learning work in a spectrum of areas, including RL, evolutionary computation, game theory, complex systems, agent modeling, and robotics, and finds that this broad view leads to a division of the work into two categories.
Abstract: Cooperative multi-agent systems (MAS) are ones in which several agents attempt, through their interaction, to jointly solve tasks or to maximize utility. Due to the interactions among the agents, multi-agent problem complexity can rise rapidly with the number of agents or their behavioral sophistication. The challenge this presents to the task of programming solutions to MAS problems has spawned increasing interest in machine learning techniques to automate the search and optimization process. We provide a broad survey of the cooperative multi-agent learning literature. Previous surveys of this area have largely focused on issues common to specific subareas (for example, reinforcement learning, RL or robotics). In this survey we attempt to draw from multi-agent learning work in a spectrum of areas, including RL, evolutionary computation, game theory, complex systems, agent modeling, and robotics. We find that this broad view leads to a division of the work into two categories, each with its own special issues: applying a single learner to discover joint solutions to multi-agent problems (team learning), or using multiple simultaneous learners, often one per agent (concurrent learning). Additionally, we discuss direct and indirect communication in connection with learning, plus open issues in task decomposition, scalability, and adaptive dynamics. We conclude with a presentation of multi-agent learning problem domains, and a list of multi-agent learning resources.

1,283 citations


Journal Article
TL;DR: Within this framework, several classical tree-based supervised learning methods and two newly proposed ensemble algorithms, namely extremely and totally randomized trees, are described and found that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples.
Abstract: Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt, ut , rt, xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.

1,079 citations


Book ChapterDOI
03 Oct 2005
TL;DR: NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron, is introduced and it is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality.
Abstract: This paper introduces NFQ, an algorithm for efficient and effective training of a Q-value function represented by a multi-layer perceptron. Based on the principle of storing and reusing transition experiences, a model-free, neural network based Reinforcement Learning algorithm is proposed. The method is evaluated on three benchmark problems. It is shown empirically, that reasonably few interactions with the plant are needed to generate control policies of high quality.

944 citations


Journal ArticleDOI
TL;DR: This book presents an interplay between the classical theory of general Lévy processes described by Skorohod (1991), Bertoin (1996), Sato (2003), and modern stochastic analysis as presented by Liptser and Shiryayev (1989), Protter (2004), and others.
Abstract: (2005). Information Theory, Inference, and Learning Algorithms. Journal of the American Statistical Association: Vol. 100, No. 472, pp. 1461-1462.

740 citations


Journal ArticleDOI
TL;DR: This model of reinforcement learning among cognitive strategies (RELACS) captures the 3 deviations, the learning curves, and the effect of information on uncertainty avoidance and outperforms other models in fitting the data and in predicting behavior in other experiments.
Abstract: Analysis of binary choice behavior in iterated tasks with immediate feedback reveals robust deviations from maximization that can be described as indications of 3 effects: (a) a payoff variability effect, in which high payoff variability seems to move choice behavior toward random choice; (b) underweighting of rare events, in which alternatives that yield the best payoffs most of the time are attractive even when they are associated with a lower expected return; and (c) loss aversion, in which alternatives that minimize the probability of losses can be more attractive than those that maximize expected payoffs. The results are closer to probability matching than to maximization. Best approximation is provided with a model of reinforcement learning among cognitive strategies (RELACS). This model captures the 3 deviations, the learning curves, and the effect of information on uncertainty avoidance. It outperforms other models in fitting the data and in predicting behavior in other experiments.

446 citations


Proceedings ArticleDOI
07 Aug 2005
TL;DR: An approach in which supervised learning is first used to estimate depths from single monocular images, which is able to learn monocular vision cues that accurately estimate the relative depths of obstacles in a scene is presented.
Abstract: We consider the task of driving a remote control car at high speeds through unstructured outdoor environments. We present an approach in which supervised learning is first used to estimate depths from single monocular images. The learning algorithm can be trained either on real camera images labeled with ground-truth distances to the closest obstacles, or on a training set consisting of synthetic graphics images. The resulting algorithm is able to learn monocular vision cues that accurately estimate the relative depths of obstacles in a scene. Reinforcement learning/policy search is then applied within a simulator that renders synthetic scenes. This learns a control policy that selects a steering direction as a function of the vision system's output. We present results evaluating the predictive ability of the algorithm both on held out test data, and in actual autonomous driving experiments.

435 citations


Journal ArticleDOI
TL;DR: The application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer results in agents that significantly outperform a range of benchmark policies.
Abstract: RoboCup simulated soccer presents many challenges to reinforcement learning methods, including a large state space, hidden and uncertain state, multiple independent agents learning simultaneously, and long and variable delays in the effects of actions. We describe our application of episodic SMDP Sarsa(λ) with linear tile-coding function approximation and variable λ to learning higher-level decisions in a keepaway subtask of RoboCup soccer. In keepaway, one team, “the keepers,” tries to keep control of the ball for as long as possible despite the efforts of “the takers.” The keepers learn individually when to hold the ball and when to pass to a teammate. Our agents learned policies that significantly outperform a range of benchmark policies. We demonstrate the generality of our approach by applying it to a number of task variations including different field sizes and different numbers of players on each team.

430 citations


Proceedings ArticleDOI
07 Aug 2005
TL;DR: A SARSA based extension of GPTD is presented, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a world-model.
Abstract: Gaussian Process Temporal Difference (GPTD) learning offers a Bayesian solution to the policy evaluation problem of reinforcement learning. In this paper we extend the GPTD framework by addressing two pressing issues, which were not adequately treated in the original GPTD paper (Engel et al., 2003). The first is the issue of stochasticity in the state transitions, and the second is concerned with action selection and policy improvement. We present a new generative model for the value function, deduced from its relation with the discounted return. We derive a corresponding on-line algorithm for learning the posterior moments of the value Gaussian process. We also present a SARSA based extension of GPTD, termed GPSARSA, that allows the selection of actions and the gradual improvement of policies without requiring a world-model.

402 citations


Book ChapterDOI
01 Aug 2005
TL;DR: A novel reinforcement learning technique based on natural stochastic policy gradients allows a general approach of improving DMPs by trial and error learning with respect to almost arbitrary optimization criteria, and demonstrates the different ingredients of the DMP approach in various examples.
Abstract: This paper discusses a comprehensive framework for modular motor control based on a recently developed theory of dynamic movement primitives (DMP). DMPs are a formulation of movement primitives with autonomous nonlinear differential equations, whose time evolution creates smooth kinematic control policies. Model-based control theory is used to convert the outputs of these policies into motor commands. By means of coupling terms, on-line modifications can be incorporated into the time evolution of the differential equations, thus providing a rather flexible and reactive framework for motor planning and execution. The linear parameterization of DMPs lends itself naturally to supervised learning from demonstration. Moreover, the temporal, scale, and translation invariance of the differential equations with respect to these parameters provides a useful means for movement recognition. A novel reinforcement learning technique based on natural stochastic policy gradients allows a general approach of improving DMPs by trial and error learning with respect to almost arbitrary optimization criteria. We demonstrate the different ingredients of the DMP approach in various examples, involving skill learning from demonstration on the humanoid robot DB, and learning biped walking from demonstration in simulation, including self-improvement of the movement patterns towards energy efficiency through resonance tuning.

381 citations


Journal ArticleDOI
TL;DR: The model suggests that long-term sensitivity enhancements to task-relevant or irrelevant stimuli occur as a result of timely interactions between diffused signals triggered by task performance and signals produced by stimulus presentation.

357 citations


Book ChapterDOI
03 Oct 2005
TL;DR: The Natural Actor-Critic as mentioned in this paper is a model-free reinforcement learning architecture, where actor updates are based on stochastic policy gradients employing Amari's natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression.
Abstract: This paper investigates a novel model-free reinforcement learning architecture, the Natural Actor-Critic. The actor updates are based on stochastic policy gradients employing Amari's natural gradient approach, while the critic obtains both the natural policy gradient and additional parameters of a value function simultaneously by linear regression. We show that actor improvements with natural policy gradients are particularly appealing as these are independent of coordinate frame of the chosen policy representation, and can be estimated more efficiently than regular policy gradients. The critic makes use of a special basis function parameterization motivated by the policy-gradient compatible function approximation. We show that several well-known reinforcement learning methods such as the original Actor-Critic and Bradtke's Linear Quadratic Q-Learning are in fact Natural Actor-Critic algorithms. Empirical evaluations illustrate the effectiveness of our techniques in comparison to previous methods, and also demonstrate their applicability for learning control on an anthropomorphic robot arm.

Proceedings ArticleDOI
05 Dec 2005
TL;DR: Integral sliding mode and reinforcement learning control are presented as two design techniques for accommodating the nonlinear disturbances of outdoor altitude control that result in greatly improved performance over classical control techniques.
Abstract: The Stanford Testbed of Autonomous Rotorcraft for Multi-Agent Control (STARMAC) is a multi-vehicle testbed currently comprised of two quadrotors, also called X4-flyers, with capacity for eight. This paper presents a comparison of control design techniques, specifically for outdoor altitude control, in and above ground effect, that accommodate the unique dynamics of the aircraft. Due to the complex airflow induced by the four interacting rotors, classical linear techniques failed to provide sufficient stability. Integral sliding mode and reinforcement learning control are presented as two design techniques for accommodating the nonlinear disturbances. The methods both result in greatly improved performance over classical control techniques.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: This paper considers the apprenticeship learning setting in which a teacher demonstration of the task is available, and shows that, given the initial demonstration, no explicit exploration is necessary, and the student can attain near-optimal performance simply by repeatedly executing "exploitation policies" that try to maximize rewards.
Abstract: We consider reinforcement learning in systems with unknown dynamics. Algorithms such as E3 (Kearns and Singh, 2002) learn near-optimal policies by using "exploration policies" to drive the system towards poorly modeled states, so as to encourage exploration. But this makes these algorithms impractical for many systems; for example, on an autonomous helicopter, overly aggressive exploration may well result in a crash. In this paper, we consider the apprenticeship learning setting in which a teacher demonstration of the task is available. We show that, given the initial demonstration, no explicit exploration is necessary, and we can attain near-optimal performance (compared to the teacher) simply by repeatedly executing "exploitation policies" that try to maximize rewards. In finite-state MDPs, our algorithm scales polynomially in the number of states; in continuous-state linear dynamical systems, it scales polynomially in the dimension of the state. These results are proved using a martingale construction over relative losses.

Journal ArticleDOI
TL;DR: A model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies based on weighting the original value function and the risk, which was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column.
Abstract: In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: A new subgoal-based method for automatically creating useful skills in reinforcement learning that identifies subgoals by partitioning local state transition graphs---those that are constructed using only the most recent experiences of the agent.
Abstract: We present a new subgoal-based method for automatically creating useful skills in reinforcement learning. Our method identifies subgoals by partitioning local state transition graphs---those that are constructed using only the most recent experiences of the agent. The local scope of our subgoal discovery method allows it to successfully identify the type of subgoals we seek---states that lie between two densely-connected regions of the state space while producing an algorithm with low computational cost.

Journal Article
TL;DR: It is found that the magnitude of ERPs after losing to the computer opponent predicted whether subjects would change decision behavior on the subsequent trial, and FRNs to decision outcomes were disproportionately larger over the motor cortex contralateral to the response hand that was used to make the decision.

Journal ArticleDOI
TL;DR: This work shows that this so-called credit assignment problem can be solved by a new role for attention in learning and shows that the new scheme, called attention-gated reinforcement learning (AGREL), is as efficient as supervised learning in classification tasks.
Abstract: Animal learning is associated with changes in the efficacy of connections between neurons. The rules that govern this plasticity can be tested in neural networks. Rules that train neural networks to map stimuli onto outputs are given by supervised learning and reinforcement learning theories. Supervised learning is efficient but biologically implausible. In contrast, reinforcement learning is biologically plausible but comparatively inefficient. It lacks a mechanism that can identify units at early processing levels that play a decisive role in the stimulus-response mapping. Here we show that this so-called credit assignment problem can be solved by a new role for attention in learning. There are two factors in our new learning scheme that determine synaptic plasticity: (1) a reinforcement signal that is homogeneous across the network and depends on the amount of reward obtained after a trial, and (2) an attentional feedback signal from the output layer that limits plasticity to those units at earlier processing levels that are crucial for the stimulus-response mapping. The new scheme is called attention-gated reinforcement learning (AGREL). We show that it is as efficient as supervised learning in classification tasks. AGREL is biologically realistic and integrates the role of feedback connections, attention effects, synaptic plasticity, and reinforcement learning signals into a coherent framework.

Journal ArticleDOI
TL;DR: This review compares methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spike-timing-dependent plasticity (STDP) and focuses on to what degree are reward-based and correlation-based learning related.
Abstract: In this review, we compare methods for temporal sequence learning (TSL) across the disciplines machine-control, classical conditioning, neuronal models for TSL as well as spike-timing-dependent plasticity (STDP). This review introduces the most influential models and focuses on two questions: To what degree are reward-based (e.g., TD learning) and correlationbased (Hebbian) learning related? and How do the different models correspond to possibly underlying biological mechanisms of synaptic plasticity? We first compare the different models in an open-loop condition, where behavioral feedback does not alter the learning. Here we observe that reward-based and correlation-based learning are indeed very similar. Machine control is then used to introduce the problem of closed-loop control (e.g., actor-critic architectures). Here the problem of evaluative (rewards) versus nonevaluative (correlations) feedback from the environment will be discussed, showing that both learning approaches are fundamentally different in the closed-loop condition. In trying to answer the second question, we compare neuronal versions of the different learning architectures to the anatomy of the involved brain structures (basal-ganglia, thalamus, and cortex) and the molecular biophysics of glutamatergic and dopaminergic synapses. Finally, we discuss the different algorithms used to model STDP and compare them to reward-based learning rules. Certain similarities are found in spite of the strongly different timescales. Here we focus on the biophysics of the different calciumrelease mechanisms known to be involved in STDP.

Journal ArticleDOI
Alan Beggs1
TL;DR: This paper examines the convergence of payoffs and strategies in Erev and Roth`s model of reinforcement learning and shows that it guarantees that the lim sup of a player`s average payoffs is at least his minmax payoff.

Journal ArticleDOI
TL;DR: This paper examines methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy using the Bellman approximation error as an optimization criterion.
Abstract: Reinforcement Learning (RL) is an approach for solving complex multi-stage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters. We present two algorithms for this problem. The first uses a gradient-based approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.

Journal ArticleDOI
TL;DR: This work considers Q-learning with function approximation for this setting and derives an upper bound on the generalization error in terms of quantities minimized by a Q- Learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q- learning and the goal of learning a policy that maximizes the value function.
Abstract: Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function.

Journal ArticleDOI
TL;DR: Encouraging results are provided that show the potential of RL for application to agent-based production scheduling in manufacturing systems as well as three example cases in which the best dispatching rules have been previously defined.

Book ChapterDOI
28 Jan 2005

Journal ArticleDOI
TL;DR: This work considers the behavior of value-based learning agents in the multi-agent multi-armed bandit problem, and shows that such agents cannot generally play at a Nash equilibrium, although if smooth best responses are used, a Nash distribution can be reached.
Abstract: The single-agent multi-armed bandit problem can be solved by an agent that learns the values of each action using reinforcement learning. However, the multi-agent version of the problem, the iterated normal form game, presents a more complex challenge, since the rewards available to each agent depend on the strategies of the others. We consider the behavior of value-based learning agents in this situation, and show that such agents cannot generally play at a Nash equilibrium, although if smooth best responses are used, a Nash distribution can be reached. We introduce a particular value-based learning algorithm, which we call individual Q-learning, and use stochastic approximation to study the asymptotic behavior, showing that strategies will converge to Nash distribution almost surely in 2-player zero-sum games and 2-player partnership games. Player-dependent learning rates are then considered, and it is shown that this extension converges in some games for which many algorithms, including the basic algorithm initially considered, fail to converge.

Journal ArticleDOI
TL;DR: In this article, the effect of the feedback mechanism on route-choice decision-making under uncertainty was discussed, and the experimental results were compared to the predictions of two static models (random utility maximization and cumulative prospect theory) and two dynamic models (stochastic fictitious play and reinforcement learning).
Abstract: This paper discusses the effect of the feedback mechanism on route-choice decision-making under uncertainty. Recent ITS (intelligent transportation systems) applications have highlighted the need for better models of the behavioral processes involved in travel decisions. However, travel behavior, and specifically route-choice decision-making, is usually modeled using normative models instead of descriptive models. Common route-choice models are based on the assumption of utility maximization. In this work, route-choice laboratory experiments and computer simulations were conducted in order to analyze route-choice behavior in iterative tasks with immediate feedback. The experimental results were compared to the predictions of two static models (random utility maximization and cumulative prospect theory) and two dynamic models (stochastic fictitious play and reinforcement learning). Based on the experimental results, it is showed that the higher the variance in travel times, the lower is the travelers’ sensitivity to travel time differences. These results are in conflict with the paradigm about travel time variability and risk-taking behavior. The empirical results may be explained by the payoff variability effect: high payoff variability seems to move choice behavior toward random choice.

Book
01 Jan 2005
TL;DR: This paper presents a meta-analyses of the XCS Classifier System and its applications in Binary Classification Problems, Reinforcement Learning Problems, and Cognitive Learning Classifier Systems.
Abstract: Prerequisites.- Simple Learning Classifier Systems.- The XCS Classifier System.- How XCS Works: Ensuring Effective Evolutionary Pressures.- When XCS Works: Towards Computational Complexity.- Effective XCS Search: Building Block Processing.- XCS in Binary Classification Problems.- XCS in Multi-Valued Problems.- XCS in Reinforcement Learning Problems.- Facetwise LCS Design.- Towards Cognitive Learning Classifier Systems.- Summary and Conclusions.

Journal ArticleDOI
01 May 2005
TL;DR: This work evaluates an implementation of CRL in a routing protocol for mobile ad hoc networks, called SAMPLE, and shows how feedback in the selection of links by routing agents enables SAMPLE to adapt and optimize its routing behavior to varying network conditions and properties, resulting in optimization of network throughput.
Abstract: Designers face many system optimization problems when building distributed systems. Traditionally, designers have relied on optimization techniques that require either prior knowledge or centrally managed runtime knowledge of the system's environment, but such techniques are not viable in dynamic networks where topology, resource, and node availability are subject to frequent and unpredictable change. To address this problem, we propose collaborative reinforcement learning (CRL) as a technique that enables groups of reinforcement learning agents to solve system optimization problems online in dynamic, decentralized networks. We evaluate an implementation of CRL in a routing protocol for mobile ad hoc networks, called SAMPLE. Simulation results show how feedback in the selection of links by routing agents enables SAMPLE to adapt and optimize its routing behavior to varying network conditions and properties, resulting in optimization of network throughput. In the experiments, SAMPLE displays emergent properties such as traffic flows that exploit stable routes and reroute around areas of wireless interference or congestion. SAMPLE is an example of a complex adaptive distributed system.

Journal ArticleDOI
TL;DR: An architectural modification to Soar is described that gives a Soar agent the opportunity to learn statistical information about the past success of its actions and utilize this information when selecting an operator.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior---rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information).
Abstract: We present an efficient "sparse sampling" technique for approximating Bayes optimal decision making in reinforcement learning, addressing the well known exploration versus exploitation tradeoff. Our approach combines sparse sampling with Bayesian exploration to achieve improved decision making while controlling computational cost. The idea is to grow a sparse lookahead tree, intelligently, by exploiting information in a Bayesian posterior---rather than enumerate action branches (standard sparse sampling) or compensate myopically (value of perfect information). The outcome is a flexible, practical technique for improving action selection in simple reinforcement learning scenarios.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: This paper considers the problem of learning in the presence of time-varying preferences among multiple objectives, using numeric weights to represent their importance, and proposes a method that allows us to store a finite number of policies, choose an appropriate policy for any weight vector and improve upon it.
Abstract: The current framework of reinforcement learning is based on maximizing the expected returns based on scalar rewards. But in many real world situations, tradeoffs must be made among multiple objectives. Moreover, the agent's preferences between different objectives may vary with time. In this paper, we consider the problem of learning in the presence of time-varying preferences among multiple objectives, using numeric weights to represent their importance. We propose a method that allows us to store a finite number of policies, choose an appropriate policy for any weight vector and improve upon it. The idea is that although there are infinitely many weight vectors, they may be well-covered by a small number of optimal policies. We show this empirically in two domains: a version of the Buridan's ass problem and network routing.