scispace - formally typeset
Search or ask a question

Showing papers on "Markov decision process published in 2009"


Journal ArticleDOI
TL;DR: It is shown that a myopic policy that maximizes the immediate one-step reward is optimal when the state transitions are positively correlated over time and when the number of channels is limited to two or three, while presenting a counterexample for the case of four channels.
Abstract: This paper considers opportunistic communication over multiple channels where the state (ldquogoodrdquo or ldquobadrdquo) of each channel evolves as independent and identically distributed (i.i.d.) Markov processes. A user, with limited channel sensing capability, chooses one channel to sense and decides whether to use the channel (based on the sensing result) in each time slot. A reward is obtained whenever the user senses and accesses a ldquogoodrdquo channel. The objective is to design a channel selection policy that maximizes the expected total (discounted or average) reward accrued over a finite or infinite horizon. This problem can be cast as a partially observed Markov decision process (POMDP) or a restless multiarmed bandit process, to which optimal solutions are often intractable. This paper shows that a myopic policy that maximizes the immediate one-step reward is optimal when the state transitions are positively correlated over time. When the state transitions are negatively correlated, we show that the same policy is optimal when the number of channels is limited to two or three, while presenting a counterexample for the case of four channels. This result finds applications in opportunistic transmission scheduling in a fading environment, cognitive radio networks for spectrum overlay, and resource-constrained jamming and antijamming.

416 citations


Journal ArticleDOI
TL;DR: The current state-of-the-art for near-optimal behavior in finite Markov Decision Processes with a polynomial number of samples is summarized by presenting bounds for the problem in a unified theoretical framework.
Abstract: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These "PAC-MDP" algorithms include the well-known E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX.

289 citations


Journal ArticleDOI
TL;DR: A generic mathematical framework is proposed to characterize the policy for single hop transmission over a replenishable sensor network, and a Markov chain model is introduced to describe different modes of energy renewal.
Abstract: Energy harvesting from the working environment has received increasing attention in the research of wireless sensor networks. Recent developments in this area can be used to replenish the power supply of sensors. However, power management is still a crucial issue for such networks due to the uncertainty of stochastic replenishment. In this paper, we propose a generic mathematical framework to characterize the policy for single hop transmission over a replenishable sensor network. Firstly, we introduce a Markov chain model to describe different modes of energy renewal. Then, we derive the optimal transmission policy for sensors with different energy budgets. Depending on the energy status of a sensor and the reward for successfully transmitting a message, we prove the existence of optimal thresholds that maximize the average reward rate. Our results are quite general since the reward values can be made application-specific for different design objectives. Compared with the unconditional transmit-all policy, which transmits every message as long as the energy storage is positive, the proposed optimal transmission policy is shown to achieve significant gains in the average reward rate.

266 citations


Proceedings ArticleDOI
29 Jun 2009
TL;DR: It is shown that the optimal strategy is different from the fixed one, and supports more effective and efficient interaction sessions, and allows conversational systems to autonomously improve a fixed strategy and eventually learn a better one using reinforcement learning techniques.
Abstract: Conversational recommender systems (CRSs) assist online users in their information-seeking and decision making tasks by supporting an interactive process. Although these processes could be rather diverse, CRSs typically follow a fixed strategy, e.g., based on critiquing or on iterative query reformulation. In a previous paper, we proposed a novel recommendation model that allows conversational systems to autonomously improve a fixed strategy and eventually learn a better one using reinforcement learning techniques. This strategy is optimal for the given model of the interaction and it is adapted to the users' behaviors. In this paper we validate our approach in an online CRS by means of a user study involving several hundreds of testers. We show that the optimal strategy is different from the fixed one, and supports more effective and efficient interaction sessions.

250 citations


Journal ArticleDOI
TL;DR: This work develops a column generation algorithm to solve the problem for a multinomial logit choice model with disjoint consideration sets (MNLD), and derives a bound as a by-product of a decomposition heuristic.
Abstract: We consider a network revenue management problem where customers choose among open fare products according to some prespecified choice model. Starting with a Markov decision process (MDP) formulation, we approximate the value function with an affine function of the state vector. We show that the resulting problem provides a tighter bound for the MDP value than the choice-based linear program. We develop a column generation algorithm to solve the problem for a multinomial logit choice model with disjoint consideration sets (MNLD). We also derive a bound as a by-product of a decomposition heuristic. Our numerical study shows the policies from our solution approach can significantly outperform heuristics from the choice-based linear program.

223 citations


Journal ArticleDOI
TL;DR: This work considers a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step, yet the dynamics remain fixed, and provides efficient algorithms, which have regret bounds with no dependence on the size of state space.
Abstract: We consider a Markov decision process (MDP) setting in which the reward function is allowed to change after each time step (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well an agent can do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions.

190 citations


Book ChapterDOI
27 Aug 2009
TL;DR: An algorithm is proposed that allows the agent to query the demonstrator for samples at specific states, instead of relying only on samples provided at "arbitrary" states, to estimate the reward function with similar accuracy as other methods from the literature while reducing the amount of policy samples required from the expert.
Abstract: Inverse reinforcement learning addresses the general problem of recovering a reward function from samples of a policy provided by an expert/demonstrator. In this paper, we introduce active learning for inverse reinforcement learning. We propose an algorithm that allows the agent to query the demonstrator for samples at specific states, instead of relying only on samples provided at "arbitrary" states. The purpose of our algorithm is to estimate the reward function with similar accuracy as other methods from the literature while reducing the amount of policy samples required from the expert. We also discuss the use of our algorithm in higher dimensional problems, using both Monte Carlo and gradient methods. We present illustrative results of our algorithm in several simulated examples of different complexities.

189 citations


Journal ArticleDOI
TL;DR: Comparisons with an existing heuristic from the literature and a lower bound computed with complete knowledge of customer demands show that the best partial reoptimization heuristics outperform this heuristic and are on average no more than 10%--13% away from this lower bound, depending on the type of instances.
Abstract: We consider the vehicle-routing problem with stochastic demands (VRPSD) under reoptimization. We develop and analyze a finite-horizon Markov decision process (MDP) formulation for the single-vehicle case and establish a partial characterization of the optimal policy. We also propose a heuristic solution methodology for our MDP, named partial reoptimization, based on the idea of restricting attention to a subset of all the possible states and computing an optimal policy on this restricted set of states. We discuss two families of computationally efficient partial reoptimization heuristics and illustrate their performance on a set of instances with up to and including 100 customers. Comparisons with an existing heuristic from the literature and a lower bound computed with complete knowledge of customer demands show that our best partial reoptimization heuristics outperform this heuristic and are on average no more than 10%--13% away from this lower bound, depending on the type of instances.

176 citations


Journal ArticleDOI
TL;DR: This work designs distributed spectrum sensing and access strategies for opportunistic spectrum access (OSA) under an energy constraint on secondary users that maximize the throughput of the secondary user during its battery lifetime and establishes threshold structures of the optimal policies.
Abstract: We design distributed spectrum sensing and access strategies for opportunistic spectrum access (OSA) under an energy constraint on secondary users. Both the continuous and the bursty traffic models are considered for different applications of the secondary network. In each slot, a secondary user sequentially decides whether to sense, where in the spectrum to sense, and whether to access. By casting this sequential decision-making problem in the framework of partially observable Markov decision processes, we obtain stationary optimal spectrum sensing and access policies that maximize the throughput of the secondary user during its battery lifetime. We also establish threshold structures of the optimal policies and study the fundamental tradeoffs involved in the energy-constrained OSA design. Numerical results are provided to investigate the impact of the secondary user's residual energy on the optimal spectrum sensing and access decisions.

175 citations


Journal ArticleDOI
TL;DR: There are at least two general theories for building probabilistic-dynamical systems: one is Markov theory and another is quantum theory as mentioned in this paper, and the decision about whether to use a Markov or quantum system depends on which of these laws are empirically obeyed in an application.

170 citations


Proceedings Article
18 Jun 2009
TL;DR: An algorithm is provided that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP) where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector.
Abstract: We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of O(HS√AT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds.

Journal ArticleDOI
TL;DR: Algorithms for counterexample generation for probabilistic CTL formulae in discrete-time Markov chains and a simple algorithm to generate (minimal) regular expressions that can act as countereXamples are considered.
Abstract: Providing evidence for the refutation of a property is an essential, if not the most important, feature of model checking. This paper considers algorithms for counterexample generation for probabilistic CTL formulae in discrete-time Markov chains. Finding the strongest evidence (i.e., the most probable path) violating a (bounded) until-formula is shown to be reducible to a single-source (hop-constrained) shortest path problem. Counterexamples of smallest size that deviate most from the required probability bound can be obtained by applying (small amendments to) k-shortest (hop-constrained) paths algorithms. These results can be extended to Markov chains with rewards, to LTL model checking, and are useful for Markov decision processes. Experimental results show that typically the size of a counterexample is excessive. To obtain much more compact representations, we present a simple algorithm to generate (minimal) regular expressions that can act as counterexamples. The feasibility of our approach is illustrated by means of two communication protocols: leader election in an anonymous ring network and the Crowds protocol.

Book ChapterDOI
01 Jan 2009
TL;DR: The basic model of continuous-time MDPs and the concept of a Markov policy are stated in precise terms and the basic optimality criteria is introduced that the author is interested in.
Abstract: In Chap. 2, we formally introduce the concepts associated to a continuous time MDP. Namely, the basic model of continuous-time MDPs and the concept of a Markov policy are stated in precise terms in Sect. 2.2. We also give, in Sect. 2.3, a precise definition of state and action processes in continuous-time MDPs, together with some fundamental properties of these two processes. Then, in Sect. 2.4, we introduce the basic optimality criteria that we are interested in.

Proceedings Article
Jaedeug Choi1, Kee-Eung Kim1
11 Jul 2009
TL;DR: This paper presents IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP) and deals with two cases according to the representation of the given expert's behavior.
Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behaviour of an expert. Most of the existing algorithms for IRL assume that the expert's environment is modeled as a Markov decision process (MDP), although they should be able to handle partially observable settings in order to widen the applicability to more realistic scenarios. In this paper, we present an extension of the classical IRL algorithm by Ng and Russell to partially observable environments. We discuss technical issues and challenges, and present the experimental results on some of the benchmark partially observable domains.

Journal ArticleDOI
TL;DR: An average cost method is introduced, patterned after the known discounted cost method, and it is proved its convergence for a range of constant stepsize choices and the convergence rate is optimal within the class of temporal difference methods.
Abstract: We consider finite-state Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(lambda ). These are temporal difference methods for constructing a linear function approximation of the cost function of a stationary policy, within the context of infinite-horizon discounted and average cost dynamic programming. We introduce an average cost method, patterned after the known discounted cost method, and we prove its convergence for a range of constant stepsize choices. We also show that the convergence rate of both the discounted and the average cost methods is optimal within the class of temporal difference methods. Analysis and experiment indicate that our methods are substantially and often dramatically faster than TD(lambda), as well as more reliable.

Journal ArticleDOI
TL;DR: An efficient online algorithm is presented that ensures that the agent's average performance loss vanishes over time, provided that the environment is oblivious to the agents' actions.
Abstract: We consider a learning problem where the decision maker interacts with a standard Markov decision process, with the exception that the reward functions vary arbitrarily over time. We show that, against every possible realization of the reward process, the agent can perform as well---in hindsight---as every stationary policy. This generalizes the classical no-regret result for repeated games. Specifically, we present an efficient online algorithm---in the spirit of reinforcement learning---that ensures that the agent's average performance loss vanishes over time, provided that the environment is oblivious to the agent's actions. Moreover, it is possible to modify the basic algorithm to cope with instances where reward observations are limited to the agent's trajectory. We present further modifications that reduce the computational cost by using function approximation and that track the optimal policy through infrequent changes.

Journal ArticleDOI
TL;DR: In this article, the authors present an optimal policy iteration algorithm for solving DEC-POMDPs, which alternates between expanding the controller and performing value-preserving transformations.
Abstract: Coordination of distributed agents is required for problems arising in many areas, including multi-robot systems, networking and e-commerce. As a formal framework for such problems, we use the decentralized partially observable Markov decision process (DECPOMDP). Though much work has been done on optimal dynamic programming algorithms for the single-agent version of the problem, optimal algorithms for the multiagent case have been elusive. The main contribution of this paper is an optimal policy iteration algorithm for solving DEC-POMDPs. The algorithm uses stochastic finite-state controllers to represent policies. The solution can include a correlation device, which allows agents to correlate their actions without communicating. This approach alternates between expanding the controller and performing value-preserving transformations, which modify the controller without sacrificing value. We present two efficient value-preserving transformations: one can reduce the size of the controller and the other can improve its value while keeping the size fixed. Empirical results demonstrate the usefulness of value-preserving transformations in increasing value while keeping controller size to a minimum. To broaden the applicability of the approach, we also present a heuristic version of the policy iteration algorithm, which sacrifices convergence to optimality. This algorithm further reduces the size of the controllers at each step by assuming that probability distributions over the other agents' actions are known. While this assumption may not hold in general, it helps produce higher quality solutions in our test problems.

Journal ArticleDOI
TL;DR: A new approximation method called nominal belief-state optimization (NBO), combined with other application-specific approximations and techniques within the POMDP framework, produces a practical design that coordinates the UAVs to achieve good long-term mean-squared-error tracking performance in the presence of occlusions and dynamic constraints.
Abstract: This paper discusses the application of the theory of partially observable Markov decision processes (POMDPs) to the design of guidance algorithms for controlling the motion of unmanned aerial vehicles (UAVs) with onboard sensors to improve tracking of multiple ground targets. While POMDP problems are intractable to solve exactly, principled approximation methods can be devised based on the theory that characterizes optimal solutions. A new approximation method called nominal belief-state optimization (NBO), combined with other application-specific approximations and techniques within the POMDP framework, produces a practical design that coordinates the UAVs to achieve good long-term mean-squared-error tracking performance in the presence of occlusions and dynamic constraints. The flexibility of the design is demonstrated by extending the objective to reduce the probability of a track swap in ambiguous situations.

Journal ArticleDOI
TL;DR: This work revisits Akamatsu's model by recasting it into a sum-over-paths statistical physics formalism allowing easy derivation of all the quantities of interest in an elegant, unified way and shows that the unique optimal policy can be obtained by solving a simple linear system of equations.
Abstract: This letter addresses the problem of designing the transition probabilities of a finite Markov chain (the policy) in order to minimize the expected cost for reaching a destination node from a source node while maintaining a fixed level of entropy spread throughout the network (the exploration). It is motivated by the following scenario. Suppose you have to route agents through a network in some optimal way, for instance, by minimizing the total travel cost---nothing particular up to now---you could use a standard shortest-path algorithm. Suppose, however, that you want to avoid pure deterministic routing policies in order, for instance, to allow some continual exploration of the network, avoid congestion, or avoid complete predictability of your routing strategy. In other words, you want to introduce some randomness or unpredictability in the routing policy (i.e., the routing policy is randomized). This problem, which will be called the randomized shortest-path problem (RSP), is investigated in this work. The global level of randomness of the routing policy is quantified by the expected Shannon entropy spread throughout the network and is provided a priori by the designer. Then, necessary conditions to compute the optimal randomized policy---minimizing the expected routing cost---are derived. Iterating these necessary conditions, reminiscent of Bellman's value iteration equations, allows computing an optimal policy, that is, a set of transition probabilities in each node. Interestingly and surprisingly enough, this first model, while formulated in a totally different framework, is equivalent to Akamatsu's model (1996), appearing in transportation science, for a special choice of the entropy constraint. We therefore revisit Akamatsu's model by recasting it into a sum-over-paths statistical physics formalism allowing easy derivation of all the quantities of interest in an elegant, unified way. For instance, it is shown that the unique optimal policy can be obtained by solving a simple linear system of equations. This second model is therefore more convincing because of its computational efficiency and soundness. Finally, simulation results obtained on simple, illustrative examples show that the models behave as expected.

Journal ArticleDOI
TL;DR: New graphical representations for the problem of sequential decision making in partially observable multiagent environments, as formalized by interactive partially observable Markov decision processes (I-POMDPs), and the error bound of the approximation technique are discussed and demonstrated.
Abstract: We develop new graphical representations for the problem of sequential decision making in partially observable multiagent environments, as formalized by interactive partially observable Markov decision processes (I-POMDPs). The graphical models called interactive influence diagrams (I-IDs) and their dynamic counterparts, interactive dynamic influence diagrams (I-DIDs), seek to explicitly model the structure that is often present in real-world problems by decomposing the situation into chance and decision variables, and the dependencies between the variables. I-DIDs generalize DIDs, which may be viewed as graphical representations of POMDPs, to multiagent settings in the same way that I-POMDPs generalize POMDPs. I-DIDs may be used to compute the policy of an agent given its belief as the agent acts and observes in a setting that is populated by other interacting agents. Using several examples, we show how I-IDs and I-DIDs may be applied and demonstrate their usefulness. We also show how the models may be solved using the standard algorithms that are applicable to DIDs. Solving I-DIDs exactly involves knowing the solutions of possible models of the other agents. The space of models grows exponentially with the number of time steps. We present a method of solving I-DIDs approximately by limiting the number of other agents' candidate models at each time step to a constant. We do this by clustering models that are likely to be behaviorally equivalent and selecting a representative set from the clusters. We discuss the error bound of the approximation technique and demonstrate its empirical performance.

Journal ArticleDOI
TL;DR: The purpose of this article is to introduce the POMDP model to behavioral scientists who may wish to apply the framework to the problem of understanding normative behavior in experimental settings.

Journal ArticleDOI
TL;DR: This work describes an approach to adaptive sensing based on approximately solving a partially observable Markov decision process (POMDP) formulation of the problem, and describes a variety of approximation methods.
Abstract: Adaptive sensing involves actively managing sensor resources to achieve a sensing task, such as object detection, classification, and tracking, and represents a promising direction for new applications of discrete event system methods. We describe an approach to adaptive sensing based on approximately solving a partially observable Markov decision process (POMDP) formulation of the problem. Such approximations are necessary because of the very large state space involved in practical adaptive sensing problems, precluding exact computation of optimal solutions. We review the theory of POMDPs and show how the theory applies to adaptive sensing problems. We then describe a variety of approximation methods, with examples to illustrate their application in adaptive sensing. The examples also demonstrate the gains that are possible from nonmyopic methods relative to myopic methods, and highlight some insights into the dependence of such gains on the sensing resources and environment.

Journal ArticleDOI
TL;DR: A Markov decision process model is developed that simultaneously determines maintenance and production schedules for a multiple-product, single-machine production system, accounting for the fact that equipment condition can affect the yield of different product types differently.
Abstract: Traditionally, the problems of equipment maintenance scheduling and production scheduling in a multi-product environment have been treated independently. In this paper, we develop a Markov decision process model that simultaneously determines maintenance and production schedules for a multiple-product, single-machine production system, accounting for the fact that equipment condition can affect the yield of different product types differently. The problem was motivated by an application in semiconductor manufacturing. After examining structural properties of the optimal policy, we compare the combined method to an approach often used in practice. In the nearly 6,000 test problems studied, the reward from the combined method was an average of more than 25 percent greater than the reward from the traditional method.

Journal ArticleDOI
TL;DR: Extensive numerical experiments show the value- and policy-approximation approaches to work well across a wide range of problem parameters, and to outperform the pooling-based heuristics in most cases.

Journal ArticleDOI
TL;DR: This article proposes an approach that translates an expressive subset of the PPDDL representation to a first-order MDP (FOMDP) specification and then derives a domain-independent policy without grounding at any intermediate step and presents proof-of-concept results of this approach.

Journal ArticleDOI
TL;DR: This paper proposes a framework of combining intrusion detection and continuous authentication in MANETs, where multimodal biometrics are used for continuous authentication, and intrusion detection is modeled as sensors to detect system security state.
Abstract: Two complementary classes of approaches exist to protect high security mobile ad hoc networks (MANETs), prevention-based approaches, such as authentication, and detection-based approaches, such as intrusion detection. Most previous work studies these two classes of issues separately. In this paper, we propose a framework of combining intrusion detection and continuous authentication in MANETs. In this framework, multimodal biometrics are used for continuous authentication, and intrusion detection is modeled as sensors to detect system security state. We formulate the whole system as a partially observed Markov decision process considering both system security requirements and resource constraints. We then use dynamic programming-based hidden Markov model scheduling algorithms to derive the optimal schemes for both intrusion detection and continuous authentication. Extensive simulations show the effectiveness of the proposed scheme.

Proceedings Article
19 Sep 2009
TL;DR: A technique to explain policies for factored MDP by populating a set of domain-independent templates and a mechanism to determine a minimal set of templates that, viewed together, completely justify the policy.
Abstract: Explaining policies of Markov Decision Processes (MDPs) is complicated due to their probabilistic and sequential nature We present a technique to explain policies for factored MDP by populating a set of domain-independent templates We also present a mechanism to determine a minimal set of templates that, viewed together, completely justify the policy Our explanations can be generated automatically at run-time with no additional effort required from the MDP designer We demonstrate our technique using the problems of advising undergraduate students in their course selection and assisting people with dementia in completing the task of handwashing We also evaluate our explanations for course-advising through a user study involving students

Proceedings ArticleDOI
06 Jul 2009
TL;DR: A model of combining a Markov decision process model and HTN planning to address Web services composition is proposed and it is shown that the proposed approach works effectively.
Abstract: Automatic Web services composition can be achieved by using AI planning techniques. HTN planning has been adopted to handle the OWL-S Web service composition problem. However, existing composition methods based on HTN planning have not considered the choice of decompositions available to a problem which can lead to a variety of valid solutions.In this paper, we propose a model of combining a Markov decision process model and HTN planning to address Web services composition. In the model, HTN planning is enhanced to decompose a task in multiple ways and hence be able to find more than one plan,taking both functional and non-functional properties into account. Furthermore, an evaluation method to choose the optimal plan and some experimental results illustrate that the proposed approach works effectively.

Proceedings ArticleDOI
18 May 2009
TL;DR: An optimal power management (OPM) used by a batch scheduler in a server farm is proposed and the result shows that with OPM the job waiting time can be maintained below the maximum threshold while the power consumption is much smaller than that without OPM.
Abstract: Green computing is a new paradigm of designing the computer system which considers not only the processing performance but also the energy efficiency. Power management is one of the approaches in green computing to reduce the power consumption in distributed computing system. In this paper, we first propose an optimal power management (OPM) used by a batch scheduler in a server farm. This OPM observes the state of a server farm and makes the decision to switch the operation mode (i.e., active or sleep) of the server to minimize the power consumption while the performance requirements are met. An optimization problem based on constrained Markov decision process (CMDP) is formulated and solved to obtain an optimal decision of OPM. Given that OPM is used in the server farm, then an assignment of users to the server farms by a job broker is considered. This assignment is to ensure that the cost due to power consumption and network transportation is minimized. The performance of the system is extensively evaluated. The result shows that with OPM the job waiting time can be maintained below the maximum threshold while the power consumption is much smaller than that without OPM.

Journal ArticleDOI
TL;DR: The results indicate that these policies rarely differ, and that when they do, the difference in long-run mission reliability is minimal, which suggests that future work should concentrate on extending results for the single-mission problem.
Abstract: ?Selective maintenance? models determine the optimal subset of desirable maintenance actions to perform when maintenance resources are constrained. We analyse a corrective selective maintenance model that identifies which components to replace in the finitely long periods of time between missions performed by a series-parallel system. We formulate this multi-mission problem as a stochastic dynamic program, and compare the resulting optimal infinite-horizon policy to both the optimal single-mission, and two-mission policies by executing a large numerical experiment. Our results indicate that these policies rarely differ, and that when they do, the difference in long-run mission reliability is minimal, which suggests that future work should concentrate on extending results for the single-mission problem.