scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Artificial Intelligence Research in 2000"


Journal ArticleDOI
TL;DR: The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction.
Abstract: This paper presents a new approach to hierarchical reinforcement learning based on decomposing the target Markov decision process (MDP) into a hierarchy of smaller MDPs and decomposing the value function of the target MDP into an additive combination of the value functions of the smaller MDPs. The decomposition, known as the MAXQ decomposition, has both a procedural semantics--as a subroutine hierarchy--and a declarative semantics--as a representation of the value function of a hierarchical policy. MAXQ unifies and extends previous work on hierarchical reinforcement learning by Singh, Kaelbling, and Dayan and Hinton. It is based on the assumption that the programmer can identify useful subgoals and define subtasks that achieve these subgoals. By defining such subgoals, the programmer constrains the set of policies that need to be considered during reinforcement learning. The MAXQ value function decomposition can represent the value function of any policy that is consistent with the given hierarchy. The decomposition also creates opportunities to exploit state abstractions, so that individual MDPs within the hierarchy can ignore large parts of the state space. This is important for the practical application of the method. This paper defines the MAXQ hierarchy, proves formal results on its representational power, and establishes five conditions for the safe use of state abstractions. The paper presents an online model-free learning algorithm, MAXQ-Q, and proves that it converges with probability 1 to a kind of locally-optimal policy known as a recursively optimal policy, even in the presence of the five kinds of state abstraction. The paper evaluates the MAXQ representation and MAXQ-Q through a series of experiments in three domains and shows experimentally that MAXQ-Q (with state abstractions) converges to a recursively optimal policy much faster than flat Q learning. The fact that MAXQ learns a representation of the value function has an important benefit: it makes it possible to compute and execute an improved, non-hierarchical policy via a procedure similar to the policy improvement step of policy iteration. The paper demonstrates the effectiveness of this nonhierarchical execution experimentally. Finally, the paper concludes with a comparison to related work and a discussion of the design tradeoffs in hierarchical reinforcement learning.

1,486 citations


Journal ArticleDOI
TL;DR: Under certain restrictions on the set of all hypothesis spaces available to the learner, it is shown that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment.
Abstract: A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Explicit bounds are also derived demonstrating that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task.

1,084 citations


Journal ArticleDOI
TL;DR: This work surveys various approximation methods, analyzes their properties and relations and provides some new insights into their differences, and presents a number of new approximation methods and novel refinements of existing techniques.
Abstract: Partially observable Markov decision processes (POMDPs) provide an elegant mathematical framework for modeling complex decision and planning problems in stochastic domains in which states of the system are observable only indirectly, via a set of imperfect or noisy observations. The modeling advantage of POMDPs, however, comes at a price -- exact methods for solving them are computationally very expensive and thus applicable in practice only to very simple problems. We focus on efficient approximation (heuristic) methods that attempt to alleviate the computational problem and trade off accuracy for speed. We have two objectives here. First, we survey various approximation methods, analyze their properties and relations and provide some new insights into their differences. Second, we present a number of new approximation methods and novel refinements of existing techniques. The theoretical results are supported by experiments on a problem from the agent navigation domain.

583 citations


Journal ArticleDOI
Marilyn A. Walker1
TL;DR: A novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users, based on a combination of reinforcement learning and performance modeling of spoken dialogue systems.
Abstract: This paper describes a novel method by which a spoken dialogue system can learn to choose an optimal dialogue strategy from its experience interacting with human users. The method is based on a combination of reinforcement learning and performance modeling of spoken dialogue systems. The reinforcement learning component applies Q-learning (Watkins, 1989), while the performance modeling component applies the PARADISE evaluation framework (Walker et al., 1997) to learn the performance function (reward) used in reinforcement learning. We illustrate the method with a spoken dialogue system named elvis (EmaiL Voice Interactive System), that supports access to email over the phone. We conduct a set of experiments for training an optimal dialogue strategy on a corpus of 219 dialogues in which human users interact with elvis over the phone. We then test that strategy on a corpus of 18 dialogues. We show that elvis can learn to optimize its strategy selection for agent initiative, for reading messages, and for summarizing email folders.

231 citations


Journal ArticleDOI
Ke Xu1, Wei Li1
TL;DR: It is proved that phase transitions from a region where almost all problems are satisfiable to a region Where almost all Problems are unsatisfiable do exist for Model RB as the number of variables approaches infinity.
Abstract: In this paper we propose a new type of random CSP model, called Model RB, which is a revision to the standard Model B. It is proved that phase transitions from a region where almost all problems are satisfiable to a region where almost all problems are unsatisfiable do exist for Model RB as the number of variables approaches infinity. Moreover, the critical values at which the phase transitions occur are also known exactly. By relating the hardness of Model RB to Model B, it is shown that there exist a lot of hard instances in Model RB.

209 citations


Journal ArticleDOI
TL;DR: An adaptive importance sampling algorithm, AIS-BN, is proposed that shows promising convergence rates even under extreme conditions and seems to outperform the existing sampling algorithms consistently, and two heuristics for initialization of the importance function that are based on the theoretical properties of importance sampling in finite-dimensional integrals and the structural advantages of Bayesian networks.
Abstract: Stochastic sampling algorithms, while an attractive alternative to exact algorithms in very large Bayesian network models, have been observed to perform poorly in evidential reasoning with extremely unlikely evidence. To address this problem, we propose an adaptive importance sampling algorithm, AIS-BN, that shows promising convergence rates even under extreme conditions and seems to outperform the existing sampling algorithms consistently. Three sources of this performance improvement are (1) two heuristics for initialization of the importance function that are based on the theoretical properties of importance sampling in finite-dimensional integrals and the structural advantages of Bayesian networks, (2) a smooth learning method for the importance function, and (3) a dynamic weighting function for combining samples from different stages of the algorithm. We tested the performance of the AIS-BN algorithm along with two state of the art general purpose sampling algorithms, likelihood weighting (Fung & Chang, 1989; Shachter & Peot, 1989) and self-importance sampling (Shachter & Peot, 1989). We used in our tests three large real Bayesian network models available to the scientific community: the CPCS network (Pradhan et al., 1994), the PATHFINDER network (Heckerman, Horvitz, & Nathwani, 1990), and the ANDES network (Conati, Gertner, VanLehn, & Druzdzel, 1997), with evidence as unlikely as 10-41. While the AIS-BN algorithm always performed better than the other two algorithms, in the majority of the test cases it achieved orders of magnitude improvement in precision of the results. Improvement in speed given a desired precision is even more dramatic, although we are unable to report numerical results here, as the other algorithms almost never achieved the precision reached even by the first few iterations of the AIS-BN algorithm.

208 citations


Journal ArticleDOI
TL;DR: This work reduces the complexity of reasoning with cardinality restrictions to reasoning with the (in general weaker) terminological formalism of general axioms for ALCQ extended with nominals, and shows that pure concept satisfiability for A LCQI with Nominals is NExpTime-complete.
Abstract: We study the complexity of the combination of the Description Logics ALCQ and ALCQI with a terminological formalism based on cardinality restrictions on concepts. These combinations can naturally be embedded into C2, the two variable fragment of predicate logic with counting quantifiers, which yields decidability in NExpTime. We show that this approach leads to an optimal solution for ALCQI, as ALCQI with cardinality restrictions has the same complexity as C2 (NExpTime-complete). In contrast, we show that for ALCQ, the problem can be solved in ExpTime. This result is obtained by a reduction of reasoning with cardinality restrictions to reasoning with the (in general weaker) terminological formalism of general axioms for ALCQ extended with nominals. Using the same reduction, we show that, for the extension of ALCQI with nominals, reasoning with general axioms is a NExpTime-complete problem. Finally, we sharpen this result and show that pure concept satisfiability for ALCQI with nominals is NExpTime-complete. Without nominals, this problem is known to be PSpace-complete.

192 citations


Journal ArticleDOI
TL;DR: It is shown that to reason about causality in the most general third class, it must extend the language used by Galles and Pearl.
Abstract: Causal models defined in terms of a collection of equations, as defined by Pearl, are axiomatized here. Axiomatizations are provided for three successively more general classes of causal models: (1) the class of recursive theories (those without feedback), (2) the class of theories where the solutions to the equations are unique, (3) arbitrary theories (where the equations may not have solutions and, if they do, they are not necessarily unique). It is shown that to reason about causality in the most general third class, we must extend the language used by Galles and Pearl (1997, 1998). In addition, the complexity of the decision procedures is characterized for all the languages and classes of models considered.

191 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a general planning algorithm for conformant planning, which applies to fully nondeterministic domains, with uncertainty in the initial condition and in action effects, based on the representation of the planning domain as a finite state automaton.
Abstract: We tackle the problem of planning in nondeterministic domains, by presenting a new approach to conformant planning. Conformant planning is the problem of finding a sequence of actions that is guaranteed to achieve the goal despite the nondeterminism of the domain. Our approach is based on the representation of the planning domain as a finite state automaton. We use Symbolic Model Checking techniques, in particular Binary Decision Diagrams, to compactly represent and efficiently search the automaton. In this paper we make the following contributions. First, we present a general planning algorithm for conformant planning, which applies to fully nondeterministic domains, with uncertainty in the initial condition and in action effects. The algorithm is based on a breadth-first, backward search, and returns conformant plans of minimal length, if a solution to the planning problem exists, otherwise it terminates concluding that the problem admits no conformant solution. Second, we provide a symbolic representation of the search space based on Binary Decision Diagrams (BDDs), which is the basis for search techniques derived from symbolic model checking. The symbolic representation makes it possible to analyze potentially large sets of states and transitions in a single computation step, thus providing for an efficient implementation. Third, we present CMBP (Conformant Model Based Planner), an efficient implementation of the data structures and algorithm described above, directly based on BDD manipulations, which allows for a compact representation of the search layers and an efficient implementation of the search steps. Finally, we present an experimental comparison of our approach with the state-of-the-art conformant planners CGP, QBFPLAN and GPT. Our analysis includes all the planning problems from the distribution packages of these systems, plus other problems defined to stress a number of specific factors. Our approach appears to be the most effective: CMBP is strictly more expressive than QBFPLAN and CGP and, in all the problems where a comparison is possible, CMBP outperforms its competitors, sometimes by orders of magnitude.

147 citations


Journal ArticleDOI
TL;DR: The paper formally defines and discusses two different goal orderings, which are called the reasonable and the forced ordering, and shows how the ordering relations can be used to compute a so-called goal agenda that divides G into an ordered set of subgoals.
Abstract: The paper addresses the problem of computing goal orderings, which is one of the longstanding issues in AI planning. It makes two new contributions. First, it formally defines and discusses two different goal orderings, which are called the reasonable and the forced ordering. Both orderings are defined for simple STRIPS operators as well as for more complex ADL operators supporting negation and conditional effects. The complexity of these orderings is investigated and their practical relevance is discussed. Secondly, two different methods to compute reasonable goal orderings are developed. One of them is based on planning graphs, while the other investigates the set of actions directly. Finally, it is shown how the ordering relations, which have been derived for a given set of goals G, can be used to compute a so-called goal agenda that divides G into an ordered set of subgoals. Any planner can then, in principle, use the goal agenda to plan for increasing sets of subgoals. This can lead to an exponential complexity reduction, as the solution to a complex planning problem is found by solving easier subproblems. Since only a polynomial overhead is caused by the goal agenda computation, a potential exists to dramatically speed up planning algorithms as we demonstrate in the empirical evaluation, where we use this method in the IPP planner.

146 citations


Journal ArticleDOI
TL;DR: In this paper, the expressiveness of a large family of propositional planning formalisms, ranging from basic STRIPS to a formalism with conditional effects, partial state specifications, and propositional formulae in the preconditions, is analyzed.
Abstract: The recent approaches of extending the GRAPHPLAN algorithm to handle more expressive planning formalisms raise the question of what the formal meaning of "expressive power" is. We formalize the intuition that expressive power is a measure of how concisely planning domains and plans can be expressed in a particular formalism by introducing the notion of "compilation schemes" between planning formalisms. Using this notion, we analyze the expressiveness of a large family of propositional planning formalisms, ranging from basic STRIPS to a formalism with conditional effects, partial state specifications, and propositional formulae in the preconditions. One of the results is that conditional effects cannot be compiled away if plan size should grow only linearly but can be compiled away if we allow for polynomial growth of the resulting plans. This result confirms that the recently proposed extensions to the GRAPHPLAN algorithm concerning conditional effects are optimal with respect to the "compilability" framework. Another result is that general propositional formulae cannot be compiled into conditional effects if the plan size should be preserved linearly. This implies that allowing general propositional formulae in preconditions and effect conditions adds another level of difficulty in generating a plan.

Journal ArticleDOI
TL;DR: In this paper, a randomized algorithm for finding a minimum weight loop cutset in a Bayesian network with high probability is presented, with probability at least 1 - (1 - 1/6k)c6k, where c > 1 is a constant specified by the user.
Abstract: We show how to find a minimum weight loop cutset in a Bayesian network with high probability. Finding such a loop cutset is the first step in the method of conditioning for inference. Our randomized algorithm for finding a loop cutset outputs a minimum loop cutset after O(c 6kkn) steps with probability at least 1 - (1 - 1/6k)c6k, where c > 1 is a constant specified by the user, k is the minimal size of a minimum weight loop cutset, and n is the number of vertices. We also show empirically that a variant of this algorithm often finds a loop cutset that is closer to the minimum weight loop cutset than the ones found by the best deterministic algorithms known.

Journal ArticleDOI
TL;DR: This work empirically and analytically explores a family of socially-attentive teamwork monitoring algorithms in two dynamic, complex, multi-agent domains, under varying conditions of task distribution and uncertainty and shows that a centralized scheme using a complex algorithm trades correctness for completeness and requires monitoring all teammates.
Abstract: Agents in dynamic multi-agent environments must monitor their peers to execute individual and group plans. A key open question is how much monitoring of other agents' states is required to be effective: The Monitoring Selectivity Problem. We investigate this question in the context of detecting failures in teams of cooperating agents, via Socially-Attentive Monitoring, which focuses on monitoring for failures in the social relationships between the agents. We empirically and analytically explore a family of socially-attentive teamwork monitoring algorithms in two dynamic, complex, multi-agent domains, under varying conditions of task distribution and uncertainty. We show that a centralized scheme using a complex algorithm trades correctness for completeness and requires monitoring all teammates. In contrast, a simple distributed teamwork monitoring algorithm results in correct and complete detection of teamwork failures, despite relying on limited, uncertain knowledge, and monitoring only key agents in a team. In addition, we report on the design of a socially-attentive monitoring system and demonstrate its generality in monitoring several coordination relationships, diagnosing detected failures, and both on-line and off-line applications.

Journal ArticleDOI
TL;DR: In this article, a new planning domain description language, NADL, has been proposed to specify non-deterministic, multi-agent domains with controllable agents and uncontrollable environment agents.
Abstract: Recently model checking representation and search techniques were shown to be efficiently applicable to planning, in particular to non-deterministic planning Such planning approaches use Ordered Binary Decision Diagrams (OBDDS) to encode a planning domain as a non-deterministic finite automaton and then apply fast algorithms from model checking to search for a solution OBDDS can effectively scale and can provide universal plans for complex planning domains We are particularly interested in addressing the complexities arising in non-deterministic, multi-agent domains In this article, we present UMOP, a new universal OBDD-based planning framework for non-deterministic, multi-agent domains We introduce a new planning domain description language, NADL, to specify non-deterministic, multi-agent domains The language contributes the explicit definition of controllable agents and uncontrollable environment agents We describe the syntax and semantics of NADL and show how to build an efficient OBDD-based representation of an NADL description The UMOP planning system uses NADL and different OBDD-based universal planning algorithms It includes the previously developed strong and strong cyclic planning algorithms In addition, we introduce our new optimistic planning algorithm that relaxes optimality guarantees and generates plausible universal plans in some domains where no strong nor strong cyclic solution exists We present empirical results applying UMOP to domains ranging from deterministic and single-agent with no environment actions to non-deterministic and multi-agent with complex environment actions UMOP is shown to be a rich and efficient planning system

Journal ArticleDOI
TL;DR: This paper describes how explanation based learning, dependency directed backtracking, dynamic variable ordering, forward checking, sticky values and random-restart search strategies can be adapted to Graphplan and demonstrates that these augmentations improve Graphplan's performance significantly.
Abstract: This paper reviews the connections between Graphplan's planning-graph and the dynamic constraint satisfaction problem and motivates the need for adapting CSP search techniques to the Graphplan algorithm. It then describes how explanation based learning, dependency directed backtracking, dynamic variable ordering, forward checking, sticky values and random-restart search strategies can be adapted to Graphplan. Empirical results are provided to demonstrate that these augmentations improve Graphplan's performance significantly (up to 1000x speedups) on several benchmark problems. Special attention is paid to the explanation-based learning and dependency directed backtracking techniques as they are empirically found to be most useful in improving the performance of Graphplan.

Journal ArticleDOI
TL;DR: It is proposed that high-cost random instances for local search are those with very large backbones which are also backbone-fragile, and the decay in cost beyond the satisfiability threshold is due to increasing backbone robustness (the opposite of backbone fragility).
Abstract: The local search algorithm WSAT is one of the most successful algorithms for solving the satisfiability (SAT) problem. It is notably effective at solving hard Random 3-SAT instances near the so-called 'satisfiability threshold', but still shows a peak in search cost near the threshold and large variations in cost over different instances. We make a number of significant contributions to the analysis of WSAT on high-cost random instances, using the recently-introduced concept of the backbone of a SAT instance. The backbone is the set of literals which are entailed by an instance. We find that the number of solutions predicts the cost well for small-backbone instances but is much less relevant for the large-backbone instances which appear near the threshold and dominate in the overconstrained region. We show a very strong correlation between search cost and the Hamming distance to the nearest solution early in WSAT's search. This pattern leads us to introduce a measure of the backbone fragility of an instance, which indicates how persistent the backbone is as clauses are removed. We propose that high-cost random instances for local search are those with very large backbones which are also backbone-fragile. We suggest that the decay in cost beyond the satisfiability threshold is due to increasing backbone robustness (the opposite of backbone fragility). Our hypothesis makes three correct predictions. First, that the backbone robustness of an instance is negatively correlated with the local search cost when other factors are controlled for. Second, that backbone-minimal instances (which are 3-SAT instances altered so as to be more backbone-fragile) are unusually hard for WSAT. Third, that the clauses most often unsatisfied during search are those whose deletion has the most effect on the backbone. In understanding the pathologies of local search methods, we hope to contribute to the development of new and better techniques.

Journal ArticleDOI
TL;DR: In this paper, the authors investigate the space efficiency of a Propositional Knowledge Representation (PKR) formalism, where knowledge is either a set of propositional interpretations (models) or propositional formulae (theorems).
Abstract: We investigate the space efficiency of a Propositional Knowledge Representation (PKR) formalism. Intuitively, the space efficiency of a formalism F in representing a certain piece of knowledge α, is the size of the shortest formula of F that represents α. In this paper we assume that knowledge is either a set of propositional interpretations (models) or a set of propositional formulae (theorems). We provide a formal way of talking about the relative ability of PKR formalisms to compactly represent a set of models or a set of theorems. We introduce two new compactness measures, the corresponding classes, and show that the relative space efficiency of a PKR formalism in representing models/theorems is directly related to such classes. In particular, we consider formalisms for nonmonotonic reasoning, such as circumscription and default logic, as well as belief revision operators and the stable model semantics for logic programs with negation. One interesting result is that formalisms with the same time complexity do not necessarily belong to the same space efficiency class.

Journal ArticleDOI
TL;DR: In this article, it was shown that the d-separation criterion for conditional independence in acyclic causal networks also applies to networks of discrete variables that have feedback cycles, provided that the variables of the system are uniquely determined by the random disturbances.
Abstract: Pearl and Dechter (1996) claimed that the d-separation criterion for conditional independence in acyclic causal networks also applies to networks of discrete variables that have feedback cycles, provided that the variables of the system are uniquely determined by the random disturbances. I show by example that this is not true in general. Some condition stronger than uniqueness is needed, such as the existence of a causal dynamics guaranteed to lead to the unique solution.

Journal ArticleDOI
TL;DR: Two solutions are presented: positive results that certain learning operators are a priori guaranteed to preserve useful classes of behavioral assurance constraints (which implies that no reverification is needed for these operators), and efficient incremental reverification algorithms for those learning operators that have negative a priora results.
Abstract: The goal of this research is to develop agents that are adaptive and predictable and timely. At first blush, these three requirements seem contradictory. For example, adaptation risks introducing undesirable side effects, thereby making agents' behavior less predictable. Furthermore, although formal verification can assist in ensuring behavioral predictability, it is known to be time-consuming. Our solution to the challenge of satisfying all three requirements is the following. Agents have finite-state automaton plans, which are adapted online via evolutionary learning (perturbation) operators. To ensure that critical behavioral constraints are always satisfied, agents' plans are first formally verified. They are then reverified after every adaptation. If reverification concludes that constraints are violated, the plans are repaired. The main objective of this paper is to improve the efficiency of reverification after learning, so that agents have a sufficiently rapid response time. We present two solutions: positive results that certain learning operators are a priori guaranteed to preserve useful classes of behavioral assurance constraints (which implies that no reverification is needed for these operators), and efficient incremental reverification algorithms for those learning operators that have negative a priori results.

Journal ArticleDOI
TL;DR: A temporal model for reasoning on disjunctive metric constraints on intervals and time points in temporal contexts that is able to represent non-binary constraints, such that logical dependencies on disjuncts in constraints can be handled.
Abstract: We introduce a temporal model for reasoning on disjunctive metric constraints on intervals and time points in temporal contexts. This temporal model is composed of a labeled temporal algebra and its reasoning algorithms. The labeled temporal algebra defines labeled disjunctive metric point-based constraints, where each disjunct in each input disjunctive constraint is univocally associated to a label. Reasoning algorithms manage labeled constraints, associated label lists, and sets of mutually inconsistent disjuncts. These algorithms guarantee consistency and obtain a minimal network. Additionally, constraints can be organized in a hierarchy of alternative temporal contexts. Therefore, we can reason on context-dependent disjunctive metric constraints on intervals and points. Moreover, the model is able to represent non-binary constraints, such that logical dependencies on disjuncts in constraints can be handled. The computational cost of reasoning algorithms is exponential in accordance with the underlying problem complexity, although some improvements are proposed.

Journal ArticleDOI
TL;DR: A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small eno...
Abstract: A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small eno...

Journal ArticleDOI
TL;DR: This paper reviews the connections between Graphplan's planning-graph and the dynamic constraint satisfaction problem and motivates the need for adapting CSP search techniques to the Graphplan algo.
Abstract: This paper reviews the connections between Graphplan's planning-graph and the dynamic constraint satisfaction problem and motivates the need for adapting CSP search techniques to the Graphplan algo...

Journal ArticleDOI
TL;DR: In this paper, a new type of random CSP model, called Model RB, is proposed, which is a revision to the standard Model B. It is proved that phase transitions from a region where almost all problems ar...
Abstract: In this paper we propose a new type of random CSP model, called Model RB, which is a revision to the standard Model B. It is proved that phase transitions from a region where almost all problems ar...