Planning is tricky because:

In planning we represent the world in belief states, in which multiple states are possible (because of incomplete information, we are not certain what the true state of the world is). Actions that are taken can either increase or decrease the possible states, in some cases down to one state.

Sequences of actions can be defined as trees, where each branch is an action and each node is a state or is an observation of the world. Then we search this tree for a plan which will satisfy the goal.

Broadly within planning, there are two kinds:

The field of planning includes many areas of research and different techniques:

An example planning problem

Planning approaches are often presented on toy problems, which can be quite different from real-world problems. Namely, toy problems have a concise and exact description, but real-world problems seldom, if ever, have an agreed-upon or unambiguous description. They also have important consequences, whereas toy problems do not. But toy problems provide a standard way of comparing approaches.

Some example toy problems:

For the following notes on planning, we will use the Dock-Worker Robots problem as an toy problem:

The available actions are:

State-space planning vs plan-space (partial-order) planning

There are two main approaches to planning: state-space planning and plan-space planning, sometimes called partial-order planning.

Nowadays with efficient heuristics, state-space planning is the more efficient way for finding solutions.

State-space planning

With state-space planning, we generate plans by searching through state space.

Representing plans and systems

We can use a state-transition system as a conceptual model for planning. Such a system is described by a 4-tuple $\Sigma = (S,A,E,\gamma)$, where:

If $a \in A$ and $\gamma(s,a) \neq \emptyset$, then $a$ is applicable in $s$. Applying $a$ in $s$ will take the system to $s' \in \gamma(s,a)$.

We can also represent such a state-transition system as a directed labelled graph $G=(N_G, E_G)$, where:

A plan is a structure that gives us appropriate actions to apply in order to achieve some objective when starting from a given state.

The objective can be:

A permutation of a solution (a plan) is a case in which some actions in the path to the solution can have their order changed without affecting the success of the path (that is, the permuted path still leads to the solution with the same cost). In this case, the actions are said to be independent.

Generally we have a planner which generates a plan and the passes the plan to a controller which executes the actions in the plan. The execution of the action then changes the state of the system. The system, however, changes not only via the controller's actions but also through external events. So the controller must observe the system using an observation function $\eta: S \to O$ and generate the appropriate action.

Sometimes, however, there may be parts of the system which we cannot observe, so, given the observations that could be collected, there may be many possible states of the world - this is the belief state of the controller.

The system as it actually is is often different from how it was described to the planner (as $\Sigma$, which, as an abstraction, loses some details). In dynamic planning, planning and execution are more closely linked to compensate for this scenario (which is more the rule than the exception). That is the controller must supervise the plan, i.e. it must detect when observations differ from expected results. The controller can pass this information to the planner as an execution status, and then the planner can revise its plan to take into account the new state.

STRIPS (Stanford Research Institute Problem Solver)

The STRIPS representation gives us an internal structure to our states, which up until now have been left as black boxes. It is based on first order predicate logic; that is, we have objects in our domain, represented by symbols and grouped according to type, and these objects are related (such relationships are known as predicates) to each other in some way.

For example, in the Dock-Worker Robot domain, one type of object is robot and each robot would be represented with a unique symbol, e.g. robot1, robot2, etc.

We must specify all of this in a syntax that the planner can understand. The most common syntax is PDDL (Planning Domain Definition Language). For example:

(define (domain dock-worker-robot))

  (:requirements :strips :typing)

    location  ;there are several connected locations
    pile      ;is attached to a location,
              ;it holds a pallet and a stack of containers
    robot     ;holds at most 1 container,
              ;only 1 robot per location
    crane     ;belongs to a location to pickup containers

    (adjacent ?l1 ?l2 - location)       ;location ?l1 is adjacent to ?l2
    (attached ?p - pile ?l - location)  ;pile ?p attached to location ?l
    (belong ?k - crane ?l - location)   ;crane ?k belongs to location ?l
    (at ?r - robot ?l - location)       ;robot ?r is at location ?l
    (occupied ?l - location)            ;there is a robot at location ?l
    (loaded ?r - robot ?c - container)  ;robot ?r is loaded with container ?c
    (unloaded ?r - robot)               ;robot ?r is empty
    (holding ?k - crane ?c - container) ;crane ?k is holding a container ?c
    (empty ?k - crane)                  ;crane ?k is empty
    (in ?c - container ?p - pile)       ;container ?c is within pile ?p
    (top ?c - container ?p - pile)      ;container ?c on top of pile ?p
    (on ?c1 ?c2 - container)            ;container ?c1 is on container ?c2


Let $\mathcal{L}$ be a first-order language with finitely many predicate symbols, finitely many constant symbols, and no function symbols (e.g. as defined with PDDL above).

A state in a STRIPS planning domain is a set of ground atoms of $\mathcal{L}$. An atom is a predicate with an appropriate number of objects (e.g. those we defined above). An atom is ground if all its objects are real objects (rather than variables).

Say we have the symbols loc1, loc2, p1, p2, crane1, r1, c1, c2, c3, pallet. An example state for the DWR problem:

state = {
  adjacent(loc1, loc2), adjacent(loc2, loc1),
  attached(p1, loc1), attached(p2, loc1),
  belong(crane1, loc1),
  in(c1,p1), in(c3,p1),
  on(c3,c1), on(c1,pallet),

In STRIPS, a planning operator is a triple $o = \text{name}(o), \text{precond}(o), \text{effects}(o)$, where:

An action in STRIPS is a ground instance of a planning operator (that is, we substitute the variables for symbols, e.g. we are "calling" the operator, as in a function).

For example, we may have an operator named move(r,l,m) with the preconditions adjacent(l,m), at(r,l), !occupied(m) and the effects at(r,m), occupied(m), !occupied(l), !at(r,l). An action might be move(robot1, loc1, loc2), since we are specifying specific instances to operate on.

In the PDDL syntax, this can be written:

(:action move
  :parameters (?r - robot ?from ?to - location)
  :precondition (and
    (adjacent ?from ?to) (at ?r ?from)
    (not (occupied ?to)))
  :effect (and
    (at ?r ?to) (occupied ?to)
    (not (occupied ?from)) (not (at ?r ?from)) ))

This is a bit confusing because PDDL does not distinguish "action" from "operator".

Other representations

Representations other than STRIPS includes:

These representations, however, can all be translated between each other.

Applicability and state transitions

When is an action applicable in a state?

Let $L$ be the set of literals. $L^+$ is the set of atoms that are positive literals in $L$ and $L^-$ is the set of all atoms whose negations are in $L$.

Let $a$ be an action and $s$ a state. $a$ is applicable in $s$ if and only if:

Which just says all positive preconditions must be true in the current state, and all negative preconditions must be false in the state.

The state transition function $\gamma$ for an applicable action $a$ in state $s$ is defined as:

$$ \gamma(s,a) = (s - \text{effects}^-(a)) \cup \text{effects}^+(a) $$

That is, we apply the delete list (remove those effects from the state) and apply the add list (add those effects to the state).

Finding actions applicable for a given state is a non-trivial problem, in particular because there may be many, many available actions.

We can define an algorithm which will find the applicable actions for a given operator in a given state:

We can formally define a planning domain in STRIPS.

Given our function-free first-order language $\mathcal{L}$, a STRIPS planning domain on $\mathcal{L}$ is a restricted (meaning there are no events) state-transition system $\Sigma = (S, A, \gamma)$ such that:

We can formally define a planning problem as a triple $\mathcal{P} = (\Sigma, s_i, g)$, where:

In PDDL syntax, we can define the initial state like so:

  (adjacent l1 l2)
  (adjacent l2 l1)

and the goal like so:

(:goal (and
    (in c1 p2) (in c2 p2)

We formally define a plan as any sequence of actions $\pi = a_1, \dots, a_k$ where $k \geq 0$:

A plan $\pi$ is a solution for a planning problem $\mathcal{P}$ if $\gamma(s_i, \pi)$ satisfies $g$.

A solution $\pi$ is redundant if there is a proper subsequence of $\pi$ that is also a solution for $\mathcal{P}$.

A solution $\pi$ is minimal if no other solution for $\mathcal{P}$ contains fewer actions that $\pi$.

Searching for plans

Forward Search

The basic idea is to apply standard search algorithms (e.g. bread-first, depth-first, A*, etc) to the planning problem.

Forward search is sound (if a plan is returned, it will indeed be a solution) and it is complete (if a solution exists, it will be found).

Backward Search

Alternatively, we can search backwards from a goal state to the initial state.

First we define two new concepts:

An action $a \in A$ is relevant for $g$ if:

Essentially what this is says is the action must contribute the goal (the first item) and the action must not interfere with the goal (the last two items).

This is equivalent to applicability.

The regression set of $g$ for a relevant action $a \in A$ is:

$$ \gamma^{-1}(g,a) = (g - \text{effects}(a)) \cup \text{precond}(a) $$

That is, it is the inverse of the state transition function.

When searching backwards, sometimes we end up with operators rather than actions (i.e. some of the parameters are still variables). We could in theory branch out to all possible actions from this operator by just substituting all possible values for the variable, but that will increase the branching factor by a lot. Instead, we can do lifted backward search, in which we just stick with these partially instantiated operators instead of actions.

Keeping variables in a plan, such as with lifted backward search, is called least commitment planning.

The FF Planner

The FF Planner performs a forward state-space search (the basic strategy can be A* or enforced hill climbing (EHC, a kind of best-first search where we commit to the first state that looks better than all previous states we have looked at)).

It uses a relaxed problem heuristic $h^{FF}$. The relaxed problem is constructed by ignoring delete list of all the operators.

Then we solve this relaxed problem; this can be done in polynomial time:
- chain forward to build a relaxed planning graph
- chain backward to extract a relaxed plan from the graph

Then we use the length (i.e. number of actions) of the relaxed plan as a heuristic value (e.g. for A*).

For example, with the simplified DWR from before:

To get the relaxed problem, we drop all delete lists:

Pseudocode for computing the relaxed planning graph (RPG):

Pseudocode for extracting a plan from the RPG (in particular, the size of the plan, since this is a heuristic calculation):

The firstlevel function tells us which layer (by index) a goal $g_i$ first appears in the planning graph.

This heuristic is not admissible (it is not guaranteed to return a minimal plan), but in practice it is quite accurate, so it (or ideas inspired by it) are frequently used (currently state-of-the-art).

Plan-space (partial-order) planning

Partial plans are like plans mentioned thus far (i.e. simple sequences of actions), but we also record the rationale behind each action, e.g. to achieve the precondition of another action. In aggregate these partial plans may form the solution to the problem (i.e. they act as component plans).

We also have explicit ordering constraints (i.e. these actions must occur in this order), so we can have partial plans with partial order, which means that we can execute actions in parallel.

And as with lifted backward search, we may have variables in our actions as well.

We adjust the planning problem a bit - instead of achieving goals, we want to accomplish tasks. Tasks are high-level descriptions of some activity we want to execute - this is typically accomplished by decomposing the high-level task into lower-level subtasks. This is the approach for Hierarchical Task Network (HTN) planning. There is a simpler version called STN planning as well.

Rather than searching through the state-space, we search through plan-space - a graph of partial plans. The nodes are partially-specified plans, the arcs are plan refinement operations (which is why this is called refinement planning), and the solutions are partial-order plans.

More concretely, if a plan is a set of actions organized into some structure, then a partial plan is:

More formally, we define a partial plan as a tuple $\pi = (A, \prec, B, L)$ where:

Note that for causal links, $a_i$ is the producer in the causal link and $a_j$ is the consumer.

Plan refinement operations

Adding actions

With least-commitment planning, we only want to add actions (more specifically, we are adding partially-instantiated operators) if it's justified:

Actions can be added anywhere in the plan.

Note that each action that we add has its own set of variables, unrelated to those of other actions.

Adding causal links

Causal links link a provider (an effect of an action or an atom that holds in the initial state) to a consumer (a precondition of an action or a goal condition). There is an ordering constraint here as well, in that the provider must come before the consumer (but not necessarily directly before).

We add causal links to prevent interference with other actions.

Adding variable bindings

A solution plan must have actions, not partially-instantiated operators. Variable bindings are what allow us to turn operators into actions.

Variable bindings constraints keep track of possible values for variables, and also can specify co-designation (i.e. that certain variables must or must not have the same value). For example, with causal links, there are variables in the producer that must be the same as those corresponding variables in the consumer (because they are "carried over"). When two variables must share the same value, we say they are unified.

Adding ordering constraints

Ordering constraints are just binary relations specifying the temporal order between actions in a plan.

Ordering constraints help us avoid possible interference. Causal links imply ordering constraints, and some trivial ordering constraints are that all actions must come after the initial state and before the goal.

The Plan-Space Search Problem

The initial search state includes just the initial state and the goal as "dummy actions":

We start with the empty plan: $\pi_0 = (\{\text{init, goal}\}, \{(\text{init} \prec \text{goal})\}, \{\}, \{\})$

It includes just the two dummy actions, one ordering constraint (init before goal), and no variable bindings or causal links.

We generate successors through one or more plan refinement operators:

Threats and flaws

A threat in a partial plan is when we have an action that might occur in parallel with a causal link and has an effect that is complimentary to the condition we want to protect (that is, it interferes with a condition we want to protect).

We can often get around this by introducing a new causal link that requires this conflicting action to follow the causal link, instead of occurring in parallel.

More formally, an action $a_k$ in a partial plan is a thread to a causal link $a_i \to [p] \to a_j$ if and only if:

That is, if we have one action which produces the precondition $p$, which is what we want, but another action which simultaneously produces the precondition $\not q$ where $q$ and $p$ are unifiable, then we have a threat.

A flaw in a partial plan is either:

Partial order solutions

We consider a plan $\pi=(A, \prec, B, L)$ as partial order solution for a planning problem $\mathcal P$ if:

The Plan-Space Planning (PSP) algorithm

The main principle is to refine the partial plan $\pi$ while maintaining $\prec$ and $B$ consistent until $\pi$ has no more flaws.

Basic operations:

The PSP procedure is sound and complete whenever $\pi_0$ can be refined into a solution plan and $\text{PSP}(\pi_0)$ returns such a plan.

To clarify, soundness: if PSP returns a plan, it is a solution plan, and completeness: if there is a solution plan, PSP will return it.


The general algorithm:

function PSP(plan):
  all_flaws = plan.open_goals() + plan.threats()
  if not all_flaws:
    return plan
  flaw = all_flaws.select_one()
  all_resolvers = flaw.get_resolvers(plan)
  if not all_resolves:
    return failure
  resolver = all_resolvers.choose_one()
  new_plan = plan.refine(resolver)
  return PSP(new_plan)

Where the initial plan is the empty plan as described previously.

all_resolvers.choose_one is a non-deterministic decision (i.e. this is something we may need to backtrack to; that is, if one resolver doesn't work out, we need to try another branch).

all_flaws.select_one is a deterministic selection (we don't need to backtrack to this because all flaws must be resolved). The order is not important for completeness, but it is important for efficiency.

Implementing plan.open_goals(): we find unachieved subgoals incrementally:

Implementing plan.threats(): we find threats incrementally:

Implementing flaw.get_resolvers(plan): for an unachieved precondition $p$ of $a_g$:

For an effect $q$ of action $a_t$ threatening $a_i \to [p] \to a_j$:

Implementing plan.refine(resolver): refines a partial plan by adding elements specified in resolver, i.e.:

This may introduce new flaws, so we must update the flaws (i.e. plan.open_goals() and plan.threats()).

Implementing ordering constraints: ordering constraint management can be implemented as an independent module with two operations:

Implementing variable binding constraints:

Types of constraints:

Unary and equality constraints can be dealt with in linear time, but inequality constraints cause exponential complexity here - with inequalities, this is a general constraint satisfaction problem which is NP-complete. So these variable binding constraints can become problematic.

The UC Partial-Order Planning (UCPoP) Planner

This is slight variation to PSP as outlined above, in which threats are instead dealt with after an open goal is dealt with (that is, it deals with threats from the resolver that was used to deal with an open goal, which is to say that threats are resolved as part of the successor generation process).

UCPoP takes in a slightly different input as well. In addition to the partial plan, it also takes in an agenda, which is a set of $(a,p)$ pairs where $a$ is an action and $p$ is one of its preconditions (i.e. this is a list of things that still need to be dealt with, that is the remaining open goal flaws).


With task network planning, we still have terms, literals, operators, actions, state transition functions, and plans.

A few new things are added:


Simple Task Networks (STN)

We can group tasks into task networks.

A simple task network $w$ is an acyclic directed graph $(U, E)$ in which:

A task network $w$ is ground/primitive if all tasks $t_u \in U$ are ground/primitive, otherwise it is unground/non-primitive.

A task network may also be totally ordered or partially ordered. We have an ordering $t_u \prec t_v$ in $w$ if there is a path from $t_u$ to $t_v$.

A network $w$ is totally ordered if and only if $E$ defines a total order on $U$ (that is, that every node is ordered with respect to every other node in the network). If $w$ is totally ordered, we can represent it as a sequence of tasks $t_1, \dots, t_n$).

Let $w = t_1, \dots, t_n$ be a totally ordered, ground, primitive STN. Then the plan $\pi(w)$ is defined as:

$$ \pi(w) = a_1, \dots, a_n $$

Where $a_i = t_i, 1 \leq i \leq n$.

Simple task networks are a simplification of the more general hierarchical task networks (HTNs).

Example (DWR)


Task networks:


Methods are plan refinements (i.e. they correspond to state transitions in our search space).

Let $M_S$ be a set of method symbols. A STN method is a 4-tuple $m=(\text{name}(m), \text{task}(m), \text{precond}(m), \text{network}(m))$ where:

Example (DWR)

Say we want to define a method which involves taking the topmost container of a stack and moving it. If you recall, we have the operators take and put, which we can use to define this method.

The name of the method could be $\text{take-and-put}(c,k,l,p_o,p_d,x_o,x_d)$. The task this method completes would be $\text{move-topmost}(p_o, p_d)$. The preconditions would be $\text{top}(c,p_0), \text{on}(c, x_o), \text{attached}(p_o,l), \text{belong}(k,l), \text{attached}(p_d,l), \text{top}(x_d,p_d)$. Finally, the subtasks would be $\text{take}(k,l,c,x_o,p_o), \text{put}(k,l,c,x_d,p_d)$.


Applicability and relevance

A method instance $m$ is applicable in a state $s$ if:

A method instance $m$ is relevant for a task $t$ if there is a substitution $\sigma$ such that $\sigma(t) = \text{task}(m)$

Decomposition of tasks

The decomposition of an individual task $t$ by a relevant method $m$ under $\sigma$ is either:

$\delta$ is called the decomposition method for a task given a method and a substitution.

That is, we break the task down into its subtasks.

The decomposition of tasks in a STN is as follows:


The decomposition of $t$ in $w$ by $m$ under $\sigma$ is the STN $\delta(w,t,m,\sigma)$ where:

That is, we replace the task with its subtasks.

Planning Domains and Problems

An STN planning domain is a pair $\mathcal D = (O,M)$ where:

$\mathcal D$ is a total-order STN planning domain if every $m \in M$ is totally ordered.

An STN planning problem is a 4-tuple $\mathcal P = (s_i, w_i, O, M)$ where:

So it is quite similar to a STRIPS planning problem (just the domain also includes STN methods and we have a initial task network).

$\mathcal P$ is a total-order STN planning problem if $w_i$ and $\mathcal D$ are both totally ordered.

A plan $\pi = a_1, \dots, a_n$ is a solution for an STN planning problem $\mathcal P = (s_i, w_i, O, M)$ if:



Planning with task networks

The Ground Total order Forward Decomposition (Ground-TFD) procedure

function GroundTFD(s1, (t1, ..., tk), O, M):
  if k=0 return []
  if t1.is_primitive() then
    actions = {(a, sigma) | a = sigma(t1) and a.applicable_in(s)}
    if not actions then return failure
    (a, sigma) = actions.choose_one() # non-deterministic choice, we may need to backtrack to here
    plan = GroundTFD(gamma(s,a), sigma((t2, ..., tk)), O, M)
    if plan = failure then return failure
    else return [a].extend(plan)
    methods = {(m, sigma) | m.relevant_for(sigma(t1) and m.applicable_in(s)}
    if not methods then return failure
    (m, sigma) = methods.choose_one()
    plan = subtasks(m).extend(sigma((t2, ..., tk)))
    return GroundTFD(s, plan, O, M)

TFD considers only applicable actions, much like forward search, and it only considers relevant actions, much like backward search.

Ground-TFD can be generalized to Lifted-TFD, giving the same advantages as lifted backward search (e.g. least commitment planning).

The Ground Partial order Forward Decomposition (Ground-PFD) procedure

function GroundPFD(s1, w, O, M):
  if w.U = {} return []
  task = {t in U| t.no_predecessors_in(w.E)}.choose_one()
  if task.is_primitive() then
    actions = {(a, sigma) | a = sigma(t1) and a.applicable_in(s)}
    if not actions then return failure
    (a, sigma) = actions.choose_one()
    plan = GroundPFD(gamma(s,a), sigma(w-{task}), O, M)
    methods = {(m, sigma) | m.relevant_for(sigma(t1) and m.applicable_in(s)}
    if not methods then return failure
    (m, sigma) = methods.choose_one()
    return GroundPFD(s, delta(w,task,m,sigma), O, M)

Hierarchical Task Network (HTN) planning

HTN planning is more general than STN planning, which also means there is no single algorithm for implementing HTN planning.

In STN planning, we had two types of constraints:

HTN planning has the flexibility to use these constraints or other arbitrary constraints as needed; that is, it maintains more general constraints explicitly (in contrast, with STN planning the constraints are embedded as part of the network or the planning). For instance, we could include constraints on the resources used by the tasks.

HTN methods are different than STN methods. Formally:

Let $M_S$ be a set of method symbols. An HTN method is a 4-tuple $m=(\text{name}(m), \text{task}(m), \text{subtasks}(m), \text{constr}(m))$ where:

So the main difference between HTN and STN is that HTN can handle arbitrary constraints, which makes it more powerful, but also more complex.

HTN vs STRIPS planning

The STN/HTN formalism is more expressive; you can encode more problems with this formalism than you can with STRIPS. For example, STN/HTN planning can encode undecidable problems.

However, if you leave out the recursive aspects of STN planning, you can translate such a non-recursive STN problem into an equivalent STRIPS problem. However the size of the problem may become exponentially larger.

There is also a set of STN domains called "regular" STN domains which are equivalent to STRIPS.

It is important to note that STN/HTN and STRIPS planning are meant to solve different kinds of problems - STN/HTN for task-based planning, and STRIPS for goal-based planning.


Like STRIPS, a planning problem for Graphplan consists of a set of operators constituting a domain, an initial state, and set of goals that need to be achieved.

A major difference is that Graphplan works on a propositional representation, i.e. the atoms that make up the world are no longer structured - they don't consist of objects and their relationships but of individual symbols (facts about the world) which can be either true or false. Actions are also individual symbols, not parameterized actions. However, note that every STRIPS planning problem can be translated into an equivalent propositional problem, so long as its operators have no negative preconditions.

The Graphplan algorithm creates a data structure called a planning graph. The algorithm has two major steps:

  1. the planning graph is expanded with two new layers, an action layer and a proposition layer
  2. the graph is searched for a plan

The initial layer is a layer of propositions that are true in the initial state. Then a layer of actions applicable given this initial layer is added, followed by a proposition layer of those propositions that would be true after these actions (including those that were true before which have not been altered by the actions).

An example Graphplan planning graph, from Jiří Iša
An example Graphplan planning graph, from Jiří Iša

This expansion step runs in polynomial time (so it is quite fast).

The search step searches backwards, searching from the last proposition layer in the plan and goes backwards to the initial state. The search itself can be accomplished with something like A*. If a plan is not found, the algorithm goes back to the first step.

Example: Simplified DWR problem:

Here are the STRIPS operators we could use in this domain:

There are no negative preconditions here so we can translate this into a propositional representation.

Basically for each operator, we have to consider every possible configuration (based on the preconditions), and each configuration is given a symbol. For example, for move(r,l,l'), we have two robots and two locations. So there are four possibilities for at(r,l) (i.e. at(robot r, location 1), at(robot q, location 1), at(robot r, location 2), at(robot q, location 2)), and because we have only two locations, there is only one possibility for adjacent(l,l'), so we will have eight symbols for that STRIPS operator in the propositional representation.

So for instance, we could use the symbol r1 for at(robot r, location 1), the symbol r2 for at(robot r, location 2) , the symbol ur for unloaded(robot r), and so on.

Then we'd represent the initial state like so: {r1, q2, a1, b2, ur, uq}.

Our actions are also symbolically represented. For instance, we could use the symbol Mr12 for move(robot r, location 1, location 2).

A propositional planning problem $\mathcal P = (\Sigma, s_i, g)$ has a solution if and only if $S_g \cap \Gamma^{>}(\{s_i\}) \neq \emptyset$, where $\Gamma^{>}(\{s_i\})$ is the set of all states reachable from the initial state.

We can identify reachable states by constructing a reachability tree, where:

All nodes in the reachability tree is denoted $\Gamma^{>}(\{s_i\})$.

All nodes up to depth $d$ are $\Gamma^d(\{s_i\})$.

These trees are usually very large: there are $O(k^d)$ nodes, where $k$ is the number of applicable actions per state. So we cannot simply traverse the entire tree.

Instead, we can construct a planning graph.

A planning graph is a layered (layers as mentioned earlier) directed graph $G=(N,E)$, where:

The first proposition layer $P_0$ has the propositions in the initial state $s_i$.

An action layer $A_j$ has all actions $a$ where $\text{precond}(a) \subseteq P_{j-1}$.

A proposition layer $P_j$ has all propositions $p$ where $p \in P_{j-1}$ or $\exists a \in A_j: p \in \text{effects}^{+}(a)$. Note that we do not look at negative effects; we never remove negative effects from a layer.

As a result, both proposition layers and action layers increase (grow larger) monotonically as we move forward through the graph.

We create arcs throughout the graph like so:

If a goal $g$ is reachable from the initial state $s_i$, then there will be some proposition layer $P_g$ in the planning graph such that $g \subseteq P_g$.

This is a necessary condition, but not sufficient because the planning graph's proposition layers contain propositions that may be true depending on the selected actions in the previous action layer; furthermore, these proposition layers may contain inconsistent propositions (e.g. a robot cannot be in two different locations simultaneously). Similarly, actions in an action layer may not be applicable at the same time (e.g. a robot cannot move to two different locations simultaneously).

The advantage of the planning graph is that it is of polynomial size and we can evaluate this necessary condition in polynomial time.

Action independence

Actions which cannot be executed simultaneously/in parallel are dependent, otherwise, they are independent.

More formally:

Two actions $a_1$ and $a_2$ are independent if and only if:

(that is, they don't interfere with each other)

A set of actions $\pi$ is independent if and only if every pair of actions $a_1, a_2 \in \pi$ is independent.

In pseudocode:

function independent(a1, a2):
  for each p in neg_effects(a1):
    if p in precond(a2) or p in pos_effects(a2):
      return false
  for each p in neg_effects(a2):
    if p in precond(a1) or p in pos_effects(a1):
      return false
  return true

A set $\pi$ of independent actions is applicable to a state $s$ if and only if $\bigcup_{a \in \pi} \text{precond}(a) \subseteq s$.

The result of applying the set $\pi$ in $s$ is defined as $\gamma(s, \pi) = (s - \text{effects} - (\pi)) \cup \text{effects}^{+}(\pi)$, where:

Independent action execution order

To turn a set of independent actions into a sequential plan:

If a set $\pi$ of independent actions is applicable in state $s$, then for any permutation $a_1, \dots, a_k$ of the elements of $\pi$:

Which is to say that the execution order doesn't matter for a set of independent actions, we can execute these actions in any order we like (intuitively, this makes sense, because none of them interfere with each other, so they can happen in any order).

Layered plans

Let $P = (A, s_i, g)$ be a statement of a propositional planning problem and let $G=(N,E)$ be the planning graph as previously defined.

A layered plan over $G$ is a sequence of sets of actions $\Pi = \pi_1, \dots, \pi_k$ where:

A layered plan $\Pi$ is a solution to a planning problem $P$ if and only if:

Mutual exclusivity (mutex)

In addition to the nodes and edges thus far mentioned, there are also mutual exclusivity (mutex) propositions and actions.

Two propositions in a layer may be incompatible if:

We introduce a No-Op operation for a proposition $p$, notated $Ap$. They carry $p$ from one proposition layer to the next. Thus their only precondition is $p$, and their only effect is $p$. These were implied previously when we said that all propositions are carried over to the next proposition layer; we are just making them explicit through these No-Op actions.

With these no-op actions, we can now encode the second reason for incompatible propositions (i.e. if they are positive and negative effects of the same action) as the first (i.e. they are produced by dependent actions), the dependent actions now being the no-op action and the original action.

We say that two propositions $p$ and $q$ in proposition layer $P_j$ are mutex (mutually exclusive) if:

Notation: $\mu P_j = \{(p, q) | p, q \in P_j \text{are mutex} \}$

In pseudocode, this would be:

function mutex_proposition(p1, p2, mu_a_j):
  for each a1 in p1.producers:
    for each a2 in p2.producers:
      if (a1, a2) not in mu_a_j:
        return false
  return true

See below for the definition mu_a_j.

Two actions $a_1$ and $a_2$ in action layer $A_j$ are mutex if:

Notation: $\mu A_j = \{(a_1, a_2) | a_1, a_2 \in A_j \text{are mutex} \}$

In pseudocode, this would be:

function mutex_action(a1, a2, mu_P):
  if not independent(a1, a2):
    return true
  for each p1 in precond(a1):
    for each p2 in precond(a2):
      if (p1, p2) not in mu_P:
        return true
  return false

How do mutex relations propagate through the planning graph?

If $p,q \in P_{j-1}$ and $(p, q) \notin \mu P_{j-1}$ then $(p,q) \notin \mu P_j$.


If $a_1,a_2 \in A_{j-1}$ and $(a_1, a_2) \notin \mu A_{j-1}$ then $(a_1,a_2) \notin \mu A_j$.


So mutex relations decrease in some sense further down the planning graph.

Actions with mutex preconditions $p$ and $q$ are impossible, and as such, we can remove that action from the graph.

Forward planning graph expansion

This is the process of growing the planning graph. Theoretically, the planning graph is infinite, but we can set a limit on it given a planning problem $P=(A,s_i,g)$.

If $g$ is reachable from $s_i$, then there is a proposition layer $P_g$ such that $g \subseteq P_g$ and $\not \exists g_1, g_2 \in g: (g_1, g_2) \in \mu P_g$ (that is, there are no pairs of goal propositions that are mutually exclusive in the proposition layer $P_g$).

The basic idea behind the Graphplan algorithm:

Pseudocode for the expand step:

The size of a planning graph up to level $k$ and the time required to expand it to that level are both polynomial in the size of the planning problem.

Proof: given a problem size of $n$ propositions and $m$ actions, $|P_j| \leq n, |A_j| \leq n+m$, including no-op actions. The algorithms for generating each layer and all link types are polynomial in the size of the layer and we have a linear number of layers $k$.

Eventually a planning graph $G$ will reach a fixed-point level, which is the $k$th level such that for all $i, i>k$, level $i$ of $G$ is identical to level $k$, i.e. $P_i = P_k, \mu P_i = \mu P_k, A_i = A_k, \mu A_i = \mu A_k$.

This is because as $P_i$ grows monotonically, $\mu P_i$ shrinks monotonically, and $A_i, P_i$ depend only on $P_{i-1}, \mu P_{i-1}$.

Backward graph search

This is just a depth-first graph search, starting from the last proposition layer $P_k$, where the search nodes are subsets of nodes from the different layers.

The general idea:

When implementing this search, we want to keep track of, for each proposition layer, which set of subgoals have failed the search (i.e. are unachievable). The motivation is that if we search and run into a failure state, we have to backtrack, and we don't want to end up in the same failure state some other way.

We use a nogood table $\nabla$ to keep track of what nodes we've seen before. Up to layer $k$, the nogood table is an array of $k$ sets of sets of goal propositions. That is, for each layer we have a set of sets. The inner sets represent a single combination of propositions that cannot be achieved. The outer set, then, contains all combinations of propositions that cannot be achieved for that layer.

So before searching for the set $g$ in a proposition layer $P_j$, we first check whether or not $g \in \nabla(j)$; that is, check to see $g$ has already been determined unachievable for $P_j$.

Otherwise, if we do search for $g$ in $P_j$ and find that it is unachievable, we add $g$ to $\nabla(j)$.

For this backward search, we define a function extract:

The function gpSearch is defined as:

We can combine everything into the complete Graphplan algorithm:

The Graphplan algorithm is sound, complete, and always terminates.

The plan that is returned (if the planning problem has a solution, otherwise, no plan is returned) will have a minimal number of layers, but not necessarily a minimal number of actions.

It is orders of magnitude faster than the previously-discussed techniques due to the planning graph structure (the backwards search still takes exponential time though).

Other considerations

Planning under uncertainty

Thus far all the approaches have assumed the outcome of actions are deterministic.

However, the outcome of actions are often uncertain, so the resulting state is uncertain.

One approach is belief state search. A belief state is a set of world states, one of which is the true state, but we don't know which one. The resulting solution plan is a sequence of actions.

Another approach is contingency planning. The possible outcomes of an action are called contingencies. The resulting solution plan is a tree that branches according to contingencies. At branching points, observation actions are included to see which branch is actually happening.

Both of these approaches are naturally way more complex due to multiple possible outcomes for actions.

If we can quantify the degree of uncertainty (i.e. we know the probabilities of different outcomes given an action) we have, we can use probabilistic planning. Instead of simple state transition systems, we use a partially observable Markov decision process (POMDP) as our model:

Planning with time

So far we have assumed actions are instantaneous but in reality, they take time. That is, we should assume that our actions are durative (they take time). We can assume that actions take a known amount of time, with a start time point and an end time point.

With A* we can include time as part of an action's cost.

With partial plans (e.g. HTN) we can use a temporal constraint manager, which could be:

One way:

We can specify an earliest start time $ES(s)$ and a latest start time $LS(s)$ for each task/state in the network.

We define $ES(s_0) = 0$. For any other state:

$$ ES(s) = \max_{A \to a} ES(A) + \text{duration}(A) $$

Where $A \to s$ just denotes each predecessor state of $s$.

We define $LS(s_g) = ES(s_g)$. For any other state:

$$ LS(s) = \min_{s \to B} LS(B) - \text{duration}(s) $$

Where $s \to B$ just denotes each successor state of $s$.

Multi-agent planning

So far we have assumed planners working in isolation, with control over all the other agents in the plan.

Planners may need to work in concert with other planners or other agents (e.g. people); such scenarios introduce a few new concerns, such as how plans and outcomes are represented and communicated across agents.

Multi-agent planning does away with the assumption of an isolated planner, and results in a much more complex problem:

Other things that must be considered during execution of multi-agent plans:

Scheduling: Dealing with resources

Actions need resources (time can also be thought of as a resource).

Planning which deals with resources is known as scheduling.

A resource is an entity needed to perform an action, which are described by resource variables.

A distinction between state and resource variables:

Some resource types include:

Planning approaches can be arranged in a table:

Deterministic Stochastic
Fully observable A*, depth-first, breadth-first MDP
Partially observable POMDP

Learning plans

There are a few ways an agent can learn plans. Presented here are some simpler ones; see the section on Reinforcement Learning for more sophisticated methods.


We can use machine learning to learn how to navigate a state space (i.e. to plan) by "watching" another agent (an "expert") perform. For example, if we are talking about a game, the learning agent can watch a skilled player play, and based on that learn how to play on its own.

In this case, the examples are states $s$, the candidates are pairs $(s,a)$, and the "correct" actions are those taken by the exprt.

We define features over $(s,a)$ pairs: $f(s,a)$.

The score of a q-state $(s,a)$ is given by $w \cdot f(s,a)$.

This is basically classification, where the inputs are states and the labels are actions.

Case-Based Goal Formulation

With case-based goal formulation, a library of cases relevant to the problem is maintained (e.g. with RTS games, this could be a library of replays for that game). Then the agent uses this library to select a goal to pursue, given the current world state That is, the agent finds the state case $q$ (from a case library $L$) most similar to the current world state $s$:

$$ q = \argmin_{c \in L} \text{distance}(s, c) $$

Where the distance metric may be domain independent or domain specific.

Then, the goal state $g$ is formulated by looking ahead $n$ actions from $q$ to a future state in that case $q'$, finding that difference, and adding that to the current world state $s$:

$$ g = s + (q' - q) $$

The number of actions $n$ is called the planning window size. A small planning window is better for domains where plans are invalidated frequently.