In artificial intelligence, problems are often represented using the **state-space** representation (sometimes called a **state-transition system**), in which the possible states of the problem and the operations that move between them are represented as a graph or a tree:

- Nodes are (abstracted) world configurations (states)
- Arcs represent successors (action results)
- A goal test is a set of goal nodes (which may just include a single goal)
- Each state occurs only once as a node

More formally, we consider a problem to have a set of possible starting states *procedure*, consists of a starting state

The distinction between state-space and **situation-space** is as follows: if the relevant parts of the problem are **fully-specified** (fully-known), then we work with *states* and *operators*, and have a state-space problem. If there is missing information (i.e., the problem is **partially-specified**), then we work with *situations* and *actions* (note that operators are often referred to actions in state-space as well), and we have a situation-space problem. Most of what is said for state-space problems is applicable to situation-space problems.

For now we will focus on state-space problems.

This state-space model can be applied to itself, in such that a given problem can be decomposed into subproblems (also known as *subgoals*); the relationships between the problem and its subproblems (and their subproblems' subproblems, etc) are also represented as a graph. Successor relationships can be grouped by AND or OR arcs which group edges together. A problem node with subproblems linked by AND edges must have all of the grouped subproblems resolved; a problem with subproblems linked by OR edges must have only one of the subproblems resolved. Using this graph, you can identify a path of subproblems which can be used to solve the primary problem. This process is known as *problem reduction*.

We take this state-space representation as the basis for a search problem.

A search problem consists of:

- an
**initial state** - a set of possible
*actions*/applicability conditions - a
**successor function**: from a state to a set of (action, state) - the successor function plus the initial state is the
*state space*(which is a directed graph as described before) - a
**path**(i.e. a solution) - a
**goal**(a goal state or a goal test function) - a
**path cost**function (for optimality, generally it is the sum of the step costs)

To clarify some terminology:

- if node A leads to node B, then node A is a
*parent*of B and B is a**successor**or*child*of A- if the edge connecting A to B is due to an operator
$q$ , we say that "B is a successor to A under the operator$q$ ".

- if the edge connecting A to B is due to an operator
- if a node has no successors, it is a
**terminal** - if there is a path between node A and node C such that node A is a parent of a parent ... of a parent of C, then A is an
*ancestor*of C and C is a*descendant*of A.- if the graph is cyclical, e.g. there is a path from A through C back to A, then A is both an ancestor and a descendant of C.

Practically, we may use a data structure for nodes that encapsulate the following information for the node:

*state*- a state in the state space*parent node*- the immediate predecessor in the search tree (only the root node has no parent)*action*- the action that, when performed in the parent node's state, leads to this node's state*path cost*- the path cost leading to this node*depth*- the depth of this node in the search tree

In the context of artificial intelligence, a path through state-space is called a *plan* - search is fundamental to planning (other aspects of planning are covered in more detail later).

*Problem formulation* can itself be a problem, as it typically is with real-world problems. We have to consider how granular/abstract we want to be and what actions and states to include. To make this a bit easier, we typically make the following assumptions about the environment:

- finite and discrete
- fully observable
- deterministic
- static (no events)

And other assumptions are typically included as well:

- restricted goals
- sequential plans (no parallel activity in plans)
- implicit time (activities do not have a duration)
- offline planning (the state transition system is not changing while we plan)

In practice, we rarely build the full state-space graph in memory (because it is often way too big). Rather, we work with **trees**.

Trees have a few constraints:

- only one node does not have a parent: the
**root node**. - every other node in the tree is a descendant of the root node
- every other node has only one parent

An additional term relevant to trees is **depth**, which is the number of ancestors a node has.

The root node is the current state and branches out into possible future states (i.e. the children are successors).

Given a tree with branching factor

To build out parts of the tree we are interested in, we take a node and apply a **successor function** (sometimes called a **generator function**) to **expand** the node, which gives us all of that node's successors (children).

There is also often a lot of repetition in search trees, which some search algorithm enhancements take advantage of.

We apply algorithms to this tree representation in order to identify paths (ideally the optimal path) from the root node (start state) to a goal node (a goal state, of which there may be many).

Most search algorithms share some common components:

- a
**fringe**(sometimes called a**frontier**) of unexplored nodes is maintained - some process for deciding which nodes to expand

The **general tree search** algorithm is as follows:

- initialize the fringe with a search node for the initial state
- iteratively:
- if the fringe is empty, return a failure
- otherwise, select a node from the fringe based on the current search strategy
- if this node's state passes the goal test (or is the goal state), return the path to this node
- otherwise, expand the fringe with this node's children (successors)

Most search algorithms are based on this general structure, varying in how they choose which node to expand from the fringe.

When considering search algorithms, we care about:

*completeness*- is it guaranteed to find a solution if one exists?*optimal*- is it guaranteed to find the optimal solution, if one exists?*size complexity*- how much space does the algorithm need? Basically, how big can the fringe get?*time complexity*- how does runtime change with input size? Basically, how many nodes get expanded?

These search algorithms discussed here are *unidirectional*, since they only expand in one direction (from the start state down, or, in some cases, from the terminal nodes up). However, there are also *bidirectional* search procedures which start from both the start state and from the goal state. They can be difficult to use, however.

**Uninformed** search algorithms, sometimes called **blind** search algorithms, vary in how they decide which node to expand.

Consider the following search space, where

Exhaustively search all paths (without revisiting any previously visited points) - it doesn't really matter how you decide which node to expand because they will all be expanded.

- time complexity: expands
$O(b^m)$ nodes (if$m$ is finite) - size complexity: the fringe takes
$O(bm)$ space - complete if
$m$ is not infinite (i.e. if there are no cycles) - optimal: no, it finds the "leftmost" solution

Go down the left branch of the tree (by convention) until you can't go any further.

If that is not your target, then **backtrack** - go up to the closest branching node and take the other leftmost path. *Backtracking* is a technique that appears in almost every search algorithm, where we try extending a path, and if the extension fails or is otherwise unsatisfactory, we take a step back and try a different successor.

Repeat until you reach your target.

It stops just on the first complete path, which may not be the optimal path.

Another way to think about depth-first search is with a queue (LIFO) which holds your candidate paths as you construct them.

Your starting "path" includes just the starting point:

Then on each iteration, you take the left-most path (which is always the first in the queue) and check if it reaches your goal.

If it does not, you extend it to build new paths, and replace it with those new paths.

On this next iteration, you again take the left-most path. It still does not reach your goal, so you extend it. And so on:

You can no longer extend the left-most path, so just remove it from the queue.

Then keep going.

- time complexity: expands
$O(b^s)$ nodes, where$s$ is the depth of the shallowest solution - size complexity: the fringe takes
$O(b^s)$ space - complete: yes
- optimal: yes, if all costs are 1, otherwise, a deeper path could have a cheaper cost

Build out the tree level-by-level until you reach your target.

In the queue representation, the only thing that is different from depth-first is that instead of placing new paths at the front of the queue, you place them at the back. Another way of putting this is that instead of a LIFO data structure for its fringe (as is used with DFS), BFS uses a FIFO data structure for its fringe.

We can make breadth-first search sensitive to path cost with **uniform cost search** (also known as *Dijkstra's algorithm*), in which we simply prioritize paths by their cost

- time complexity: If we say the solution costs
$C*$ and arcs cost at least$\epsilon$ , then the "effective depth" is roughly$\frac{C*}{/epsilon}$ , so the time complexity is$O(b^{C*/\epsilon})$ - size complexity: the fringe takes
$O(b^{C*/\epsilon})$ space - complete: yes if the best solution has finite cost and minimum arc cost is positive
- optimal: yes

On each iteration, extend the shortest cumulative path. Once you reach your goal, extend every other extendible path to check that its length ends up being longer than your current path to the goal.

The fringe is kept sorted so that the shortest path is first.

This approach can be quite exhaustive, but it can be improved by using extended list filtering.

The general idea is to combine depth-first search's space advantage with breadth-first search's time/shallow-solution advantages.

- Run depth-first search with depth limit 1
- If no solution:
- Run depth-first search with depth limit 2
- If no solution:
- Run depth-first search with depth limit 3

(etc)

**Extended list filtering** involves maintaining a list of visited nodes and only expanding nodes in the fringe if they have not already been expanded - it would be redundant to search again from that node.

For example, branch and bound search can be combined with extended list filtering to make it less exhaustive.

Informed search algorithms improve on uninformed search by incorporating **heuristics** which tell us whether or not we're getting closer to the goal. With heuristics, we can search less of the search space.

In particular, we want **admissible** heuristics, which is simply a heuristic that never overestimates the distance to the goal.

Formally, we can define the admissible heuristic as:

That is, a node is admissible if the estimated distance

Note that sometimes inadmissible heuristics (i.e. those that sometimes overestimate the distance to the goal) can still be useful.

The specific heuristic function is chosen depending on the particular problem (i.e. we estimate the distance to the goal state differently in different problems, for instance, with a travel route, we might estimate the cost with linear distance to the target city).

The typical trade-off with heuristics is between simplicity/efficiency and accuracy.

The question of finding good heuristics, and doing so automatically, has been a big topic in AI planning recently.

**Best-first** search algorithms are those that select the next node from the fringe by

With **greedy best-first search**, the fringe is kept sorted by heuristic distance to the goal; that is,

This often ends up with a suboptimal path, however.

Beam search is essentially breadth-first search, but we set a *beam width*

The fringe is the same as in breadth-first search, but we keep only the

Beam search is *not* complete, unless the iterative approach is used.

A* is an extension of branch & bound search which includes (admissible) heuristic distances in its sorting.

We define

With A* search, we simply sort the fringe by

A* search is optimal if *admissible*; that is, it never overestimates the distance to the goal. It is complete as well; i.e. if a solution exists, A* will find it.

A* is also *optimally efficient* (with respect to the number of expanded nodes) for a given heuristic function. That is, no other optimal algorithm is guaranteed to expand fewer nodes than A*.

Uniform cost search is a special case of A* where

The downside of A* vs greedy best-first search is that it can be slower since it explores the space more thoroughly - it has worst case time and space complexity of

Typically we are dealing with the worst case; the fringe usually grows exponentially. Sometimes the time complexity is permissible, but the space complexity is problematic because there may simply not be enough memory for some problems.

There is a variation of A* called **iterative deepening A*** (IDA*) which uses significantly less memory.

Iterative deepening A* is an extension of A* which uses an iterative approach, searching up to a distance

Local search algorithms do not maintain a fringe; that is, we don't keep track of unexplored alternatives. Rather, we continuously try to improve a single option until we can't improve it anymore.

Instead of extending a plan, the successor function in local search takes an existing plan and just modifies a part of it.

Local search is generally much faster and more memory efficient, but because it does not keep track of unexplored alternatives, it is incomplete and suboptimal.

A basic method in local search is **hill climbing** - we choose a starting point, move to the best neighboring state (i.e. closest as determined by the heuristic), and repeat until there are no better positions to move to - we've reached the top of hill. As mentioned, this is incomplete and suboptimal, as it can end up in local maxima.

The difference between hill climbing and greedy search is that with greedy search, the entire fringe is sorted by heuristic distance to the goal. With hill climbing, we only sort the children of the currently expanded node, choosing the one closest to the goal.

You can also use *simulated annealing* (detailed elsewhere) to try to escape local maxima - and this helps, and has a theoretical guarantee that it will converge to the optimal state given infinite time, but of course, this is not a practical guarantee for real-world applications. So simulated annealing in practice can do better but still can end up in local optima.

You can also use *genetic algorithms* (detailed elsewhere).

Up until now we have considered search algorithms in the context of trees.

With search trees, we often end up with states repeated throughout the tree, which will have redundant subtrees, and thus end up doing (potentially a lot of) redundant computation.

Instead, we can consider the search space as a graph.

Graph search algorithms are typically just slight modifications of tree search algorithms. One main modification is the introduction of a list of **explored** (or **expanded**) nodes, so that we only expand states which have not already expanded.

Completeness is not affected by graph search, but it is not optimal. We may close off a branch because we have already expanded that state elsewhere, but it's possible that the shortest path still goes through that state. Graph search algorithms (such as the graph search version of A*) can be made optimal through an additional constraint to admissible heuristics: *consistency*.

The main idea of **consistency** is that the estimated heuristic costs should be less than or equal to the actual costs for each arc between *any* two nodes, not just between any node and the goal state:

That is, the absolute value of the difference between the estimated distance between a node

Consistency enforces this for any two nodes, which includes the goal node, so consistency implies admissibility.

**Adversarial search** is essentially search for games involving two or more players.

There are many kinds of games - here we primarily consider games that are:

- deterministic (sometimes called "non-chance")
- two-player
- turn-based
- zero-sum: agents have opposite utilities (one's gain is another's loss), also known as a "pure competition" or "strictly competitive" game.
- perfect information: every player has full knowledge of what state the game is in and what actions are possible

Note that while we are only considering zero-sum games, in *general* games agents have independent utilities, so there is opportunity for cooperation, indifference, competition, and so on.

One way of formulating of games (there are many) is as a tree:

- states
$S$ , starting with$s_0$ - players
$P=\{1, \dots, n\}$ , usually taking turns - actions
$A$ (may depend on player/state) - transition function (analogous to a successor function),
$S \times A \to S$ - terminal test (analogous to a goal test):
$S \to \{t, f\}$ - terminal utility function (computes how much an end/terminal state is worth to each player):
$S \times P \to R$ . For example, we may assign a utility of 100 for terminal states where we win, and -100 for terminal states where we lose.

We want our adversarial search algorithm to return a **strategy** (a **policy**) which is essentially a function which returns an action given a state. That is, a policy tells us what action take in a state - this is constrasted to a plan, which details a step-by-step procedure from start to finish. This is because we can't plan on opponents acting in a particular way, so we need a strategy to respond to their actions.

The solution then, for an adversarial search algorithm for a player is a policy

In **minimax**, we start at the bottom of the tree (where we have utilities computed terminal nodes), moving upwards. We propagate the terminal utilities through the graph up to the root node, propagating the utility at each depth that satisfies a particular criteria.

For nodes at depths that correspond to the opponent's turns, we assume that the opponent chooses their best move (that is, we assume they are a perfect adversary), which means we propagate the *minimum* utility for us.

For nodes at depths that correspond to our turn, we want to choose our best move; that is, we propagate the *maximum* utility for us.

The propagated utility is known as the **backed-up evaluation**.

At the end, this gives us a utility for the root node, which gives us a value for the current state.

Minimax is just like exhaustive depth-first search, so its time complexity is

Minimax is optimal against a perfect adversarial player (that is, an opponent that always takes their best action), but it is not otherwise.

Most interesting games have game trees far too deep to expand all the way to the terminal nodes.

Instead, we can use depth-limited search to only go down a few levels. However, since we don't reach the terminal nodes, their values never propagate up the tree. How will we compute the utility of any given move?

We can introduce an **evaluation function** which computes a utility for non-terminal positions, i.e. it estimates the value of an action. For instance, with chess, you could just take the different of the number of your units vs the number of the opponent's units. Generally moves that lower your opponent's units is better, but not always.

Often in games there are some time constraints - for instance, the computer opponent should respond within a reasonable amount of time.

Iterative deepening can be applied to minimax, running for a set amount of time, and return the best policy found thus far.

This type of algorithm is called an **anytime algorithm** because it has an answer ready at anytime.

If the game is not zero-sum or has multiple players, we can generalize minimax as such:

- terminal nodes have utility
*tuples* - node values are also utility tuples
- each player maximizes their own component

This can model cooperation and competition dynamically.

We can further improve minimax by *pruning* the game tree; i.e. removing branches we know won't be worthwhile. This variation is known as **alpha-beta** search.

Here we can look at branching and figure out a bound for describing its score.

First we look at the left-most branch and see the value 2 in its left-most terminal node. Since we are looking for the min here, we know that the score for this branch node will be at most 2. If we then look at the other terminal node, we see that it is 7 and we know the branch node's score is 2.

At this point we can apply a similar logic to the next node up (where we are looking for the max). We know that it will be at least 2.

So then we look at the next branch node and see that it will be at most 1. We don't have to look at the very last terminal node because now we know that the max node can only be 2. So we have saved ourselves a little trouble.

In larger trees this approach becomes very valuable, since you are effectively discounting entire branches and saving a lot of unnecessary computation. This allows you to compute deeper trees.

Note that with alpha-beta, the minimax value computed for the root is always correct, but the values of intermediate nodes may be wrong, and as such, (naive) alpha-beta is not great for action selection. Good ordering of child nodes improves upon this. With a "perfect ordering", time complexity drops to

Generally you want to generate game trees so that successors to each node are ordered left-to-right in descending order of their eventual backed-up evaluations (such an ordering is called the "correct" ordering). Naturally, it is quite difficult to generate this ordering before these evaluations have been computed.

Thus a *plausible* ordering must suffice. These are a few techniques for generating plausible orderings of nodes:

**Generators**first produce the most immediately desirable choices (though without regard to possible consequences further on)**Shallow search**first generates some of the tree and then uses some static evaluation function and compute backed-up evaluations upwards to order the results.**Dynamic generation**, in which alpha-beta is applied to identify plausible branches of the game tree, then branch is evaluated which can cause the ordering to change.

In many situations the outcomes of actions are uncertain. Another way of phrasing is that actions may be *noisy*.

Like adversarial search, non-deterministic search solutions take the form of policies.

We can model uncertainty as a "dumb" adversary in a game.

Whereas in minimax we assume a "smart" adversary, and thus consider worst-case outcomes (i.e. that the opponent plays their best move), with non-deterministic search, we instead consider average-case outcomes (i.e. expected utilities). This is called **expectimax** search.

So instead of minimax's min nodes, we have "chance" nodes, though we still keep max nodes. For a chance node, we compute its expected utility as the weighted (by probability) average of its children.

Because we take the weighted average of children for a chance node's utility, we cannot use alpha-beta pruning as we could with minimax. There could conceivably be an unexplored child which increases the expected utility enough to make that move ideal, so we have to explore all child nodes to be sure.

We can have games that involve adversaries *and* chance, in which case we would have both minimax layers and expectimax layers. This approach is called **expectiminimax**.

Say you are at some arbitrary position in your search tree (it could be the start or somewhere further along). You can treat the problem of what node to move to next as a multi-armed bandit problem and apply the **Monte Carlo search** technique.

Say you have multiple options with uncertain payouts. You want to maximize your overall payout, and it seems the most prudent strategy would be to identify the one option which consistently yields better payouts than the other options.

However - how do you identify the best option, and do so quickly?

This problem is known as the **multi-armed bandit** problem, and a common strategy is based on upper confidence bounds (UCB).

To start, you randomly try the options and compute confidence intervals for each options' payout:

where:

$\bar x_i$ is the mean payout for option$i$ $n_i$ is the number of times option$i$ was chosen$n$ is the total number of trials

You take the upper bound of these confidence intervals and continue to choose the option with the highest upper bound. As you use this option more, it's confidence interval will narrow (since you have collected more data on it), and eventually another option's confidence interval upper bound will be higher, at which point you switch to that option.

At first, you have no statistical information about the child nodes to compute confidence intervals. So you randomly choose a child and run Monte Carlo simulations down that branch to see the outcomes.

For each simulation run, you go along each node in the branch that was walked and increment its play count (i.e. number of trials) by 1, and if the outcome is a win, you increment its win count by 1 as well (this explanation assumes a game, but is generalizes to other cases).

You repeat this until you have enough statistics for the direct child nodes of your current position to make a UCB choice as to where to move next.

You will need to run less simulations over time because you accumulate these statistics for the search tree.

A variation of MCTS where fixed scores are assigned to unvisited nodes.

MDPs are another way of modeling non-deterministic search.

MDPs are essentially Markov models, but there's a choice of action.

In MDPs, there may be two types of rewards (which can be positive or negative):

- terminal rewards (i.e. those that come at the end, these aren't always present)
- "living" rewards, which are given for each step (these are always present)

For instance, you could imagine a maze arranged on a grid. The desired end of the maze has a positive terminal reward and a dead end of the maze has a negative terminal reward. Every non-terminal position in the maze also has a reward ("living" rewards) associated with it. Often these living rewards are negative so that each step is penalized, thus encouraging the agent to find the desired end in as few steps as possible.

The agent doesn't have complete knowledge of the maze so every action has an uncertain outcome. It can try to move north - sometimes it will successfully do so, but sometimes it will hit a wall and remain in its current position. Sometimes our agent may even move in the wrong direction (e.g. maybe a wheel gets messed up or something).

This kind of scenario can be modeled as a Markov Decision Process, which includes:

- a set of states
$s \in S$ - a set of actions
$a \in A$ - a transition function
$T(s,a,s')$ , sometimes called a**state transition matrix** - gives the probability that
$a$ from$s$ leads to$s'$ , i.e.$P(s'|s,a)$ - also called the "model" or the "dynamics"
- a reward function
$R(s,a,s')$ (sometimes just$R(s)$ or$R(s')$ ), sometimes called a*utility function*, which associates a reward (or penalty) with each state - a discount
$\gamma$ - a start state
- maybe a terminal state

MDPs, as non-deterministic search problems, can be solved with expectimax search.

MDPs are so named because we make the assumption that action outcomes depend only on the current state (i.e. the Markov assumption).

The solution of an MDP is an optimal *policy*

- gives us an action to take for each state
- an optimal policy maximizes expected utility if followed
- an explicit policy defines a reflex agent

In contrast, expectimax does not give us entire policies. Rather, it gives us an action for a single state only. It's similar to a policy, but requires re-computing at each step. Sometimes this is fine because a problem may be too complicated to compute an entire policy anyways.

The objective MDP is to maximize the expected sum of all future rewards, i.e.

Sometimes a **discount factor**

Using this, we can define a **value function**

That is, it is the expected sum of future discounted reward provided we start in state

This can be computed empirically via simulations. In particular, we can use the **value iteration** algorithm.

With value iteration, we recursively calculate the value function, starting from the goal states, to get the optimal value function, from which we can derive the optimal policy.

More formally - we want to recursively estimate the value

This method is called *back-up*.

In terminal states, we just set

We estimate these values over all our states - these estimates eventually converge.

This function essentially defines the optimal policy - that is:

(since it's maximization we can drop

Note that the X square is a wall. Every movement has an uncertain outcome, e.g. if the agent moves to the east, it may only successfully do so with an 80% chance.

For

A | B | C | D | |
---|---|---|---|---|

0 | → | → | → | +1 |

1 | ↑ | X | ← | -1 |

2 | ↑ | ← | ← | ↓ |

At C1 the agent plays very conservatively and moves in the opposite direction of the negative terminal position because it can afford doing so many times until it accidentally randomly moves to another position.

Similar reasoning is behind the policy at D2.

For

A | B | C | D | |
---|---|---|---|---|

0 | → | → | → | +1 |

1 | ↑ | X | ↑ | -1 |

2 | ↑ | ← | ← | ← |

With a stronger step penalty, the agent finds it better to take a risk and move upwards at C1, since it's too expensive to play conservatively.

Similar reasoning is behind the change in policy at D2.

For

A | B | C | D | |
---|---|---|---|---|

0 | → | → | → | +1 |

1 | ↑ | X | → | -1 |

2 | → | → | → | ↑ |

With such a large movement penalty, the agent decides it's better to "commit suicide" by diving into the negative terminal node and end the game as soon as possible.

Each MDP state projects an expectimax-like search tree; that is, we build a search tree from the current state detailing what actions can be taken and the possible outcomes for each action.

We can describe actions and states together as a **q-state** *q-state* (i.e. you are committed to action *transition*

How should we encode preferences for *sequences* of utilities? For example, should the agent prefer the reward sequence

We can model this by *discounting*, that is, *decaying* reward value exponentially. If a reward is worth 1 now, it is worth

**Stationary preferences** are those which are invariant to the inclusion of another reward which delays the others in time, i.e.:

Nonstationary preferences are possible, e.g. if the delay of a reward changes its value relative to other rewards (maybe it takes a greater penalty for some reason).

With stationary preferences, there are only two ways to define utilities:

- Additive utility:
$U([r_0, r_1, r_2, \dots]) = r_0 + r_1 + r_2 + \dots$ - Discounted utility:
$U([r_0, r_1, r_2, \dots]) = r_0 + \gamma r_1 + \gamma^2 r_2 + \dots$

Note that additive utility is just discounted utility where

For now we will assume stationary preferences.

If a game lasts forever, do we have infinite rewards? Infinite rewards makes it difficult to come up with a good policy.

We can specify a finite horizon (like depth-limited search) and just consider only up to some fixed number of steps. This gives us nonstationary policies, since

Alternatively, we can just use discounting, where

A smaller *horizon*).

Another way is to use an **absorbing state**. That is, we guarantee that for every policy, a terminal state will eventually be reached.

Usually we use discounting.

We say that the value (utility) of a state

While a reward is for a state in a single time step, a value is the expected utility over all paths from that state.

The value (utility) of a q-state **Q-value**, which is the expected utility starting out taking action

The optimal policy

So the main objective is to compute (expectimax) values for the states, since this gives us the expected utility (i.e. average sum of discounted rewards) under optimal action.

More concretely, we can define value recursively:

These are the **Bellman equations**.

They can be more compactly written as:

Again, because these trees can go on infinitely (or may just be very deep), we want to limit how far we search (that is, how far we do this recursive computation). We can specify *time-limited values*, i.e. define

To clarify, *bottom* of the tree, that is, *last* time step (since there are 0 more steps to the end).

We can use this with the *value iteration* algorithm to efficiently compute these

- start with
$V_0(s) = 0$ (i.e. with no time steps left, we have an expected reward sum of zero). Note that this is a zero vector over all states. - given a vector of
$V_k(s)$ values, do one ply of expectimax from each state:

Note that since we are starting at the last time step

Then we simply repeat until convergence. This converges if the discount is less than 1.

With the value iteration algorithm, each iteration has complexity

The approximations get refined towards optimal values the deeper you go into the tree. However, the policy may converge long before the values do - so while you may not have a close approximation of values, the policy/strategy they convey early on may already be optimal.

**Partially-observed MDPs** are MDPs in which the states are not (fully) observed. They include *observations* **observation function**

When we take an action, we get an observation which puts us in a new *belief* state (a distribution of possible states).

Partially-observable environments may require information-gathering actions in addition to goal-oriented actions. Such information-gathering actions may require detours from goals but may be worth it in the long run. See the section on reinforcement learning for more.

With POMDPs the state space becomes very large because there are many (infinite) probability distributions over a set of states.

As a result, you can't really run value iteration on POMDPs, but you can use approximate Q-learning (see the section on reinforcement learning) or a truncated (limited lookahead) expectimax approach to approximate the value of actions.

In general, however, POMDPs are very hard/expensive to solve.

Decision networks are a generalization of Bayes' networks. Some nodes are random variables (these are essentially embedded Bayes' networks), some nodes are **action variables**, in which a decision is made, and some nodes are utility functions, which computes a utility for its parent nodes.

For instance, an action node could be "bring (or don't bring) an umbrella", and a random variable node could be "it is/isn't raining". These nodes may feed into a utility node which computes a utility based on the values of these nodes. For instance, if it is raining and we don't bring an umbrella, we will have a very low utility, compared to when it isn't raining and we don't bring an umbrella, for which we will have a high utility.

We want to choose actions that maximize the expected utility given observed evidence.

The general process for action selection is:

- instantiate all evidence
- set action node(s) each possible way
- calculate the posterior for all parents of the utility node, given the evidence
- calculated the expected utility for each action
- choose the maximizing action (it will vary depending on the observed evidence)

This is quite similar to expectimax/MDPs, except now we can incorporate evidence we observe.

More evidence helps, but typically there is a cost to acquiring it. We can quantify the value of acquiring evidence as the *value of information* to determine whether or not it is more evidence is worth the cost. We can compute this with a decision network.

The value of information is simply the expected gain in the maximum expected utility given the new evidence.

For example, say someone hides 100 dollars behind one of two doors, and if we can correctly guess which door it is behind, we get the money.

There is a 0.5 chance that the money is behind either door.

In this scenario, we can use the following decision network:

Where

The utility function at

choose door | money door | utility |
---|---|---|

a | a | 100 |

a | b | 0 |

b | a | 0 |

b | b | 100 |

In this current scenario, our maximum expected utility is 50. That is, choosing either door

How valuable is knowing which door the money is behind?

We can consider that if we know which door the money is behind, our maximum expected utility becomes 100, so we can quantify the value of that information as

In this scenario, we get *perfect information*, because we observe the evidence "perfectly" (that is, our friend tells us the truth and there's no chance that we misheard them).

More formally, the value of perfect information of evidence

Properties of VPI:

- nonnegative: $\forall E', e: \text{VPI}(E' | e) \geq 0 |
||
---|---|---|---|

- order-independent: $\text{VPI}(E_j,E_k | e) = \text{VPI}(E_j | e) + \text{VPI}(E_k | e,E_j) = \text{VPI}(E_k |

Also: generally, if the parents of the utility node is conditionally independent of another node

What's the value of *imperfect* information? Well, we just say that "imperfect" information is perfect information of a noisy version of the variable in question.

For example, say we have a "light level" random variable that we observe through a sensor. Sensors always have some noise, so we add an additional random variable to the decision network (connected to the light level random variable) which corresponds to the sensor's light level measurement. Thus the sensor's observations are "perfect" in the context of the sensor random variable, because they are exactly what the sensor observed, though technically they are noisy in the context of the light level random variable.

How do we evaluate policies?

We can compute the values under a fixed policy. That is, we construct a tree based on the policy (it is a much simpler tree because for any given state, we only have one action - the action the policy says to take from that state), and then compute values from that tree.

More specifically, we compute the value of applying a policy

Again, since we only have one action to choose from, the

We can use an approach similar to value iteration to compute these values, i.e.

This approach is sometimes called *simple value iteration* since we've dropped

This has complexity

**Policy extraction** is the problem opposite to policy evaluation - that is, given values, how do we extract the policy which yields these values?

Say we have optimal values

That is, we do one step of expectimax.

What if we have optimal Q-values instead?

With Q-values, it is trivial to extract the policy, since the hard work is already capture by the Q-value:

Value iteration is quite slow -

**Policy iteration** is another way of solving MDPs (an alternative to value iteration) in which we start with a given policy and improve on it iteratively:

- First, we evaluate the policy (calculate utilities for the given policy until the utilities converge).
- Then we update the policy using one-step look-ahead (one-step expectimax) with the resulting converged utilities as the future (given) values (i.e. policy extraction).
- Repeat until the policy converges.

Policy iteration is optimal and, under some conditions, can converge must faster.

More formally:

Evaluation: iterate values until convergence:

Improvement: compute the new policy with one-step lookahead:

Policy iteration and value iteration are two ways of solving MDPs, and they are quite similar - they are just variations of Bellman updates that use one-step lookahead expectimax.

Search as presented thusfar has been concerned with producing a plan or a policy describing how to act to achieve some goal state. However, there are search problems in which the aim is to identify the goal states themselves - such problems are called **identification problems**.

In **constraint satisfaction problems**, we want to identify states which statisfy a set of constraints.

We have a set of variables **assign** each variable

We want to satisfy a set of **constraints** on what combinations of values are allowed on different subsets of variables. So we want to identify states which satisfy these constraints; that is, we want to identify variable assignments that satisfy the constraints.

Constraints can be specified using a formal language, e.g. code that

We can represent constraints as a graph.

In a binary CSP, each constraint relates at most two variables. We can construct a binary constraint graph in which the nodes are variables, and arcs show constraints. We don't need to specify what the constraints are.

If we have constraints that are more than binary (that is, they relate more than just two variables), we can represent the constraints as square nodes in the graph and link them to the variables they relate (as opposed to representing constraints as the arcs themselves).

General-purpose CSP algorithms use this graph structure for faster search.

Variables may be:

- discrete, and come from
- finite domains
- infinite domains (integers, strings, etc)
- continuous

Constraints may be:

- unary (involve a single variable, this is essentially reducing a domain, e.g.
$A \neq \text{green}$ ) - binary (involve a pair of variables)
- higher-order (involve three or more variables)

We may also have *preferences*, i.e. soft constraints. We can represent these as costs for each variable assignment. This gives us a *constraint optimization problem*.

We can formulate CSPs as search problems using search trees or search graphs (in the context of CSPs, they are called **constraint graphs**).

States are defined by the values assigned so far (partial assignments).

The initial state is the empty assignment,

Successor functions assign a value to an unassigned variable (one at a time).

The goal test is to check if the current assignment is complete (all variables have values) and satisfies all constraints.

Breadth-first search does not work well here because all the solutions will be at the bottom of the search tree (all variables must have values assigned, and that happens only at the bottom).

Depth-first search does a little better, but it is very naive - it can make a mistake early on in its path, but not realize it until reaching the end of a branch.

The main shortcoming with these approaches is that we aren't checking constraints until it's far too late.

Backtracking search is the basic uninformed search algorithm for solving CSPs. It is a simple augmentation of depth-first search.

Rather than checking the constraint satisfaction at the very end of a branch, we check constraints as we go, i.e. we only try values that do not conflict with previous assignments. This is called an **incremental goal test**.

Furthermore, we only consider one variable at a time in some order. Variable assignments are commutative (i.e. the order in which we assign them doesn't matter, e.g.

The moment we violate a constraint, we backtrack and try different a variable assignment.

Simple backtracking can be improved in a few ways:

- ordering
- we can be smarter about in what order we assign variables
- we can be smarter about what we try for the next value for a variable
- filtering: we can detect failure earlier
- structure: we can exploit the problem structure

Backtracking pseudocode:

```
def backtracking(csp):
def backtracing_recursive(assignment):
if is_complete(assignment):
return assignment
var = select_unassigned_variable(csp.variables, assignment)
for val in csp.order_domain_values(var, assignment):
if is_consistent_with_constraints(val, assignment, csp.constraints):
assignment[var] = val
result = backtracking_recursive(assignment)
if result is not None: # if not a failure
return result
else: # otherwise, remove the assignment
del assignment[var]
return None # failure
```

Filtering looks ahead to eliminate incompatible variable assignments early on.

With **forward checking**, when we assign a new variable, we look ahead and eliminate values for other variables that we know will be incompatible with this new assignment. So when we reach that variable, we only have to check values we know will not violate a constraint (that is, we only have to consider a subset of the variable's domain).

If we reach an empty domain for a variable, we know to backup.

With **constraint propagation** methods, we can check for failure ahead of time.

One constraint propagation method is **arc consistency** (AC3).

First, we must consider the *consistency* of an arc (here, in the context of binary constraints, but this can be extended to higher-order constraints). In the context of filtering, an arc *consistent* if and only if for *every* *some*

An inconsistent arc can be made consistent by deleting values from its tail; that is, by deleting tail values which lead to constraint-violating head values.

Note that since arcs are directional, a consistency relationship (edge) must be checked in both directions.

We can re-frame forward checking as just enforcing consistency of arcs pointing to each new assignment.

A simple form of constraint propagation is to ensure all arcs in the CSP graph are consistent. Basically, we visit each arc, check if its consistent, if not, delete values from its tail until it is consistent. If we encounter an empty domain (that is, we've deleted all values from its tail), then we know we have failed.

Note that if a value is deleted from a tail of a node, its incoming arcs must be-rechecked.

We combine this with backtracking search by applying this filtering after each new variable assignment. It's extra work at each step, but it should save us backtracking.

Arc consistency (AC3) pseudocode:

```
function AC3(csp):
queue = csp.all_arcs()
while queue:
from_node, to_node = queue.pop()
if remove_inconsistent_values(from_node, to_node):
for node in neighbors(from_node):
queue.append((node, from_node))
return csp
function remove_inconsistent_values(from_node, to_node):
removed = False
for x in domain[from_node]:
if no value y in domain[to_node] allows (x,y) to satisfy the constraint from_node <-> to_node:
domain[from_node].remove(x)
removed = True
return removed
```

Arc consistency can be generalized to

- 1-consistency is
*node consistency*, i.e. each node's domain has a value which satisfies its own unary constraints. - 2-consistency is arc consistency: for each pair of nodes, any consistent assignment to one can be extended to the other ("extended" meaning from the tail to the head).
$k$ -consistency: for each$k$ nodes, any consistent assignment to$k-1$ can be extended to the$k$ th node.- 3-consistency is called
*path consistency*

Naturally, a higher

We can extend this further with *strong*

One method for selecting the next variable to assign to is called **minimum remaining values** (MRV), in which we choose the variable with the fewest legal values left in its domain (hence this is sometimes called *most constrained variable*). We know this number if we are running forward checking. Essentially we decide to try the hardest variables first so if we fail, we fail early on and thus have to do less backtracking (for this reason, this is sometimes called *fail-fast ordering*).

For choosing the next value to try, a common method is **least constraining value**. That is, we try the value that gives us the most options later on. We may have to re-run filtering to determine what the least constraining value is.

Sometimes there are features of the problem structure that we can use to our advantage.

For example, we may have independent subproblems (that is, we may have multiple connected components; i.e. isolated subgraphs), in which case we can divide-and-conquer.

In practice, however, you almost never see independent subproblems.

Some CSPs have a tree structure (i.e. have no loops). Tree-structured CSPs can be solved in

The algorithm for solving tree-structured CSPs is as follows:

- For order in a tree-structured CSP, we first choose a root variable, then order variables such that parents precede children.
- Backward pass: starting from the end moving backwards, we visit each arc once (the arc pointing from parent to child) and make it consistent.
- Forward assignment: starting from the root and moving forward, we assign each variable so that it is consistent with its parent.

This method has some nice properties:

- after the backward pass, all root-to-leaf arcs are consistent
- if root-to-leaf arcs are consistent, the forward assignment will not backtrack

Unfortunately, in practice you don't typically encounter tree-structured CSPs.

Rather, we can improve an existing CSPs structure so that it is *nearly* tree-structured.

Sometimes there are just a few variables which prevent the CSP from having a tree structure.

With **cutset conditioning**, we assign values to these variables such that the rest of the graph is a tree.

This, for example, turns binary constraints into unary constraints, e.g. if we have a constraint

Cutset conditioning with a cutset size

More specifically, the cutset conditioning algorithm:

- choose a cutset (the variables to set values for)
- instantiate the cutset in all possible ways (e.g. produce a graph for each possible combination of values for the cutset)
- for each instantiation, compute the
*residual*(tree-structured) CSP by removing the cutset constraints and replacing them with simpler constraints (e.g. replace binary constraints with unary constraints as demonstrated above) - solve the residual CSPs

Unfortunately, finding the smallest cutset is an NP-hard problem.

There are other methods for improving the CSP structure, such as *tree decomposition*.

Tree decomposition involves creating "mega-variables" which represent subproblems of the original problem, such that the graph of these mega-variables has a tree structure. For each of these mega-variables we consider valid combinations of assignments to its variables.

These subproblems must overlap in the right way (the *running intersection property*) in order to ensure consistent solutions.

Rather than building solutions step-by-step, iterative algorithms start with an incorrect solution and try to fix it.

Such algorithms are *local search* methods in that they work with "complete" states (that is, all variables are assigned, though constraints may be violated/unsatisfied), and there is no fringe.

Then we have operators which reassign variable values.

A very simple iterative algorithm:

- while not solved
- randomly select any conflicted variable
- select a value which violates the fewest constraints (the
*min-conflicts*heuristic), i.e. hill climb with$h(n) = \text{num. of violated constraints}$

In practice, this min-conflicts approach tends to perform quickly for randomly-generated CSPs; that is, there are some particular CSPs which are very hard for it, but for the most part, it can perform in almost constant time for arbitrarily difficult randomly-generated CSPs.

Though, again, unfortunately many real-world CSPs fall in this difficult domain.

Multi-action adversarial games (assuming turn-based) are tricky because they have enormous branching factors. The problem is no longer what the best single action is for a turn - now we need to find the best *sequence* of actions to take. An evolutionary algorithm can be applied to select these actions in a method called *online evolution* because the agent doesn't not learn in advance (offline learning), rather, it learns the best moves while it plays.

Online evolution evolves the actions in a single turn and uses an estimation of the state at the end of the turn as a fitness function. This is essentially a single iteration of *rolling horizon evolution*, a method that evolves a sequence of actions and evolves new action sequences as those actions are executed. In its application here, we have a horizon of just one turn.

An individual (to be evolved) in this context is a candidate sequence of actions for the turn. A basic genetic algorithm can be applied. The fitness function can include rollouts, e.g. to a depth of one extra turn, to incorporate how an opponent might counter move, but it may not help performance.

- Introduction to Monte Carlo Tree Search. Jeff Bradberry.
- MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
- Introduction to Artificial Intelligence (2nd ed). Philip C. Jackson, Jr. 1985.
- Integrating Learning in a Multi-Scale Agent. Ben G. Weber. 2012.
- Planning Algorithms. Steven M. LaValle. 2006.
- Artificial Intelligence Planning. Dr. Gerhard Wickler, Prof. Austin Tate. The University of Edinburgh (Coursera). 2015.
- Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
- Logical Foundations of Artificial Intelligence (1987) (Chapter 12: Planning)
- CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX).
- Artificial Intelligence: Foundations of Computational Agents. David Poole, Alan Mackworth.
*Algorithmic Puzzles*. Anany Levitin, Maria Levitin. 2011.- A way to deal with enormous branching factors. Julian Togelius. March 25, 2016.
- Online Evolution for Multi-Action Adversarial Games. Niels Justesen, Tobias Mahlmann, Julian Togelius.