In artificial intelligence, problems are often represented using the state-space representation (sometimes called a state-transition system), in which the possible states of the problem and the operations that move between them are represented as a graph or a tree:
More formally, we consider a problem to have a set of possible starting states
The distinction between state-space and situation-space is as follows: if the relevant parts of the problem are fully-specified (fully-known), then we work with states and operators, and have a state-space problem. If there is missing information (i.e., the problem is partially-specified), then we work with situations and actions (note that operators are often referred to actions in state-space as well), and we have a situation-space problem. Most of what is said for state-space problems is applicable to situation-space problems.
For now we will focus on state-space problems.
This state-space model can be applied to itself, in such that a given problem can be decomposed into subproblems (also known as subgoals); the relationships between the problem and its subproblems (and their subproblems' subproblems, etc) are also represented as a graph. Successor relationships can be grouped by AND or OR arcs which group edges together. A problem node with subproblems linked by AND edges must have all of the grouped subproblems resolved; a problem with subproblems linked by OR edges must have only one of the subproblems resolved. Using this graph, you can identify a path of subproblems which can be used to solve the primary problem. This process is known as problem reduction.
We take this state-space representation as the basis for a search problem.
A search problem consists of:
To clarify some terminology:
Practically, we may use a data structure for nodes that encapsulate the following information for the node:
In the context of artificial intelligence, a path through state-space is called a plan - search is fundamental to planning (other aspects of planning are covered in more detail later).
Problem formulation can itself be a problem, as it typically is with real-world problems. We have to consider how granular/abstract we want to be and what actions and states to include. To make this a bit easier, we typically make the following assumptions about the environment:
And other assumptions are typically included as well:
In practice, we rarely build the full state-space graph in memory (because it is often way too big). Rather, we work with trees.
Trees have a few constraints:
An additional term relevant to trees is depth, which is the number of ancestors a node has.
The root node is the current state and branches out into possible future states (i.e. the children are successors).
Given a tree with branching factor
To build out parts of the tree we are interested in, we take a node and apply a successor function (sometimes called a generator function) to expand the node, which gives us all of that node's successors (children).
There is also often a lot of repetition in search trees, which some search algorithm enhancements take advantage of.
We apply algorithms to this tree representation in order to identify paths (ideally the optimal path) from the root node (start state) to a goal node (a goal state, of which there may be many).
Most search algorithms share some common components:
The general tree search algorithm is as follows:
Most search algorithms are based on this general structure, varying in how they choose which node to expand from the fringe.
When considering search algorithms, we care about:
These search algorithms discussed here are unidirectional, since they only expand in one direction (from the start state down, or, in some cases, from the terminal nodes up). However, there are also bidirectional search procedures which start from both the start state and from the goal state. They can be difficult to use, however.
Uninformed search algorithms, sometimes called blind search algorithms, vary in how they decide which node to expand.
Consider the following search space, where
Exhaustively search all paths (without revisiting any previously visited points) - it doesn't really matter how you decide which node to expand because they will all be expanded.
Go down the left branch of the tree (by convention) until you can't go any further.
If that is not your target, then backtrack - go up to the closest branching node and take the other leftmost path. Backtracking is a technique that appears in almost every search algorithm, where we try extending a path, and if the extension fails or is otherwise unsatisfactory, we take a step back and try a different successor.
Repeat until you reach your target.
It stops just on the first complete path, which may not be the optimal path.
Another way to think about depth-first search is with a queue (LIFO) which holds your candidate paths as you construct them.
Your starting "path" includes just the starting point:
Then on each iteration, you take the left-most path (which is always the first in the queue) and check if it reaches your goal.
If it does not, you extend it to build new paths, and replace it with those new paths.
On this next iteration, you again take the left-most path. It still does not reach your goal, so you extend it. And so on:
You can no longer extend the left-most path, so just remove it from the queue.
Then keep going.
Build out the tree level-by-level until you reach your target.
In the queue representation, the only thing that is different from depth-first is that instead of placing new paths at the front of the queue, you place them at the back. Another way of putting this is that instead of a LIFO data structure for its fringe (as is used with DFS), BFS uses a FIFO data structure for its fringe.
We can make breadth-first search sensitive to path cost with uniform cost search (also known as Dijkstra's algorithm), in which we simply prioritize paths by their cost
On each iteration, extend the shortest cumulative path. Once you reach your goal, extend every other extendible path to check that its length ends up being longer than your current path to the goal.
The fringe is kept sorted so that the shortest path is first.
This approach can be quite exhaustive, but it can be improved by using extended list filtering.
The general idea is to combine depth-first search's space advantage with breadth-first search's time/shallow-solution advantages.
(etc)
Extended list filtering involves maintaining a list of visited nodes and only expanding nodes in the fringe if they have not already been expanded - it would be redundant to search again from that node.
For example, branch and bound search can be combined with extended list filtering to make it less exhaustive.
Informed search algorithms improve on uninformed search by incorporating heuristics which tell us whether or not we're getting closer to the goal. With heuristics, we can search less of the search space.
In particular, we want admissible heuristics, which is simply a heuristic that never overestimates the distance to the goal.
Formally, we can define the admissible heuristic as:
That is, a node is admissible if the estimated distance
Note that sometimes inadmissible heuristics (i.e. those that sometimes overestimate the distance to the goal) can still be useful.
The specific heuristic function is chosen depending on the particular problem (i.e. we estimate the distance to the goal state differently in different problems, for instance, with a travel route, we might estimate the cost with linear distance to the target city).
The typical trade-off with heuristics is between simplicity/efficiency and accuracy.
The question of finding good heuristics, and doing so automatically, has been a big topic in AI planning recently.
Best-first search algorithms are those that select the next node from the fringe by
With greedy best-first search, the fringe is kept sorted by heuristic distance to the goal; that is,
This often ends up with a suboptimal path, however.
Beam search is essentially breadth-first search, but we set a beam width
The fringe is the same as in breadth-first search, but we keep only the
Beam search is not complete, unless the iterative approach is used.
A* is an extension of branch & bound search which includes (admissible) heuristic distances in its sorting.
We define
With A* search, we simply sort the fringe by
A* search is optimal if
A* is also optimally efficient (with respect to the number of expanded nodes) for a given heuristic function. That is, no other optimal algorithm is guaranteed to expand fewer nodes than A*.
Uniform cost search is a special case of A* where
The downside of A* vs greedy best-first search is that it can be slower since it explores the space more thoroughly - it has worst case time and space complexity of
Typically we are dealing with the worst case; the fringe usually grows exponentially. Sometimes the time complexity is permissible, but the space complexity is problematic because there may simply not be enough memory for some problems.
There is a variation of A* called iterative deepening A* (IDA*) which uses significantly less memory.
Iterative deepening A* is an extension of A* which uses an iterative approach, searching up to a distance
Local search algorithms do not maintain a fringe; that is, we don't keep track of unexplored alternatives. Rather, we continuously try to improve a single option until we can't improve it anymore.
Instead of extending a plan, the successor function in local search takes an existing plan and just modifies a part of it.
Local search is generally much faster and more memory efficient, but because it does not keep track of unexplored alternatives, it is incomplete and suboptimal.
A basic method in local search is hill climbing - we choose a starting point, move to the best neighboring state (i.e. closest as determined by the heuristic), and repeat until there are no better positions to move to - we've reached the top of hill. As mentioned, this is incomplete and suboptimal, as it can end up in local maxima.
The difference between hill climbing and greedy search is that with greedy search, the entire fringe is sorted by heuristic distance to the goal. With hill climbing, we only sort the children of the currently expanded node, choosing the one closest to the goal.
You can also use simulated annealing (detailed elsewhere) to try to escape local maxima - and this helps, and has a theoretical guarantee that it will converge to the optimal state given infinite time, but of course, this is not a practical guarantee for real-world applications. So simulated annealing in practice can do better but still can end up in local optima.
You can also use genetic algorithms (detailed elsewhere).
Up until now we have considered search algorithms in the context of trees.
With search trees, we often end up with states repeated throughout the tree, which will have redundant subtrees, and thus end up doing (potentially a lot of) redundant computation.
Instead, we can consider the search space as a graph.
Graph search algorithms are typically just slight modifications of tree search algorithms. One main modification is the introduction of a list of explored (or expanded) nodes, so that we only expand states which have not already expanded.
Completeness is not affected by graph search, but it is not optimal. We may close off a branch because we have already expanded that state elsewhere, but it's possible that the shortest path still goes through that state. Graph search algorithms (such as the graph search version of A*) can be made optimal through an additional constraint to admissible heuristics: consistency.
The main idea of consistency is that the estimated heuristic costs should be less than or equal to the actual costs for each arc between any two nodes, not just between any node and the goal state:
That is, the absolute value of the difference between the estimated distance between a node
Consistency enforces this for any two nodes, which includes the goal node, so consistency implies admissibility.
Adversarial search is essentially search for games involving two or more players.
There are many kinds of games - here we primarily consider games that are:
Note that while we are only considering zero-sum games, in general games agents have independent utilities, so there is opportunity for cooperation, indifference, competition, and so on.
One way of formulating of games (there are many) is as a tree:
We want our adversarial search algorithm to return a strategy (a policy) which is essentially a function which returns an action given a state. That is, a policy tells us what action take in a state - this is constrasted to a plan, which details a step-by-step procedure from start to finish. This is because we can't plan on opponents acting in a particular way, so we need a strategy to respond to their actions.
The solution then, for an adversarial search algorithm for a player is a policy
In minimax, we start at the bottom of the tree (where we have utilities computed terminal nodes), moving upwards. We propagate the terminal utilities through the graph up to the root node, propagating the utility at each depth that satisfies a particular criteria.
For nodes at depths that correspond to the opponent's turns, we assume that the opponent chooses their best move (that is, we assume they are a perfect adversary), which means we propagate the minimum utility for us.
For nodes at depths that correspond to our turn, we want to choose our best move; that is, we propagate the maximum utility for us.
The propagated utility is known as the backed-up evaluation.
At the end, this gives us a utility for the root node, which gives us a value for the current state.
Minimax is just like exhaustive depth-first search, so its time complexity is
Minimax is optimal against a perfect adversarial player (that is, an opponent that always takes their best action), but it is not otherwise.
Most interesting games have game trees far too deep to expand all the way to the terminal nodes.
Instead, we can use depth-limited search to only go down a few levels. However, since we don't reach the terminal nodes, their values never propagate up the tree. How will we compute the utility of any given move?
We can introduce an evaluation function which computes a utility for non-terminal positions, i.e. it estimates the value of an action. For instance, with chess, you could just take the different of the number of your units vs the number of the opponent's units. Generally moves that lower your opponent's units is better, but not always.
Often in games there are some time constraints - for instance, the computer opponent should respond within a reasonable amount of time.
Iterative deepening can be applied to minimax, running for a set amount of time, and return the best policy found thus far.
This type of algorithm is called an anytime algorithm because it has an answer ready at anytime.
If the game is not zero-sum or has multiple players, we can generalize minimax as such:
This can model cooperation and competition dynamically.
We can further improve minimax by pruning the game tree; i.e. removing branches we know won't be worthwhile. This variation is known as alpha-beta search.
Here we can look at branching and figure out a bound for describing its score.
First we look at the left-most branch and see the value 2 in its left-most terminal node. Since we are looking for the min here, we know that the score for this branch node will be at most 2. If we then look at the other terminal node, we see that it is 7 and we know the branch node's score is 2.
At this point we can apply a similar logic to the next node up (where we are looking for the max). We know that it will be at least 2.
So then we look at the next branch node and see that it will be at most 1. We don't have to look at the very last terminal node because now we know that the max node can only be 2. So we have saved ourselves a little trouble.
In larger trees this approach becomes very valuable, since you are effectively discounting entire branches and saving a lot of unnecessary computation. This allows you to compute deeper trees.
Note that with alpha-beta, the minimax value computed for the root is always correct, but the values of intermediate nodes may be wrong, and as such, (naive) alpha-beta is not great for action selection. Good ordering of child nodes improves upon this. With a "perfect ordering", time complexity drops to
Generally you want to generate game trees so that successors to each node are ordered left-to-right in descending order of their eventual backed-up evaluations (such an ordering is called the "correct" ordering). Naturally, it is quite difficult to generate this ordering before these evaluations have been computed.
Thus a plausible ordering must suffice. These are a few techniques for generating plausible orderings of nodes:
In many situations the outcomes of actions are uncertain. Another way of phrasing is that actions may be noisy.
Like adversarial search, non-deterministic search solutions take the form of policies.
We can model uncertainty as a "dumb" adversary in a game.
Whereas in minimax we assume a "smart" adversary, and thus consider worst-case outcomes (i.e. that the opponent plays their best move), with non-deterministic search, we instead consider average-case outcomes (i.e. expected utilities). This is called expectimax search.
So instead of minimax's min nodes, we have "chance" nodes, though we still keep max nodes. For a chance node, we compute its expected utility as the weighted (by probability) average of its children.
Because we take the weighted average of children for a chance node's utility, we cannot use alpha-beta pruning as we could with minimax. There could conceivably be an unexplored child which increases the expected utility enough to make that move ideal, so we have to explore all child nodes to be sure.
We can have games that involve adversaries and chance, in which case we would have both minimax layers and expectimax layers. This approach is called expectiminimax.
Say you are at some arbitrary position in your search tree (it could be the start or somewhere further along). You can treat the problem of what node to move to next as a multi-armed bandit problem and apply the Monte Carlo search technique.
Say you have multiple options with uncertain payouts. You want to maximize your overall payout, and it seems the most prudent strategy would be to identify the one option which consistently yields better payouts than the other options.
However - how do you identify the best option, and do so quickly?
This problem is known as the multi-armed bandit problem, and a common strategy is based on upper confidence bounds (UCB).
To start, you randomly try the options and compute confidence intervals for each options' payout:
where:
You take the upper bound of these confidence intervals and continue to choose the option with the highest upper bound. As you use this option more, it's confidence interval will narrow (since you have collected more data on it), and eventually another option's confidence interval upper bound will be higher, at which point you switch to that option.
At first, you have no statistical information about the child nodes to compute confidence intervals. So you randomly choose a child and run Monte Carlo simulations down that branch to see the outcomes.
For each simulation run, you go along each node in the branch that was walked and increment its play count (i.e. number of trials) by 1, and if the outcome is a win, you increment its win count by 1 as well (this explanation assumes a game, but is generalizes to other cases).
You repeat this until you have enough statistics for the direct child nodes of your current position to make a UCB choice as to where to move next.
You will need to run less simulations over time because you accumulate these statistics for the search tree.
A variation of MCTS where fixed scores are assigned to unvisited nodes.
MDPs are another way of modeling non-deterministic search.
MDPs are essentially Markov models, but there's a choice of action.
In MDPs, there may be two types of rewards (which can be positive or negative):
For instance, you could imagine a maze arranged on a grid. The desired end of the maze has a positive terminal reward and a dead end of the maze has a negative terminal reward. Every non-terminal position in the maze also has a reward ("living" rewards) associated with it. Often these living rewards are negative so that each step is penalized, thus encouraging the agent to find the desired end in as few steps as possible.
The agent doesn't have complete knowledge of the maze so every action has an uncertain outcome. It can try to move north - sometimes it will successfully do so, but sometimes it will hit a wall and remain in its current position. Sometimes our agent may even move in the wrong direction (e.g. maybe a wheel gets messed up or something).
This kind of scenario can be modeled as a Markov Decision Process, which includes:
MDPs, as non-deterministic search problems, can be solved with expectimax search.
MDPs are so named because we make the assumption that action outcomes depend only on the current state (i.e. the Markov assumption).
The solution of an MDP is an optimal policy
In contrast, expectimax does not give us entire policies. Rather, it gives us an action for a single state only. It's similar to a policy, but requires re-computing at each step. Sometimes this is fine because a problem may be too complicated to compute an entire policy anyways.
The objective MDP is to maximize the expected sum of all future rewards, i.e.
Sometimes a discount factor
Using this, we can define a value function
That is, it is the expected sum of future discounted reward provided we start in state
This can be computed empirically via simulations. In particular, we can use the value iteration algorithm.
With value iteration, we recursively calculate the value function, starting from the goal states, to get the optimal value function, from which we can derive the optimal policy.
More formally - we want to recursively estimate the value
This method is called back-up.
In terminal states, we just set
We estimate these values over all our states - these estimates eventually converge.
This function essentially defines the optimal policy - that is:
(since it's maximization we can drop
Note that the X square is a wall. Every movement has an uncertain outcome, e.g. if the agent moves to the east, it may only successfully do so with an 80% chance.
For
A | B | C | D | |
---|---|---|---|---|
0 | → | → | → | +1 |
1 | ↑ | X | ← | -1 |
2 | ↑ | ← | ← | ↓ |
At C1 the agent plays very conservatively and moves in the opposite direction of the negative terminal position because it can afford doing so many times until it accidentally randomly moves to another position.
Similar reasoning is behind the policy at D2.
For
A | B | C | D | |
---|---|---|---|---|
0 | → | → | → | +1 |
1 | ↑ | X | ↑ | -1 |
2 | ↑ | ← | ← | ← |
With a stronger step penalty, the agent finds it better to take a risk and move upwards at C1, since it's too expensive to play conservatively.
Similar reasoning is behind the change in policy at D2.
For
A | B | C | D | |
---|---|---|---|---|
0 | → | → | → | +1 |
1 | ↑ | X | → | -1 |
2 | → | → | → | ↑ |
With such a large movement penalty, the agent decides it's better to "commit suicide" by diving into the negative terminal node and end the game as soon as possible.
Each MDP state projects an expectimax-like search tree; that is, we build a search tree from the current state detailing what actions can be taken and the possible outcomes for each action.
We can describe actions and states together as a q-state
How should we encode preferences for sequences of utilities? For example, should the agent prefer the reward sequence
We can model this by discounting, that is, decaying reward value exponentially. If a reward is worth 1 now, it is worth
Stationary preferences are those which are invariant to the inclusion of another reward which delays the others in time, i.e.:
Nonstationary preferences are possible, e.g. if the delay of a reward changes its value relative to other rewards (maybe it takes a greater penalty for some reason).
With stationary preferences, there are only two ways to define utilities:
Note that additive utility is just discounted utility where
For now we will assume stationary preferences.
If a game lasts forever, do we have infinite rewards? Infinite rewards makes it difficult to come up with a good policy.
We can specify a finite horizon (like depth-limited search) and just consider only up to some fixed number of steps. This gives us nonstationary policies, since
Alternatively, we can just use discounting, where
A smaller
Another way is to use an absorbing state. That is, we guarantee that for every policy, a terminal state will eventually be reached.
Usually we use discounting.
We say that the value (utility) of a state
While a reward is for a state in a single time step, a value is the expected utility over all paths from that state.
The value (utility) of a q-state
The optimal policy
So the main objective is to compute (expectimax) values for the states, since this gives us the expected utility (i.e. average sum of discounted rewards) under optimal action.
More concretely, we can define value recursively:
These are the Bellman equations.
They can be more compactly written as:
Again, because these trees can go on infinitely (or may just be very deep), we want to limit how far we search (that is, how far we do this recursive computation). We can specify time-limited values, i.e. define
To clarify,
We can use this with the value iteration algorithm to efficiently compute these
Note that since we are starting at the last time step
Then we simply repeat until convergence. This converges if the discount is less than 1.
With the value iteration algorithm, each iteration has complexity
The approximations get refined towards optimal values the deeper you go into the tree. However, the policy may converge long before the values do - so while you may not have a close approximation of values, the policy/strategy they convey early on may already be optimal.
Partially-observed MDPs are MDPs in which the states are not (fully) observed. They include observations
When we take an action, we get an observation which puts us in a new belief state (a distribution of possible states).
Partially-observable environments may require information-gathering actions in addition to goal-oriented actions. Such information-gathering actions may require detours from goals but may be worth it in the long run. See the section on reinforcement learning for more.
With POMDPs the state space becomes very large because there are many (infinite) probability distributions over a set of states.
As a result, you can't really run value iteration on POMDPs, but you can use approximate Q-learning (see the section on reinforcement learning) or a truncated (limited lookahead) expectimax approach to approximate the value of actions.
In general, however, POMDPs are very hard/expensive to solve.
Decision networks are a generalization of Bayes' networks. Some nodes are random variables (these are essentially embedded Bayes' networks), some nodes are action variables, in which a decision is made, and some nodes are utility functions, which computes a utility for its parent nodes.
For instance, an action node could be "bring (or don't bring) an umbrella", and a random variable node could be "it is/isn't raining". These nodes may feed into a utility node which computes a utility based on the values of these nodes. For instance, if it is raining and we don't bring an umbrella, we will have a very low utility, compared to when it isn't raining and we don't bring an umbrella, for which we will have a high utility.
We want to choose actions that maximize the expected utility given observed evidence.
The general process for action selection is:
This is quite similar to expectimax/MDPs, except now we can incorporate evidence we observe.
More evidence helps, but typically there is a cost to acquiring it. We can quantify the value of acquiring evidence as the value of information to determine whether or not it is more evidence is worth the cost. We can compute this with a decision network.
The value of information is simply the expected gain in the maximum expected utility given the new evidence.
For example, say someone hides 100 dollars behind one of two doors, and if we can correctly guess which door it is behind, we get the money.
There is a 0.5 chance that the money is behind either door.
In this scenario, we can use the following decision network:
Where
The utility function at
choose door | money door | utility |
---|---|---|
a | a | 100 |
a | b | 0 |
b | a | 0 |
b | b | 100 |
In this current scenario, our maximum expected utility is 50. That is, choosing either door
How valuable is knowing which door the money is behind?
We can consider that if we know which door the money is behind, our maximum expected utility becomes 100, so we can quantify the value of that information as
In this scenario, we get perfect information, because we observe the evidence "perfectly" (that is, our friend tells us the truth and there's no chance that we misheard them).
More formally, the value of perfect information of evidence
Properties of VPI:
- nonnegative: $\forall E', e: \text{VPI}(E' | e) \geq 0 |
||
---|---|---|---|
- order-independent: $\text{VPI}(E_j,E_k | e) = \text{VPI}(E_j | e) + \text{VPI}(E_k | e,E_j) = \text{VPI}(E_k |
Also: generally, if the parents of the utility node is conditionally independent of another node
What's the value of imperfect information? Well, we just say that "imperfect" information is perfect information of a noisy version of the variable in question.
For example, say we have a "light level" random variable that we observe through a sensor. Sensors always have some noise, so we add an additional random variable to the decision network (connected to the light level random variable) which corresponds to the sensor's light level measurement. Thus the sensor's observations are "perfect" in the context of the sensor random variable, because they are exactly what the sensor observed, though technically they are noisy in the context of the light level random variable.
How do we evaluate policies?
We can compute the values under a fixed policy. That is, we construct a tree based on the policy (it is a much simpler tree because for any given state, we only have one action - the action the policy says to take from that state), and then compute values from that tree.
More specifically, we compute the value of applying a policy
Again, since we only have one action to choose from, the
We can use an approach similar to value iteration to compute these values, i.e.
This approach is sometimes called simple value iteration since we've dropped
This has complexity
Policy extraction is the problem opposite to policy evaluation - that is, given values, how do we extract the policy which yields these values?
Say we have optimal values
That is, we do one step of expectimax.
What if we have optimal Q-values instead?
With Q-values, it is trivial to extract the policy, since the hard work is already capture by the Q-value:
Value iteration is quite slow -
Policy iteration is another way of solving MDPs (an alternative to value iteration) in which we start with a given policy and improve on it iteratively:
Policy iteration is optimal and, under some conditions, can converge must faster.
More formally:
Evaluation: iterate values until convergence:
Improvement: compute the new policy with one-step lookahead:
Policy iteration and value iteration are two ways of solving MDPs, and they are quite similar - they are just variations of Bellman updates that use one-step lookahead expectimax.
Search as presented thusfar has been concerned with producing a plan or a policy describing how to act to achieve some goal state. However, there are search problems in which the aim is to identify the goal states themselves - such problems are called identification problems.
In constraint satisfaction problems, we want to identify states which statisfy a set of constraints.
We have a set of variables
We want to satisfy a set of constraints on what combinations of values are allowed on different subsets of variables. So we want to identify states which satisfy these constraints; that is, we want to identify variable assignments that satisfy the constraints.
Constraints can be specified using a formal language, e.g. code that
We can represent constraints as a graph.
In a binary CSP, each constraint relates at most two variables. We can construct a binary constraint graph in which the nodes are variables, and arcs show constraints. We don't need to specify what the constraints are.
If we have constraints that are more than binary (that is, they relate more than just two variables), we can represent the constraints as square nodes in the graph and link them to the variables they relate (as opposed to representing constraints as the arcs themselves).
General-purpose CSP algorithms use this graph structure for faster search.
Variables may be:
Constraints may be:
We may also have preferences, i.e. soft constraints. We can represent these as costs for each variable assignment. This gives us a constraint optimization problem.
We can formulate CSPs as search problems using search trees or search graphs (in the context of CSPs, they are called constraint graphs).
States are defined by the values assigned so far (partial assignments).
The initial state is the empty assignment,
Successor functions assign a value to an unassigned variable (one at a time).
The goal test is to check if the current assignment is complete (all variables have values) and satisfies all constraints.
Breadth-first search does not work well here because all the solutions will be at the bottom of the search tree (all variables must have values assigned, and that happens only at the bottom).
Depth-first search does a little better, but it is very naive - it can make a mistake early on in its path, but not realize it until reaching the end of a branch.
The main shortcoming with these approaches is that we aren't checking constraints until it's far too late.
Backtracking search is the basic uninformed search algorithm for solving CSPs. It is a simple augmentation of depth-first search.
Rather than checking the constraint satisfaction at the very end of a branch, we check constraints as we go, i.e. we only try values that do not conflict with previous assignments. This is called an incremental goal test.
Furthermore, we only consider one variable at a time in some order. Variable assignments are commutative (i.e. the order in which we assign them doesn't matter, e.g.
The moment we violate a constraint, we backtrack and try different a variable assignment.
Simple backtracking can be improved in a few ways:
Backtracking pseudocode:
def backtracking(csp):
def backtracing_recursive(assignment):
if is_complete(assignment):
return assignment
var = select_unassigned_variable(csp.variables, assignment)
for val in csp.order_domain_values(var, assignment):
if is_consistent_with_constraints(val, assignment, csp.constraints):
assignment[var] = val
result = backtracking_recursive(assignment)
if result is not None: # if not a failure
return result
else: # otherwise, remove the assignment
del assignment[var]
return None # failure
Filtering looks ahead to eliminate incompatible variable assignments early on.
With forward checking, when we assign a new variable, we look ahead and eliminate values for other variables that we know will be incompatible with this new assignment. So when we reach that variable, we only have to check values we know will not violate a constraint (that is, we only have to consider a subset of the variable's domain).
If we reach an empty domain for a variable, we know to backup.
With constraint propagation methods, we can check for failure ahead of time.
One constraint propagation method is arc consistency (AC3).
First, we must consider the consistency of an arc (here, in the context of binary constraints, but this can be extended to higher-order constraints). In the context of filtering, an arc
An inconsistent arc can be made consistent by deleting values from its tail; that is, by deleting tail values which lead to constraint-violating head values.
Note that since arcs are directional, a consistency relationship (edge) must be checked in both directions.
We can re-frame forward checking as just enforcing consistency of arcs pointing to each new assignment.
A simple form of constraint propagation is to ensure all arcs in the CSP graph are consistent. Basically, we visit each arc, check if its consistent, if not, delete values from its tail until it is consistent. If we encounter an empty domain (that is, we've deleted all values from its tail), then we know we have failed.
Note that if a value is deleted from a tail of a node, its incoming arcs must be-rechecked.
We combine this with backtracking search by applying this filtering after each new variable assignment. It's extra work at each step, but it should save us backtracking.
Arc consistency (AC3) pseudocode:
function AC3(csp):
queue = csp.all_arcs()
while queue:
from_node, to_node = queue.pop()
if remove_inconsistent_values(from_node, to_node):
for node in neighbors(from_node):
queue.append((node, from_node))
return csp
function remove_inconsistent_values(from_node, to_node):
removed = False
for x in domain[from_node]:
if no value y in domain[to_node] allows (x,y) to satisfy the constraint from_node <-> to_node:
domain[from_node].remove(x)
removed = True
return removed
Arc consistency can be generalized to
Naturally, a higher
We can extend this further with strong
One method for selecting the next variable to assign to is called minimum remaining values (MRV), in which we choose the variable with the fewest legal values left in its domain (hence this is sometimes called most constrained variable). We know this number if we are running forward checking. Essentially we decide to try the hardest variables first so if we fail, we fail early on and thus have to do less backtracking (for this reason, this is sometimes called fail-fast ordering).
For choosing the next value to try, a common method is least constraining value. That is, we try the value that gives us the most options later on. We may have to re-run filtering to determine what the least constraining value is.
Sometimes there are features of the problem structure that we can use to our advantage.
For example, we may have independent subproblems (that is, we may have multiple connected components; i.e. isolated subgraphs), in which case we can divide-and-conquer.
In practice, however, you almost never see independent subproblems.
Some CSPs have a tree structure (i.e. have no loops). Tree-structured CSPs can be solved in
The algorithm for solving tree-structured CSPs is as follows:
This method has some nice properties:
Unfortunately, in practice you don't typically encounter tree-structured CSPs.
Rather, we can improve an existing CSPs structure so that it is nearly tree-structured.
Sometimes there are just a few variables which prevent the CSP from having a tree structure.
With cutset conditioning, we assign values to these variables such that the rest of the graph is a tree.
This, for example, turns binary constraints into unary constraints, e.g. if we have a constraint
Cutset conditioning with a cutset size
More specifically, the cutset conditioning algorithm:
Unfortunately, finding the smallest cutset is an NP-hard problem.
There are other methods for improving the CSP structure, such as tree decomposition.
Tree decomposition involves creating "mega-variables" which represent subproblems of the original problem, such that the graph of these mega-variables has a tree structure. For each of these mega-variables we consider valid combinations of assignments to its variables.
These subproblems must overlap in the right way (the running intersection property) in order to ensure consistent solutions.
Rather than building solutions step-by-step, iterative algorithms start with an incorrect solution and try to fix it.
Such algorithms are local search methods in that they work with "complete" states (that is, all variables are assigned, though constraints may be violated/unsatisfied), and there is no fringe.
Then we have operators which reassign variable values.
A very simple iterative algorithm:
In practice, this min-conflicts approach tends to perform quickly for randomly-generated CSPs; that is, there are some particular CSPs which are very hard for it, but for the most part, it can perform in almost constant time for arbitrarily difficult randomly-generated CSPs.
Though, again, unfortunately many real-world CSPs fall in this difficult domain.
Multi-action adversarial games (assuming turn-based) are tricky because they have enormous branching factors. The problem is no longer what the best single action is for a turn - now we need to find the best sequence of actions to take. An evolutionary algorithm can be applied to select these actions in a method called online evolution because the agent doesn't not learn in advance (offline learning), rather, it learns the best moves while it plays.
Online evolution evolves the actions in a single turn and uses an estimation of the state at the end of the turn as a fitness function. This is essentially a single iteration of rolling horizon evolution, a method that evolves a sequence of actions and evolves new action sequences as those actions are executed. In its application here, we have a horizon of just one turn.
An individual (to be evolved) in this context is a candidate sequence of actions for the turn. A basic genetic algorithm can be applied. The fitness function can include rollouts, e.g. to a depth of one extra turn, to incorporate how an opponent might counter move, but it may not help performance.