Algorithms

We measure the performance of algorithms in terms of time complexity (how many steps are taken, as a function of input size) and space complexity (how much memory is used, as a function of input size).

There may be trade-offs for better time and space complexity - for instance, there may be situations where speed is more important than correctness (a correct algorithm is one that always terminates in the correct answer), in which case we may opt for a faster algorithm that provides only an approximate solution or a local optimum.

There may be trade-offs between time and space complexity as well - for instance, we might have the choice between a fast algorithm that uses a lot of memory, or a slower one that has lower memory requirements.

It's important to be able to evaluate and compare algorithms in order to determine which is most appropriate for a particular problem.

Algorithm design paradigms

Algorithms generally fall into a few categories of "design paradigms":

decrease-and-conquer: recursively reduce the problem into a smaller and smaller problem until it can no longer be reduced, then solve that problem.
divide-and-conquer: break up the problem into subproblems which can be recursively solved, then combine the results of the subproblems in some way to form the solution for the original problem.
greedy: take the solution that looks best at the moment; i.e. always go for the local optimum, in the hope that it may the global optimum, even though it may not be.
dynamic programming: the divide-and-conquer paradigm can lead to redundant computations if the same subproblem appears multiple times; dynamic programming involves memoizing ("remembering", i.e. storing in memory) results of computations so that when an identical subproblem is encountered, the solution can just be retrieved from memory.
brute force: try everything until you get a solution

Algorithmic Analysis

The number of operations for an algorithm is typically a function of its inputs, typically denoted $n$.

There are a few kinds of algorithmic analysis:

worst-case analysis: the upper bound running time that is true for any arbitrary input of length $n$
average-case analysis: assuming that all input are equally likely, the average running time
benchmarks: runtime on an agreed-upon set of "typical" inputs

Average-case analysis and benchmarks requires some domain knowledge about what inputs to expect. When you want to do a more "general purpose" analysis, worst-case analysis is preferable.

When comparing the performance of algorithms:

we measure the algorithm in terms of the number of steps (operations) required, which provides a consistent measurement across machines (otherwise, some machines are more powerful and naturally perform faster)
algorithms may perform differently depending on characteristics of input data, e.g. if it is already partially sorted and so on. So we look at the worst case scenario to compensate for this.
performance can change depending on the size of the input as well (for example, an algorithm $A$ which appears slower on a smaller dataset than $B$ may in fact be faster than $B$ on larger datasets), so we look at algorithmic performance as a function of input size, and look at the asymptotic performance as the problem size increases.

We focus on asymptotic analysis; that is, we focus on performance for large input sizes.

Note, however, that algorithms which are inefficient for large $n$ may be better on small $n$ when compared to algorithms that perform well for large $n$. For example, insertion sort has an upper bound runtime of $\frac{n^2}{2}$, which, for small $n$ (e.g. $n < 90$), is better than merge sort. This is because constant factors are more meaningful with small inputs. Anyways, with small $n$, it often doesn't really matter what algorithm you use, since the input is so small, there are unlikely to be significant performance differences, so analysis of small input sizes is not very valuable (or interesting).

Thus we define a "fast" algorithm as one in which the worst-case running time grows slowly with input size.

Asymptotic Analysis

With asymptotic analysis, we suppress constant factors and lower-order terms, since they don't matter much for large inputs, and because the constant factors can vary quite a bit depending on the architecture, compiler, programmer, etc.

For example, if we have an algorithm with the upper bound runtime of $6n \log_2 n + 6n$, we would rewrite it as just $n \log n$ (note that $\log$ typically implies $\log_2$).

Then we say the running time is $O(n \log n)$, said "big-oh of $n \log n$", the $O$ implies that we have dropped the constant factors and lower-order terms.

Generally, we categorize big-oh performance by order of growth, which define a set of algorithms that grow equivalently:

Order of growth	Name
$O(1)$	constant
$O(\log_b n)$	logarithmic (for any $b$)
$O(n)$	linear
$O(n \log_b n)$	$n \log n$
$O(n^2)$	quadratic
$O(n^3)$	cubic
$O(c^n)	exponential (for any $c$)

The order of growth is determined by the leading term, that is, the term with the highest exponent.

Note for $\log$, the base doesn't matter because it is equivalent to multiplying by a constant, which we ignore anyways.

Loop examples

Consider the following algorithm for finding an element in an array:

def func(i, arr):
    for el in arr:
        if el == i:
            return True
    return False

This has the running time of $O(n)$ since, in the worst case, it checks every item.

Now consider the following:

def func2(i, arr):
    return func(i, arr), func(i, arr)

This still has the running time of $O(n)$, although it has twice the number of operations (i.e. $\sim 2n$ operations total), we drop the constant factor $2$.

Now consider the following algorithm for checking if two arrays have a common element:

def func3(arr1, arr2):
    for el1 in arr1:
        for el2 in arr2:
            if el1 == el2:
                return True
    return False

This has a runtime of $O(n^2)$, which is called a quadratic time algorithm.

The following algorithm for checking duplicates in an array also has a runtime of $O(n^2)$, again due to dropping constant factors:

def func4(arr):
    for i, el1 in enumerate(arr):
        for el2 in arr[i:]:
            if el1 == el2:
                return True
    return False

Big-Oh formal definition

Say we have a function $T(n)$, $n \geq 0$, which is usually the worst-case running time of an algorithm.

We say that $T(n) = O(f(n))$ if and only if there exist constants $c, n_0 > 0$ such that $T(n) \leq c f(n)$ for all $n \geq n_0$.

That is, we can multiply $f(n)$ by some constant $c$ such that there is some value $n_0$, after which $T(n)$ is always below $c f(n)$.

For example: we demonstrated that $6n \log_2 n + 6n$ is the worst-case running time for merge sort. For merge sort, this is $T(n)$. We described merge sort's running time in big-oh notation with $O(n \log n)$. This is appropriate because there exists some constant $c$ we can multiply $n \log n$ by such that, after some input size $n_0$, $c f(n)$ is always larger than $T(n)$. In this sense, $n_0$ defines a sufficiently large input.

As a simple example, we can prove that $2^{n+10} = O(2^n)$.

So the inequality is:

$$ 2^{n+10} \leq c 2^n $$

We can re-write this:

$$ 2^{10} 2^n \leq c 2^n $$

Then it's clear that if we set $c=2^{10}$, this inequality holds, and it happens to hold for all $n$, so we can just set $n_0 = 1$. Thus $2^{n+10} = O(2^n)$ is in fact true.

Big-Omega notation

$T(n) = \Omega(f(n))$ if and only if there exist constants $c, n_0 > 0$ such that $T(n) \geq c f(n)$ for all $n \geq n_0$.

That is, we can multiply $f(n)$ by some constant $c$ such that there is some value $n_0$, after which $T(n)$ is always above $c f(n)$.

Big-Theta notation

$T(n) = \Theta(f(n))$ if and only if $T(n) = O(f(n))$ and $T(n) = \Omega(f(n))$. That is, $T(n)$ eventually stays sandwiched between $c_1 f(n)$ and $c_2 f(n)$ after some value $n_0$.

Little-Oh notation

Stricter than big-oh, in that this must be true for all positive constants.

$T(n) = o(f(n))$ if and only if for all constants $c>0$, there exists a constant $n_0$ such that $T(n) \leq c f(n)$ for all $n \geq n_0$.

The divide-and-conquer paradigm

This consists of:

Divide the problem into smaller subproblems. You do not have to literally divide the problem in the algorithm's implementation; this may just be a conceptual step.
Compute the subproblems using recursion.
Combine the subproblem solutions into the problem for the original problem.

The Master Method/Theorem

The Master Method provides an easy way of computing the upper-bound runtime for a divide-and-conquer algorithm, so long as it satisfies the assumptions:

Assume that subproblems have equal size and the recurrence has the format:

Base case: $T(n) \leq c$, where $c$ is a constant, for all sufficiently small $n$.
For all larger $n$: $T(n) \leq a T(\frac{n}{b}) + O(n^d)$, where $a$ is the number of recursive calls, $a \geq 1$, each subproblem has input size $\frac{n}{b}$, $b > 1$, and outside each recursive call, we do some additional $O(n^d)$ of work (e.g. a combine step), parameterized by $d$, $d \geq 0$.

$$ T(n) = \begin{cases} O(n^d \log_n) & \text{if $a = b^d$} \\ O(n^d) & \text{if $a < b^d$} \\ O(n^{\log_b a}) & \text{if $a > b^d$} \end{cases} $$

Note that the logarithm base does not matter in the first case, since it just changes the leading constant (which doesn't matter in Big-Oh notation), whereas in the last case the base does matter, because it has a more significant effect.

To use the master method, we just need to determine $a,b,d$.

For example:

With merge sort, there are two recursive calls (thus $a=2$), and the input size to each recursive call is half the original input (thus $b=2$), and the combine step only involves the merge operation ($d=1$).

Thus, working out the master method inequality, we get $2 = 2^1$, i.e. $a = b^d$, thus:

$$ T(n) = O(n \log_n) $$

Proof of the Master Method

(Carrying over the previously-stated assumptions)

For simplicity, we'll also assume that $n$ is a power of $b$, but this proof holds in the general case as well.

In the recursion tree, at each level $j=0,1,2,\dots,\log_b n$, there are $a^j$ subproblems, each of size $n/b^j$.

At a level $j$, the total work, not including the work in recursive calls, is:

$$ \leq a^j c (\frac{n}{b^j})^d $$

Note that the $a$ and $b$ terms are dependent on the level $j$, but the $c,n,d$ terms are not. We can rearrange the expression to separate those terms:

$$ \leq cn^d (\frac{a}{b^d})^j $$

To get the total work, we can sum over all the levels:

$$ \leq cn^d \sum_{j=0}^{\log_b n} (\frac{a}{b^d})^j $$

We can think of $a$ as the rate of subproblem proliferation (i.e. how the number of subproblems grow with level depth) and $b^d$ as the rate of work shrinkage per subproblem.

There are three possible scenarios, corresponding to the master method's three cases:

If $a < b^d$, then the amount of work decreases with the recursion level.
If $a > b^d$, then the amount of work increases with the recursion level.
If $a = b^d$, then the amount of work stays the same with the recursion level.

If $a = b^d$, then the summation term in the total work expression, $\sum_{j=0}^{\log_b n} (\frac{a}{b^d})^j$, simply becomes $\log_b n + 1$, thus the total work upper bound in that case is just $c n^d (\log_b n + 1)$, which is just $O(n^d \log n)$.

A geometric sum for $r \neq 1$:

$$ 1 + r + r^2 + r^3 + \dots + r^k $$

Can be expressed in the following closed form:

$$ \frac{r^{k+1} - 1}{r-1} $$

If $r < 1$ is constant, then this is $\leq \frac{1}{1-r}$ (i.e. it is some constant independent of $k$).
If $r > 1$ is constant, then this is $\leq r^k(1 + \frac{1}{r-1})$, where the last term $(1 + \frac{1}{r-1})$ is a constant independent of $k$.

We can bring this back to the master method by setting $r = \frac{a}{b^d}$.

If $a < b^d$, then the summation term in the total work expression becomes a constant (as demonstrated with the geometric sum); thus in Big-Oh, that summation term drops, and we are left with $O(n^d)$.

If $a > b^d$, then the summation term in the total work becomes a constant times $r^k$, where $k = \log_b n$, i.e. the summation term becomes a constant times $(\frac{a}{b^d})^{\log_b n}$. So we get $O(n^d (\frac{a}{b^d})^{\log_b n})$, which simplifies to $O(a^{\log_b n})$, which ends up being the number of leavens in the recursion tree. This is equivalent to $O(n^{\log_b a})$.

Data Structures

Data structures are particular ways of organizing data that support certain operations. Structures have strengths and weaknesses in what kinds of operations they can perform, and how well they can perform them.

For example: lists, stacks, queues, heaps, search trees, hashtables, bloom filters, etc.

Different data structures are appropriate for different operations (and thus different kinds of problems).

Heaps

A heap (sometimes called a priority queue) is a container for objects that have keys; these keys are comparable (e.g. we can say that one key is bigger than another).

Supported operations:

insert: add a new object to the heap, runtime $O(\log n)$
extract-min: remove the object with the minimum key value (ties broken arbitrarily), runtime $O(\log n)$

Alternatively, there are max-heaps which return the maximum key value (this can be emulated by a heap by negating key values such that the max becomes the min, etc).

Sometimes there are additional operations supported:

heapify: initialize a heap in linear time (i.e. $O(n)$ time, faster than inserting them one-by-one)
delete: delete an arbitrary element from the middle of the heap in $O(\log n)$ time

For example, you can have a heap where events are your objects and their keys are a scheduled time to occur. Thus when you extract-min you always get the next event scheduled to occur.

Balanced Binary Search Tree

A balanced binary search tree can be thought of as a dynamic sorted array (i.e. a sorted array which supports insert and delete operations).

First, consider sorted arrays. They support the following operations:

search: binary search, runtime $O(\log n)$
select: select an element by index, runtime $O(1)$
min and max: return first and last element of the array (respectively), runtime $O(1)$
predecessor and successor: return next smallest and next largest element of the array (respectively), runtime $O(1)$
rank: the number of elements less than or equal to a given value, runtime $O(\log n)$ (search for the given value and return the position)
output: output elements in sorted order, runtime $O(n)$ (since they are already sorted)

With sorted arrays, insertion and deletions have $O(n)$ runtime, which is too slow.

If you want more logarithmic-time insertions and deletions, we can use a balanced binary search tree. This supports the same operations as sorted arrays (though some are slower) in addition to faster insertions and deletions:

search: runtime $O(\log n)$
select: runtime $O(\log n)$ (slower than sorted arrays)
min and max: runtime $O(\log n)$ (slower than sorted arrays)
predecessor and successor: runtime $O(\log n)$ (slower than sorted arrays)
rank: runtime $O(\log n)$
output: runtime $O(n)$
insert: runtime $O(\log n)$
delete: runtime $O(\log n)$

To understand how balanced binary search trees, first consider binary search trees.

Binary Search Tree

A binary search tree (BST) is a data structure for efficient searching.

The keys that are stored are the nodes of the tree.

Each node has three pointers: one to its parent, one to its left child, and one to its right child.

These pointers can be null (e.g. the root node has no parent).

The search tree property asserts that for every node, all the keys stored in its left subtree should be less than its key, and all the keys in its right subtree should be greater than its key.

You can also have a convention for handling equal keys (e.g. just put it on the left or the right subtree).

This search tree property is what makes it very easy to search for particular values.

Note that there are many possible binary search trees for a given set of keys. The same set of keys could be arranged as a very deep and narrow BST, or as a very shallow and wide one. The worst case is a depth of about $n$, which is more of a chain than a tree; the best case is a depth of about $\log_2 n$, which is perfectly balanced.

This search tree property also makes insertion simple. You search for the key to be inserted, which will fail since the key is not in the tree yet, and you get a null pointer - you just assign that pointer to point to the new key. In the case of duplicates, you insert it according to whatever convention you decided on (as mentioned previously). This insert method maintains the search tree property.

Search and insert performance are dependent on the depth of the tree, so at worst the runtime is $O(\text{height})$.

The min and max operations are simple: go down the leftmost branch for the min key and go down the rightmost branch for the max key.

The predecessor operation is a bit more complicated. First you search for the key in question. Then, if the key's node has a left subtree, just take the max of that subtree. However, if the key does not have a left subtree, move up the tree through its parent and ancestors until you find a node with a key less than the key in question. (The successor operation is accomplished in a similar way.)

The deletion operation is tricky. First we must search for the key we want to delete. Then there are three possibilities:

the node has no children, so we can just delete it and be done
the node has one child; we can just delete the node and replace it with its child
the node has two children; we first compute the predecessor of the node, then swap it with the node, then delete the node

For the select and rank operations, we can augment our search tree by including additional information at each node: the size of its subtree, including itself (i.e. number of descendants + 1). Augmenting the data structure in this way does add some overhead, e.g. we have to maintain the data/keep it updated whenever we modify the tree.

For select, we want to find the $i$th value of the data structure. Starting at a node $x$, say $a$ is the size of its left subtree. If $a = i-1$, return $x$. If $a \geq i$, recursively select the $i$th value of the left subtree. If $a < i-1$, recursively select the $(i-a-1)$th value of the right subtree.

Balanced Binary Search Trees (Red-Black Trees)

Balanced binary search trees have the "best" depth of about $\log_2 n$.

There are different kinds of balanced binary search trees (which are all quite similar), here we will talk about red-black trees (other kinds are AVL trees, splaytrees, B trees, etc).

Red-Black trees maintain some invariants/constraints which are what guarantee that the tree is balanced:

each node stores an additional bit indicating if it is a "red" or "black" node
the root is always black
never allow two reds in a row (e.g. all of a red node's children are black)
every path from the root node to a null pointer (e.g. an unsuccessful search) must go through the same number of black nodes

Consider the following: for a binary search tree, if every root-null path has $\geq k$ nodes, then the tree includes at the top a perfectly balanced search tree of depth $k-1$. Thus there must be at least $2^k - 1$ nodes in the tree, i.e. $n \geq 2^k - 1$. We can restate this as $k \leq \log_2(n+1)$

In a red-black tree, there is a root-null path with at most $log_2(n+1)$ black nodes (e.g. it can have a root-null path composed of only black nodes).

The fourth constraint on red-black trees means that every root-null path has $\leq log_2(n+1)$ black nodes. The third constraint means that we can never have more red nodes than black nodes (because the red nodes can never come one after the other) in a path. So at most a root-null path will have $\leq 2 log_2(n+1)$, which gives us a balanced tree.

Hash Tables

Hash tables (also called dictionaries) allow us to maintain a (possibly evolving) set of stuff.

The core operations include:

insert using a key
delete using a key
lookup using a key

When implemented properly, and on non-pathological data, these operations all run in $O(1)$ time.

Hash tables do not maintain any ordering.

Basically, hash tables use some hash function to produce a hash for an object (some number); this hash is the "address" of the object in the hash table.

More specifically, we have a hash function which gives us a value in some range $[0, n]$; we have an array of length $n$, so the hash function tells us and what index to place some object.

There is a chance of collisions in which two different objects produce the same hash.

There are two main solutions for resolving collisions:

(separate) chaining: if there is a collision, store the objects together at that index as a list
open addressing: here, a hash function specifies a sequence (called a probe sequence) instead of a single value. Try the first value, if its occupied, try the next, and so on.
one strategy, linear probing, just has you try the hash value + 1 and keep incrementing by one until an empty bucket is found
another is double hashing, in which you have two hash functions, you look at the first hash, if occupied, offset by the second hash until you find an empty bucket

Each is more appropriate in different situations.

The performance of a hash table depends a lot on the particular hash function. Ideally, we want a hash function that:

has good performance (i.e. low collisions)
should be easy to store
should be fast to evaluate (constant time)

Designing hash functions is as much an art as it is a science. They are quite difficult to design.

A hash table has a load factor (sometimes just called load), denoted $\alpha = \frac{\text{num. objects in hash table}}{\text{num. buckets in hash table}}$.

For hash table operations to run constant time, it is necessary that $\alpha = O(1)$. Ideally, it is less than 1, especially with open addressing.

So for good performance, we need to control load. For example, if $\alpha$ passes some threshold, e.g. 0.75, then we may want to expand the hash table to lower the load.

Every hash function has a "pathological" data set which it performs poorly on. There is no hash function which is guaranteed to spread out every data set evenly (i.e. have low collisions on any arbitrary data set). You can often reverse engineer this pathological data set by analyzing the hash function.

However, for some hash functions, it is "infeasible" to figure out its pathological data set (as is the case with cryptographic hash functions).

One approach is to define a family of hash functions, rather than just one, and randomly choose a hash function to use at runtime. This has the property that, on average, you do well across all datasets.

Bloom Filters

Bloom filters are a variant on hash tables - they are more space efficient, but they allow for some errors (that is, there's a chance of a false positive). In some contexts, this is tolerable.

They are more space efficient because they do not actually store the objects themselves. They are more commonly used to keep track of what objects have been seen so far. They typically do not support deletions (there are variants that do incorporate deletions but they are more complicated). There is also a small chance that it will say that it's seen an object that it hasn't (i.e. false positives).

Like a hash table, a bloom filter consists of an array $A$, but each entry in the array is just one bit. Say we have a set of objects $S$ and the total number of bits $n$ - a bloom filter will use only $\frac{n}{|S|}$ bits per object in $S$.

We also have $k$ hash functions $h_1, \dots, h_k$ (usually $k$ is small).

The insert operation is defined:

hash_funcs = [...]
for i in range(k):
  A[h[i](input)] = 1

That is, we just set the values of those bits to 1.

Thus lookup is just to check that all the corresponding bits for an object are 1.

We can't have any false negatives because those bits will not have been set and bits are never reset back to zero.

False positives are possible, however, because some other objects may have in aggregate set the bits corresponding to another object.

P vs NP

Consider the following problem:

7 * 13 = ?

This is solved very quickly by a computer (it gives 91).

Now consider the following factoring problem :

? * ? = 91

This a bit more difficult for a computer to solve, though it will yield the correct answers (7 and 13).

If we consider extremely large numbers, a computer can still very quickly compute their product.

But, given a product, it will take a computer a very, very long time to compute their factors.

In fact, modern cryptography is based on the fact that computers are not good at finding factors for a number (in particular, prime factors).

This is because computers basically have to use brute force search to identify a factor; with very large numbers, this search space is enormous (it grows exponentially).

However, once we find a possible solution, it is easy to check that we are correct (e.g. just multiply the factors and compare the product).

There are many problems which are characterized in this way - they require brute force search to identify the exact answer (there are often faster ways of getting approximate answers), but once an answer is found, it can be easily checked if it is correct.

There are other problems, such as multiplication, where we can easily "jump" directly to the correct exact answer.

For the problems that require search, it is not known whether or not there is also a method that can "jump" to the correct answer.

Consider the "needle in the haystack" analogy. We could go through each piece of hay until we find the needle (brute force search). Or we could use a magnet to pull it out immediately. The question is open: does this magnet exist for problems like factorization?

Problems which we can quickly solve, like multiplication, are in a family of problems called "P", which stands for "polynomial time" (referring to the relationship between the number of inputs and how computation time increases).

Problems which can be quickly verified to be correct are in a family of problems called "NP", which stands for "nondeterministic polynomial time".

P is a subset of NP, since their answers are quickly verified, but NP also includes the aforementioned search problems. Thus, a major question is whether or not P = NP, i.e. we have the "magnet" for P problems, but is there also one for the rest of the NP problems? Or is searching the only option?

NP-hard

Problems which are NP-hard are at least as hard as NP problems, so this includes problems which may not even be in NP.

NP-completeness

There are some NP problems which all NP problems can be reduced to. Such an NP problem is called NP-complete.

For example, any NP problem can be reduced to a clique problem (e.g. finding a clique of some arbitrary size in a graph); thus the clique problem is NP-complete. Any other NP problem can be reduced to a clique problem, and if we can find a way of solving the clique problem quickly, we can also all of those related problems quickly as well.

To show that a problem is NP complete, you must:

Show that it is in NP: that is, show that there is a polynomial time algorithm that can verify whether or not an answer is correct
Show that it is NP-hard: that is, reduce some known NP-complete problem to your problem in polynomial time.

References

Algorithms: Design and Analysis, Part 1. Tim Roughgarden. Stanford/Coursera.
Think Complexity. Version 1.2.3. Allen B. Downey. 2012.
Beyond Computation: The P vs NP Problem. Michael Sipser, MIT. Tuesday, October 3, 2006.
Algorithmic Puzzles. Anany Levitin, Maria Levitin. 2011.