Algorithms

We measure the performance of algorithms in terms of time complexity (how many steps are taken, as a function of input size) and space complexity (how much memory is used, as a function of input size).

There may be trade-offs for better time and space complexity - for instance, there may be situations where speed is more important than correctness (a correct algorithm is one that always terminates in the correct answer), in which case we may opt for a faster algorithm that provides only an approximate solution or a local optimum.

There may be trade-offs between time and space complexity as well - for instance, we might have the choice between a fast algorithm that uses a lot of memory, or a slower one that has lower memory requirements.

It's important to be able to evaluate and compare algorithms in order to determine which is most appropriate for a particular problem.

Algorithm design paradigms

Algorithms generally fall into a few categories of "design paradigms":

Algorithmic Analysis

The number of operations for an algorithm is typically a function of its inputs, typically denoted $n$.

There are a few kinds of algorithmic analysis:

Average-case analysis and benchmarks requires some domain knowledge about what inputs to expect. When you want to do a more "general purpose" analysis, worst-case analysis is preferable.

When comparing the performance of algorithms:

We focus on asymptotic analysis; that is, we focus on performance for large input sizes.

Note, however, that algorithms which are inefficient for large $n$ may be better on small $n$ when compared to algorithms that perform well for large $n$. For example, insertion sort has an upper bound runtime of $\frac{n^2}{2}$, which, for small $n$ (e.g. $n < 90$), is better than merge sort. This is because constant factors are more meaningful with small inputs. Anyways, with small $n$, it often doesn't really matter what algorithm you use, since the input is so small, there are unlikely to be significant performance differences, so analysis of small input sizes is not very valuable (or interesting).

Thus we define a "fast" algorithm as one in which the worst-case running time grows slowly with input size.

Asymptotic Analysis

With asymptotic analysis, we suppress constant factors and lower-order terms, since they don't matter much for large inputs, and because the constant factors can vary quite a bit depending on the architecture, compiler, programmer, etc.

For example, if we have an algorithm with the upper bound runtime of $6n \log_2 n + 6n$, we would rewrite it as just $n \log n$ (note that $\log$ typically implies $\log_2$).

Then we say the running time is $O(n \log n)$, said "big-oh of $n \log n$", the $O$ implies that we have dropped the constant factors and lower-order terms.

Generally, we categorize big-oh performance by order of growth, which define a set of algorithms that grow equivalently:

Order of growth Name
$O(1)$ constant
$O(\log_b n)$ logarithmic (for any $b$)
$O(n)$ linear
$O(n \log_b n)$ $n \log n$
$O(n^2)$ quadratic
$O(n^3)$ cubic
$O(c^n) exponential (for any $c$)

The order of growth is determined by the leading term, that is, the term with the highest exponent.

Note for $\log$, the base doesn't matter because it is equivalent to multiplying by a constant, which we ignore anyways.

Loop examples

Consider the following algorithm for finding an element in an array:

def func(i, arr):
    for el in arr:
        if el == i:
            return True
    return False

This has the running time of $O(n)$ since, in the worst case, it checks every item.

Now consider the following:

def func2(i, arr):
    return func(i, arr), func(i, arr)

This still has the running time of $O(n)$, although it has twice the number of operations (i.e. $\sim 2n$ operations total), we drop the constant factor $2$.

Now consider the following algorithm for checking if two arrays have a common element:

def func3(arr1, arr2):
    for el1 in arr1:
        for el2 in arr2:
            if el1 == el2:
                return True
    return False

This has a runtime of $O(n^2)$, which is called a quadratic time algorithm.

The following algorithm for checking duplicates in an array also has a runtime of $O(n^2)$, again due to dropping constant factors:

def func4(arr):
    for i, el1 in enumerate(arr):
        for el2 in arr[i:]:
            if el1 == el2:
                return True
    return False

Big-Oh formal definition

Say we have a function $T(n)$, $n \geq 0$, which is usually the worst-case running time of an algorithm.

We say that $T(n) = O(f(n))$ if and only if there exist constants $c, n_0 > 0$ such that $T(n) \leq c f(n)$ for all $n \geq n_0$.

That is, we can multiply $f(n)$ by some constant $c$ such that there is some value $n_0$, after which $T(n)$ is always below $c f(n)$.

For example: we demonstrated that $6n \log_2 n + 6n$ is the worst-case running time for merge sort. For merge sort, this is $T(n)$. We described merge sort's running time in big-oh notation with $O(n \log n)$. This is appropriate because there exists some constant $c$ we can multiply $n \log n$ by such that, after some input size $n_0$, $c f(n)$ is always larger than $T(n)$. In this sense, $n_0$ defines a sufficiently large input.

As a simple example, we can prove that $2^{n+10} = O(2^n)$.

So the inequality is:

$$ 2^{n+10} \leq c 2^n $$

We can re-write this:

$$ 2^{10} 2^n \leq c 2^n $$

Then it's clear that if we set $c=2^{10}$, this inequality holds, and it happens to hold for all $n$, so we can just set $n_0 = 1$. Thus $2^{n+10} = O(2^n)$ is in fact true.

Big-Omega notation

$T(n) = \Omega(f(n))$ if and only if there exist constants $c, n_0 > 0$ such that $T(n) \geq c f(n)$ for all $n \geq n_0$.

That is, we can multiply $f(n)$ by some constant $c$ such that there is some value $n_0$, after which $T(n)$ is always above $c f(n)$.

Big-Theta notation

$T(n) = \Theta(f(n))$ if and only if $T(n) = O(f(n))$ and $T(n) = \Omega(f(n))$. That is, $T(n)$ eventually stays sandwiched between $c_1 f(n)$ and $c_2 f(n)$ after some value $n_0$.

Little-Oh notation

Stricter than big-oh, in that this must be true for all positive constants.

$T(n) = o(f(n))$ if and only if for all constants $c>0$, there exists a constant $n_0$ such that $T(n) \leq c f(n)$ for all $n \geq n_0$.

The divide-and-conquer paradigm

This consists of:

The Master Method/Theorem

The Master Method provides an easy way of computing the upper-bound runtime for a divide-and-conquer algorithm, so long as it satisfies the assumptions:

Assume that subproblems have equal size and the recurrence has the format:

$$ T(n) = \begin{cases} O(n^d \log_n) & \text{if $a = b^d$} \\ O(n^d) & \text{if $a < b^d$} \\ O(n^{\log_b a}) & \text{if $a > b^d$} \end{cases} $$

Note that the logarithm base does not matter in the first case, since it just changes the leading constant (which doesn't matter in Big-Oh notation), whereas in the last case the base does matter, because it has a more significant effect.

To use the master method, we just need to determine $a,b,d$.

For example:

With merge sort, there are two recursive calls (thus $a=2$), and the input size to each recursive call is half the original input (thus $b=2$), and the combine step only involves the merge operation ($d=1$).

Thus, working out the master method inequality, we get $2 = 2^1$, i.e. $a = b^d$, thus:

$$ T(n) = O(n \log_n) $$

Proof of the Master Method

(Carrying over the previously-stated assumptions)

For simplicity, we'll also assume that $n$ is a power of $b$, but this proof holds in the general case as well.

In the recursion tree, at each level $j=0,1,2,\dots,\log_b n$, there are $a^j$ subproblems, each of size $n/b^j$.

At a level $j$, the total work, not including the work in recursive calls, is:

$$ \leq a^j c (\frac{n}{b^j})^d $$

Note that the $a$ and $b$ terms are dependent on the level $j$, but the $c,n,d$ terms are not. We can rearrange the expression to separate those terms:

$$ \leq cn^d (\frac{a}{b^d})^j $$

To get the total work, we can sum over all the levels:

$$ \leq cn^d \sum_{j=0}^{\log_b n} (\frac{a}{b^d})^j $$

We can think of $a$ as the rate of subproblem proliferation (i.e. how the number of subproblems grow with level depth) and $b^d$ as the rate of work shrinkage per subproblem.

There are three possible scenarios, corresponding to the master method's three cases:

If $a = b^d$, then the summation term in the total work expression, $\sum_{j=0}^{\log_b n} (\frac{a}{b^d})^j$, simply becomes $\log_b n + 1$, thus the total work upper bound in that case is just $c n^d (\log_b n + 1)$, which is just $O(n^d \log n)$.

A geometric sum for $r \neq 1$:

$$ 1 + r + r^2 + r^3 + \dots + r^k $$

Can be expressed in the following closed form:

$$ \frac{r^{k+1} - 1}{r-1} $$

If $r < 1$ is constant, then this is $\leq \frac{1}{1-r}$ (i.e. it is some constant independent of $k$).
If $r > 1$ is constant, then this is $\leq r^k(1 + \frac{1}{r-1})$, where the last term $(1 + \frac{1}{r-1})$ is a constant independent of $k$.

We can bring this back to the master method by setting $r = \frac{a}{b^d}$.

If $a < b^d$, then the summation term in the total work expression becomes a constant (as demonstrated with the geometric sum); thus in Big-Oh, that summation term drops, and we are left with $O(n^d)$.

If $a > b^d$, then the summation term in the total work becomes a constant times $r^k$, where $k = \log_b n$, i.e. the summation term becomes a constant times $(\frac{a}{b^d})^{\log_b n}$. So we get $O(n^d (\frac{a}{b^d})^{\log_b n})$, which simplifies to $O(a^{\log_b n})$, which ends up being the number of leavens in the recursion tree. This is equivalent to $O(n^{\log_b a})$.

Data Structures

Data structures are particular ways of organizing data that support certain operations. Structures have strengths and weaknesses in what kinds of operations they can perform, and how well they can perform them.

For example: lists, stacks, queues, heaps, search trees, hashtables, bloom filters, etc.

Different data structures are appropriate for different operations (and thus different kinds of problems).

Heaps

A heap (sometimes called a priority queue) is a container for objects that have keys; these keys are comparable (e.g. we can say that one key is bigger than another).

Supported operations:

Alternatively, there are max-heaps which return the maximum key value (this can be emulated by a heap by negating key values such that the max becomes the min, etc).

Sometimes there are additional operations supported:

For example, you can have a heap where events are your objects and their keys are a scheduled time to occur. Thus when you extract-min you always get the next event scheduled to occur.

Balanced Binary Search Tree

A balanced binary search tree can be thought of as a dynamic sorted array (i.e. a sorted array which supports insert and delete operations).

First, consider sorted arrays. They support the following operations:

With sorted arrays, insertion and deletions have $O(n)$ runtime, which is too slow.

If you want more logarithmic-time insertions and deletions, we can use a balanced binary search tree. This supports the same operations as sorted arrays (though some are slower) in addition to faster insertions and deletions:

To understand how balanced binary search trees, first consider binary search trees.

Binary Search Tree

A binary search tree (BST) is a data structure for efficient searching.

The keys that are stored are the nodes of the tree.

Each node has three pointers: one to its parent, one to its left child, and one to its right child.

These pointers can be null (e.g. the root node has no parent).

The search tree property asserts that for every node, all the keys stored in its left subtree should be less than its key, and all the keys in its right subtree should be greater than its key.

You can also have a convention for handling equal keys (e.g. just put it on the left or the right subtree).

This search tree property is what makes it very easy to search for particular values.

Note that there are many possible binary search trees for a given set of keys. The same set of keys could be arranged as a very deep and narrow BST, or as a very shallow and wide one. The worst case is a depth of about $n$, which is more of a chain than a tree; the best case is a depth of about $\log_2 n$, which is perfectly balanced.

This search tree property also makes insertion simple. You search for the key to be inserted, which will fail since the key is not in the tree yet, and you get a null pointer - you just assign that pointer to point to the new key. In the case of duplicates, you insert it according to whatever convention you decided on (as mentioned previously). This insert method maintains the search tree property.

Search and insert performance are dependent on the depth of the tree, so at worst the runtime is $O(\text{height})$.

The min and max operations are simple: go down the leftmost branch for the min key and go down the rightmost branch for the max key.

The predecessor operation is a bit more complicated. First you search for the key in question. Then, if the key's node has a left subtree, just take the max of that subtree. However, if the key does not have a left subtree, move up the tree through its parent and ancestors until you find a node with a key less than the key in question. (The successor operation is accomplished in a similar way.)

The deletion operation is tricky. First we must search for the key we want to delete. Then there are three possibilities:

For the select and rank operations, we can augment our search tree by including additional information at each node: the size of its subtree, including itself (i.e. number of descendants + 1). Augmenting the data structure in this way does add some overhead, e.g. we have to maintain the data/keep it updated whenever we modify the tree.

For select, we want to find the $i$th value of the data structure. Starting at a node $x$, say $a$ is the size of its left subtree. If $a = i-1$, return $x$. If $a \geq i$, recursively select the $i$th value of the left subtree. If $a < i-1$, recursively select the $(i-a-1)$th value of the right subtree.

Balanced Binary Search Trees (Red-Black Trees)

Balanced binary search trees have the "best" depth of about $\log_2 n$.

There are different kinds of balanced binary search trees (which are all quite similar), here we will talk about red-black trees (other kinds are AVL trees, splaytrees, B trees, etc).

Red-Black trees maintain some invariants/constraints which are what guarantee that the tree is balanced:

  1. each node stores an additional bit indicating if it is a "red" or "black" node
  2. the root is always black
  3. never allow two reds in a row (e.g. all of a red node's children are black)
  4. every path from the root node to a null pointer (e.g. an unsuccessful search) must go through the same number of black nodes

Consider the following: for a binary search tree, if every root-null path has $\geq k$ nodes, then the tree includes at the top a perfectly balanced search tree of depth $k-1$. Thus there must be at least $2^k - 1$ nodes in the tree, i.e. $n \geq 2^k - 1$. We can restate this as $k \leq \log_2(n+1)$

In a red-black tree, there is a root-null path with at most $log_2(n+1)$ black nodes (e.g. it can have a root-null path composed of only black nodes).

The fourth constraint on red-black trees means that every root-null path has $\leq log_2(n+1)$ black nodes. The third constraint means that we can never have more red nodes than black nodes (because the red nodes can never come one after the other) in a path. So at most a root-null path will have $\leq 2 log_2(n+1)$, which gives us a balanced tree.

Hash Tables

Hash tables (also called dictionaries) allow us to maintain a (possibly evolving) set of stuff.

The core operations include:

When implemented properly, and on non-pathological data, these operations all run in $O(1)$ time.

Hash tables do not maintain any ordering.

Basically, hash tables use some hash function to produce a hash for an object (some number); this hash is the "address" of the object in the hash table.

More specifically, we have a hash function which gives us a value in some range $[0, n]$; we have an array of length $n$, so the hash function tells us and what index to place some object.

There is a chance of collisions in which two different objects produce the same hash.

There are two main solutions for resolving collisions:

Each is more appropriate in different situations.

The performance of a hash table depends a lot on the particular hash function. Ideally, we want a hash function that:

Designing hash functions is as much an art as it is a science. They are quite difficult to design.

A hash table has a load factor (sometimes just called load), denoted $\alpha = \frac{\text{num. objects in hash table}}{\text{num. buckets in hash table}}$.

For hash table operations to run constant time, it is necessary that $\alpha = O(1)$. Ideally, it is less than 1, especially with open addressing.

So for good performance, we need to control load. For example, if $\alpha$ passes some threshold, e.g. 0.75, then we may want to expand the hash table to lower the load.

Every hash function has a "pathological" data set which it performs poorly on. There is no hash function which is guaranteed to spread out every data set evenly (i.e. have low collisions on any arbitrary data set). You can often reverse engineer this pathological data set by analyzing the hash function.

However, for some hash functions, it is "infeasible" to figure out its pathological data set (as is the case with cryptographic hash functions).

One approach is to define a family of hash functions, rather than just one, and randomly choose a hash function to use at runtime. This has the property that, on average, you do well across all datasets.

Bloom Filters

Bloom filters are a variant on hash tables - they are more space efficient, but they allow for some errors (that is, there's a chance of a false positive). In some contexts, this is tolerable.

They are more space efficient because they do not actually store the objects themselves. They are more commonly used to keep track of what objects have been seen so far. They typically do not support deletions (there are variants that do incorporate deletions but they are more complicated). There is also a small chance that it will say that it's seen an object that it hasn't (i.e. false positives).

Like a hash table, a bloom filter consists of an array $A$, but each entry in the array is just one bit. Say we have a set of objects $S$ and the total number of bits $n$ - a bloom filter will use only $\frac{n}{|S|}$ bits per object in $S$.

We also have $k$ hash functions $h_1, \dots, h_k$ (usually $k$ is small).

The insert operation is defined:

hash_funcs = [...]
for i in range(k):
  A[h[i](input)] = 1

That is, we just set the values of those bits to 1.

Thus lookup is just to check that all the corresponding bits for an object are 1.

We can't have any false negatives because those bits will not have been set and bits are never reset back to zero.

False positives are possible, however, because some other objects may have in aggregate set the bits corresponding to another object.

P vs NP

Consider the following problem:

7 * 13 = ?

This is solved very quickly by a computer (it gives 91).

Now consider the following factoring problem :

? * ? = 91

This a bit more difficult for a computer to solve, though it will yield the correct answers (7 and 13).

If we consider extremely large numbers, a computer can still very quickly compute their product.

But, given a product, it will take a computer a very, very long time to compute their factors.

In fact, modern cryptography is based on the fact that computers are not good at finding factors for a number (in particular, prime factors).

This is because computers basically have to use brute force search to identify a factor; with very large numbers, this search space is enormous (it grows exponentially).

However, once we find a possible solution, it is easy to check that we are correct (e.g. just multiply the factors and compare the product).

There are many problems which are characterized in this way - they require brute force search to identify the exact answer (there are often faster ways of getting approximate answers), but once an answer is found, it can be easily checked if it is correct.

There are other problems, such as multiplication, where we can easily "jump" directly to the correct exact answer.

For the problems that require search, it is not known whether or not there is also a method that can "jump" to the correct answer.

Consider the "needle in the haystack" analogy. We could go through each piece of hay until we find the needle (brute force search). Or we could use a magnet to pull it out immediately. The question is open: does this magnet exist for problems like factorization?

Problems which we can quickly solve, like multiplication, are in a family of problems called "P", which stands for "polynomial time" (referring to the relationship between the number of inputs and how computation time increases).

Problems which can be quickly verified to be correct are in a family of problems called "NP", which stands for "nondeterministic polynomial time".

P is a subset of NP, since their answers are quickly verified, but NP also includes the aforementioned search problems. Thus, a major question is whether or not P = NP, i.e. we have the "magnet" for P problems, but is there also one for the rest of the NP problems? Or is searching the only option?

NP-hard

Problems which are NP-hard are at least as hard as NP problems, so this includes problems which may not even be in NP.

NP-completeness

There are some NP problems which all NP problems can be reduced to. Such an NP problem is called NP-complete.

For example, any NP problem can be reduced to a clique problem (e.g. finding a clique of some arbitrary size in a graph); thus the clique problem is NP-complete. Any other NP problem can be reduced to a clique problem, and if we can find a way of solving the clique problem quickly, we can also all of those related problems quickly as well.

To show that a problem is NP complete, you must:

References