We measure the performance of algorithms in terms of time complexity (how many steps are taken, as a function of input size) and space complexity (how much memory is used, as a function of input size).
There may be trade-offs for better time and space complexity - for instance, there may be situations where speed is more important than correctness (a correct algorithm is one that always terminates in the correct answer), in which case we may opt for a faster algorithm that provides only an approximate solution or a local optimum.
There may be trade-offs between time and space complexity as well - for instance, we might have the choice between a fast algorithm that uses a lot of memory, or a slower one that has lower memory requirements.
It's important to be able to evaluate and compare algorithms in order to determine which is most appropriate for a particular problem.
Algorithms generally fall into a few categories of "design paradigms":
The number of operations for an algorithm is typically a function of its inputs, typically denoted
There are a few kinds of algorithmic analysis:
Average-case analysis and benchmarks requires some domain knowledge about what inputs to expect. When you want to do a more "general purpose" analysis, worst-case analysis is preferable.
When comparing the performance of algorithms:
We focus on asymptotic analysis; that is, we focus on performance for large input sizes.
Note, however, that algorithms which are inefficient for large
Thus we define a "fast" algorithm as one in which the worst-case running time grows slowly with input size.
With asymptotic analysis, we suppress constant factors and lower-order terms, since they don't matter much for large inputs, and because the constant factors can vary quite a bit depending on the architecture, compiler, programmer, etc.
For example, if we have an algorithm with the upper bound runtime of
Then we say the running time is
Generally, we categorize big-oh performance by order of growth, which define a set of algorithms that grow equivalently:
Order of growth | Name |
---|---|
constant | |
logarithmic (for any |
|
linear | |
quadratic | |
cubic | |
$O(c^n) | exponential (for any |
The order of growth is determined by the leading term, that is, the term with the highest exponent.
Note for
Consider the following algorithm for finding an element in an array:
def func(i, arr):
for el in arr:
if el == i:
return True
return False
This has the running time of
Now consider the following:
def func2(i, arr):
return func(i, arr), func(i, arr)
This still has the running time of
Now consider the following algorithm for checking if two arrays have a common element:
def func3(arr1, arr2):
for el1 in arr1:
for el2 in arr2:
if el1 == el2:
return True
return False
This has a runtime of
The following algorithm for checking duplicates in an array also has a runtime of
def func4(arr):
for i, el1 in enumerate(arr):
for el2 in arr[i:]:
if el1 == el2:
return True
return False
Say we have a function
We say that
That is, we can multiply
For example: we demonstrated that
As a simple example, we can prove that
So the inequality is:
We can re-write this:
Then it's clear that if we set
That is, we can multiply
Stricter than big-oh, in that this must be true for all positive constants.
This consists of:
The Master Method provides an easy way of computing the upper-bound runtime for a divide-and-conquer algorithm, so long as it satisfies the assumptions:
Assume that subproblems have equal size and the recurrence has the format:
Note that the logarithm base does not matter in the first case, since it just changes the leading constant (which doesn't matter in Big-Oh notation), whereas in the last case the base does matter, because it has a more significant effect.
To use the master method, we just need to determine
For example:
With merge sort, there are two recursive calls (thus
Thus, working out the master method inequality, we get
(Carrying over the previously-stated assumptions)
For simplicity, we'll also assume that
In the recursion tree, at each level
At a level
Note that the
To get the total work, we can sum over all the levels:
We can think of
There are three possible scenarios, corresponding to the master method's three cases:
If
A geometric sum for
Can be expressed in the following closed form:
If
If
We can bring this back to the master method by setting
If
If
Data structures are particular ways of organizing data that support certain operations. Structures have strengths and weaknesses in what kinds of operations they can perform, and how well they can perform them.
For example: lists, stacks, queues, heaps, search trees, hashtables, bloom filters, etc.
Different data structures are appropriate for different operations (and thus different kinds of problems).
A heap (sometimes called a priority queue) is a container for objects that have keys; these keys are comparable (e.g. we can say that one key is bigger than another).
Supported operations:
insert
: add a new object to the heap, runtime extract-min
: remove the object with the minimum key value (ties broken arbitrarily), runtime Alternatively, there are max-heaps which return the maximum key value (this can be emulated by a heap by negating key values such that the max becomes the min, etc).
Sometimes there are additional operations supported:
heapify
: initialize a heap in linear time (i.e. delete
: delete an arbitrary element from the middle of the heap in For example, you can have a heap where events are your objects and their keys are a scheduled time to occur. Thus when you extract-min
you always get the next event scheduled to occur.
A balanced binary search tree can be thought of as a dynamic sorted array (i.e. a sorted array which supports insert
and delete
operations).
First, consider sorted arrays. They support the following operations:
search
: binary search, runtime select
: select an element by index, runtime min
and max
: return first and last element of the array (respectively), runtime predecessor
and successor
: return next smallest and next largest element of the array (respectively), runtime rank
: the number of elements less than or equal to a given value, runtime output
: output elements in sorted order, runtime With sorted arrays, insertion and deletions have
If you want more logarithmic-time insertions and deletions, we can use a balanced binary search tree. This supports the same operations as sorted arrays (though some are slower) in addition to faster insertions and deletions:
search
: runtime select
: runtime min
and max
: runtime predecessor
and successor
: runtime rank
: runtime output
: runtime insert
: runtime delete
: runtime To understand how balanced binary search trees, first consider binary search trees.
A binary search tree (BST) is a data structure for efficient searching.
The keys that are stored are the nodes of the tree.
Each node has three pointers: one to its parent, one to its left child, and one to its right child.
These pointers can be null (e.g. the root node has no parent).
The search tree property asserts that for every node, all the keys stored in its left subtree should be less than its key, and all the keys in its right subtree should be greater than its key.
You can also have a convention for handling equal keys (e.g. just put it on the left or the right subtree).
This search tree property is what makes it very easy to search for particular values.
Note that there are many possible binary search trees for a given set of keys. The same set of keys could be arranged as a very deep and narrow BST, or as a very shallow and wide one. The worst case is a depth of about
This search tree property also makes insertion simple. You search for the key to be inserted, which will fail since the key is not in the tree yet, and you get a null pointer - you just assign that pointer to point to the new key. In the case of duplicates, you insert it according to whatever convention you decided on (as mentioned previously). This insert method maintains the search tree property.
Search and insert performance are dependent on the depth of the tree, so at worst the runtime is
The min and max operations are simple: go down the leftmost branch for the min key and go down the rightmost branch for the max key.
The predecessor operation is a bit more complicated. First you search for the key in question. Then, if the key's node has a left subtree, just take the max of that subtree. However, if the key does not have a left subtree, move up the tree through its parent and ancestors until you find a node with a key less than the key in question. (The successor operation is accomplished in a similar way.)
The deletion operation is tricky. First we must search for the key we want to delete. Then there are three possibilities:
For the select and rank operations, we can augment our search tree by including additional information at each node: the size of its subtree, including itself (i.e. number of descendants + 1). Augmenting the data structure in this way does add some overhead, e.g. we have to maintain the data/keep it updated whenever we modify the tree.
For select, we want to find the
Balanced binary search trees have the "best" depth of about
There are different kinds of balanced binary search trees (which are all quite similar), here we will talk about red-black trees (other kinds are AVL trees, splaytrees, B trees, etc).
Red-Black trees maintain some invariants/constraints which are what guarantee that the tree is balanced:
Consider the following: for a binary search tree, if every root-null path has
In a red-black tree, there is a root-null path with at most
The fourth constraint on red-black trees means that every root-null path has
Hash tables (also called dictionaries) allow us to maintain a (possibly evolving) set of stuff.
The core operations include:
insert
using a keydelete
using a keylookup
using a keyWhen implemented properly, and on non-pathological data, these operations all run in
Hash tables do not maintain any ordering.
Basically, hash tables use some hash function to produce a hash for an object (some number); this hash is the "address" of the object in the hash table.
More specifically, we have a hash function which gives us a value in some range
There is a chance of collisions in which two different objects produce the same hash.
There are two main solutions for resolving collisions:
Each is more appropriate in different situations.
The performance of a hash table depends a lot on the particular hash function. Ideally, we want a hash function that:
Designing hash functions is as much an art as it is a science. They are quite difficult to design.
A hash table has a load factor (sometimes just called load), denoted
For hash table operations to run constant time, it is necessary that
So for good performance, we need to control load. For example, if
Every hash function has a "pathological" data set which it performs poorly on. There is no hash function which is guaranteed to spread out every data set evenly (i.e. have low collisions on any arbitrary data set). You can often reverse engineer this pathological data set by analyzing the hash function.
However, for some hash functions, it is "infeasible" to figure out its pathological data set (as is the case with cryptographic hash functions).
One approach is to define a family of hash functions, rather than just one, and randomly choose a hash function to use at runtime. This has the property that, on average, you do well across all datasets.
Bloom filters are a variant on hash tables - they are more space efficient, but they allow for some errors (that is, there's a chance of a false positive). In some contexts, this is tolerable.
They are more space efficient because they do not actually store the objects themselves. They are more commonly used to keep track of what objects have been seen so far. They typically do not support deletions (there are variants that do incorporate deletions but they are more complicated). There is also a small chance that it will say that it's seen an object that it hasn't (i.e. false positives).
Like a hash table, a bloom filter consists of an array
We also have
The insert operation is defined:
hash_funcs = [...]
for i in range(k):
A[h[i](input)] = 1
That is, we just set the values of those bits to 1.
Thus lookup is just to check that all the corresponding bits for an object are 1.
We can't have any false negatives because those bits will not have been set and bits are never reset back to zero.
False positives are possible, however, because some other objects may have in aggregate set the bits corresponding to another object.
Consider the following problem:
7 * 13 = ?
This is solved very quickly by a computer (it gives 91).
Now consider the following factoring problem :
? * ? = 91
This a bit more difficult for a computer to solve, though it will yield the correct answers (7 and 13).
If we consider extremely large numbers, a computer can still very quickly compute their product.
But, given a product, it will take a computer a very, very long time to compute their factors.
In fact, modern cryptography is based on the fact that computers are not good at finding factors for a number (in particular, prime factors).
This is because computers basically have to use brute force search to identify a factor; with very large numbers, this search space is enormous (it grows exponentially).
However, once we find a possible solution, it is easy to check that we are correct (e.g. just multiply the factors and compare the product).
There are many problems which are characterized in this way - they require brute force search to identify the exact answer (there are often faster ways of getting approximate answers), but once an answer is found, it can be easily checked if it is correct.
There are other problems, such as multiplication, where we can easily "jump" directly to the correct exact answer.
For the problems that require search, it is not known whether or not there is also a method that can "jump" to the correct answer.
Consider the "needle in the haystack" analogy. We could go through each piece of hay until we find the needle (brute force search). Or we could use a magnet to pull it out immediately. The question is open: does this magnet exist for problems like factorization?
Problems which we can quickly solve, like multiplication, are in a family of problems called "P", which stands for "polynomial time" (referring to the relationship between the number of inputs and how computation time increases).
Problems which can be quickly verified to be correct are in a family of problems called "NP", which stands for "nondeterministic polynomial time".
P is a subset of NP, since their answers are quickly verified, but NP also includes the aforementioned search problems. Thus, a major question is whether or not P = NP, i.e. we have the "magnet" for P problems, but is there also one for the rest of the NP problems? Or is searching the only option?
Problems which are NP-hard are at least as hard as NP problems, so this includes problems which may not even be in NP.
There are some NP problems which all NP problems can be reduced to. Such an NP problem is called NP-complete.
For example, any NP problem can be reduced to a clique problem (e.g. finding a clique of some arbitrary size in a graph); thus the clique problem is NP-complete. Any other NP problem can be reduced to a clique problem, and if we can find a way of solving the clique problem quickly, we can also all of those related problems quickly as well.
To show that a problem is NP complete, you must: