In Practice

Machine Learning System Design

Before you start building your machine learning system, you should:

Then to start:

If you have an idea for a feature which may help, it's best to just test it out. This process is much easier if you have a single metric for your model's performance.

Machine learning diagnostics

In machine learning, a diagnostic is:

A test that you can run to gain insight [about] what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.

They take time to implement but can save you a lot of time by preventing you from going down fruitless paths.

Learning curves

To generate a learning curve, you deliberately shrink the size of your training set and see how the training and validation errors change as you increase the training set size. This way you can see how your model improves (or doesn't, if something unexpected is happening) with more training data.

With smaller training sets, we expect the training error will be low because it will be easier to fit to less data. So as training set size grows, the average training set error is expected to grow.
Conversely, we expect the average validation error to decrease as the training set size increases.

If it seems like the training and validation error curves are flattening out at a high error as training set size increases, then you have a high bias problem. The curves flattening out indicates that getting more training data will not (by itself) help much.

On the other hand, high variance problems are indicated by a large gap between the training and validation error curves as training set size increases. You would also see a low training error. In this case, the curves are converging and more training data would help.

Important training figures

Training figures (from Xiu-Shen Wei)
Training figures (from Xiu-Shen Wei)

Large Scale Machine Learning

Map Reduce

You can distribute the workload across computers to reduce training time.

For example, say you're running batch gradient descent with $b=400$.

$$ \theta_j := \theta_j - \alpha \frac{1}{400} \sum^400_{i=1} (h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}_j $$

You can divide up (map) your batch so that different machines calculate the error of a subset (e.g. with 4 machines, each machine takes 100 examples) and then those results are combined (reduced/summed) back on a single machine. So the summation term becomes distributed.

Map Reduce can be applied wherever your learning algorithm can be expressed as a summation over your training set.

Map Reduce also works across multiple cores on a single computer.

Online (live/streaming) machine learning

Distribution Drift

Say you train a model on some historical data and then deploy your model in a production setting where it is working with live data.

It is possible that the distribution of the live data starts to drift from the distribution your model learned. This change may be due to factors in the real world that influence the data.

Ideally you will be able to detect this drift as it happens, so you know whether or not your model needs adjusting. A simple way to do it is to continually evaluate the model by computing some validation metric on the live data. If the distribution is stable, then this validation metric should remain stable; if the distribution drifts, the model starts to become a poor fit for the new incoming data, and the validation metric will worsen.