Before you start building your machine learning system, you should:
Be explicit about the problem.
Start with a very specific and well-defined question: what do you want to predict, and what do you have to predict it with?
Brainstorm some possible strategies.
What features might be useful?
Do you need to collect more data?
Try and find good input data
Randomly split data into:
training sample
testing sample
if enough data, a validation sample too
Use features of or features built from the data that may help with prediction
Then to start:
Start with a simple algorithm which can be implemented quickly.
Apply a machine learning algorithm
Estimate the parameters for the algorithm on your training data
Test the simple algorithm on your validation data, evaluate the results
Plot learning curves to decide where things need work:
Do you need more data?
Do you need more features?
And so on.
Error analysis: manually examine the examples in the validation set that your algorithm made errors on. Try to identify patterns in these errors. Are there categories of examples that the model is failing on in particular? Are there any other features that might help?
If you have an idea for a feature which may help, it's best to just test it out. This process is much easier if you have a single metric for your model's performance.
Machine learning diagnostics
In machine learning, a diagnostic is:
A test that you can run to gain insight [about] what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.
They take time to implement but can save you a lot of time by preventing you from going down fruitless paths.
Learning curves
To generate a learning curve, you deliberately shrink the size of your training set and see how the training and validation errors change as you increase the training set size. This way you can see how your model improves (or doesn't, if something unexpected is happening) with more training data.
With smaller training sets, we expect the training error will be low because it will be easier to fit to less data. So as training set size grows, the average training set error is expected to grow.
Conversely, we expect the average validation error to decrease as the training set size increases.
If it seems like the training and validation error curves are flattening out at a high error as training set size increases, then you have a high bias problem. The curves flattening out indicates that getting more training data will not (by itself) help much.
On the other hand, high variance problems are indicated by a large gap between the training and validation error curves as training set size increases. You would also see a low training error. In this case, the curves are converging and more training data would help.
Important training figures
Large Scale Machine Learning
Map Reduce
You can distribute the workload across computers to reduce training time.
For example, say you're running batch gradient descent with $b=400$.
You can divide up (map) your batch so that different machines calculate the error of a subset (e.g. with 4 machines, each machine takes 100 examples) and then those results are combined (reduced/summed) back on a single machine. So the summation term becomes distributed.
Map Reduce can be applied wherever your learning algorithm can be expressed as a summation over your training set.
Map Reduce also works across multiple cores on a single computer.
Online (live/streaming) machine learning
Distribution Drift
Say you train a model on some historical data and then deploy your model in a production setting where it is working with live data.
It is possible that the distribution of the live data starts to drift from the distribution your model learned. This change may be due to factors in the real world that influence the data.
Ideally you will be able to detect this drift as it happens, so you know whether or not your model needs adjusting. A simple way to do it is to continually evaluate the model by computing some validation metric on the live data. If the distribution is stable, then this validation metric should remain stable; if the distribution drifts, the model starts to become a poor fit for the new incoming data, and the validation metric will worsen.
References
Review of fundamentals, IFT725. Hugo Larochelle. 2012.