Before you start building your machine learning system, you should:

- Be explicit about the problem.
- Start with a very specific and well-defined question: what do you want to predict, and what do you have to predict it with?

- Brainstorm some possible strategies.
- What features might be useful?
- Do you need to collect more data?

- Try and find good input data
- Randomly split data into:
- training sample
- testing sample
- if enough data, a validation sample too

- Randomly split data into:
- Use features of or features built from the data that may help with prediction

Then to start:

- Start with a simple algorithm which can be implemented quickly.
- Apply a machine learning algorithm
- Estimate the parameters for the algorithm on your training data

- Test the simple algorithm on your validation data, evaluate the results
- Plot
*learning curves*to decide where things need work:- Do you need more data?
- Do you need more features?
- And so on.

- Error analysis: manually examine the examples in the validation set that your algorithm made errors on. Try to identify patterns in these errors. Are there categories of examples that the model is failing on in particular? Are there any other features that might help?

If you have an idea for a feature which may help, it's best to just test it out. This process is much easier if you have a single metric for your model's performance.

In machine learning, a diagnostic is:

A test that you can run to gain insight [about] what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.

They take time to implement but can save you a lot of time by preventing you from going down fruitless paths.

To generate a learning curve, you deliberately shrink the size of your training set and see how the training and validation errors change as you increase the training set size. This way you can see how your model improves (or doesn't, if something unexpected is happening) with more training data.

With smaller training sets, we expect the training error will be low because it will be easier to fit to less data. So as training set size grows, the average training set error is expected to grow.

Conversely, we expect the average validation error to decrease as the training set size increases.

If it seems like the training and validation error curves are flattening out at a high error as training set size increases, then you have a high bias problem. The curves flattening out indicates that getting more training data will not (by itself) help much.

On the other hand, high variance problems are indicated by a large gap between the training and validation error curves as training set size increases. You would also see a low training error. In this case, the curves are converging and more training data would help.

You can distribute the workload across computers to reduce training time.

For example, say you're running batch gradient descent with

You can divide up (map) your batch so that different machines calculate the error of a subset (e.g. with 4 machines, each machine takes 100 examples) and then those results are combined (reduced/summed) back on a single machine. So the summation term becomes distributed.

Map Reduce can be applied wherever your learning algorithm can be expressed as a summation over your training set.

Map Reduce also works across multiple cores on a single computer.

Say you train a model on some historical data and then deploy your model in a production setting where it is working with live data.

It is possible that the distribution of the live data starts to *drift* from the distribution your model learned. This change may be due to factors in the real world that influence the data.

Ideally you will be able to detect this drift as it happens, so you know whether or not your model needs adjusting. A simple way to do it is to continually evaluate the model by computing some validation metric on the live data. If the distribution is stable, then this validation metric should remain stable; if the distribution drifts, the model starts to become a poor fit for the new incoming data, and the validation metric will worsen.

- Review of fundamentals, IFT725. Hugo Larochelle. 2012.
- Exploratory Data Analysis Course Notes. Xing Su.
- Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman.
- Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
- CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX).
*Evaluating Machine Learning Models*. Alice Zheng. 2015.- Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
- Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
- MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
- Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
- CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 2: Setting up the Data and the Loss. Andrej Karpathy.
- POLS 509: Hierarchical Linear Models. Justin Esarey.
- Bayesian Inference with Tears. Kevin Knight, September 2009.
- Learning to learn, or the advent of augmented data scientists. Simon Benhamou.
- Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle, Ryan P. Adams.
- What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou.
- Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010.
- Maximum Likelihood Estimation. Penn State Eberly College of Science.
- Data Science Specialization. Johns Hopkins (Coursera). 2015.
- Practical Machine Learning. Johns Hopkins (Coursera). 2015.
- Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome Friedman.
- CS231n Convolutional Neural Networks for Visual Recognition, Linear Classification. Andrej Karpathy.
- Must Know Tips/Tricks in Deep Neural Networks. Xiu-Shen Wei.