*Model selection* is the process of choosing between different machine learning approaches - e.g. SVM, logistic regression, etc - or choosing between different hyperparameters or sets of features for the same machine learning approach - e.g. deciding between the polynomial degrees/complexities for linear regression.

The choice of the actual machine learning algorithm (e.g. SVM or logistic regression) is less important than you'd think - there may be a "best" algorithm for a particular problem, but often its performance is not much better than other well-performing approaches for that problem.

There may be certain qualities you look for in an model:

- Interpretable - can we see or understand why the model is making the decisions it makes?
- Simple - easy to explain and understand
- Accurate
- Fast (to train and test)
- Scalable (it can be applied to a large dataset)

Though there are generally trade-offs amongst these qualities.

In order to select amongst models, we need some way of evaluating their performance.

You can't evaluate a model's hypothesis function with the cost function because minimizing the error can lead to overfitting.

A good approach is to take your data and split it randomly into a training set and a test set (e.g. a 70%/30% split). Then you train your model on the training set and see how it performs on the test set.

For linear regression, you might do things this way:

- Learn parameter
$\theta$ from training data by minimizing training error$J(\theta)$ . -
Compute test set error (using the squared error) (

$m_{\text{test}}$ is the test set size):$$ J_{\text{test}}(\theta) = \frac{1}{2m_{\text{test}}} \sum^{m_{\text{test}}}_{i=1} (h_{\theta}(x^{(i)}_{\text{test}}) - y^{(i)}_{\text{test}}) $$

For logistic regression, you might do things this way:

- Learn parameter
$\theta$ from training data by minimizing training error$J(\theta)$ . -
Compute test set error (

$m_{\text{test}}$ is the test set size):$$ J_{\text{test}}(\theta) = -\frac{1}{m_{\text{test}}} \sum^{m_{\text{test}}}_{i=1} y^{(i)}_{\text{test}} \log h_{\theta} (x^{(i)}_{\text{test}}) + (1 - y^{(i)}_{\text{test}}) \log h_{\theta} (x^{(i)}_{\text{test}}) $$ -
Alternatively, you can use the misclassification error ("0/1 misclassification error", read "zero-one"), which is just the fraction of examples that your hypothesis has mislabeled:

$$ \begin{aligned} err(h_{\theta}(x), y) &= \begin{cases} 1, & \text{if $h_{\theta}(x) \geq 0.5, y=0$ or if $h_{\theta}(x) < 0.5, y=1$} \\[2ex] 0, & \text{otherwise} \end{cases} \\ \text{test error} &= \frac{1}{m_{\text{test}}} \sum^{m_{\text{test}}}_{i=1} err(h_{\theta}(x^{(i)}_{\text{test}}), y^{(i)}_{\text{test}}) \end{aligned} $$

A better way of splitting the data is to not split it only into training and testing sets, but to also include a *validation* set. A typical ratio is 60% training, 20% validation, 20% testing.

So instead of just measuring the test error, you would also measure the validation error.

Validation is used mainly to tune hyperparameters - you don't want to tune them on the training set because that can result in overfitting, nor do you want to tune them on your test set because that results in an overly optimistic estimation of generalization. Thus we keep a separate set of data for the purpose of validation, that is, for tuning the hyperparameters - the **validation set**.

You can use these errors to identify what kind of problem you have if your model isn't performing well:

- If your training error is large and your validation/test set error is large, then you have a high bias (underfitting) problem.
- If your training error is small and your validation/test set error is large, then you have a high variance (overfitting) problem.

Because the test set is used to estimate the generalization error, it should not be used for "training" in any sense - this includes tuning hyperparameters. You should not evaluate on the test set and then go back and tweak things - this will give an overly optimistic estimation of generalization error.

Some ways of evaluating a model's performance on (some of) your known data are:

- hold out (just set aside some portion of the data for validation; this is less reliable if the amount of data is small such that the held out portion is very small)
- k-fold cross-validation (better than hold out for small datasets)
- the training set is divided into
$k$ folds - iteratively take
$k-1$ folds for training and validate on the remaining fold - average the results
- there is also "leave-one-out" cross-validation which is k-fold cross-validation where
$k=n$ ($n$ is the number of datapoints)

- the training set is divided into
- bootstrapping
- new datasets are generated by sampling with replacement (uniformly at random) from the original dataset
- then train on the bootstrapped dataset and validate on the unselected data

- jackknife resampling
- essentially to leave-one-out cross-validation, since leave-one-out is basically sampling without replacement

*Validation* refers to the phase where you are tuning your model and its hyperparameters. Once you do that, you want to *test* this model on a new set of data it has not seen yet (i.e. data which has not been used in cross-validation or bootstrapping or whatever method you used). This is to simulate the model's performance on completely new data and see how it does, which is the most important quality of a model.

The main techniques for evaluating regression models are:

- mean absolute error
- median absolute error
- (root) mean squared error
- coefficient of determination (
$R^2$ )

A **residual**

This can also be thought of as the vertical distance between an observed data point and the regression line.

Fitting a line by **least squares** minimizes **mean squared error** (MSE) between the line and the data. But there always remain some error from the fit line; this remaining error is the residual.

Alternatively, the **mean absolute error** or **median absolute error** can be used instead of the mean squared error.

We can measure the quality of a linear model, which is called **goodness of fit**. One approach is to look at the variation of the residuals. You can also use the coefficient of determination (

Residual variation measures how well a regression line fits the data points.

The average squared residual (the estimated residual variance) is the same as the mean squared error, i.e.

However, to make this estimator unbiased, you're more likely to see:

That is, with the degrees of freedom taken into account (here for intercept and slope, which both have to be estimated).

The square root of this estimated variance,

The **total variation** is equal to the residual variation (variation after removing the predictor) plus the systematic/regression variation (the variation explained by the regression model):

TODO combine the below

For a line

Intuitively, this is the vertical difference between the point on the line at

and the actual point at

The **squared error of the line** is the sum of the squares of all of these errors:

To get a best fit line, you want to minimize this squared error. That is, you want to

find

Note that you can alternatively calculate the regression line slope

The line that these values yields is the **regression line**.

We can calculate the total variation in

And then we can calculate the percentage of total variation in

regression line:

This is known as the **coefficient of determination** or **R-squared**.

The closer R-squared is to 1, the better a fit the line is.

Important quantities:

- Sensitivity:
$\frac{TP}{TP+FN}$ - Specificity:
$\frac{TN}{TN+FP}$ - Positive predictive value:
$\frac{TP}{TP+FP}$ - Negative predictive value:
$\frac{TN}{TN+FN}$ - Accuracy:
$\frac{TP+TN}{TP+FP+TN+FN}$

This method is for binary classification and multilabel classification. In binary classification you may choose some cutoff above which you assign a sample to one class, and below which you assign a sample to the other class.

Depending on your cutoff, you will get different results - there is a trade off between the true and false positive rates.

You can plot a Receiver Operating Characteristic (ROC) curve, which has for its y-axis *all* cutoff thresholds, whereas other metrics (such as the F-score and so on) only tell you the performance for one particular cutoff. By looking at all thresholds at once, you get a more complete and honest picture of how your classifier is performing, in particular, how well it is separating the classes. It is insensitive to the bias of the data's classes - that is, if there are way more or way less of the positive class than there are of the negative class (other metrics may be deceptively favorable or punishing in such unbalanced circumstances).

The *area under the curve* (AUC) is used to quantify how good the classification algorithm is. In general, an AUC of above 0.8 is considered "good". An AUC of 0.5 (a straight line) is equivalent to random guessing.

So ROC curves (and the associated AUC metric) are very useful for evaluating binary classification.

Note that ROC curves can be extended to classification of three or more classes by using the one-vs-all approach (see section on classification).

TODO incorporate the explanation below as well:

AUC is a metric for binary classification and is especially useful when dealing with *high-bias* data, that is, where one class is much more common than the other. Using accuracy as a metric falls apart in high-bias datasets: for example, say you have 100 training examples, one of which is is positive, the rest of which are negative. You could develop a model which just labels every thing negative, and it would have 99% accuracy. So accuracy doesn't really tell you enough here.

Many binary classifies output some continuous value (0-1), rather than class labels; there is some threshold (usually 0.5) above which one label is assigned, and below which the other label is assigned. Some models may work best with a different threshold. Changing this threshold leads to a trade off between true positives and false positives - for example, decreasing the threshold will yield more true positives, but also more false positives.

AUC runs over all thresholds and plots the the true vs false positive rates. This curve is called a *receiver operating characteristic* curve, or *ROC* curve. A random classifier would give you equal false and true positives, which leads to a AUC of 0.5; the curve in this case would be a straight line. The better the classifier is, the more area under the curve there is (so the AUC approaches 1).

This method is suitable for binary or multiclass classification.

For classification, evaluation often comes in the form of a **confusion matrix**.

The core values are:

**True positives**(TP): samples classified as positive which were labeled positive**True negatives**(TN): samples classified as negative which were labeled negative**False positives**(FP): samples classified as positive which were labeled negative**False negatives**(FN): samples classified as negative which were labeled positive

A few other metrics are computed from these values:

**Accuracy**: How often is the classifier correct? ($\frac{\text{TP} + \text{TN}}{\text{total}}$ )**Misclassification rate**(or "**error rate**"): How often is the classifier wrong? ($\frac{\text{FP} + \text{FN}}{\text{total}} = 1 - \text{accuracy}$ )**Recall**(or "**sensitivity**" or "**true positive rate**"): How often are positive-labeled samples predicted as positive? ($\frac{\text{TP}}{\text{num positive-labeled examples}}$ )**False positive rate**: How often are negative-labeled samples predicted as positive? ($\frac{\text{FP}}{\text{num negative-labeled examples}}$ )**Specificity**(or "**true negative rate**"): How often are negative-labeled samples predicted as negative? ($\frac{\text{TN}}{\text{num negative-labeled examples}}$ )**Precision**: How many of the predicted positive samples are correctly predicted? ($\frac{\text{TP}}{\text{TP} + \text{FP}}$ )**Prevalence**: How many labeled-positive samples are there in the data? ($\frac{\text{num positive-labeled examples}}{\text{num examples}}$ )

Some other values:

**Positive predictive value**(PPV): precision but takes prevalence into account. With a perfectly balanced dataset (i.e. equal positive and negative examples, that is prevalence is 0.5), the PPV equals the precision.**Null error rate**: how often you would be wrong if you just predicted positive for every example. This is a good starting baseline metric to compare your classifier against.**F-score**: The weighted average of recall and precision**Cohen's Kappa**: a measure of how well the classifier performs compared against if it had just guessed randomly, that is a high Kappa score happens when there is a big difference between the accuracy and the null error rate.**ROC Curve**: (see the section on this)

This method is suitable for binary, multiclass, and multilabel classification.

**Log-loss** is an accuracy metric that can be used when the classifier output is not a class but a probability, as is the case with logistic regression. It penalizes the classifier based on how far off it is, e.g. if it predicts 1 with probability of 0.51 but the correct class is 0, it is less "wrong" than if it had predicted class 1 with probability 0.95.

For a binary classifier, log-loss is computed:

Log-loss is the cross-entropy b/w the distribution of the true labels and the predictions. It is related to relative entropy (that is, Kullback-Leilber divergence).

Intuitively, the way this works is the

Because

The **F1 score**, also called the **balanced F-score** or **F-measure**, is the weighted average of precision and recall:

The best score is 1 and the worst is 0.

It can be used for binary, multiclass, and multilabel classification (for the latter two, use the weighted average of the F1 score for each class).

When it comes to **skewed classes** (or *high bias data*), metric selection is more nuanced.

For instance, say you have a dataset where only 0.5% of the data is in category 1 and the rest is in category 0. You run your model and find that it categorized 99.5% of the data correctly! But because of the skew in that data, your model could just be: classify each example in category 0, and it would achieve that accuracy.

Note that the convention is to set the rare class to 1 and the other class to 0. That is, we try to predict the rare class.

Instead, you may want to use *precision/recall* as your evaluation metric.

1T | 0T | |
---|---|---|

1P | True positive | False positive |

0P | False negative | True negative |

Where 1T/0T indicates the actual class and 1P/0P indicates the predicted class.

**Precision** is the number of true positives over the total number predicted as positive. That is, what fraction of the examples labeled as positive actually are positive?

**Recall** is the number of true positives over the number of actual positives. That is, what fraction of the positive examples in the data were identified?

So in the previous example, our simple classifier would have a recall of 0.

There is a trade-off between precision and recall.

Say you are using a logistic regression model for this classification task. Normally, the category threshold in logistic regression is 0.5, that is, predict class 1 if

But you may want to only classify an example as 1 if you're very confidence. So you may change the threshold to 0.9 to be stricter about your classifications. In this case, you would increase precision, but lower recall since the model may not be confident enough about some of the more ambiguous positive examples.

Conversely, you may want to lower the threshold to avoid false negatives, in which case recall increases, but precision decreases.

So how do you compare precision/recall values across algorithms to determine which is best? You can condense precision and recall into a single metric: the **F-score**, which is the harmonic mean of the precision and recall):

Although more data doesn't always help, it generally does. Many algorithms perform significantly better as they get more and more data. Even relatively simple algorithms can outperform more sophisticated ones, solely on the basis of having more training data.

If your algorithm doesn't perform well, here are some things to try:

- Get more training examples (can help with high variance problems)
- Try smaller sets of features (can help with high variance problems)
- Try additional features (can help with high bias problems)
- Try adding polynomial features (
$x_1^2, x_2^2, x_1 x_2$ , etc) (can help with high bias problems) - Try decreasing the regularization parameter
$\lambda$ (can help with high bias problems) - Try increasing the regularization parameter
$\lambda$ (can help with high variance problems)

Another part of model selection is *hyperparameter selection*.

Hyperparameter tuning is often treated as an art, i.e. without a reliable and practical systematic process for optimizing them. However, there are some automated methods that can be useful, including:

- grid search
- random search
- evolutionary algorithms
- Bayesian optimization

Random search and grid search don't perform particularly well but are worth being familiar with.

Just searching through combinations of different hyperparameters and seeing which combination performs the best. Generally hyperparameters are searched over specific intervals or scales, depending on the particular hyperparameter. It may be 10, 20, 30, etc or 1e-5, 1e-4, 1e-3, etc. It is easy to parallelize but quite brute-force.

Surprisingly, randomly sampling from the full grid often works just as well as a complete grid search, but in much less time.

Intuitively: if we want the hyperparameter combination leading to the top 5% of performance, then any random hyperparameter combination from the grid has a 5% chance of leading to that result. If we want to successfully find such a combination 95% of the time, how many random combinations do we need to run through?

If we take

We can use Bayesian optimization to select good hyperparameters for us. We can sample hyperparameters from a Gaussian process (the prior) and use the result as observations to compute a posterior distribution. Then we select the next hyperparameters to try by optimizing the expected improvement over the current best result or the Gaussian process upper confidence bound (UCB). In particular, we choose an *acquisition function* to construct a utility function from the model posterior - this is what we use to decide what next set of hyperparameters to try.

Basic idea: Model the generalization performance of an algorithm as a smooth function of its hyperparameters and then try to find the maxima.

It has two parts:

- Exploration: evaluate this function on sets of hyperparameters where the outcome is most uncertain
- Exploitation: evaluate this function on sets of hyperparameters which seem likely to output high values

Which repeat until convergence.

This is faster than grid search by making "educated" guesses as to where the optimal set of hyperparameters might be, as opposed to brute-force searching through the entire space.

One problem is that computing the results of a hyperparameter sample can be very expensive (for instance, if you are training a large neural network).

We use a Gaussian process because its properties allow us to compute marginals and conditionals in closed form.

Some notation for the following:

$f(x)$ is the function drawn from the Gaussian process prior, where$x$ is the set of hyperparameters- observations are in the form
$\{x_n, y_n\}_{n=1}^{N}$ , where$y_n \sim \mathcal N (f(x_n), v)$ and$v$ is the variance of noise introduced into the function observations - the acquisition function is
$a : \mathcal X \to \mathbb R^+$ , where$\mathcal X$ is the hyperparameter space - the next set of hyperparameters to try is
$x_{\text{next}} = \argmax_x a(x)$ - the current best set of hyperparameters is
$x_{\text{best}}$ $\Phi()$ denotes the cumulative distribution function of the standard normal

A few popular choices of acquisition functions include:

*probability of improvement*: with a Gaussian process, this can be computed analytically as:

*expected improvement*: under a Gaussian process, this also has a closed form:

*Gaussian process upper confidence bound*: use upper confidence bounds (when maximizing, otherwise, lower confidence bounds) to construct acquisition functions that minimize regret over the course of their optimization:

Where

Some difficulties with Bayesian optimization of hyperparameters include:

- often unclear what the appropriate choice for the covariance function and its associated hyperparameters (these hyperparameters are distinct from the ones the method is optimizing; i.e. these are in some sense "hyper-hyperparameters")
- the function evaluation can be a time-consuming optimization procedure. One method is to optimize expected improvement
*per second*, thereby taking wall clock time into account. That way, we prefer to evaluate points that are not only likely to be good, but can also be evaluated quickly. However, we don't know the*duration function*$c(x) : \mathcal X \to \mathbb R^+$ , but we can use this same Gaussian process approach to model$c(x)$ alongside$f(x)$ .

Furthermore, we can parallelize these Bayesian optimization procedures (refer to paper)

You can plot out a graph with the number of gradient descent iterations on the x-axis and the values of

You could use an *automatic convergence test* which just declares convergence if

You would expect this curve to be similar to the one above.

The particular problem is called the *CASH* problem (*Combined Algorithm Selection and Hyperparameter optimization* problem).

It can be formalized as such:

is a set of algorithms$\mathcal A = \{A^{(1)}, \dots, A^{(R)}\}$ - the algorithm
$A^{(j)}$ 's hyperparameters has the domain$\Lambda^{(j)}$

- the algorithm
be a training set$D_{\text{train}} = \{(x_1, y_1), \dots, (x_n, y_n)\}$ - it is split into
$K$ cross-validation folds${D_{\text{valid}}^{(1)}, \dots, D_{\text{valid}}^{(K)}}$ and${D_{\text{train}}^{(1)}, \dots, D_{\text{train}}^{(K)}}$

- it is split into
- the loss
$L(A_{\lambda}^{(j)}, D_{\text{train}}^{(i)}, D_{\text{valid}}^{(i)})$ is the loss an algorithm$A^{(j)}$ achieves on$D_{\text{valid}}^{(i)}$ when trained on$D_{\text{train}}^{(i)}$ with hyperparameters$\lambda$

we want to find the joint algorithm and hyperparameter settings that minimizes this loss:

Approaches to this problem include the aforementioned Bayesian optimization methods.

*Meta-learning* is another approach, in which machine learning is applied machine learning itself, that is, to algorithm and hyperparameter selection (and additionally feature preprocessing). The input data are different machine learning tasks and datasets, the output is a well-performing algorithm and hyperparameter combination. In meta-learning we learn "meta-features" to identify similar problems for which a algorithm and hyperparameter combination is good for.

These meta-features can include things like the number of datapoints, features, and classes, the data skewness, the entropy of the targets, etc.

Meta-learning can be combined with Bayesian optimization - it can be used to roughly identify good algorithm and hyperparameter choices, and Bayesian optimization can be used to fine-tune these choices. This approach of using meta-learning to support Bayesian optimization is called "warmstarting".

As Bayesian optimization searches for hyperparameters it may come across many well-performing hyperparameters that it discards because they are not the best. However, they can be saved to construct an (weighted) ensemble model, which usually outperforms individual models. The ensemble selection method seems to work best for constructing the ensemble:

- start with an empty ensemble
- iteratively, up to a specified ensemble size
- add a model that maximizes ensemble validation performance

Models are unweighted, but models can be added multiple time so the end result is a weighted ensemble.

- Review of fundamentals, IFT725. Hugo Larochelle. 2012.
- Exploratory Data Analysis Course Notes. Xing Su.
- Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman.
- Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
- CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX).
*Evaluating Machine Learning Models*. Alice Zheng. 2015.- Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
- Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
- MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
- Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
- CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 2: Setting up the Data and the Loss. Andrej Karpathy.
- POLS 509: Hierarchical Linear Models. Justin Esarey.
- Bayesian Inference with Tears. Kevin Knight, September 2009.
- Learning to learn, or the advent of augmented data scientists. Simon Benhamou.
- Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle, Ryan P. Adams.
- What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou.
- Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010.
- Maximum Likelihood Estimation. Penn State Eberly College of Science.
- Data Science Specialization. Johns Hopkins (Coursera). 2015.
- Practical Machine Learning. Johns Hopkins (Coursera). 2015.
- Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome Friedman.
- Model evaluation: quantifying the quality of predictions. scikit-learn.
- Efficient and Robust Automated Machine Learning. Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, Frank Hutter.