Model Selection

Model selection is the process of choosing between different machine learning approaches - e.g. SVM, logistic regression, etc - or choosing between different hyperparameters or sets of features for the same machine learning approach - e.g. deciding between the polynomial degrees/complexities for linear regression.

The choice of the actual machine learning algorithm (e.g. SVM or logistic regression) is less important than you'd think - there may be a "best" algorithm for a particular problem, but often its performance is not much better than other well-performing approaches for that problem.

There may be certain qualities you look for in an model:

Though there are generally trade-offs amongst these qualities.

Model evaluation

In order to select amongst models, we need some way of evaluating their performance.

You can't evaluate a model's hypothesis function with the cost function because minimizing the error can lead to overfitting.

A good approach is to take your data and split it randomly into a training set and a test set (e.g. a 70%/30% split). Then you train your model on the training set and see how it performs on the test set.

For linear regression, you might do things this way:

For logistic regression, you might do things this way:

A better way of splitting the data is to not split it only into training and testing sets, but to also include a validation set. A typical ratio is 60% training, 20% validation, 20% testing.

So instead of just measuring the test error, you would also measure the validation error.

Validation is used mainly to tune hyperparameters - you don't want to tune them on the training set because that can result in overfitting, nor do you want to tune them on your test set because that results in an overly optimistic estimation of generalization. Thus we keep a separate set of data for the purpose of validation, that is, for tuning the hyperparameters - the validation set.

You can use these errors to identify what kind of problem you have if your model isn't performing well:

Because the test set is used to estimate the generalization error, it should not be used for "training" in any sense - this includes tuning hyperparameters. You should not evaluate on the test set and then go back and tweak things - this will give an overly optimistic estimation of generalization error.

Some ways of evaluating a model's performance on (some of) your known data are:

Validation vs Testing

Validation refers to the phase where you are tuning your model and its hyperparameters. Once you do that, you want to test this model on a new set of data it has not seen yet (i.e. data which has not been used in cross-validation or bootstrapping or whatever method you used). This is to simulate the model's performance on completely new data and see how it does, which is the most important quality of a model.

Evaluating regression models

The main techniques for evaluating regression models are:

Residuals

A residual $e_i$ is the difference between the observed and predicted outcome, i.e.:

$$ e_i = y_i - \hat y_i $$

This can also be thought of as the vertical distance between an observed data point and the regression line.

Fitting a line by least squares minimizes $\sum_{i=1}^n e_i^2$; that is, it minimizes the mean squared error (MSE) between the line and the data. But there always remain some error from the fit line; this remaining error is the residual.

Alternatively, the mean absolute error or median absolute error can be used instead of the mean squared error.

$e_i$ can be interpreted as estimates of the regression error $\epsilon_i$, since we can only compute the true error if we know the true model parameters.

We can measure the quality of a linear model, which is called goodness of fit. One approach is to look at the variation of the residuals. You can also use the coefficient of determination ($R^2$), explained previously, which measures the variance explained by the least squares line.

Residual (error) variation

Residual variation measures how well a regression line fits the data points.

The average squared residual (the estimated residual variance) is the same as the mean squared error, i.e. $\sigma^2 = \frac{1}{n} \sum_{i=1}^n e_i^2$.

However, to make this estimator unbiased, you're more likely to see:

$$ \hat \sigma^2 = \frac{1}{n-2} \sum_{i=1}^n e_i^2 $$

That is, with the degrees of freedom taken into account (here for intercept and slope, which both have to be estimated).

The square root of this estimated variance, $\sigma$, is the root mean squared error (RMSE).

Coefficient of determination

The total variation is equal to the residual variation (variation after removing the predictor) plus the systematic/regression variation (the variation explained by the regression model):

$$ \sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 $$

$R^2$ ($0 \leq R^2 \leq 1$) is the percent of total variability that is explained by the regression model, that is:

$$ R^2 = \frac{\text{regression variation}}{\text{total variation}} = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = 1 - \frac{\text{residual variation}}{\text{total variation}} = 1 - \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} $$

$R^2$ can be a misleading summary of model fit since deleting data or adding terms will inflate it.

TODO combine the below

Coefficient of determination

Example of error
Example of error

For a line $y = mx+b$, the error of a point $(x_n, x_y)$ against that line is:

$$ y_n - (mx_n + b) $$

Intuitively, this is the vertical difference between the point on the line at $x_n$
and the actual point at $x_n$.

The squared error of the line is the sum of the squares of all of these errors:

$$ \SE_{\text{line}} = \sum_{i=0}^n (y_i - (mx_i + b))^2 $$

To get a best fit line, you want to minimize this squared error. That is, you want to
find $m$ and $b$ which minimizes $SE_{\text{line}}$. This works out as ^[Reminder: a bar over a variable ($\bar x$) means the mean of those values. So $\bar{x^2} = \frac{x_1^2 + x_2^2 + \dots + x_n^2}{n}$]:

$$ m = \frac{\bar{x}\bar{y} - \bar{xy}}{\bar{x}^2 - \bar{x^2}} $$
$$ b = \bar{y} - m\bar{x} $$

Note that you can alternatively calculate the regression line slope $m$ as with the covariance and variance:

$$ m = \frac{\Cov(x,y)}{\Var(x)} $$

The line that these values yields is the regression line.

We can calculate the total variation in $y$, $\SE_{\bar{y}}$, as:

$$ \SE_{\bar{y}} = \sum_{i=0}^n (y_i - \bar{y})^2 $$

And then we can calculate the percentage of total variation in $y$ described by the
regression line:

$$ 1 - \frac{\SE_{\text{line}}}{\SE_{\bar{y}}} $$

This is known as the coefficient of determination or R-squared.

The closer R-squared is to 1, the better a fit the line is.

Evaluating classification models

Important quantities:

Area under the curve (AUC)

This method is for binary classification and multilabel classification. In binary classification you may choose some cutoff above which you assign a sample to one class, and below which you assign a sample to the other class.

Depending on your cutoff, you will get different results - there is a trade off between the true and false positive rates.

You can plot a Receiver Operating Characteristic (ROC) curve, which has for its y-axis $P(TP)$ and for its x-axis $P(FP)$. Every point on the curve corresponds to a cutoff value. That is, the ROC curve visualizes a sweep through all the cutoff thresholds so you can see the performance of your classifier across all cutoff thresholds, whereas other metrics (such as the F-score and so on) only tell you the performance for one particular cutoff. By looking at all thresholds at once, you get a more complete and honest picture of how your classifier is performing, in particular, how well it is separating the classes. It is insensitive to the bias of the data's classes - that is, if there are way more or way less of the positive class than there are of the negative class (other metrics may be deceptively favorable or punishing in such unbalanced circumstances).

The area under the curve (AUC) is used to quantify how good the classification algorithm is. In general, an AUC of above 0.8 is considered "good". An AUC of 0.5 (a straight line) is equivalent to random guessing.

ROC curves
ROC curves

So ROC curves (and the associated AUC metric) are very useful for evaluating binary classification.

Note that ROC curves can be extended to classification of three or more classes by using the one-vs-all approach (see section on classification).

TODO incorporate the explanation below as well:

AUC is a metric for binary classification and is especially useful when dealing with high-bias data, that is, where one class is much more common than the other. Using accuracy as a metric falls apart in high-bias datasets: for example, say you have 100 training examples, one of which is is positive, the rest of which are negative. You could develop a model which just labels every thing negative, and it would have 99% accuracy. So accuracy doesn't really tell you enough here.

Many binary classifies output some continuous value (0-1), rather than class labels; there is some threshold (usually 0.5) above which one label is assigned, and below which the other label is assigned. Some models may work best with a different threshold. Changing this threshold leads to a trade off between true positives and false positives - for example, decreasing the threshold will yield more true positives, but also more false positives.

AUC runs over all thresholds and plots the the true vs false positive rates. This curve is called a receiver operating characteristic curve, or ROC curve. A random classifier would give you equal false and true positives, which leads to a AUC of 0.5; the curve in this case would be a straight line. The better the classifier is, the more area under the curve there is (so the AUC approaches 1).

Confusion Matrices

This method is suitable for binary or multiclass classification.

For classification, evaluation often comes in the form of a confusion matrix.

The core values are:

A few other metrics are computed from these values:

Some other values:

Log-loss

This method is suitable for binary, multiclass, and multilabel classification.

Log-loss is an accuracy metric that can be used when the classifier output is not a class but a probability, as is the case with logistic regression. It penalizes the classifier based on how far off it is, e.g. if it predicts 1 with probability of 0.51 but the correct class is 0, it is less "wrong" than if it had predicted class 1 with probability 0.95.

For a binary classifier, log-loss is computed:

$$ -\frac{1}{n} \sum_i^N y_i \log(\hat y_i) + (1-y_i) \log(1 - \hat y_i) $$

Log-loss is the cross-entropy b/w the distribution of the true labels and the predictions. It is related to relative entropy (that is, Kullback-Leilber divergence).

Intuitively, the way this works is the $y_i$ terms "turn on" the appropriate parts, e.g. when $y_i = 1$ then the term $y_i \log(\hat y_i)$ is activated and the other is 0. The reverse is true when $y_i = 0$.

Because $\log(1) = 0$, we get the best loss (0) when the term within the $\log$ operation is 1; i.e. when $y_i = 1$ we want $\hat y_i$ to equal 1, so the loss comes down to $\log(\hat y_i)$, but when $y_i = 0$, we want $\hat y_i = 0$, so the loss in that case comes down to $\log(1 - \hat y_i)$.

F1 score

The F1 score, also called the balanced F-score or F-measure, is the weighted average of precision and recall:

$$ F_1 = 2 \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} $$

The best score is 1 and the worst is 0.

It can be used for binary, multiclass, and multilabel classification (for the latter two, use the weighted average of the F1 score for each class).

Metric selection

When it comes to skewed classes (or high bias data), metric selection is more nuanced.

For instance, say you have a dataset where only 0.5% of the data is in category 1 and the rest is in category 0. You run your model and find that it categorized 99.5% of the data correctly! But because of the skew in that data, your model could just be: classify each example in category 0, and it would achieve that accuracy.

Note that the convention is to set the rare class to 1 and the other class to 0. That is, we try to predict the rare class.

Instead, you may want to use precision/recall as your evaluation metric.

1T 0T
1P True positive False positive
0P False negative True negative

Where 1T/0T indicates the actual class and 1P/0P indicates the predicted class.

Precision is the number of true positives over the total number predicted as positive. That is, what fraction of the examples labeled as positive actually are positive?

$$ \frac{\text{true positives}}{\text{true positives} + \text{false positives}} $$

Recall is the number of true positives over the number of actual positives. That is, what fraction of the positive examples in the data were identified?

$$ \frac{\text{true positives}}{\text{true positives} + \text{false negatives}} $$

So in the previous example, our simple classifier would have a recall of 0.

There is a trade-off between precision and recall.

Say you are using a logistic regression model for this classification task. Normally, the category threshold in logistic regression is 0.5, that is, predict class 1 if $h_{\theta}(x) \geq 0.5$ and predict class 0 if $h_{\theta}(x) < 0.5$.

But you may want to only classify an example as 1 if you're very confidence. So you may change the threshold to 0.9 to be stricter about your classifications. In this case, you would increase precision, but lower recall since the model may not be confident enough about some of the more ambiguous positive examples.

Conversely, you may want to lower the threshold to avoid false negatives, in which case recall increases, but precision decreases.

So how do you compare precision/recall values across algorithms to determine which is best? You can condense precision and recall into a single metric: the $F_1$ score (also just called the F-score, which is the harmonic mean of the precision and recall):

$$ F_1 \text{score} = 2 \frac{PR}{P+R} $$

Although more data doesn't always help, it generally does. Many algorithms perform significantly better as they get more and more data. Even relatively simple algorithms can outperform more sophisticated ones, solely on the basis of having more training data.

If your algorithm doesn't perform well, here are some things to try:

Hyperparameter selection

Another part of model selection is hyperparameter selection.

Hyperparameter tuning is often treated as an art, i.e. without a reliable and practical systematic process for optimizing them. However, there are some automated methods that can be useful, including:

Random search and grid search don't perform particularly well but are worth being familiar with.

Grid search

Just searching through combinations of different hyperparameters and seeing which combination performs the best. Generally hyperparameters are searched over specific intervals or scales, depending on the particular hyperparameter. It may be 10, 20, 30, etc or 1e-5, 1e-4, 1e-3, etc. It is easy to parallelize but quite brute-force.

Random search

Surprisingly, randomly sampling from the full grid often works just as well as a complete grid search, but in much less time.

Intuitively: if we want the hyperparameter combination leading to the top 5% of performance, then any random hyperparameter combination from the grid has a 5% chance of leading to that result. If we want to successfully find such a combination 95% of the time, how many random combinations do we need to run through?

If we take $n$ hyperparameter combinations, the probability that all $n$ are outside of this 5% of top combinations is $(1 - 0.05)^n$, so the probability that at least one is in the 5% is just $1 - (1-0.05)^n$. If we want to find one of these combinations 95% of the time, that is, we want the probability that at least one of them to be what we're looking for to be 95%, then we just set $1 - (1-0.05)^n = 0.95$, and thus $n \geq 60$, so we need to try only 60 random hyperparamter combinations at minimum to have a 95% chance of finding at least one hyperparameter combination that yields top 5% performance for the model.

Bayesian Hyperparameter Optimization

We can use Bayesian optimization to select good hyperparameters for us. We can sample hyperparameters from a Gaussian process (the prior) and use the result as observations to compute a posterior distribution. Then we select the next hyperparameters to try by optimizing the expected improvement over the current best result or the Gaussian process upper confidence bound (UCB). In particular, we choose an acquisition function to construct a utility function from the model posterior - this is what we use to decide what next set of hyperparameters to try.

Basic idea: Model the generalization performance of an algorithm as a smooth function of its hyperparameters and then try to find the maxima.

It has two parts:

Which repeat until convergence.

This is faster than grid search by making "educated" guesses as to where the optimal set of hyperparameters might be, as opposed to brute-force searching through the entire space.

One problem is that computing the results of a hyperparameter sample can be very expensive (for instance, if you are training a large neural network).

We use a Gaussian process because its properties allow us to compute marginals and conditionals in closed form.

Some notation for the following:

A few popular choices of acquisition functions include:

$$ \begin{aligned} a_{\text{PI}}(x ; \{x_n, y_n\} \theta) &= \Phi(\gamma(x)) \\ \gamma(x) &= \frac{f(x_{\text{best}} - \mu(x; \{x_n, y_n\}, \theta)}{\sigma(x; \{x_n, y_n\}, \theta)} \end{aligned} $$

$$ a_{\text{EI}} (x; \{x_n, y_n\}, \theta) = \sigma(x; \{x_n, y_n\}, \theta) (\gamma(x)\Phi(\gamma(x)) + \mathcal N (\gamma(x); 0, 1)) $$

$$ a_{\text{LCB}} (x; \{x_n, y_n\}, \theta) = \mu(x; \{x_n, y_n\}, \theta) - \kappa \sigma(x; \{x_n, y_n\}, \theta) $$

Where $\kappa$ is tunable to balance exploitation against exploration.

Some difficulties with Bayesian optimization of hyperparameters include:

Furthermore, we can parallelize these Bayesian optimization procedures (refer to paper)

Choosing the Learning Rate $\alpha$

You can plot out a graph with the number of gradient descent iterations on the x-axis and the values of $\min_{\theta} J(\theta)$ on the y-axis and visualize how the latter changes with the number of iterations. At some point, that curve will flatten out; that's about the number of iterations it took for gradient descent to converge on your particular problem.

Decreasing error over number of iterations
Decreasing error over number of iterations

You could use an automatic convergence test which just declares convergence if $J(\theta)$ decreases by less than some threshold value in an iteration, but in practice that threshold value may be difficult to determine.

You would expect this curve to be similar to the one above. $min_{\theta} J(\theta)$ should decrease with the number of iterations, if gradient descent is working correctly. If not, then you should probably be using a smaller learning rate ($\alpha$). But again, don't make it too small or convergence will be slow.

CASH

The particular problem is called the CASH problem (Combined Algorithm Selection and Hyperparameter optimization problem).

It can be formalized as such:

we want to find the joint algorithm and hyperparameter settings that minimizes this loss:

$$ A^*, \lambda_* \in \argmin_{A^{(j)} \in \mathcal A, \lambda \in \Lambda^{(j)}} \frac{1}{K} \sum_{i=1}^K L(A_{\lambda}^{(j)}, D_{\text{train}}^{(i)}, D_{\text{valid}}^{(i)}) $$

Approaches to this problem include the aforementioned Bayesian optimization methods.

Meta-learning is another approach, in which machine learning is applied machine learning itself, that is, to algorithm and hyperparameter selection (and additionally feature preprocessing). The input data are different machine learning tasks and datasets, the output is a well-performing algorithm and hyperparameter combination. In meta-learning we learn "meta-features" to identify similar problems for which a algorithm and hyperparameter combination is good for.

These meta-features can include things like the number of datapoints, features, and classes, the data skewness, the entropy of the targets, etc.

Meta-learning can be combined with Bayesian optimization - it can be used to roughly identify good algorithm and hyperparameter choices, and Bayesian optimization can be used to fine-tune these choices. This approach of using meta-learning to support Bayesian optimization is called "warmstarting".

As Bayesian optimization searches for hyperparameters it may come across many well-performing hyperparameters that it discards because they are not the best. However, they can be saved to construct an (weighted) ensemble model, which usually outperforms individual models. The ensemble selection method seems to work best for constructing the ensemble:

Models are unweighted, but models can be added multiple time so the end result is a weighted ensemble.

References