In supervised learning, the learning algorithm is provided some pre-labeled examples (a *training set*) to learn from.

In *regression* problems, you try to predict some continuous valued output (i.e. a real number).

In *classification* problems, you try to predict some discrete valued output (e.g. categories).

Typical notation:

$m$ = number training of examples$x$ 's = input variables or features$y$ 's = output variables or the "target" variable$(x^{(i)}, y^{(i)})$ = the$i$ th training example$h$ = the hypothesis, that is, the function that the learning algorithm learns, taking$x$ 's as input and outputting$y$ 's

The typical process is:

- Feed training set data into the learning algorithm
- The learning algorithm learns the hypothesis
$h$ - Input new data into
$h$ - Get output from
$h$

The hypothesis can thought of as the model that you try to learn for a particular task. You then use this model on new inputs, e.g. to make predictions - **generalization** is how the model performs on new examples; this is most important in machine learning.

**Capacity**: the flexibility of a model - that is, the variety of functions it can fit.**Representational capacity**- the functions which the model*can*learn**Effective capacity**- in practice, a learning algorithm is not likely to find the*best*function out of the possible functions it can learn, though it can learn one that performs exceptionally well - those functions that the learning algorithm is capable of finding defines the model's*effective*capacity.

**Hypothesis space**: the set of functions the model is limited to learning. For instance, linear regression can be limited to linear functions as its hypothesis space, or it can be expanded to learn polynomials as well, e.g. by introducing an$x^2$ term.**Hyperparameter**: a parameter of a model that is not learned (that is, you specify it yourself)**Underfitting**: when the model could achieve better generalization with more training or capacity. Characterized by a high training error.**Overfitting**: when the model could achieve better generalization with more training or capacity; in particular, the model is too tuned to the idiosyncrasies of the training data (for instance, it may fit to sampling error, which we don't want). Too much capacity can lead to overfitting in that the model may be able to learn functions too specific to the data. Characterized by a large gap between the training error and the test error.**Model selection**: the process of choosing the best hyperparameters on a validation set

If the true function is in your hypothesis space *realizable* in

In machine learning, there are generally two kinds of problems: *regression* and *classification* problems. Machine learning algorithms are typically designed for one or the other.

**Regression** involves fitting a model to data. The goal is to understand the relationship between one set of variables - the **dependent** or **response** or **target** or **outcome** or **explained** variables (e.g. **independent** or **explanatory** or **predictor** or **regressor** variables (e.g. **simple regression**. In scenarios with more than one explanatory variable, we have **multiple regression**. In scenarios with more than one dependent variable, we have **multivariate regression**.

With **linear regression** we expect that the dependent and explanatory variables have a linear relationship; that is, can be expressed as a **linear combination** of random variables, i.e.:

For some dependent variable

Of course, we do not know the true values for these **regression coefficients**) so they end up being point estimates as well. We can estimate them as follows.

When given data, one technique we can use is **ordinary least squares**, sometimes just called **least squares regression**, which looks for parameters **least squares criterion**). The resulting line is called the **least squares line**.

Note that *linear model* does not mean the model is necessarily a *straight* line. It can be polynomial as well - but you can think of the polynomial terms as additional explanatory variables; looking at it this way, the line (or curve, but for consistency, they are all called "lines") still follows the form above. And of course, in higher-dimensions (that is, for multiple regression) we are not dealing with lines but planes, hyperplanes, and so on. But again, for the sake of simplicity, they are all just referred to as "lines".

For example, the line:

Can be re-written:

When we use a regression model to predict a dependent variable, e.g.

Classification problems are where your target variables are discrete, so they represent categories or classes.

For *binary classification*, there are only two classes (that is, *negative* class, and the 1 class the *positive* class.

Otherwise, the classification problem is called a *multiclass* classification problem - there are more than two classes.

Much of machine learning can be framed as optimization problems - there is some kind of *objective* (also called *loss* or *cost*) function which we want to optimize (e.g. minimize classification error on the training set). Typically you are trying to find some parameters for your model,

Generally this framework for machine learning is called **empirical risk minimization** and can be formulated:

Where:

$f(x^{(i)}; \theta)$ is your model, which outputs some predicted value for the input$x^{(i)}$ and$\theta$ are the parameters for the model$y^{(i)}$ is the training label (i.e. the ground-truth) for the input$x^{(i)}$ $l$ is the loss function$\Omega(\theta)$ is a*regularlizer*to penalize certain values of$\theta$ and$\lambda$ is the regularization parameter (see below on*regularlization*)

Some optimization terminology:

**Critical points**:$\\{x \in \mathbb R^n | \nabla_x f(x) = 0 \\}$ **Curvature in direction**:$v$ $v^T \nabla_x^2 f(x)v$ - Types of critical points:
- local minima:
$v^T \nabla_x^2 f(x)v > 0, \, \forall v$ , that is$\nabla_x^2f(x)$ is positive definite - local maxima:
$v^T \nabla_x^2 f(x)v < 0, \, \forall v$ , that is$\nabla_x^2f(x)$ is negative definite - saddle point: curvature is positive in some directions and negative in others

- local minima:

So we have our training set

Here is the cost function for *linear* regression:

Note that the

Note that now the

We can extract

The cost function for logistic regression is different than that used for linear regression because the hypothesis function of logistic regression causes *non-convex*, that is, look something like the following with many local optima, making it hard to converge on the global minimum.

So we want to find a way to define

Some properties of this function is that if

We can rewrite

So our entire cost function is:

You could use other cost functions for logistic regression, but this one is derived from the principle of maximum likelihood estimation and has the nice property of being convex, so this is the one that basically everyone uses for logistic regression.

Then we can calculate

This looks exactly the same as the linear regression gradient descent algorithm, but it is different because

*Gradient descent* is an optimization algorithm for finding parameter values which minimize a cost function.

Gradient descent perhaps the most common optimization algorithm in machine learning.

So we have some cost function

The general approach is:

- Start with some
$\theta_0, \theta_1, \dots, \theta_n$ . - Changing
$\theta_0, \theta_1, \dots, \theta_n$ in some increment/step to reduce$J(\theta_0, \theta_1, \dots, \theta_n)$ as much as possible. - Repeat the previous step until convergence on a minimum (hopefully)

Repeat the following until convergence:

(Note that := is the assignment operator.)

For each

Every *simultaneously*. So technically, you'd calculate this value for each

For example, if the right-hand side of that equation was a function `func(j, t0, t1)`

, you would implement it like so (example is

```
temp_0 = func(0, theta_0, theta_1)
temp_1 = func(1, theta_0, theta_1)
theta_0 = temp_0
theta_1 = temp_1
```

*learning rate* and tells how large a step/increment to change the parameters by.

Learning rates which are too small cause the gradient descent to go slowly. Learning rates which are too large can cause the gradient descent to overshoot the minimum, and in those cases it can fail to converge or even diverge.

The partial derivative on the right is just the rate of change from the current value.

The normal equation is an approach which allows for the direct determination of an optimal

With calculus, you find the optimum of a function by calculating where its derivatives equal 0 (the intuition is that derivatives are rates of change, when the rate of change is zero, the function is "turning around" and is at a peak or valley).

So we can take the same cost function we've been using for linear regression and take the partial derivatives of the cost function

And for each

Then solve for

The fast way to do this is to construct a matrix out of your features, including a column for

If you have

From which we can construct *design matrix*:

That is, the design matrix is composed of the transposes of the feature vectors for all the training examples. Thus a column in the design matrix corresponds to a feature, and each row corresponds to an example.

Typically, all examples have the same length, but this may not necessarily be the case. You may have, for instance, images of different dimensions you wish to classify. This kind of data is *heterogeneous*.

And then the vector

Then you can calculate the

With this method, feature scaling isn't necessary.

Note that it's possible that *singular*, also called *degenerate*), but this is usually due to redundant features (e.g. having a feature in feet and in meters; they communicate the same information) or having too many features (e.g. *regularization*.

Programs which calculate the inverse of a matrix often have a method which allows it to calculate the optimal

- Gradient Descent
- requires that you choose
$\alpha$ - needs many iterations
- works well when
$n$ is large

- requires that you choose
- Normal Equation
- don't need to choose
$\alpha$ - don't need to iterate
- slow if
$n$ is very large (computing$(X^TX)^{-1}$ has a complexity of$O(n^3)$ ), but is usually ok up until around$n = 10000$

- don't need to choose

Also note that for some learning algorithms, the normal equation is not applicable, whereas gradient descent still works.

There are other advanced optimization algorithms, such as:

- Conjugate gradient
- BFGS
- L-BFGS

These shouldn't be implemented on your own since they require an advanced understanding of numerical computing, even just to understand what they're doing.

They are more complex, but (in the context of machine learning) there's no need to manually pick a learning rate

Prior to applying a machine learning algorithm, data almost always must be *preprocessed*, i.e. prepared in a way that helps the algorithm perform well (or in a way necessary for the algorithm to work at all).

Good features:

- lead to data compression
- retain relevant information
- are based on expert domain knowledge

Common mistakes:

- trying to automate feature selection
- not paying attention to data-specific quirks
- throwing away information unnecessarily

The model where you include all available features is called the **full model**. But sometimes including all features can hurt prediction accuracy.

There are a few feature selection strategies that can be used.

One class of selection strategies is called **stepwise** selection because they iteratively remove or add one feature at a time, measuring the goodness of fit for each. The two approaches here are the **backward-elimination** strategy which begins with the full model and removes one feature at a time, and the **forward-selection** strategy which is the reverse of backward-elimination, starting with one feature and adding the rest one at a time. These two strategies don't necessarily lead to the same model.

Your data may have features explicitly present, e.g. a column in a database. But you can also design or *engineer* new features by combining these explicit features or through observing patterns on your own in the data that haven't yet been explicitly encoded. We're doing a form of this in polynomial regression above by encoding the polynomials as new features.

A very important choice in machine learning is how you represent the data. What are its salient features, and in what form is it best presented? Each field in the data (e.g. column in the table) is a **feature** and a great deal of time is spent getting this representation right. The best machine learning algorithms can't do much if the data isn't represented in a way suited to the task at hand.

Sometimes it's not clear how to represent data. For instance, in identifying an image of a car, you may want to use a wheel as a feature. But how do you define a wheel in terms of pixel values?

**Representation learning** is a kind of machine learning in which representations themselves can be learned.

An example representation learning algorithm is the **autoencoder**. It's a combination of an *encoder* function that converts input data into a different representation and a *decoder* function which converts the new representation back into its original format.

Successful representations separate the *factors of variations* (that is, the contributors to variability) in the observed data. These may not be explicit in the data, "they may exist either as unobserved objects or forces in the physical world that affect the observable quantities, or they are constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data." (*Deep Learning*).

Deep learning builds upon representation learning. It involves having the program learn some hierarchy of concepts, such that simpler concepts are used to construct more complicated ones. This hierarchy of concepts forms a deep (many-layered) graph, hence "deep learning".

With deep learning we can have simpler representations aggregate into more complex abstractions.

A basic example of a deep learning model is the multilayer perceptron (MLP), which is essentially a function composed of simpler functions (layers); each function (i.e. layer) can be thought of as taking the input and outputting a new representation of it.

For example, if we trained a MLP for image recognition, the first layer may end up learning representations of edges, the next may see corners and contours, the next may identify higher level features like faces, etc.

If you design your features such that they are on a similar scale, gradient descent can converge more quickly.

For example, say you are developing a model for predicting the price of a house. Your first feature may be the area, ranging from 0-2000 sqft, and your second feature may be the number of bedrooms, ranging from 1-5.

These two ranges are very disparate, causing the contours of the cost function to be such that the gradient descent algorithm jumps around a lot trying to find an optimum.

If you *scale* these features such that they share the same (or at least a similar) range, you avoid this problem.

More formally, with feature scaling you want to get every feature into approximately a

With feature scaling, you could also apply *mean normalization*, where you replace

Mean subtraction centers the data around the origin (i.e. it "zero-centers" it), simply by subtracting each feature's mean from itself.

Sometimes some of your features may be redundant. You can combine these features in such a way that you project your higher dimension representation into a lower dimension representation while minimizing information loss. With the reduction in dimensionality, your algorithms will run faster.

The most common technique for dimensionality reduction is *principal component analysis* (PCA), although other techniques, such as non-negative matrix factorization (NMF) can be used.

Say you have some data. This data has two dimensions, but you could more or less capture it in one dimension:

Most of the variability of the data happens along that axis.

This is basically what PCA does.

PCA is the most commonly used algorithm for dimensionality reduction. PCA tries to identify a lower-dimensional surface to project the data onto such that the square *projection error* is minimized.

PCA might project the data points onto the green line on the left. The projection error are the blue lines. Compare to the line on the right - PCA would not project the data onto that line since the projection error is much larger for that line.

This example is going from 2D to 1D, but you can use PCA to project from any

Note that this is different than linear regression, though the example might look otherwise. In PCA, the projection error is orthogonal to the line in question. In linear regression, it is vertical to the line. Linear regression also favors the target variable

Prior to PCA you should perform mean normalization (i.e. ensure every feature has zero mean) on your features and scale them.

First you compute the *covariance matrix*, which is denoted

Then, you compute the *eigenvectors* of the matrix *singular value decomposition*:

The resulting

So how do you choose

One way to choose

If the average squared projection error (which is what PCA tries to minimize) is:

And the total variation in the data is given by:

Then you would choose the smallest value of

That is, so that 99% of variance is retained.

This procedure for selecting

The

Or, to put it another way:

In practice, you can reduce the dimensionality quite drastically, such as by 5 or 10 times, such as from 10,000 features to 1,000, and retain variance.

But you should not use PCA prematurely - first try an approach without it, then later you can see if it helps.

The process of using principal component analysis (PCA) to reduce dimensionality of data is called *factor analysis*.

In factor analysis, the retained principal components are called *common factors* and their correlations with the input variables are called *factor loadings*.

PCA becomes more reliable the more data you have. The number of examples must be larger than the number of variables in the input matrix. The assumptions of linear correlation must hold as well (i.e. that the variables must be linearly related).

You can go a step further with the resulting *PCA whitening*, which can improve the training process.

PCA whitening is used to decorrelate features and equalize the variance of the features.

Thus the first step is to decorrelate the original data

Then the data is normalized to have a variance of 1 for all of its components. To do so we just divide each component by the square root of its eigenvalue. An epsilon value is included to prevent division by zero:

Basic idea: Generate more data from your existing data by resampling

Bagging (stands for Bootstrap Aggregation) is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multisets of the same cardinality/size as your original data. By increasing the size of your training set you can't improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome. - http://stats.stackexchange.com/a/19053/55910

*Univariate linear regression* or *simple linear regression* (SLR) is linear regression with a single variable.

In univariate linear regression, we have one input variable

The hypothesis takes the form:

Where the *parameters* that the learning algorithm learns.

This should look familiar: it's just a line.

The general idea is that you want to choose your parameters so that

To the math easier, you multiply everything by

This is the *cost function* (or *objective function*). In this case, we call it

Here it is the *squared error function* - it is probably the most commonly used cost function for regression problems.

The squared error loss function is not the only loss function available. There are a variety you can use, and you can even come up with your own if needed. Perhaps, for instance, you want to weigh positive errors more than negative errors.

We want to find *minimize*

For univariate linear regression, the derivatives are:

so overall, the algorithm involves repeatedly updating:

Remember that the *simultaneously*.

Note that because we are summing over *all* training examples for each step, this particular type of gradient descent is known as *batch gradient descent*. There are other approaches which only sum over a subset of the training examples for each step.

Univariate linear regression's cost function is always convex ("bowl-shaped"), which has only one optimum, so gradient descent int his case will always find the global optimum.

*Multivariate* linear regression is simply linear regression with multiple variables. This technique is for using multiple features with linear regression.

Say we have:

$n$ = number of features$x^{(i)}$ = the input features of the$i$ th training example$x_j^{(i)}$ = the value of feature$j$ in the$i$ th training example

Instead of the simple linear regression model we can use a **generalized linear model** (GLM). That is, the hypothesis

For convenience of notation, you can define

And the hypothesis can be re-written as:

Sometimes in multiple regression you may have predictor variables which are correlated with one another; we say that these predictors are **collinear**.

The previous gradient descent algorithm for univariate linear regression is just generalized (this is still repeated and simultaneously updated):

```
"""
- X = feature vectors
- y = labels/target variable
- theta = parameters
- hyp = hypothesis (actually, the vector computed from the hypothesis function)
"""
import numpy as np
def cost_function(X, y, theta):
"""
This isn't used, but shown for clarity
"""
m = y.size
hyp = np.dot(X, theta)
sq_err = sum(pow(hyp - y, 2))
return (0.5/m) * sq_err
def gradient_descent(X, y, theta, alpha=0.01, iterations=10000):
m = y.size
for i in range(iterations):
hyp = np.dot(X, theta)
for i, p in enumerate(theta):
temp = X[:,i]
err = (hyp - y) * temp
cost_function_derivative = (1.0/m) * err.sum()
theta[i] = theta[i] - alpha * cost_function_derivative
return theta
if __name__ == '__main__':
def true_function(X):
# Create random parameters for X's dimensions, plus one for x0.
true_theta = np.random.rand(X.shape[1] + 1)
return true_theta[0] + np.dot(true_theta[1:], X.T), true_theta
# Create some random data
n_samples = 20
n_dimensions = 5
X = np.random.rand(n_samples, n_dimensions)
y, true_theta = true_function(X)
# Add a column of 1s for x0
ones = np.ones((n_samples, 1))
X = np.hstack([ones, X])
# Initialize parameters
theta = np.zeros((n_dimensions+1))
# Split data
X_train, y_train = X[:-1], y[:-1]
X_test, y_test = X[-1:], y[-1:]
# Estimate parameters
theta = gradient_descent(X_train, y_train, theta, alpha=0.01, iterations=10000)
# Predict
print('true', y_test)
print('pred', np.dot(X_test, theta))
print('true theta', true_theta)
print('pred theta', theta)
```

Outliers can pose a problem for fitting a regression line. Outliers that fall horizontally away from the rest of the data points can influence the line more, so they are called points with **high leverage**. Any such point that actually does influence the line's slope is called an **influential point**. You can examine this effect by removing the point and then fitting the line again and seeing how it changes.

Outliers should only be removed with good reason - they can still be useful and informative and a good model will be able to capture them in some way.

Your data may not fit a straight line and might be better described by a polynomial function, e.g.

A trick to this is that you can write this in the form of plain old multivariate linear regression. You would, for example, just treat

Note that in situations like this, feature scaling is very important because these features' ranges differ by a lot due to the exponents.

Logistic regression is a common approach to classification. The name "regression" may be a bit confusing - it is a classification algorithm, though it returns a continuous value. In particular, it returns the probability of the positive class; if that probability is

Logistic regression outputs a value between zero and one (that is,

Say we have our hypothesis function

With logistic regression, we apply an additional function

where

This function is known as the *sigmoid* function, also known as the *logistic* function, with the form:

So in logistic regression, the hypothesis ends up being:

The output of the hypothesis is interpreted as the probability of the given input belonging to the positive class. that is:

Which is read: "the probability that

Since we are classifying input, we want to output a label, not a continuous value. So we might say *decision boundary*. Note that decision boundaries can be non-linear as well (e.g. they could be a circle or something).

Logistic regression is also a GLM - you're fitting a line which models the probability of being in the positive class. We can use the Bernoulli distribution since it models events with two possible outcomes and is parameterized by only the probability of the positive outcome,

But to represent a probability, **logit transformation**:

The

The

The inverse of the logit transformation is:

So the model is now:

So the likelihood here is:

And the log likelihood then is:

Linear regression is good for explaining continuous dependent variables. But for discrete variables, linear regression gives ambiguous results - what does a fractional result mean? It can't be interpreted as a probability because linear regression models are not bound to

When dealing with boolean/binary dependent variables you can use **logistic regression**. When dealing with non-binary discrete dependent variables, you can use **Poisson regression** (which is a GLM that uses the log link function).

So we expect the logistic regression function to output a probability. In linear regression, the model can output any value, not bound to **logit transformation**, so that our resulting values can be interpreted as probability:

So if we solve the original regression equation for

Logistic regression does not have a closed form solution - that is, it can't be solved in a finite number of operations, so we must estimate its parameters using other methods, more specifically, we use iterative methods. Generally the goal is to find the **maximum likelihood estimate** (MLE), which is the set of parameters that maximizes the likelihood of the data. So we might start with random guesses for the parameters, the compute the likelihood of our data (that is, we can compute the probability of each data point; the likelihood of the data is the product of these individual probabilities) based on these parameters. We iterate until we find the parameters which maximize this likelihood.

The technique of *one-vs-all* (or *one-vs-rest*) involves dividing your training set into multiple binary classification problems, rather than as a single multiclass classification problem.

For example, say you have three classes 1,2,3. Your first binary classifier will distinguish between class 1 and classes 2 and 3,, your second binary classifier will distinguish between class 2 and classes 1 and 3, and your final binary classifier will distinguish between class 3 and classes 1 and 2.

Then to make the prediction, you pick the class

Softmax regression generalizes logistic regression to beyond binary classification (i.e. multinomial classification; that is, there are more than just two possible classes). Logistic regression is the reduced form of softmax regression where

In the case of many, many classes, the *hierarchical* variant of softmax may be preferred. In hierarchical softmax, the labels are structured as a hierarchy (a tree). A Softmax classifier is trained for each node of the tree, to distinguish the left and right branches.

There is a class of machine learning models known as *generalized linear models* (GLMs) because they are expressed as a linear combination of parameters, i.e.

We can use linear models for non-regression situations, as we do with logistic regression - that is, when the output variable is not an unbounded continuous value directly computed from the inputs (that is, the output variable is not a linear function of the inputs), such as with binary or other kinds of classification. In such cases, the linear models we used are called *generalized linear models*. Like any linear function, we get some value from our inputs, but we then also apply a *link function* which transforms the resulting value into something we can use. Another way of putting it is that these link functions allow us to *generalize* linear models to other situations.

Linear regression also assumes *homoscedasticity*; that is, that the variance of the error is uniform along the line. GLMs do not need to make this assumption; the link function transforms the data to satisfy this assumption.

For example, say you want to predict whether or not someone will buy something - this is a binary classification and we want either a 0 or a 1. We might come up with some linear function based on income and number of items purchased in the last month, but this won't give us a 0/no or a 1/yes, it will give us some continuous value. So then we apply some link function of our choosing which turns the resulting value to give us the probability of a 1/yes.

Linear regression is also a GLM, where the link function is the identity function.

Logistic regression uses the logit link function.

Logistic regression is a type of models called **generalized linear models** (GLM), which involves two steps:

- Model the response variable with a probability distribution.
- Model the distribution's parameters using the predictor variables and a special form of multiple regression.

This probability distribution is taken from the exponential family of probability distributions, which includes the normal, Bernoulli, beta, gamma, Dirichlet, and Poisson distributions (among others). A distribution is in the exponential family if it can be written in the form:

**natural parameter** or the **canonical parameter** of the distribution, **sufficient statistics**, which is often just **log partition function**.

We can set

For instance, the Bernoulli distribution is in the exponential family, where

Same goes for the Gaussian distribution, where

Note that with linear models, you should avoid **extrapolation**, that is, estimating values which are outside the original data's range. For example, if you have data in some range

In a linear model there may be *mixed effects*, which includes *fixed* and *random* effects. Fixed effects are variables in your model where their coefficients are fixed (non-random). Random effects are variables in your model where their coefficients are random.

For example, say you want to create a model for crop yields given a farm and amount of rainfall. We have data from several years and the same farms are represented multiple times throughout. We could consider that some farms may be better at producing greater crop yields given the same amount of rainfall as another farm. So we expect that samples from different farms will have different variances - e.g. if we look at just farm A's crop yields, that sample would have different variance than if we just looked at farm B's crop yields. In this regard, we might expect that models for farm A and farm B will be somewhat different.

The naive approach would be to just ignore differences between farms and consider only rainfall as a fixed effect (i.e. with a fixed/constant coefficient). This is sometimes called "pooling" because we've lumped everything (in our case, all the farms) together.

We could create individual models for each farm ("no pooling") but perhaps for some farms we only have one or two samples. For those farms, we'd be building very dubious models since their sample sizes are so small. The information from the other farms are still useful for giving us more data to work with in these cases, so no pooling isn't necessarily a good approach either.

We can use a mixed model ("partial pooling") to capture this and make it so that the rainfall coefficient random, varying by farm.

We may run into situations like the following:

Where our data seems to encompass multiple models (the red, green, blue, and black ones going up from left to right), but if we try to model them all simultaneously, we get a complete incorrectly model (the dark grey line going down from left to right).

Each of the true lines (red, green, blue, black) may come from distinct *units*, i.e. each could represent a different part of the day or a different US state, etc. When there are different effects for each unit, we say that there is *unit heterogeneity* in the data.

In the example above, each line has a different intercept. But the slopes could be different, or both the intercepts and slopes could be different:

In this case, we use a random-effects model because some of the coefficients are random.

For instance, in the first example above, the intercepts varied, in which case the intercept coefficient would be replaced with a random variable

Or in the case of the slopes varying, we'd say that

When both slope and intercept vary, we draw them together from a multivariate normal distribution since they may have some relation, i.e.

Now consider when there are multiple levels of these effects that we want to model. For instance, perhaps there are differences across US states but also differences across US regions.

In this case, we will have a hierarchy of effects. Let's say only the intercept is affected - if we wanted to model the effects of US regions and US states on separate levels, then the

SVMs can be powerful for learning non-linear functions and are widely-used.

With SVMs, the optimization objective is:

Where the term at the end is the regularization term. Note that this is quite similar to the objective function for logistic regression; we have just removed the

If we break up the logistic regression objective function into terms (that is, the first sum and the regularization term), we might write it as

The SVM objective is often instead notated by convention as

With that representation in mind, we can rewrite the objective by replacing the

The SVM hypothesis is:

SVMs are sometimes called *large margin* classifiers.

Take the following data:

On the left, a few different lines separating the data are drawn. The optimal one found by SVM is the one in orange. It is the optimal one because it has the largest margins, illustrated by the red lines on the right (technically, the margin is orthogonal from the decision boundary *to* those red lines). When

However, outliers can throw SVM off if your regularization parameter

*Kernels* are the main technique for adapting SVMs to do complex non-linear classification.

A note on notation. Say your hypothesis looks something like:

We can instead notate each non-parameter term as a feature

For SVMs, how do we choose these features?

What we can do is compute features based on

Here, the *kernel*, sometimes just notated

We have a choice in what kernel function we use; here we are using *Gaussian kernels*. In the Gaussian kernel we have a parameter

If

With this approach, classification becomes based on distances to the landmarks - points that are far away from certain landmarks will be classified 0, points that are close to certain landmarks will be classified 1. And thus we can get some complex decision boundaries like so:

So how do you choose the landmarks?

You can take each training example and place a landmark there. So if you have

So given

Then given a training example

Then instead of

Note that here

Of course, using a landmark for each of your training examples makes SVM difficult on large datasets. There are some implementation tricks to make it more efficient, though.

When choosing the regularization parameter

- A large
$C$ means lower bias, high variance - A small
$C$ means higher bias, low variance

For the Gaussian kernel, we also have to choose the parameter

- A large
$\sigma^2$ means that features$f_i$ vary more smoothly. Higher bias, lower variance. - A small
$\sigma^2$ means that features$f_i$ vary less smoothly. Lower bias, higher variance.

When using SVM, you also need to choose a kernel, which could be the Gaussian kernel, or it could be no kernel (i.e. a linear kernel), or it could be one of many others. The Gaussian and linear kernels are by far the most commonly used.

You may want to use a linear kernel if

The Gaussian kernel is appropriate if

Not all similarity functions make valid kernels - they must satisfy a condition called Mercer's Theorem which allows the optimizations that most SVM implementations provide and also so they don't diverge.

Other off-the-shelf kernels include:

- Polynomial kernel:
$k(x,l) = (x^T l)^2$ , or$k(x,l) = (X^T l)^3$ , or$k(x,l + 1)^3$ , etc (there are many variations), the general form is$(x^Tl + \text{constant})^{\text{degree}}$ . It usually performs worse than the Gaussian kernel. - More esoteric ones: String kernel, chi-square kernel, histogram intersection kernel, ...

But these are seldom, if ever, used.

Some SVM packages have a built-in multi-class classification functionality. Otherwise, you can use the one-vs-all method. That is, train

If

If

If

SVM without a kernel works out to be similar to logistic regression for the most part.

Neural networks are likely to work well for most of these situations, but may be slower to train.

The SVM's optimization problem turns out to be convex, so good SVM packages will find global minimum or something close to it (so no need to worry about local optima).

Other rules of thumb:

- Use linear kernel when number of features is larger than number of observations.
- Use gaussian kernel when number of observations is larger than number of features.
- If number of observations is larger than 50,000 speed could be an issue when using gaussian kernel; hence, one might want to use linear kernel.

Source

Also:

Usually, the decision is whether to use linear or an RBF (aka Gaussian) kernel. There are two main factors to consider:

Solving the optimisation problem for a linear kernel is much faster, see e.g. LIBLINEAR.

Typically, the best possible predictive performance is better for a nonlinear kernel (or at least as good as the linear one).It's been shown that the linear kernel is a degenerate version of RBF, hence the linear kernel is never more accurate than a properly tuned RBF kernel. Quoting the abstract from the paper I linked:

The analysis also indicates that if complete model selection using the Gaussian kernel has been conducted, there is no need to consider linear SVM.

A basic rule of thumb is briefly covered in NTU's practical guide to support vector classification (Appendix C).

If the number of features is large, one may not need to map data to a higher dimensional space. That is, the nonlinear mapping does not improve the performance. Using the linear kernel is good enough, and one only searches for the parameter C.

Your conclusion is more or less right but you have the argument backwards. In practice, the linear kernel tends to perform very well when the number of features is large (e.g. there is no need to map to an even higher dimensional feature space). A typical example of this is document classification, with thousands of dimensions in input space.

In those cases, nonlinear kernels are not necessarily significantly more accurate than the linear one. This basically means nonlinear kernels lose their appeal: they require way more resources to train with little to no gain in predictive performance, so why bother.

TL;DRAlways try linear first since it is way faster to train (AND test). If the accuracy suffices, pat yourself on the back for a job well done and move on to the next problem. If not, try a nonlinear kernel.

Source

*Support vector machines* is another way of coming up with decision boundaries to divide a space.

Here the decision boundary is positioned so that its *margins* are as wide as possible.

We can consider some vector

Then we can consider an unknown vector

We can compute their dot product,

To make things easier to work with mathematically, we set

This is our decision rule: if this inequality is true, we have a positive example.

Now we will define a few things about this system:

Where

For mathematical convenience, we will define another variable

So we can rewrite our constraints as:

Which ends up just collapsing into:

Or:

We then add an additional constraint for an

So how do you compute the total width of the margins?

You can take a negative example

Using our previous constraints we get

We want to maximize the margins, that is, we want to maximize the width, and we can divide by

Let's turn this into something we can maximize, incorporating our constraints. We have to use Lagrange multipliers which provide us with this new function we can maximize without needing to think about our constraints anymore:

(Note that the Lagrangian is an objective function which includes equality constraints).

Where

So then to get the maximum, we just compute the partial derivatives and look for zeros:

Let's take these partial derivatives and re-use them in the original Lagrangian:

Which simplifies to:

We see that this depends on

Similarly, we can rewrite our decision rule, substituting for

And similarly we see that this depends on

The nice thing here is that this works in a convex space (proof not shown) which means that it cannot get stuck on a local maximum.

Sometimes you may have some training data *is* linearly separable.

Since the maximization and the decision rule depend only on the dot products of vectors, we can just substitute the transformation, so that:

- we want to maximize
$\phi(\vec x_i) \cdot \phi(\vec x_j)$ - for the decision rule, we have
$\phi(\vec x_i) \cdot \phi(\vec u)$

Since these are just dot products between the transformed vectors, we really only need a function which gives us that dot product:

This function *kernel* function.

So if you have the kernel function, you don't even need to know the specific transformation - you just need the kernel function.

Some popular kernels:

- linear kernel:
$K(\vec u, \vec v) = (\vec u \cdot \vec v + 1)^n$ - radial basis kernel:
$e^{-\frac{||\vec x_i - \vec x_j||}{\sigma}}$

Many machine learning algorithms can be written in the form:

Where

We can substitute **kernel**.

Thus we are left with:

Machine learning algorithms which use this trick are called **kernel methods** or **kernel machines**.

Basic algorithm:

- Start with data all in one group
- Find some criteria which best splits the outcomes
- Divide the data into two groups (which become the leaves) on that split (which becomes a node)
- Within each split, repeat
- Repeat until the groups are too small or are sufficiently "pure" (homogeneous)

Classification trees are non-linear models:

- They use interactions b/w variables
- Data transformations may be less important (monotone transformations probably won't affect how data is split)
- Trees can be used for regression problems (continuous outcome)

That is, within the

**misclassification error**:$1 - \hat p_{mk(m)} ; k(m) = \text{most; common; k}$ - 0 = perfect purity
- 0.5 = no purity

**Gini index**:$\sum_{k \neq k'} \hat p_{mk} \times \hat p_{mk'} = \sum_{k=1}^K \hat p_{mk} (1 - \hat p_{mk}) = 1 - \sum_{k=1}^K p^2_{mk}$ - 0 = perfect purity
- 0.5 = no purity

**Deviance/information gain**:$- \sum^K_{k=1} \hat p_{mk} \log_2 \hat p_{mk}$ - 0 = perfect purity
- 1 = no purity

Random forests are the ensemble model version of decision trees.

Basic idea:

- Bootstrap samples (i.e. resample)
- At each split in the tree, bootstrap the variables (i.e. only a subset of the variables is considered at each split)
- Grow multiple trees
- Each tree votes on a classification

This can be very accurate but slow, prone to overfitting (cross-validation helps though), and not easy to interpret. However, they generally perform very well.

The hinge loss function takes the form

where

$N$ = number of samples$C$ = number of classes

Typically used for Softmax classifiers.

Basically, taking many models and combining their outputs as a weighted average.

Basic idea:

- Take lots of (possibly) weak predictors
$h_1, \dots, h_k$ , e.g. a bunch of different trees or regression models or different cutoffs. - Weight them and combine them by creating a classifier which combine the predictors:
$f(x) = \text{sign}(\sum^T_{t=1} \alpha_t h_t(x))$ - Goal is to minimize error on training set
- Iteratively select a classifier
$h$ at each step - Calculate weights based on errors
- Increase the weight of missed classifications and select the next classifier
- The sign of the result tells you the class

*Adaboost* is a popular boosting algorithm.

One class of boosting is *gradient boosting*.

Boosting typically does very well.

Here we focus on binary classification.

Say we have a classifier

We have some error rate, which ranges from 0 to 1. A weak classifier is one where the error is just less than 0.5 (that is, it works slightly better than chance). A stronger classifier has an error rate closer to 0.

Let's say we have several weak classifiers,

We can combine them into a bigger classifier, *vote* on the classification:

How do we generate these weak classifiers?

- We can create one by taking the data, training classifiers on it, and selecting with the smallest error rate (this will be classifier
$h_1$ .) - We can create another by taking the data and giving it some exaggeration of
$h_1$ 's errors (e.g. pay more attention to the samples that$h_1$ has trouble one). Training a new classifier on this gives us$h_2$ . - We can create another by taking the data and giving it some exaggeration to the samples where the results of
$h_1 \neq h_2$ . Training a new classifier on this gives us$h_3$ .

This process can be recursive. That is,

For our classifiers we could use *decision tree stumps*, which is just a single test to divide the data into groups (i.e. just a part of a fuller decision tree). Note that boosting doesn't have to use decision tree (stumps), it can be used with any classifier.

We can assign a weight to each training example,

We can compute the error

For our aggregate classifier, we may want to weight the classifiers with the weights

The general algorithm is:

- We can set the starting weights
$w^t_i$ for our training examples to be$\frac{1}{N}$ where$N$ is the number of examples and$t=1$ , representing the time (or the iteration). - Then we pick a classifier
$h^t$ which minimizes the error rate. - Then we can pick
$\alpha^t$ . - And we can calculate
$w^{t+1}$ . - Then repeat.

Now suppose

We want to minimize the error bound for

Stacking is similar to boosting, except that you also learn the weights for the weighted average by wrapping the ensemble of models into another model.

*Overfitting* is a problem where your hypothesis describes the training data *too* well, to the point where it cannot generalize to new examples. It is a *high variance* problem. In contrast, *underfitting* is a *high bias* problem.

To clarify, if your model has no bias, it means that it makes no errors on your training data (i.e. it does not underfit). If your model has no variance, it means your model generalizes well on your test data (i.e. it does not overfit). It is possible to have bias and variance problems simultaneously.

Another way to think of this is that:

- variance = how much does the model vary if the training data changes? I.e. what space of possible models does this cover? High variance implies that the model is too sensitive to the particular training examples it looked at, and thus will not adapt well to other examples.
- bias = is the average model close to the "true" solution/model? High bias means that the model is systematically incorrect.

There is a *bias-variance trade-off*, in which improvement of one is typically at the detriment of the other.

You can think of generalization error as the sum of bias and variance. You want to keep both low, if possible.

Overfitting can happen if your hypothesis is too complex, which can happen if you have too many features. So you will want to through a *feature selection* phase and pick out features which seem to provide the most value.

Alternatively, you can use the technique of *regularization*, in which you keep all your features, but reduce the magnitudes/values of parameters

**Regularization** can be defined as any method which aims to reduce the generalization error of a model though it may not have the same effect on the training error. Since good generalization error is the main goal of machine learning, regularization is essential to success.

Perhaps the most common form of regularization aims to favor smaller parameters. The intuition is that, if you have small values for your parameters

In practice, there may be many combination of parameters which fit your data well. However, some may overfit/not generalize well. We want to introduce some means of valuing these simpler hypotheses over more complex ones (i.e. with larger parameters). We can do so with regularization.

So generally regularization is about shrinking your parameters to make them smaller; for this reason it is sometimes called *weight decay*. For linear regression, you accomplish this by modifying the cost function to include the term

Note that we are not shrinking

*regularization parameter*. It tunes the balance between fitting the data and keeping the parameters small (i.e. each half of the cost function). If you make *too* large for your problem, you may make your parameters too close to 0 for them to be meaningful - large values of lambda can lead to underfitting problems (since the parameters get close to 0).

The additional *regularization loss*, and the rest of the loss function is called the *data loss*.

A regularization method used in linear regression; the L2 norm of the parameters is constrained so that it less than or equal to some specified value (that is, this is L2 regularization):

Where:

$\lambda \geq 0$ is a hyperparameter controlling the amount of shrinkage.$N$ is the number of data points$p$ is the number of dimensions

**LASSO** (Least Absolute Shrinkage and Selection Operator) is a regularization method which constrains the L1 norm of the parameters such that it is less than or equal to some specified value:

(These two regularization methods are sometimes called *shrinkage* methods)

We can update gradient descent to work with our regularization term:

The

If we are using the normal equation, we can update it to a regularized form as well:

Where

We can also update the logistic regression cost function with the regularization term:

Then we can update gradient descent with the new derivative of this cost function for the parameters

It looks the same as the one for linear regression, but again, the actual hypothesis function

Fundamentally, machine learning is all about data:

- Stochastic, chaotic, and/or complex generative processes
- Noisily observed
- Partially observed

So there is a lot of uncertainty - we can use probability theory to express this uncertainty in the form of *probabilistic models*. Generally, learning probabilistic models is known as *probablistic machine learning*; here we are primarily concerned with non-Bayesian machine learning.

We have some data

We have a parameter

A probabilistic model is just a parameterized joint distribution over all the variables:

We usually interpret such models as a *generative* model - how was our observed data *generated* by the world?

So the problem of inference is about learning about our latent variables given the observed data, which we can get via the posterior distribution:

Learning is typically posed as a *maximum likelihood* problem; that is, we try to find

Then, to make a prediction we want to compute the conditional distribution of some future data:

Or, for classification, if we some classes, each parameterizing a joint distribution, we want to pick the class which maximizes the probability of the observed data:

*Discriminative* learning algorithms include algorithms like logistic regression, decision trees, kNN, and SVM. Discriminative approaches try to find some way of separating data (*discriminating* them), such as in logistic regression which tries to find a dividing line and then sees where new data lies in relation to that line. They are unconcerned with *how* the data was generated.

Say our input features are

Discriminative learning algorithms learn

*Generative* learning algorithms instead tries to develop a model for each class and sees which model new data conforms to.

Generative learning algorithms learn *generated* the given features

It is easier to estimate the conditional distribution

For both discriminative and generative approaches, you will have parameters and latent variables

Say we have some observed values

If we flip this we get the **likelihood** of

*Likelihood* is just the name for the probability of observed data as a function of the parameters.

The **maximum likelihood estimation** is the

(The Expectation-Maximization (EM) algorithm is a way of computing a maximum likelihood estimate for situations where some variables may be hidden.)

If the random variables associated with the values, i.e.

Sometimes this is just notated

So we are looking to estimate the

Logarithms are used, however, for convenience (i.e. dealing with sums rather than products), so instead we are often maximizing the **log likelihood** (which has its maximum at the same value (i.e. the same argmax) as the regular likelihood, though the actual maximum value may be different):

Another way of explaining MLE:

We have some data

(Notation note:

Though the logarithm version mentioned above is typically preferred to avoid underflow.

Typically, we are more interested in the conditional probability

Assuming the examples are iid, this is:

Say we have a coin which may be unfair. We flip it ten times and get `HHHHTTTTTT`

(we'll call this observed data

Here we just have a binomial distribution so the parameters here are

Where

For this case our MLE would be

Also formulated as:

For a Gaussian distribution, the sample mean is the MLE.

The **expectation maximization** (EM) algorithm is a two-staged iterative algorithm.

Say you have a dataset which is missing some values. How can you complete your data?

The EM algorithm allows you to do so.

The two stages work as such:

- Begin with initial parameters
$\hat \theta^{(t)}, t=0$ . - The "E-step"
- Using the current parameters
$\hat \theta^{(t)}$ , compute probabilities for each possible completion of the missing data. - Use these probabilities to create a weighted training set of these possible completions.

- Using the current parameters
- The "M-step"
- Use a modified version of MLE (one which can deal with weighted samples) to derive new parameter estimates,
$\hat \theta^{(t+1)}$ .

- Use a modified version of MLE (one which can deal with weighted samples) to derive new parameter estimates,
- Repeat the E and M steps until convergence.

Intuitively, what EM does is tries to find the parameters

Say we have two coins

We conduct 5 experiments in which we randomly choose one of the coins (with equal probability) and flip it 10 times.

We have the following results:

```
1. HTTTHHTHTH
2. HHHHTHHHHH
3. HTHHHHHTHH
4. HTHTTTHHTT
5. THHHTHHHTH
```

We are still interested in learning a parameter for each coin,

If we knew which coin we flipped during each experiment, this would be a simple MLE problem. Say we did know which coin was picked for each experiment:

```
1. B: HTTTHHTHTH
2. A: HHHHTHHHHH
3. A: HTHHHHHTHH
4. B: HTHTTTHHTT
5. A: THHHTHHHTH
```

Then we just use MLE and get:

That is, for each coin we just compute

But, alas, we are missing the data of which coin we picked for each experiment. We can instead apply the EM algorithm.

Say we initially guess that

We're dealing with a binomial distribution here, so we are using:

The binomial coefficient is the same for both coins (the

For the first experiment we have 5 heads (and

So for this first iteration and for the first experiment, we estimate that the chance of the picked coin being coin

Then we generate the weighted set of these possible completions by computing how much each of these coins, as weighted by the probabilities we just computed, contributed to the results for this experiment (

Then we repeat this for the rest of the experiments, getting the following weighted values for each coin for each experiment:

and sum up the weighted values for each coin:

Then we use these weighted values and MLE to update

And repeat until convergence.

In K-Means we make hard assignments of datapoints to clusters (that is, they belong to only one cluster at a time, and that assignment is binary).

EM is similar to K-Means, but we use soft assignments instead - datapoints can belong to multiple clusters in varying strengths (the "strengths" are probabilities of assignment to each cluster). When the centroids are updated, they are updated against *all* points, weighted by assignment strength (whereas in K-Means, centroids are updated only against their members).

EM converges to approximately the same clusters as K-Means, except datapoints still have some membership to other clusters (though they may be very weak memberships).

In EM, we consider that each datapoint is generated from a mixture of classes.

For each

These terms may be notated:

What this is modeling here is that each centroid is the center of a Gaussian distribution, and we try to fit these centroids and their distributions to the data.

- Review of fundamentals, IFT725. Hugo Larochelle. 2012.
- Exploratory Data Analysis Course Notes. Xing Su.
- Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman.
- Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
- CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX).
*Evaluating Machine Learning Models*. Alice Zheng. 2015.- Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
- Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
- MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
- Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
- CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 2: Setting up the Data and the Loss. Andrej Karpathy.
- POLS 509: Hierarchical Linear Models. Justin Esarey.
- Bayesian Inference with Tears. Kevin Knight, September 2009.
- Learning to learn, or the advent of augmented data scientists. Simon Benhamou.
- Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle, Ryan P. Adams.
- What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou.
- Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010.
- Maximum Likelihood Estimation. Penn State Eberly College of Science.
- Data Science Specialization. Johns Hopkins (Coursera). 2015.
- Practical Machine Learning. Johns Hopkins (Coursera). 2015.
- Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome Friedman.
- CS231n Convolutional Neural Networks for Visual Recognition, Linear Classification. Andrej Karpathy.
- How does expectation maximization work?. joriki.
- How does expectation maximization work in coin flipping problem. joriki.
- Deep Learning Tutorial - PCA and Whitening. Chris McCormick.