The **slope** of a two-dimensional function (in higher dimensions, the term **gradient** is used instead of "slope"; in particular, the gradient is the vector of partial derivatives) can be thought of as the *rate of change* for that function.

For a linear function

But for non-linear functions, e.g.

**Differentation** is a way to find another function, called the **derivative** of the original function, that gives us the rate of change (slope) of one variable with respect to another variable.

It tells us how to change the input in order to get a change in the output:

This will become useful later on - many machine learning training methods use derivatives (in particular, multidimensional partial derivatives, i.e. gradients) to determine how to update weights (inputs) in order to reduce error (the output).

Say that we want to compute the rate of change (slope) at a *single point*. How? It takes two points to define a line, which we can easily compute the slope for.

Instead of a single point, we can consider two points that are very, very close together:

Note that sometimes

Their slope is then given by:

We want the two points as close as possible, so we can look at the limit of

That is the derivative of

If this limit exists, we say that **differentiable** at

We have a car and have a variable

With differentiation we can get

Note that this is *not* the same as

If instead we want instantaneous velocity - the velocity at a given point in time - we need to have the time interval

This can be read as

- "the rate of change in
$x$ with respect to$t$ ", or - "an infinitesimal value of
$y$ divided by an infinitesimal value of$x$ "

For a given function

A derivative of a function

$f'(x)$ $D_x [f(x)]$ $Df(x)$ $\frac{dy}{dx}$ $\frac{d}{dx}[y]$

As a special case, if we are looking at a variable with respect to time

$\dot f = \frac{df}{dt}$

**Derivative of a constant function**: For any fixed real number ,$c$ .$\frac{d}{dx}[c] = 0$ - This is because a constant function is just a horizontal line (it has a slope of 0).

**Derivative of a linear function**: For any fixed real numbers$m$ and$c$ ,$\frac{d}{dx}[mx+c] = m$ **Constant multiple rule**: For any fixed real number$c$ ,$\frac{d}{dx}[cf(x)] = c\frac{d}{dx}[f(x)]$ **Addition rule**:$\frac{d}{dx}[f(x) \pm g(x)] = \frac{d}{dx}[f(x)] \pm \frac{d}{dx}[g(x)]$ **The power rule**:$\frac{d}{dx}[x^n] = nx^{n-1}$ **Product rule**:$\frac{d}{dx}[f(x) \cdot g(x)] = f(x) \cdot g'(x) + f'(x) \cdot g(x)$ **Quotient rule**:$\frac{d}{dx} [\frac{f(x)}{g(x)}] = \frac{g(x)f'(x) - f(x)g'(x)}{g(x)^2}$

- Apply the addition rule:
$\frac{d}{dx}[6x^5] + \frac{d}{dx}[3x^2] + \frac{d}{dx}[3x] + \frac{d}{dx}[1]$ - Apply the linear and constant rules:
$\frac{d}{dx}[6x^5] + \frac{d}{dx}[3x^2] + 3 + 0$ - Apply the constant multiplier rule:
$6\frac{d}{dx}[x^5] + 3\frac{d}{dx}[x^2] + 3$ - Then the power rule:
$6(5x^4) + 3(2x) + 3$ - And finally:
$30x^4 + 6x + 3$

If a function

This rule can be applied sequentially to nestings (compositions) of many functions:

The chain rule is very useful when you recompose functions in terms of nested functions.

Given

- We can apply the chain rule:
$\frac{df}{dx} = \frac{df}{du} \cdot \frac{du}{dx}$ . - Then substitute:
$\frac{df}{dx} = \frac{d}{du}[u^3] \cdot \frac{d}{dx}(x^2 + 1)$ . - Then we can just apply the rest of our rules:
$\frac{df}{dx} = 3u^2 \cdot 2x$ . - Then substitute again:
$\frac{df}{dx} = 3(x^2+1)^2 \cdot 2x$ and simplify.

The derivative of a function as described above is the **first derivative**.

The **second derivative**, or **second order derivative**, is the derivative of the first derivative, denoted

There's also the third derivative,

Any derivative beyond the first is a **higher order derivative**.

The above notation gets unwieldy, so there are alternate notations.

For the

$f^{(n)}(x)$ (this is to distinguish from$f^n(x)$ which is the quantity$f(x)$ raised to the$n^{th}$ power)$\frac{d^n f}{dx^n}$ (Leibniz notation)$\frac{d^n}{dx^n}[f(x)]$ (another form of Leibniz notation)$D^n f$ (Euler's notation)

When dealing with multiple variables, there is sometimes the option of **explicit differentiation**. This simply involves expressing one variable in terms of the other.

For example:

Here it is easy to apply the chain rule:

Implicit differentiation is useful for differentiating equations which cannot be explicitly differentiated because it is impossible to isolate variables. With implicit differentiation, you do not need to define one of the variables in terms of the other.

For example, using the same equation from before:

First, differentiate with respect to

To differentiate

So returning to our other in-progress derivative:

We can substitute and bring it to completion:

A **global maximum** (or *absolute maximum*) of a function

A **global minimum** (or *absolulte minimum*) of a function

The **extreme value theorem** states that if

Note that at *any* **extremum** (i.e. a minimum or a maximum), global or local, the slope is 0 because the graph stops rising/falling and "turns around". For this reason, extrema are also called **stationary points** or *turning points*.

Thus, the first derivative of a function is equal to 0 at extrema. But the converse does not hold true: the first derivative of a function is not always an extrema when it equals 0. This is because a slope of 0 may also be found at a point of **inflection**:

To discern extrema from inflection points, you can use the *extremum test*, aka the *second derivative test*.

If the second derivative at the stationary point is positive (increasing) or negative (decreasing), then we know we have a minimum or a maximum, respectively.

The intuition here is that the rate of change is *also* changing at extrema (e.g. it is going from a positive slope to a negative slope, which indicates a maximum, or the reverse, which indicates a minimum).

However, if the second derivative is also 0, then we still have not distinguished the point. It may be a saddle point or on a flat region. What you can do is continue differentiating until you get a non-zero result.

If we take

However, if

A **critical point** are points where the function's derivative are

If a function

This is basically saying that if you have an interval which ends with the same value it starts with, at some point in that curve the slope will be 0:

If

This is basically saying that there is some point on the interval where its instantaneous slope is equal to the average slope of the interval.

Rolle's Theorem is a special case of the Mean Value Theorem where

An **indeterminate limit** is one which results in

If

If the resulting limit here is also indeterminate, you can re-apply L'Hopital's rule until it is not.

Note that

Certain functions can be expressed as an expansion of itself around a point **Taylor series** and is an infinite sum of that function and its derivatives around

When **Maclaurin series**.

How can we find the area under a graph?

We can try to approximate the area using a finite number (

The more rectangles (i.e. increasing

So we can have

Say we have a function

The endpoint of an subinterval can be denoted

For each

Thus, for the

or

So the total area for the interval is:

This kind of area approximation is called a **Riemann sum**.

The best approximation then is:

So we define the **definite integral**:

Suppose

where

In the expression **integrand**, **lower limit** and **upper limit** of integration.

A right-handed Riemann sum is just one where

**The constant rule**:$\int_a^b cf(x)dx = c\int_a^b f(x)dx$ - A special case rule for integrating constants is:
$\int_a^b cdx = c(b-a)$

- A special case rule for integrating constants is:
**Addition and subtraction rule**:$\int_a^b (f(x) \pm g(x))dx = \int_a^b f(x)dx \pm \int_a^b g(x)dx$ **The comparison rule**- Suppose
$f(x) \ge 0$ for all$x$ in$[a, b]$ . Then$\int_a^b f(x)dx \ge 0$ . - Suppose
$f(x) \ge g(x)$ for all$x$ in$[a, b]$ . Then$\int_a^b f(x)dx \ge \int_a^b g(x)dx$ . - Suppose
$M \ge f(x) \ge m$ for all$x$ in$[a, b]$ . Then$M(b-a) \ge \int_a^b f(x)dx \ge m(b-a)$ .

- Suppose
**Additivity with respect to endpoints**: Suppose . Then$a < c < b$ .$\int_a^b f(x)dx = \int_a^c f(x)dx + \int_c^b f(x)dx$ - This is basically saying the area under the graph from
$a$ to$b$ is equal to the area under the graph from$a$ to$c$ plus the area under the graph from$c$ to$b$ , so long as$c$ is some point between$a$ and$b$ .

- This is basically saying the area under the graph from
**Power rule of integration**: As long as$n \neq 1$ and$0 \notin [a,b]$ , or$n>0$ ,$\int_a^b x^ndx = \frac{x^{n+1}}{n+1}|_a^b = \frac{b^{n+1} - a^{n+1}}{n+1}$

Suppose

If we have a function **antiderivative** of

Generally, a function

So we usually include a *set* of functions rather than a unique function.

We say that the integral of

This is the **indefinite integral** since we are not specifying a range the integral is computed over. Thus, we are not given an explicit value but rather the function(s) that results (this typically includes the ambiguous

Here **integrand**.

In a definite integral, we specify the upper and lower limits:

The fundamental theorem of calculus connects the concept of a derivative to that of an integral.

Suppose that

We can define a function

Suppose

Then

Thus

Suppose

Note that

To understand why this is so, consider:

Say that

If we want to compute the area under

Note that this is also how the derivative is calculated.

So as we take the limit of

Thus we have shown that

Given some arbitrary:

We know that this is also equal to:

Therefore:

The integral rules defined above still apply.

**Power rule for indefinite integrals**: For all$n \neq -1$ ,$\int x^ndx = \frac{1}{n+1}x^{n+1} + C$ **Integral of the inverse function**: For$f(x) = \frac{1}{x}$ , remember that$\frac{d}{dx}\ln x = \frac{1}{x}$ , so$\int \frac{dx}{x} = \ln|x| + C$ **Integral of the exponential function**: Because$\frac{d}{dx}e^x = e^x$ ,$\int e^x dx = e^x + C$ **The substitution rule for indefinite integrals**: Assume is differentiable with a continuous derivative and that$u$ is continuous on the range of$f$ . Then$u$ .$\int f(u(x))\frac{du}{dx}dx = \int f(u)du$ - Remember that
$\frac{du}{dx}$ is*not*a fraction, so you're not just "canceling" things out here.

- Remember that

Suppose

You set *ILATE*:

- I for inverse trigonometric functions
- L for log functions
- A for algebraic functions
- T for trigonometric functions
- E for expontential functions

There are two types of **improper integrals**:

- Those on an unbounded function, e.g.:

- Those on an unbounded interval, e.g.:

The integral on an unbounded function depicted above is known as an "improper integral with infinite integrand at

The integral on an unbounded interval is known as an "improper integral on an infinite interval". Here we just consider the limit:

If the interval is unbounded in both directions, we consider instead two separate intervals:

Say we have the integral

If we set the upper bound to be a finite value

The formal definition:

1. Suppose

as long as this limit exists and is finite. If it does exist we say the integral is *convergent* and otherwise we say it is *divergent*.

2. Similarly if

3. Finally, suppose

Suppose

We define

as long as each integral on the right converges.

As a simpler example, say we have an improper integral with a single discontinuity.

If

If this limit exists, the integral we say it converges and otherwise we say it diverges.

Similarly, if

Finally, if

We are frequently dealing with data in many dimensions, so we must expand the previous concepts of derivatives and integrals to higher-dimensional spaces.

A definite integral for

But say we are working with three dimensions, i.e. we have *volume under the surface* of

The area of one face of that chunk is the area under the curve, with respect to

Because this is with respect to

To get the volume of this chunk, we multiply that area by some depth

So if we want to get the volume in the bounds of

A double integral!

It is also written without the parentheses:

Note that here we first integrated wrt to

Note: the lower bounds here were 0 but that's just an example.

You could instead conceptualize the double integral as the sum of the volumes of infinitely small columns:

The area of each column's base,

In the previous example we had fixed boundaries (see accompanying illustration, on the left).

What if instead we have a variable boundary (see accompanying illustration, on the right. The lower

Well you express variable boundaries as a functions. As is the case in the example above, the lower

That's if you first integrate wrt to

Triple integrals also involve infinitely small volumes and in many cases are no different than double integrals.

So why use triple integrals? Well they are good for calculating the *mass* of something - if the density under the surface is not uniform. The density at a given point is expressed as

Say you have a function

With two variables, we are now working in three dimensions. How does differential calculus work in 3 (or more) dimensions? In three dimensions, what is the slope at a given point? Any given point has an infinite number of tangent lines (only one tangent plane though). So when you take a derivative in three dimensions, you have to specify what direction that derivative is in.

Say we have **partial derivative**. If we were doing it wrt to

So we could work this out as:

Then you could get the partial derivative wrt to

The plane that these two functions define together for a given point

More generally, for a function

The partial derivative tells us how much the output of a function

Partial derivatives can be generalized into *directional derivatives*, which are derivatives with respect to to any arbitrary line (it does not have to be, for example, with respect to the

A gradient is a vector of all the partial derivatives at a given point, which is to say it is a generalization of the derivative from two-dimensions to higher dimensions.

The gradient of a function

Sometimes it is just notated as

The gradient of some function

That is, the partial derivative of

It can also be written (this is just different notation):

It's worth noting that this can be thought of in terms of matrices, i.e. given some function

Which is to say that

Some properties, taken from equivalent properties of partial derivatives, are:

$\nabla_x (f(x) + g(x)) = \nabla_x f(x) + \nabla_x g(x)$ - For
$t \in \mathbb R, \nabla_x (tf(x)) = t \nabla_x f(x)$

Say we have the function

Using the partials we calculated previously, the gradient is:

So what we're really calculating here is a *vector field*, which gives an

What the gradient tells us is, for a given point, what direction to travel to get the maximum slope for

For the vector **Jacobian**, notated

That is, it is an

To clarify, the difference between the gradient and the Jacobian is that the gradient is for a single function, thus yielding a vector, whereas the Jacobian is for *multiple* functions, thus yielding a matrix.

Say we have a function

The **Hessian** matrix with respect to

Which is to say

Wherever the second partial derivatives are continuous, the Hessian is symmetric, i.e.

Just as the second derivative test is used to check if a critical point is a maximum, a minimum, or still ambiguous, as the Hessian is composed of second-order partial derivatives, it does the same for multiple dimensions.

This is accomplished as follows. If the Hessian matrix is real and symmetric, it can be decomposed into a set of real eigenvalues and an orthogonal basis of eigenvectors. At critical points of the function we can look at the Hessian's eigenvalues:

- If the Hessian is positive definite, we have a local minimum (because movement in any direction is positive)
- If the Hessian is negative definite, we have a local maximum (because movement in any direction is negative)
- When at least one eigenvalue is positive and at least one is negative, we have a saddle point
- When all non-zero eigenvalues are of the same sign, but at least one is zero, we still have an ambiguous critical point

The Jacobian and the Hessian are related by:

Intuitively, the

A **scalar field** just means a space where, for any point, you can get a scalar value.

For example, with

A **vector field** is similar but instead of just a scalar value, you get a value and a direction.

For example,

Say we have a vector field

The **divergence** of that vector field is:

That is, it is the dot product (which tells us how much two vectors move together) of the gradient and the vector field.

So for our example:

The divergence, which is scalar number for any point in a vector field, represents the change in volume density from an infinitesimal volume around a given point in that field. A positive divergence means the volume density is decreasing (more going out than coming in); a negative divergence means the volume density is increasing (more is coming in than going on, this is also called **convergence**). A divergence of 0 means the volume density is not changing.

Using our previously calculated divergence, say we want to look at the point

The **curl** measures the rotational effect of a vector field at a given point. Unlike divergence, where we are seeing how much the gradient and the vector field move together, we are interested in seeing how they move against each other. So we use their cross product:

Consider, for a symmetric matrix

Optimization problems with equality constraints are typically solved by forming the **Lagrangian**, an objective function which includes the equality constraints. For this particular problem (i.e. with the quadratic form), the Lagrangian is:

Where **Lagrange multiplier** associated with the equality constraint. For

This is just the linear equation

Differential equations are simply just equations that contain derivatives.

**Ordinary differential equations** (ODEs) involve equations containing:

- variables
- functions
- their derivatives
- their solutions

This is contrasted to **partial differential equations** (PDEs), which contain partial derivatives instead of ordinary derivatives.

Say we have:

First we can integrate both sides:

Then we can integrate once more:

So our solution is

The values *initial conditions*, e.g. the starting conditions of a model.

There are four main types (though there are many others) of differential equations:

- separable
- homogenous
- linear
- exact

A *separable equation* is in the form:

You can group the terms together like so:

And then integrate both sides to obtain the solution:

Say we want to solve

Separate the terms:

Then integrate:

If we let

A *homogenous equation* is in the form:

To make things easier, we can use the substitution

so

Then we can set

so now the equation is in separable form and be solved as a separable equation.

A *linear* first order differential equation is a differential equation in the form:

To solve, you multiply both sides by *integrating factor*.

So in this case,

So, we calculate the integrating factor:

and multiply both sides by

and work out the integration.

An *exact equation* is in the form:

such that

There exists some function

so long as

- Calculus. Revised 14 October 2013. Wikibooks.
- Multivariable Calculus. Khan Academy.
- Linear Algebra Review and Reference. Zico Kolter. October 16, 2007.
- Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
- Math for Machine Learning. Hal DaumÃ© III. August 28, 2009.