$$ \DeclareMathOperator{\Img}{im} $$


Fundamentally, a function is a relationship (mapping) between the values of some set $X$ and some set $Y$:

$$ f:X \to Y $$

A function is a mapping between domains.
A function is a mapping between domains.

A function can map a set to itself. For example, $f(x) = x^2$, also notated $f:x \mapsto x^2$, is the mapping of all real numbers to all real numbers, or $f:\mathbb R \to \mathbb R$.

The set you are mapping from is called the domain.

The set that is being mapped to is called the codomain.

The range is the subset of the codomain which the function actually maps to (a function doesn't necessarily map to every value in the codomain. But where it does, the range equals the codomain).

Functions which map to $\mathbb R$ are known as scalar-valued or real-valued functions.

Functions which map to $\mathbb R^n$ where $n > 1$ are known as vector-valued functions.

Identity functions

An identity function maps something to itself:

$$ I_X : X \to X $$

That is, for every $a$ in $X$, $I_X(a) = a$:

$$ I_X(a) = a, \forall \, a \in X $$

The inverse of a function

Say we have a function $f: X \to Y$, where $f(a) = b$ for any $a \in X$.

We say $f$ is invertible if and only if there exists a function $f^{-1}: Y \to X$ such that $f^{-1} \circ f = I_X$ and $f \circ f^{-1} = I_Y$. Note that $\circ$ denotes function composition, i.e. $f \circ g = f(g)$, which is the same as $f(g(x))$.

The inverse of a function is unique, that is, it is surjective and injective (described below), that is, there is a unique $x$ for each $y$.

Surjective functions

A surjective function, also called "onto", is a function $f: X \to Y$ where, for every $y \in Y$ there exists at least one $x \in X$ such that $f(x) = y$. That is, every $y$ has at least one corresponding $x$ value.

This is equivalent to:

$$ \text{range}(f) = Y $$

Injective functions

An injective function, also called "one-to-one", is a function $f: X \to Y$ where, for every $y \in Y$, there exists at most one $x \in X$ such that $f(x) = y$.

That is, not all $y$ necessarily has a corresponding $x$, but those that do only have one corresponding $x$.

Surjective & injective functions

A function can be both surjective and injective, which just means that for every $y \in Y$ there exists exactly one $x \in X$ such that $f(x) = y$, that is, every $y$ has exactly one corresponding $x$.

As mentioned before, the inverse of a function is both surjective and injective!

Convex and non-convex functions

A convex function is a continuous function whose value at the midpoint of every interval in its domain does not exceed the arithmetic mean of its values at the ends of the interval. (Convex Function. Weisstein, Eric W. Wolfram MathWorld)

A convex region is one in which any two points in the region can be joined by a straight line that does not leave the region.

Which is to say that a convex function has a minimum, and only one (and this is also the only position where the derivative is 0).

More formally, a function is convex if the second derivative is positive everywhere. A function can be convex on a range $[a,b]$ if its second derivative is positive everywhere in that range.

In higher dimensions, these derivatives aren't scalar values, so we instead define convexity if the Hessian $H$ (the matrix of second derivatives) is positive semidefinite (notated $H \succeq 0$). It is strictly convex if $H$ is positive definite (notated $H \succ 0$). Refer to the Calculus section for more details on this.

Convex and non-convex functions
Convex and non-convex functions

Transcendental functions

Transcendental functions are those that are not polynomial, e.g. $\sin, \exp, \log, \text{etc}$.


Logarithms are frequently encountered. They have many useful properties, such as turning multiplication into addition:

$$ \log(xy) = \log(x) + \log(y) $$

Multiplying many small numbers is problematic with computers, leading to underflow errors. Logarithms are commonly used to turn this kind of multiplication into addition and avoid underflow errors.

Note that $\log(x)$, without any base, typically implies the natural log, i.e. $\log_e(x)$, sometimes notated $\ln(x)$, which has the inverse $\exp(x)$, more commonly seen as $e^x$.

Other useful concepts

Solving analytically vs numerically

Often you may see a distinction made between solving a problem analytically (sometimes algebraeically is used) and solving a problem numerically.

Solving a problem analytically means you can exploit properties of the objects and equations, e.g. through methods from calculus, avoiding substituting numerical values for the variables you are manipulating (that is, you only need to manipulate symbols). If a problem may be solved analytically, the resulting solution is called a closed form solution (or the analytic solution) and is an exact solution.

Not all problems can be solved analytically; generally more complex mathematical models have no closed form solution. These problems are also often the ones of most interest. Such problems need to be approximated numerically, which involves evaluating the equations many times by substituting different numerical values for variables. The result is an approximate (numerical) solution.

Linear vs nonlinear models

You'll often see a caveat with algorithms that they only work for linear models. On the other hand, some models are touted for their capacity for nonlinear models.

A linear model is a model which takes the general form:

$$ y = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n $$

Note that this function does not need to produce a literal line. The "linear" constraint does not apply to the predictor variables $x_1, \dots, x_n$. For instance, the function $y = x^2$ is linear.

"Linear" refers to the parameters; i.e. the function must be "linear in the parameters", meaning that the parameters $\beta_0, \dots, \beta_n$ themselves must form a line (or its equivalent in whatever dimensional space you're working in).

A nonlinear model includes parameters such as $\beta^2$ or $\beta_0 \beta_1$ (that is, multiple parameters in the same term, which is not linear) or transcendental functions.


Many artificial intelligence and machine learning algorithms are based on or benefit from some kind of metric. In this context the term has a concrete definition.

The typical case for metrics is around similarity. Say you have a bunch of random variables $X_i$ which take on values in a label space $V$. If $X_i$ and $X_j$ are connected by an edge, we want them to take on "similar" values.

How do we define "similar"?

We'll use a distance function $\mu: V \times V \to R^+$, which needs to satisfy:

If all these are satisfied, we say that $\mu$ is a metric.

If only reflexivity and symmetry are satisfied, we have a semi-metric instead.

So we can create a feature $f_{ij}(X_i, X_j) = \mu(X_i, X_j)$ and then this works out such that:

$$ \exp(- w_{ij} f_{ij} (X_i, X_j)), w_{ij} > 0 $$

that the lower the distance (metric), the higher the probability.