Notes from *Mining Massive Datasets* (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman.

Often with a graph, you want to rank the nodes in some way. A common approach is to rank the nodes by the link (edge) structure. *Link analysis* provides some approaches for computing these node importance scores.

Think of links as votes - a node with more links is more important. In-links are preferable here because they are harder to fake than out-links.

We want to weight in-links based on where they are coming from (e.g. in-links from more important nodes should have higher weight).

So we have a recursive approach:

- Each link's vote is proportional to the importance of its source node
- If node
$j$ with importance (rank)$r_j$ has$n$ out-links, each link gets$\frac{r_j}{n}$ votes - Node
$j$ 's won importance is the sum of the votes on its in-links - We add an additional constraint that the ranks must sum to 1 (so the solutions aren't unbounded)

Another way of formulating this is that:

That is, the rank of node

If we consider this rank for every node of

However this won't scale to large networks.

We define a *stochastic adjacency matrix* *column stochastic*, which means that every column in the matrix sums to 1.

Then we define a *rank vector*

We can re-write the flow equations as:

Which may seem familiar as eigenvalue equation, i.e.:

Where

In this case, we have an implied eigenvalue *principal* eigenvector.

Now we can solve for *power iteration* method to do so.

Power iteration is relatively simple. Say we have

Then we iteratively compute

An alternative interpretation of power iteration is that we are using a Markov process (i.e. a random walker), i.e.

There are a couple problems with PageRank as it is formulated so far:

- The "dead end" problem: some nodes have no out-links, in which case their importance can't be passed onto any other nodes, i.e. the random walker has no where to move and we get a nonsensical stationary distribution.
- The "spider trap" problem: there may be a group of nodes that only has out-links to other nodes in the group. That is, the random walker of the Markov process will get stuck in here with no way out, which will result in what is another nonsensical stationary distribution - all the importance gets stuck in the group.

For spider traps, the Google formulation of PageRank has the following solution:

At each time step, the random walker can:

- with probability

- with probability

With this method, the random walker can "teleport" out of spider traps within a few time steps. This preserves the aperiodicity condition (which requires that there are no periods, i.e. inescapable loops, within the graph) the necessary for convergence to a unique stationary distribution in the Markov process.

Similarly, for the dead end problem, the random walker teleports from dead ends with probability 1. We can represent this in the matrix

With these fixes in mind, the PageRank equation is:

Instead of our previous matrix

Where

And so we now have

So we end up running power iteration with:

Introducing the matrix

However, with the way we construct

How can we preserve the sparsity of

What we can do instead is the following.

Suppose there are

We have

Rather than introducing all the extra transition edges, we can instead capture random teleportation by taxing the PageRank score of every node by a fraction

Mathematically, we are saying that:

Which allows us to re-write

Note that

So we can again apply a slightly modified power iteration, where for each iteration we:

- Compute
$r^{(t+1)} = \beta Mr^{(t)}$ - Add a constant value
to each entry in$\frac{1-\beta}{N}$ $r^{(t+1)}$ - If
$M$ contains dead ends, then$\sum_i r^{(t+1)}_i < 1$ (i.e. it does not sum to 1) so we must renormalize$r^{(t+1)}$ so that it does sum to 1.

- If

So the full algorithm is:

- Take inputs graph
$G$ and parameter$\beta$ (typically in the range 0.8-0.9) - Set
$r_j^{(0)} = \frac{1}{N}, t = 1$ - While
$\sum_j |r_j^{(t)} - r_j^{(t-1)}| > \varepsilon$ , do:

$\forall j$ , do:

$$ r'^{(t)}_j = \begin{cases} 0 & \text{if in-degree of $j$ is 0} \\ \sum_{i \to j} \beta \frac{r_i^{(t-1)}}{d_i} & \text{otherwise} \end{cases} $$

- Re-insert the leaked PageRank (
$\frac{1-S}{N}$ ).$\forall j$ , do:

$$ r_j^{(t)} = r'^{(t)}_J + \frac{1-S}{N} \, \text{where $S=\sum_j r'^{(t)}_j$} $$

$t = t+1$

However, with very large graphs,

Up until now we have been defining importance in the context of the *entire* graph. But what if we want to compute importance for a particular "domain" or "topic"?

This requires only a small modification of the previous PageRank formulation. Instead of the random walker being able to teleport to any node with equal probability, we restrict it so that it can only teleport to a topic-specific set of "relevant" nodes *teleport set*). This way we compute a PageRank relevant to that specific topic.

How do you measure node proximity (aka relevance, closeness, similarity) on a graph?

You could use shortest path, but it neglect that there may be multiple paths between nodes (which could indicate greater proximity):

It also ignores that there may be other nodes dangling off the path (which could indicate less proximity):

We can instead use SimRank, which is a variation of PageRank, which is random walks with restarts from a fixed node

A random walk with restarts is a random walk with teleports, but the teleport set consists of only one node. Here, that node is

The

For example:

Then the similarity between a nodes and the others in the same type is the PageRank scores for each of the other nodes, starting from that node (i.e. it is

The downside with this method is that it is not very scalable; it has to be run for each node

Here, each node (page) has two scores:

- Quality as an expert (hub) - the total sum of votes of authorities pointed
*to* - Quality as a content provider (authority) - the total sum of votes coming
*from*experts

So *authorities* are pages containing useful informations, and *hubs* are pages that link to authorities (they are not mutually exclusive; both scores are computed for each page).

Each page starts with a hub score of 1. Then we compute the authority scores. Then we re-compute the hub scores based on these authority scores, and then we re-compute the authority scores based on these hub scores, and so on until convergence. Then we normalize the scores.

More formally:

- We have a vector
$h$ of hub scores (the hub score for a single page is$h_i$ ) and a vector$a$ of authority scores (the authority score for a single page is$a_i$ ) - Initialize
$a_j^{(0)} = \frac{1}{\sqrt n}, h_j^{(0)} = \frac{1}{\sqrt n}$ (where we have$n$ pages) - Repeat until convergence:
- for each
$i$ ,$a_i^{(t+1)} = \sum_{j \to i} h_j^{(t)}$ - for each
$i$ ,$h_i^{(t+1)} = \sum_{i \to j} a_j^{(t)}$ - for each
$i$ , normalize:$\sum_i (a_i^{(t+1)})^2 = 1, \sum_j (h_j^{(t+1)})^2 = 1$

- for each

Alternatively, we can say we have an

So we can re-write the algorithm's iteration as:

- Repeat until convergence:
$h = A \cdot a$ $a = A^T \cdot h$ - Normalize
$a$ and$h$

If we substitute terms, something becomes clear:

That is, the converged

If you do a similar substitution for

Spammers can be an issue on such networks, where they *link spam* hyperlinks on sites where they can post content (e.g. blog posts) to pages that they own. If successfully executed, this can increase the PageRank of their own page, from which they can link to other pages they own and distribute this ill-gained PageRank score across those pages.

TrustRank is a variation on PageRank where the teleport set is a set of *trusted pages*, e.g. `.edu`

domains.

The basic principle is called **approximate isolation**, which says that it is rare for a "good" page to point to a "bad" page.

So you perform TrustRank like PageRank, just with a different teleport set, and the importance scores which propagate through the network are now considered *trust* scores.

AGM is a generative model for networks. That is, given a set of nodes

In addition to the nodes

We say that any node can be a member of any community. The noes in

Each community

This initial graph of memberships **affiliation graph**.

We then use this affiliation graph to generate the network itself.

The probability that a node

So the more communities they share in common, the more likely they have an edge between them.

If

Using AGM, we can develop another algorithm, BIGCLAM, which we can use to detect communities in (very) large networks.

In AGM, nodes had binary memberships to the communities in

We say that

Then we revise the probability that

So the stronger the membership

If one of the two nodes is not in

We define a community membership strength matrix

The row vector

With our new definition of

Now that we have this adaption of AGM defined, let's go backwards. Instead of generating a network from this AGM, let's generate this AGM from a network. Say we have a network

The basic idea is to find

Although it is usually easier to do the log likelihood:

We can use optimization methods here:

Compute the gradient of a single row

And we can just do this for all rows of

- Iterate over rows of
until convergence: $F$

- Compute gradient
$\nabla l(F_u)$ for row$u$ while keeping others fixed- Update the row
$F_u$ like so:$F_u := F_u + \eta \nabla l(F_u)$ - If
$F_{uC} < 0$ , set$F_{uC} = 0$ (i.e. change any negative values to 0)

However, this is slow because the second term of the gradient,

We can speed this up by computing

Clustering in graphs is a **partitioning** task - that is, we want to find a way to divide the nodes into disjoint groups.

One graph partitioning criteria is **conductance**, which measures the connectivity of the cluster to the rest of the network, relative to the density of the group:

Where

The numerator is the **cut** of

The general approach of spectral graph partitioning is:

- Compute the Laplacian matrix
$L$ for the graph - Compute eigenpairs of
$L$ , then use these to map points to lower-dimensional representations - Then assign points to clusters based on this new representation

Say we represent our graph as an adjacency matrix

Then we also define a vector

Consider the dot product

To put it another way:

We can think of the result in terms of eigenvectors, i.e.:

We define a **spectrum** as the eigenvectors

We define the **degree matrix**

Then we can define the **Laplacian matrix**

One property of

This means we have the first eigenpair for

So now with this Laplacian and other eigenvectors, we can start partitioning our graph.

We do by computing eigenpairs beyond the trivial eigenpair and use these to map the rows in