0:00

[MUSIC]

Â The final topic of this module on Eigenproblems, as well as the final topic

Â of this course as a whole, will focus on an algorithm called PageRank.

Â This algorithm was famously published by and

Â named after Google founder Larry Page and colleagues in 1998.

Â And was used by Google to help them decide which order to display their websites when

Â they returned from search.

Â The central assumption underpinning page rank

Â is that the importance of a website is related to its links to and

Â from other websites, and somehow Eigen theory comes up.

Â This bubble diagram represents a model mini Internet,

Â where each bubble is a webpage and each arrow from A, B, C,

Â and D represents a link on that webpage which takes you to one of the others.

Â We're trying to build an expression that tells us, based on this network structure,

Â which of these webpages is most relevant to the person who made the search.

Â As such, we're going to use the concept of Procrastinating Pat

Â who is an imaginary person who goes on the Internet and

Â just randomly click links to avoid doing their work.

Â By mapping all the possible links, we can build a model to estimate the amount of

Â time we would expect Pat to spend on each webpage.

Â We can describe the links present on page A as a vector, where each row is either

Â a one or a zero based on whether there is a link to the corresponding page.

Â And then normalise the vector by the total number of the links,

Â such that they can be used to describe a probability for that page.

Â For example, the vector of links from page A will be 0 1 1 1.

Â Because vector A has links to sites B, to C, and to D, but

Â it doesn't have a link to itself.

Â Also, because there are three links in this page in total,

Â we would normalize by a factor of a third.

Â So the total click probability sums to one.

Â So we can write, L_A =

Â (0, a third, a third, a third).

Â Following the same logic, the link vectors in the next two sites are shown here.

Â And finally, for page D, we can write L_D is going to equal,

Â so D is connected to B and C, but not A, two sides in total,

Â (0, a half, a half, 0).

Â We can now build our link matrix L by using each of our

Â linked vectors as a column, which you can see will form a square matrix.

Â What we're trying to represent here with our matrix L

Â is the probability of ending up on each of the pages.

Â For example, the only way to get to A is by being at B.

Â So you then need to know the probability of being at B,

Â which you could've got to from either A or D.

Â As you can see, this problem is self-referential,

Â as the ranks on all the pages depend on all the others.

Â Although we built our matrix from columns of outward links, we can see that

Â the rows describe inward links normalized with respect to their page of origin.

Â 3:08

We can now write an expression which summarises the approach.

Â We're going to use the vector r to store the rank of all webpages.

Â To calculate the rank of page A,

Â you need to know three things about all other pages on the Internet.

Â What's your rank?

Â Do you link to page A?

Â And how many outgoing links do you have in total?

Â The following expression combines these three pieces of information for

Â webpage A only.

Â So r_A is going to equal the sum from j = 1 to n,

Â where n is all the webpages of the link matrix relevant to webpage A and

Â location j, multiplied by the rank at location j.

Â So this is going to scroll through each of our webpages.

Â Which means that the rank of A is the sum of the ranks of all the pages which link

Â to it, weighted by their specific link probability taken from matrix L.

Â Now we want to be able to write this expression for

Â all pages and solve them simultaneously.

Â Thinking back to our linear algebra, we can rewrite the above

Â expression applied to all webpages as a simple matrix multiplication.

Â So r = Lr.

Â Clearly, we start off not knowing r.

Â So we simply assume that all the ranks are equally and normalise them by the total

Â number of webpages in our analysis, which in this case is 4.

Â So r equals a quarter, a quarter,

Â a quarter, a quarter.

Â 4:58

Applying this expression repeatedly means that we are solving this problem

Â iteratively.

Â Each time we do this, we update the values in r until, eventually, r stops changing.

Â So now r really does equal Lr.

Â Thinking back to the previous videos in this module, this implies that r is now

Â an eigenvector of matrix L, with an eigenvalue of 1.

Â At this point, you might well be thinking,

Â if we want to multiply r by L many times, perhaps this will be best

Â tackled by applying the diagonalization method that we saw in the last video.

Â But don't forget, this would require us to already know all of the Eigen vectors,

Â which is what we're trying to find in the first place.

Â So now that we have an equation, and

Â hopefully some idea of where it came from, we can ask our computer

Â to iteratively apply it until it converges to find our rank vector.

Â You can see that although it takes about ten iterations for the numbers to settle

Â down, the order is already established after the first iteration.

Â However, this is just an artifact of our system being so tiny.

Â So now we have our result, which says that as Procrastinating Pat randomly clicks

Â around our network, we'd expect them to spend about 40% of their time on page D.

Â But only about 12% of their time on page A, with 24% on each of pages B and C.

Â We now have our ranking, with D at the top and A at the bottom, and B and

Â C equal in the middle.

Â As it turns out, although there are many approaches for

Â efficiently calculating eigenvectors that have been developed over the years,

Â repeatedly multiplying a randomly selected initial guest vector by a matrix,

Â which is called the power method, is still very effective for

Â the page rank problem for two reasons.

Â Firstly, although the power method will clearly only give you one eigenvector, when

Â we know that there will be n for an n webpage system,

Â it turns out that because of the way we've structured our link matrix,

Â the vector it gives you will always be the one that you're looking for, with an eigenvalue of 1.

Â Secondly, although this is not true for the full webpage mini Internet,

Â when looking at the real Internet you can imagine that almost every entry in

Â the link matrix will be zero, i.e,, most pages don't connect to most other pages.

Â This is referred to as a sparse matrix.

Â And algorithms exist such that multiplications can be performed

Â very efficiently.

Â 7:28

This adds an additional term to our iterative formula.

Â So r i + 1 is now going to equal d, times Lri + 1- d over n.

Â d is something between 0 and 1.

Â And you can think of it as 1 minus the probability with which

Â procrastinating Pat suddenly, randomly types in a web address,

Â rather than clicking on a link on his current page.

Â The effect that this has on the actual calculation is about finding a compromise

Â between speed and stability of the iterative convergence process.

Â There are over one billion websites on the Internet today, compared with just a few

Â million when the page rank algorithm was first published in 1998.

Â And so the methods for search and ranking have had to evolve to maximize efficiency,

Â although the core concept has remained unchanged for many years.

Â