0:00

With all these work done, let's look at a bigger example.

Â It does not illustrate the scale challenge, we'll, which we'll come back

Â very briefly in a moment. But it is slightly bigger than four node,

Â now we're talking about eight nodes, nodes one, two, three, four, five, six, seven,

Â representing eight webpages, interconnected by directional links, these

Â hyperlinks. If we just stare at this graph and say,

Â which node is most important? Well, we may say by degree, in degree

Â count, node four is most important. And I just got four, it's bigger than any of

Â the other nodes in degrees. But, if you think it along the line of

Â Google Page Rank, which implicitly assumes certain navigation behavior,

Â you see that four basically gives away all this importance score to a single node

Â downstream: three. And three does not give it back to four.

Â It spreads it further to two and eight. So this gives you a hesitation.

Â Say, maybe node three by page rank calculation of importance score will be

Â higher than four. And, indeed, we'll see that's the case.

Â Intuitively, two, three, four should be highly ranked, and that is what we will

Â see. And, which nodes will have the least

Â importance score? Well, those nodes that are not pointed by

Â many other important nodes, who concentrate on them.

Â And we can see, probably node seven and node six will have low importance scores.

Â And, indeed, that's the intuition we'll confirm in a short while.

Â These will be lowly-ranked, will be highly-ranked.

Â So to make this a little more efficient, I've written out the H matrix already,

Â just based on the topology of this given graph here.

Â Now, we can then construct H hat, but H hat is the same as H, because there's no

Â dangling nodes. And then, we can construct matrix G with

Â randomization parameter, or anti-randomization parameter being 85%.

Â And then fifteen percent of the time, we'll be doing random hopping among the

Â pages. Then you can write down that G matrix, and

Â then you can start doing the iteration. Pi equals pi transpose, equals pi

Â transpose G, from one iteration to the next.

Â And this shows the first initialization, all the nodes have one-eighth, exactly

Â evenly distributed. It doesn't matter what initialization is

Â used actually. It would change the convergence speed a

Â little but not the end result. The first iteration is shown here.

Â 3:29

And I've already normalized them. So what you see here is the normalized and

Â ordered by the importance scores in descending order.

Â So node three is the highest importance score and therefore ranked

Â number one, followed by node two followed by node four.

Â You can see, node three actually, is by far the most important.

Â 0.2 sounds like small number, but remember, the pi is now normalized,

Â so they all add up to one. So 0.2 out of eight nodes is actually

Â quite impressive. Two and four are about the same.

Â And, nodes six and seven are by far the smallest.

Â Okay, that's 0.06, 0.04, but then again, Google page rank returns the ranking.

Â It does not show you the actual scaling, so the scaling is only intermediate step.

Â 4:22

And this confirms our intuition. What we just said before, that node three

Â perhaps is the most important, even though node four has a higher in degree.

Â That's because PageRank calculation models the navigation behavior.

Â Saying that, if a lot of people come to this web page, but then they all have to

Â go out to this page, then this page actually shares a hundred percent of the

Â importance scores spreading from node four.

Â 5:27

In fact, when we generalize PageRank, the

Â famous algorithm we just showed--a little bit, generalize a little bit--

Â you see what exactly I mean by that in the advanced material part of the lecture.

Â You will see a striking parallel between PageRank and DPC.

Â The PageRank's used by Google for billions of searchers each day.

Â And DPC is used by all the 3G cellphones invented by a number of companies and,

Â but, eventually, most of them funnel through Qualcomm, and so these are very,

Â two vastly different industries. This was invented early 90s, was, was in

Â the mid to late 90s, But, mathematically, they have a very interesting parallel.

Â This is a network of interfering nodes. And the topology, the graph is represented

Â by a matrix G. Where the Gij's are the direct and the

Â difference channel gains. This is about a graph of web pages,

Â hyper-linked and directed. Turns out there's another matrix, same

Â symbol but very different definition, of Google matrix.

Â They help you to rank these webpages. The functionalities are vastly different,

Â but it turns out the generalized version of the PageRank we just saw has exactly

Â the same mathematical formula as DPC. So if you are curious about that please

Â come to the advanced material part of the lecture.

Â Now PageRank can be understood from quite a few different angles.

Â We talked about this eigenvector angle. We talked about this iterative computation

Â angle. Modeling the hopping behavior, navigation

Â behavior of a user or this random walk on graph behavior.

Â There's also an economic growth model angle which is actually what led to this

Â equivalence. But the biggest challenge for Google using

Â PageRank is the challenge of scale, the N is too big.

Â But fortunately, these graphs, H is sparse.

Â 7:50

And the other graphs' matrices are rank one matrices.

Â So you add two rank one matrices to a sparse graph, you get a very

Â well-structured graph G that helps a lot in scaling up the computation.

Â Notice that this computation is done in a central server.

Â DPC is done in a distributed way that's the word distributed power control.

Â So there's a big contrast between those but then again DPC we're talking about

Â tens of nodes, here we're talking about millions of nodes relevant to a search if

Â not the entire set of webpages 40 to 60 billion out there.

Â There are quite a few tricks you can use to improve the scalability of this

Â centralized computation. One is numerical linear algebra methods,

Â okay? There are a lot of ways to decompose this

Â matrix H, so that you speed up the computation of matrix multiplication.

Â There are also a few more tricks, for example.

Â It's only the order that matters, so don't worry too much about converging on the

Â scale of the importance scores. Certain web pages can be grouped together.

Â The dangling nodes can be treated separately and you may also have a

Â hierarchy. You can group certain clusters of nodes

Â together and run an approximation before you distributing a given importance score

Â among those nodes in a cluster. In the homework problem you looked at

Â this cluster-based hierarchy approximation.

Â So whether it's speed up method or approximation tricks, there are many

Â approaches to make this algorithm work so fast for a large N.

Â Now finally. Google also plays a game with companies

Â called SEO, search engine optimizations, and as soon as Google became popular

Â around 98, SEO popped up around 99-2000. And since then, it's been over decade of

Â constant struggle back and forth between the two.

Â So SEO in a nutshell, basically try to increase your webpage ranking by doing

Â things that try to leverage the PageRank's algorithm.

Â For example, they'll say here's a truly important webpage I construct.

Â Then if you pay me I will provide pointers to your webpages even though your webpages

Â may not deserve a high rank. So basically try to change this structure

Â of the H matrix. But then, Google is not sitting idle there. It's also reacting.

Â Two recent reaction is in early 2011 and May 2012, where Google announced some

Â changes in the way they run the actual detailed version of the PageRank methods

Â that would try to counter SEO's artificial lift of certain web pages' importance score.

Â 10:57

So, there's a lot of interesting stories that we'll not have time to go into.

Â Let's quickly summarized what we have seen so far in this lecture.

Â The key concept is that we can view the collection of hyperlinked web pages as a

Â network. There's a graph that we'll represent by

Â matrices such as H, H hat and G. There is also a functionality of

Â navigation that we're implicitly modeling through the way we construct these

Â matrices. You can disagree with those models, but so

Â far Google's search result has proven to be quite useful.

Â And this is a general theme: the connectivity pattern

Â provides a hint on the node importance. We'll come back to this in lecture four

Â with three more additional importance metrics suitable for other purposes.

Â So PageRank, this famous algorithm we just saw and construct, uniquely defines

Â and then efficiently computes a consistent set of importance scores based on these

Â ideas. And one of the angles we just saw and

Â emphasized is the dominant eigenvector angle.

Â That's the pi star transpose equals pi star transpose times Google matrix G.

Â