0:09

Hello, today we're going to talk about another way to find central nodes in

Â the network.

Â Just like PageRank,

Â this way was also developed in the context of how a search engine might go about

Â finding important web pages given a query using the hyperlink structure of the web.

Â So the first step will be to find a set of relevant webpages.

Â So for example,

Â web pages that contain the query string in the text of the web page or for

Â some reason the search engine thinks these might be an important page to look at.

Â So these are potential authorities,

Â potential pages that are important given the query that the user submitted.

Â This will be called the root set.

Â And so let's say in this example that nodes A, B, and

Â C are these potential authorities, this is the root set.

Â And the next step will be to find all the web pages that link to

Â any page in the root set, and these pages will be potential hubs.

Â So hubs are pages that are not themselves necessarily relevant to

Â the query that the user submitted, but they link to pages that are relevant.

Â So they're pages that are good at pointing at things that may be relevant.

Â 1:16

And let's say that in this example the nodes E, F, G, D,

Â and H are these pages that point to at least one of the pages in the root set.

Â This whole set of nodes, whatever it was in the root and

Â anything that points to something in the root, is going to be called the base set.

Â And we're going to consider all the hyperlinks that link

Â any node in the base set to any other node in the base set.

Â So there may be many other edges in this network and

Â we're going to consider them all.

Â And so this is the network that we're going to use in

Â order to find the important web pages.

Â 2:22

Now, the difference is that the HITS algorithm is going to keep track of two

Â kinds of scores for every node, the authority score and the hub score.

Â The first step is to give every node an authority and

Â a hub score of 1, and then we're going to apply two different rules.

Â These rules are going to be similar to the rules that we

Â used when we computed PageRank, but

Â again now we're going to have to keep track of two different scores.

Â So the first rule is going to be the Authority Update Rule,

Â which says that each node's authority score is going to be the sum of the hub

Â scores of each node that points to that node.

Â 2:55

And then the other rule is the Hub Update Rule,

Â which is sort of the symmetric thing.

Â So each node's hub score is going to be the sum of the authority scores

Â of each node that it points to.

Â And then we're going to have to do some normalizations because these scores

Â are going to keep growing and growing, so

Â at every iteration we're going to normalize the hub and the authority score.

Â And so, for example, the way you normalize the authority score is you take each

Â node's authority score, let's say j, and you divide the authority score of

Â j by the sum of all the authorities in the whole network.

Â And then we're going to repeat this process eight times.

Â Now, this might seem a little abstract at this point but

Â let's run through an example to see exactly how it works.

Â So we're going to take the network that we had just discussed,

Â and we're going to compute two iterations of the HITS algorithm of the network.

Â So just like in PageRank, we're going to have to keep track of the old

Â scores to be able to compute the new scores.

Â And because the first step is to give every node a hub and

Â authority score of one, we're going to start there.

Â We're going to give every node an old authority on hub score of 1,

Â and then we're going to compute the new scores.

Â So let's start with the authority scores.

Â We look at node A and we're going to look at what nodes point to node A in order to

Â figure out what the new authority score of A is, and

Â it turns out that C, G, and H point to A.

Â And because C, G, and H all have hub score of 1,

Â then the new authority score of A is going to be 3.

Â Next, node B, nodes D and E point to B, and E and D both have hub score of 1,

Â so B is going to have a new authority score of 2.

Â So you can see at this point that what we're really doing when we get this new

Â authority score is looking at the in-degree of each one of the nodes, and

Â that's what's going to happen.

Â So, for example, node C has an in-degree of 5.

Â And because all the nodes at this point have a hub score of 1,

Â then C is going to have a new authority score of 5.

Â And then D has two nodes pointing to it, so it has a new authority score of 2.

Â And E has in-degree of 1, F has in-degree of 1.

Â G has in-degree of 0, so it's going to have a new authority score of 0.

Â And H has one node pointing to it, so it has a new authority score of 1.

Â Okay, now let's move to the new hub scores.

Â It's going to be very similar, but now instead of looking at the in-degree

Â of every node, we're going to look at the auth degree.

Â And so, for example, A has an auth degree of 1, it points to D.

Â And now we have to look at these old authority score.

Â And again, because this is our first step,

Â every node has an old authority score of 1.

Â And D does as well, and so A is going to have a new hub score of 1.

Â B points to two things, both of them have an authority score of 1,

Â and so B has hub score of 2.

Â C has an auth degree of 1, so it has a new hub score of 1.

Â D has an auth degree of 2.

Â E has an auth degree of 4.

Â F has an auth degree of 2.

Â G has auth degree 2, and H has auth degree 1.

Â 6:05

Next, we have to normalize.

Â And so, to normalize we have to add up the authority scores and

Â add up the hub scores.

Â In this case, they'll both add up to 15.

Â It's not the case that in every duration they're going to add up to the same thing,

Â but in the first one they do.

Â So they both in this case add up to 15.

Â And so we have to divide all the scores by 15.

Â And so if we do that we normalize its scores.

Â And now the new authority and hub scores are going to become our old scores.

Â And we're ready to go for the next iteration, which is going to be slightly

Â more interesting since now the nodes don't all have the same authority and hub score,

Â so we have to pay attention at the nodes that we're looking at.

Â So let's start with node A.

Â We want to figure out the new authority score of A, and so

Â we have to figure out what nodes point to A.

Â So C, G, and H all point to A.

Â And now we have to look at the hub scores of C, G, and H,

Â which are 1/15, 2/15, and 1/15.

Â And so we add those up and

Â we get 4/15, and that is going to be a new authority score.

Â 7:07

Now let's move to B. B has two nodes pointing to it, E and D.

Â And they have old hub score of 2/15 and 4/15, which adds up to 6/15.

Â And then C has five things pointing to it, nodes E, F, G, B, and

Â D, which have these old hub scores that I'm highlighting here.

Â And they add up to 12/15, so that's C's new authority score.

Â D has two things pointing to it with old hub score of 1/15 and 4/14,

Â which adds up to 1/3, and so on, right?

Â We can continue doing this for all of the other nodes and

Â find all of the new authority scores.

Â Now, let's go and try to find the new hub scores.

Â So, looking at A, again now, we're not looking at the in-degree,

Â we're not looking at who points to it, but who does A point to.

Â 7:56

And now we have to pay attention to the old authority scores of those nodes.

Â So in this case, A points to D, and D has a old authority score of 2/15,

Â so that's going to be A's new hub score.

Â And then for B it points to E and D, and they have and

Â old authorities score of 1/15 and 1/3,

Â which adds up to 2/5, and so that's going to be B's new hub score.

Â C points to A and A has an old authority score of 1/5, so that's C's new hub score.

Â D points to B and C, and they have an old authority score of 2/15 and

Â 1/3, which adds up to 7/15, and then so on.

Â We can continue doing this for all the other nodes and find the new hub scores.

Â Next we have to normalize.

Â So we have to add up all the authority scores.

Â In this case, they add up to 35/15.

Â So we have to divide every new authority score by 35/15.

Â So if we do that, these are the updated normalized scores.

Â And we do the same thing for the hubs.

Â So we add up the hub scores for all the nodes, which adds up to 3 in this case.

Â And so we have to divide every new hub score by 3, and

Â then these are the updated normalized scores.

Â And so these are our final new authority and

Â hub scores after two iterations of the HITS algorithm.

Â 9:29

Well, let's take a look.

Â So here are the scores that we just computed for k equals 2, 2 iterations,

Â these are the authority scores.

Â So what happens as we go to 4 iterations?

Â Well, this is what you would find.

Â And 6 iterations, this is what you would find.

Â And same thing for the hub scores.

Â These are the scores that we found so far with 2 iterations, and

Â we can figure out what they are for 4 iterations and 6 iterations.

Â So what you see here is that for some nodes these scores aren't changing,

Â but for others they are changing.

Â So here I'm highlighting the nodes for

Â which the authority or hub score continued to change after 6 iterations.

Â So for example, node B here starts out with an authority score of .15.

Â Then after 4 iterations, it goes to .18.

Â Then after 6 iterations, it goes to .19.

Â So will this score for B continue to grow or

Â will it saturate at some point, what could happen here?

Â And so if we continue iterating, here I'm showing you what happens to the hub and

Â authority score of node B.

Â So in this plot on the x-axis we have the number of iterations, and

Â then the y-axis we have the authority and hub scores for node B.

Â As it turns out for most networks, as k gets larger, the authority and

Â the hub scores actually do converge to a unique value.

Â And in this case,

Â this is a unique value that you find as k becomes larger and larger.

Â And so for this network the nodes with the highest authority score are B and

Â C, and the nodes with the highest hub score are D and E.

Â And if you remember the original set up of the network, B, C,

Â and A, were these root nodes, right?

Â The ones that were supposed to be very relevant, and it turns out that B and

Â C have the highest authority scores.

Â And then biggest hubs, the nodes that are good at pointing at nodes that

Â are particularly irrelevant or particular authorities, are D and E.

Â And if you notice both A and D point to B and C, as well as other nodes.

Â In particular, D only points to B and C, and E points to them too, and also others.

Â