0:00

Okay, now, we are going to talk about a different class of random graph

Â models that begins to bring in attributes of individuals or nodes into the picture.

Â And as we have looked in terms of random graph models so far, we have talked about

Â Erdos-Renyi really basic uniform at random, used for benchmark models.

Â And we saw that they missed a lot of real world features.

Â Then we talked about other kinds of link

Â by link models that began to bring in features.

Â We looked at, at modes that kept clustering.

Â They could match different kinds of degree distributions.

Â They could get some aspect of co, correlation.

Â And what we are going to start to do now is, is bring in attributes of nodes and,

Â so, we'll start with stochastic blocks models which are

Â sort of the natural enrichment of Erdos-Renyi random networks.

Â So here they are going to allow

Â the probability of a link to actually

Â depend on some characteristics of the nodes.

Â And this could be characteristics attributes that,

Â that could be, that could be latent.

Â So it might be that that we are not fully observing all these things.

Â We're trying to infer them or there could be observed characteristics.

Â And so we are going to start with

Â simple block models that, where everything is, is observed.

Â 1:14

And so here,

Â the idea now is that nodes have characteristics,

Â things like age, gender, religion, profession, and so forth.

Â And links between nodes depend on those characteristics.

Â So people of of similar ages are more

Â likely to link than people of different ages.

Â People of similar religions and so forth.

Â And so now let's start with a simple situation.

Â We have three types of nodes in this example,

Â blue, green, and yellow nodes. They're connected in a network.

Â And we can imagine, then, that there's, the

Â model is that there's different probabilities of connections.

Â So if there were homophily here, we think of

Â the blues being more likely to connect to blues.

Â There's a probability that yellows connect with yellows.

Â There's a probability that greens connect with greens.

Â Probability that blues connect with greens, and so forth.

Â So we have different probabilities for every different possible combination

Â of types of nodes in this situation.

Â But then everything else about this model is independent.

Â So, the links are formed independently, it's just

Â the probabilities are varying depending on node attributes.

Â And any two pair of blue nodes have the same probability of linking to each other.

Â Any, pair of a green and a blue node has the same

Â probability of, of linking as any other pair of, of green and blue.

Â It's just that we're allowing the probabilities to vary with types

Â of nodes.

Â Okay, so if this were the, the resulting network that we observed, this is

Â a very easy kind of model to go ahead and, and estimate probabilities for.

Â And in this case, for instance, if we're trying to estimate what's the probability

Â of, of blue links links from, between two blue nodes, then what do we see?

Â Well we see that that there is five actual links present among blues right,

Â one, two, three, four, five and out of four possible nodes

Â we have six possible total so we end up with an estimate of five six.

Â 3:41

More generally, these kinds of models are

Â going to be worked with, often with continuous variables.

Â And, so if we have continuous covariates so, for instance, we keep track of age in,

Â in, in days or some for instance or in fractions of years, then we

Â would have, instead of just green yellow and blue, we might have people that are

Â 34 and half years old and people that are 60 years old and so forth.

Â And so then when we start to think about a, links between two nodes,

Â we could allow them to, to depend

Â in more complicated ways on, on the characteristics.

Â So if we have some vector of characteristics that i has, some vector of

Â characteristics that j has, then we could have a, a vector of parameters beta i.

Â Vector of parameters could be the j.

Â 4:28

It could also be that, that there's, that the probability of the link

Â depends on differences, so people that are close in age are more likely to link

Â than people that have distance between ages

Â or it could be geographic locations, GPS data.

Â If you live closer together you're more likely to,

Â to link together than if you live far apart.

Â And then we have basically a function,

Â which we're going to base the probability on.

Â And now the standard formulation for this, since this could be depending on

Â the parameters and so forth, this might be positive or negative.

Â The idea would then be that, that often one uses

Â a logistic form where you look at the log of the

Â probability that there's a link between i and j compared to

Â the probability that there's not a link between i and j.

Â Right, so we take the odds ratios here.

Â What's the ratio of the odds that there are a link, compared to not.

Â And we just say that the log of that is proportional to this.

Â So this would be a standard way of fitting this

Â kind of model.

Â And what that's going to do is then give you

Â an ability to continuously estimate

Â 5:30

probabilities based on different attributes.

Â And so, so now it's a more complicated stochastic model,

Â but one that still very easy to go ahead and estimate.

Â And you can use this in most any standard

Â statistics package or allow you to do logistic regressions.

Â And estimate this quite easily.

Â Now one possibility to

Â using this kind of model would be to test for homophily for instance.

Â So is there real difference between the probability of,

Â of certain types linking compared to linking across types?

Â So for instance, here is an example.

Â This is from the data set I talked about earlier in, in the course from

Â work that I did with Abijeet Bannerjee, Arun Chandrasekar, and Esther du Fleau.

Â 6:13

So this is the 2013 Science paper and, and this is what I

Â did here is I pulled out one of the villages, village 26 in particular.

Â And here the nodes are, are color-coded by caste.

Â And in particular, the blue nodes are what are known as scheduled castes and

Â scheduled tribes, so these are castes that

Â are under affirmative action by the Indian government.

Â The reds are general castes and otherwise backward castes.

Â So these are not affirmative action castes.

Â So.

Â This is a differences in caste, and then we could look here,

Â and we could say okay, let's do the very simple block model.

Â We're just going to see what's the probability of linking with somebody of

Â your same caste compared to linking with somebody of a different caste.

Â So here we can go through and just directly count how

Â many links there are among members of the same caste and

Â going across castes, where caste here is

Â just this dichotomous version actually, the particular caste.

Â There is quite a number of castes in these villages.

Â So what do we end up with, if you

Â do the count here the probability of having a link.

Â Between a red and a blue node in this graph is 0.006, the probability

Â of having a link between either two reds or two blues is 0.089.

Â So more than ten times more likely to have a link within the same

Â caste as a caste designation in this case, as oppose to going across this boundary.

Â And so we see that it indeed there is a

Â substantial homophily based on this kind of simple block model.

Â And that block model then allows us to, to, keep track of these things.

Â And if we hadn't colored these nodes, it's

Â not so obviously that we would see that there

Â is a, a strong dichotomy here as once we

Â start keeping track of this and then estimating directly.

Â Now, we could have not had the pictures of the, of

Â the colors here, and we could have tried to discover that.

Â That's known as community detection or clustering algorithms.

Â So there's ways in which we could begin to look for things.

Â And we might begin to say, okay, look actually this,

Â even the blues look like they're segmented into different groups.

Â We could try and, and fit,

Â and see whether there's additional blocks, and whether there is more density in here

Â than over here and your algorithms and techniques people use for doing that.

Â So what's the important aspect that's missed from block models?

Â Well, the likelihood does depend on node attributes, either observed or, or latent.

Â But in practice. Often that the probability that that two

Â nodes have been interacting, so people would been interacting

Â could depend on whether they have friends in common.

Â So there is real social structure to it.

Â So let's you know, the fact that, that A and

Â B both have a common friend C makes it more

Â likely that they are going to be linked to each

Â other than if, if they didn't have a friend in common.

Â And so that could be an aspect, which independently of their

Â characteristics, people tend to meet people through other, through friends or

Â have reasons for spending time together in,

Â in triples and so forth, and that's going

Â to lead to extra things, so we're going to need richer models to capture that.

Â So next up we'll be talking about classes of models

Â that allow us to keep track of these explicit dependencies.

Â Which is going to make our life a little

Â more difficult statistically, because things aren't independent any longer,

Â but will allow us to capture a lot

Â of things in networks that are going to be important.

Â