0:09

Well what beta diversity is instead is this measure of the, of the similarity or

Â dissimilarity between two different samples.

Â And and we, we especially think about beta diversity in

Â ecological context as measuring the amount of change between different environments.

Â 0:24

So in other words, the similarity between two different,

Â between the communities in different samples is now interspecies.

Â The higher the species diversity, the less similar the communities are.

Â For any two people, base diversity turns out to be relatively low in their mouths.

Â In other words, all of us have relatively similar microbes.

Â Well reality high on the gut.

Â In other words, different people have different microbes from one another.

Â 1:06

You probably remember our friends from the last lecture on alpha diversity.

Â But when we're talking about entire communities and

Â the similarity and dissimilarity it gets a bit unwieldy to try to juggle them all so

Â we're going to resort to slides for this.

Â 1:42

Therefore, the dissimilarity of beta diversity is higher between sample A and

Â sample B than between sample C and

Â sample D because they have fewer things in common.

Â We also need to

Â consider whether organisms that are shared are present in the same abundance.

Â 2:06

However, in sample E the blue organisms make up about 73% of the community,

Â whereas they are only about 29% abundant in sample F.

Â We can also look at how closely related the organisms are between two communities.

Â 2:30

In this case we also have the phylogenetic tree that describes

Â the evolutionary relationships between these organisms.

Â So in this case what you can see is that all three communities share one organism.

Â The dark blue one at about the same abundance of 50%.

Â However the light blue microbe is very closely related to

Â that dark blue one according to the phylogenetic tree.

Â Whereas, for example, if we look at the green microbe in the second sample,

Â that's much more distantly related.

Â 2:58

Therefore, sample one is more similar to sample three than it is to sample two,

Â even though they share the same number of species.

Â It matters whether or not their species are closely related or distantly related.

Â 3:18

This method of using the phylogenetic tree as a measuring stick to, to tell how,

Â how, how similar or dissimilar different communities are,

Â is a technique called unifrag, which Cathy Lozupone and I introduced back in 2005.

Â Cathy completed a really brilliant PhD thesis on this when she was a member of

Â my lab as a graduate student.

Â And she's now a faculty member at the Medical Campus where she now

Â works among other things on links between microbes and autism.

Â And we'll hear, and

Â we'll, we'll hear from her on that topic in an interview later in the course.

Â 3:50

But right now, we're going to hear from another talented current member from my

Â lab, Will Van Treuren.

Â Will has a background in applied math and MCUB, a very unusual combination.

Â And and,and more recently he's been working on a lot of

Â fascinating microbial analysis.

Â Including pushing the envelope of how we can tell whether two microbes,

Â the relation together, in a, in a particular set of samples, and whether or

Â not they interact with one another to produce an interesting, to produce

Â an interesting effect, that neither of them could pull off on their own.

Â 4:21

What he's going to tell you about now, though,

Â is he's going to walk you through some of the techniques that we use to visualize.

Â The similarities and differences and diversity between microbial [INAUDIBLE].

Â >> Recording data is one of the essential functions of the scientist.

Â In microbial ecology in particular, and data science in general,

Â the contingency table is widely used.

Â A contingency table has rows and columns.

Â The columns usually record the things found in a given sample.

Â And the rows record the specific thing or things observed.

Â The rows of a contingency table are often called the features of the data.

Â As an example I have taken four samples of some arbitrary environment, and

Â recorded the data.

Â 5:04

Sample A has three green bugs, two pink bugs and two tan bugs.

Â So in the table, the first column, the one corresponding to Sample A,

Â has three in the green-bug row, two in the pink-bug row, and two in the tan-bug row.

Â A contingency table is really the work horse of data science in general, so

Â familiarize yourself with this concept, before moving on.

Â The phrase, a picture is worth a thousand words, is doubly true in science.

Â To make sense of contingency tables, and

Â data in general, scientists turn to various visualizations, or plots.

Â One of the simplest plotting schemes goes like this.

Â Treat each feature of the contingency table as a dimension or access and

Â mark where each sample would go based on its featured.

Â For instance, let's look at sample C.

Â It has three of the green but, so

Â we locate it at the three position on our green bug axis.

Â It has one of the pink bug, so

Â we locate it at the one position on the pink bug axis.

Â Finally, it has zero of the tan bug, so we put it at zero on the tan bug axis.

Â 6:10

If you prefer to think of these axes as X, Y, and Z, we've located the point for

Â sample C at X equals three, Y equals one and Z equals zero.

Â The patterns a visualization shows helps scientists derive conclusions and

Â develop applications for their science.

Â For instance, in our microbiome studies we are frequently concerned with how

Â close someone is to developing a certain type of disease, like ulcerative colitis.

Â If we have a known disease sample, say Sample B, and two unknown samples D and

Â C, our intuition tells us that the closer the unknown sample is to be,

Â the more likely it is to come from a person who has that disease.

Â In many cases however, saying that sample x looks closer to sample y,

Â than to sample z is not rigorous enough for scientific or medical use.

Â To help us in these cases, we introduce the notion of distance.

Â 7:04

Now, there are a lot of notions of distance that are mathematically complex.

Â But a familiar one to most of us is Euclidian distance.

Â The Euclidian distance between two points is just the square root

Â of the sum of the square differences in the locations on each axis.

Â 7:40

Both A and B have two pink bugs, so we have 2 minus 2, and

Â finally B has two more tan bugs than A and we have 2 minus 4.

Â What's important to note, even if you're not fluent in math,

Â is that the distance between sample A and sample B, the green line,

Â looks smaller than the distance between sample B and sample C, the yellow line.

Â The distance calculation confirms our observation and

Â shows us that A is 3.61 units from B, while B is 5.09 units from C.

Â 8:12

When we have three dimensions of data,

Â the visualization strategy we just outlined works great.

Â However, when we have more then three things it won't work.

Â Imagine these four samples are the same samples as before,

Â we have just looked harder through them to find some new bugs.

Â Now, instead of three type of bugs, or three features, we have six types of bug.

Â We can't plot these samples in the way we did before because we have no way to

Â visualize six dimensions.

Â Although we can't visualize this data, our notion of distance still works just fine.

Â In fact, it will work with any number of dimensions.

Â The bottom calculation shows the distance between sample A and sample B again.

Â But this time with the inclusion of the new features.

Â You might thing, well, if we have something that works, ie we can

Â compare these samples using distances, why do we need visualizations at all.

Â The answer is that we might never even develop the intuition that there is

Â a pattern in the data if we don't visualize it.

Â Scientists have been faced with the dilemma of how to

Â visualize high-dimensional data for a long time, and

Â they've come up with some pretty ingenious methods.

Â But one I'll discuss now is called dimensionality reduction.

Â In essence, when we are doing dimensionality reduction,

Â we are looking to recapture whatever patterns are in the data, but

Â reduce the number of dimensions we need to see that pattern.

Â We'll start with an example.

Â 9:58

The Orange point is 3 units East and 3.35 units North.

Â While the Green point is negative 1 unit East and negative 4.38 units North.

Â The contingency table at the bottom left records this data.

Â Now, if you look on the right I have the same circle with the same points, but

Â I've recorded their positions differently.

Â 10:30

If I gave you the representation on the right, it would be just as unambiguous as

Â the one on the left but I only used a single dimension of data, that is degrees,

Â rather than two dimensions of data, that is north and east.

Â This is the essence of dimensionality reduction.

Â Find a new coordinate system or

Â presentation of the data that captures the same patterns in fewer dimensions.

Â In microbial ecology, we frequently use a specific type of

Â dimensionality reduction called principal components analysis, PCA,

Â and a related technique called principal coordinates analysis called PCOA.

Â The math required for these techniques are basic linear algebra,

Â but it's well beyond the scope of this course.

Â Instead of slogging through that, we're going to give you a high level overview.

Â 11:18

Imagine you have a 2D oval of paper that is tilted at

Â an angle from the horizontal like in panel A.

Â Now to describe a point on our oval we need an X, a Y, and a Z coordinate.

Â If you think about it,

Â though, the oval is only 2D so it could really lie flat in the plane.

Â It doesn't need a Z axis to describe it.

Â If we were to choose a new set of coordinates,

Â one along the long end of the major axis of the oval, and

Â one perpendicular to it along the minor axis of the oval we could

Â unambiguously represent any point on that oval using only two dimensions.

Â 11:53

Panel B shows our axis in the original space with the new coordinate system, and

Â panel C shows the oval displayed only in 2D.

Â The oval actually was found in a smaller dimensional manifold of our

Â original coordinate system.

Â This is the essence of PCA and PCOA.

Â By choosing new coordinate systems, we can eliminate redundant or

Â useless systems and see the pattern in our data visually.

Â 12:17

In reality, the process is not quite this easy.

Â Usually we can reduce the importance of a dimension but

Â we can't actually eliminate it.

Â In the context of this example,

Â this means that the oval has a bit of thickness to it.

Â 12:36

To give you an example of how this is useful n the real world

Â I've included the plot of some human microbiome project data.

Â This data initially had thousands of dimensions so

Â to accurately graph sample relationship we needed thousands of axes.

Â By using the magic of PCOA however, we reduce that down to three dimensions,

Â which allowed us to see that all of the fecal samples are similar and

Â clustered on the bottom,

Â and that all of the oral samplers are similar and cluster on the top left.

Â In contrast, the skin and

Â vaginal samples while similar to one another, are more spread out.

Â Without PCOA, we'd never have seen this pattern, and we'd never have been able to

Â make some of the exciting discoveries that we'll talk about in the coming weeks.

Â