0:22

How do they work?

Â We have now learned how to construct Profile given

Â Motifs, but can we construct Motifs given Profile?

Â We can, because given an arbitrary Profile and a set of sequences Dna,

Â we can construct the set Motifs(Profile,DNA) by simply finding

Â Profile-most probable k-mers in each sequence from Dna.

Â 0:54

And therefore, we can further iterate this process by finding profile of motifs,

Â motifs of profiles, profile of motifs, until our motifs keep improving.

Â So this is the key idea of the algorithm called RandomizedMotifSearch.

Â We simply iterate and see whether the score of motif continues

Â to improve. And stop when it doesn't improve anymore.

Â 1:38

If the motifs are completely random,

Â and if they're taken from random sequences,

Â then presumably. expected probability of every element

Â in the profile matrix is 1/4.

Â There is absolutely no information in such a motif.

Â How possibly it can lead us to the correct motif in the sequence?

Â Let's just keep our fingers crossed and see how this approach works on an example.

Â Here's my example.

Â These are very short DNA strings and I implanted

Â 4-nucleotide long motifs (one motif on each string). They're shown in blue, so

Â 2:30

First, it will randomly select motifs shown in bold here.

Â And, of course, it has missed most of the motifs.

Â They are randomly selected.

Â 2:53

Using Profile(Motifs) it will calculate the probability

Â of every k-mer in the set of strings,

Â shown here, and will simply select the most probable k-mers in every string.

Â The results are in this matrix, and I should probably start iterating.

Â 3:15

But if you look carefully at this matrix, I don't have to iterate.

Â It already found all blue inserted motifs right there.

Â It almost sounds like a magic

Â trick, because we started from something completely random

Â and, in a single iteration, found our implanted motifs!

Â Well, what happened?

Â 4:09

Where does this statistical bias point to? If it points

Â in the direction of the implanted motif, then there is no miracle

Â that we start from something random (which is not really random),

Â and, applying iterative procedure, arrived to a hidden part.

Â So, let's see how it actually works.

Â I return back to the case that we considered before.

Â You saw that when we randomly selected

Â motifs, in the first sequence, we missed the motif (the correct, the blue one).

Â In the second sequence, we also missed it.

Â In the third sequnce, we missed it.

Â In the fourth sequence, we missed it.

Â But in the fifth one, simply by pure chance, we actually found it.

Â It's not a big deal.

Â There is always a probability that we simply by chance

Â select one of the implanted motifs.

Â But it results in a biased profile.

Â If you look at the profiles that

Â resulted from this implantation, then you see that the largest elements

Â in this profile actually correspond to A, C, G and T

Â of our implanted pattern ACGT. Therefore, even

Â a single correctly found k-mer in

Â the set of Dna strings creates a bias in the profile, and this bias

Â leads us to the correct sequence. Of course, it doesn't always happen.

Â It depends on how we have chosen initial

Â set of randomly selected k-mers in the sequence.

Â But if we try this algorithm, a thousand or

Â a million times, there is a good chance that

Â eventually one of these attempts will bring us to

Â the correct motif. Even if this procedure sounds ridiculous.

Â 6:06

We now move to the next topic: "How do bacteria hibernate?"

Â Our goal is to apply our motifs finding algorithm to real biological problems.

Â To figure out how our algorithms work on

Â real biological data, we will start by analyzing tuberculosis.

Â Tuberculosis is a deadly disease that

Â killed quarter of population of Europe in the 19th century.

Â Unfortunately, humanity is not done with tuberculosis yet since

Â tuberculosis strains that resist to all antibiotics are now emerging.

Â Tuberculosis is caused by

Â Mycobacterium tuberculosis also called MTB.

Â An interesting thing about MTB

Â is that it can persist in for decades in humans

Â and one third of the world population is affected by MTB.

Â Of course, it usually doesn't lead to tuberculosis

Â nut, in some cases. bacteria switches from its latent

Â state to the active state and it's when the tuberculosis begins.

Â The question that we will be asking is how

Â MTB switches from the latent to the active state and how

Â does it survive during latency? MTB has an ability

Â to form dormant spores where metabolism is shut down.

Â In such sporulated form it can survive in

Â very tough conditions, for example, in conditions of food shortage or

Â oxygen shortage. And oxygen shortage, called hypoxia, is

Â associated with these latent forms of tuberculosis.

Â So, how can we figure out what makes

Â MTB switch into the latent state and return back to

Â the active states?

Â Biologists have performed a DNA array experiment and identified

Â roughly 25 genes that are activated in the hypoxic condition.

Â So when bacteria is deprived from

Â oxygen, these gene dramatically change their expression.

Â And our goal today is to discover what regulatory

Â motif controls these genes when they're activated or repressed.

Â 8:44

We now have quite an arsenal of algorithms for finding motifs.

Â Let's try them.

Â But the first problem is that we have no clue of what size of motif to try.

Â Should we try k equal to 8 or k equal to 20? Let's try all of them

Â using median string problem and randomized motif search.

Â And you will see that, for different k, we are getting quite different results.

Â It is not clear which one to choose.

Â 9:16

for median string, we actually will have

Â difficulties going beyond k=12 because

Â it is an exponential algorithm whose running time grows as four to the power k.

Â But for randomized motif search we can search

Â for very long motif and, for k=20,

Â we actually find this rather long motif.

Â You can see that there are similarities

Â Between the 12-mer found by the Median String.

Â the 12-mer found by the RandomizedMotifSearch,

Â and the 20-mer found by the RandomizedMotifSearch. Which one is correct?

Â To answer this

Â question, we need to introduce the notion of the motif entropy.

Â 9:59

And entropy of motif is defined in the following way.

Â First, we construct the profile matrix.

Â Every column of the profile matrix correspond to a biased four-sided dice.

Â Or, in other words, to a probability distribution, (p1, p2, ...p_n).

Â Sum of p_i is equal

Â to 1, of course. And entropy is simply defined as

Â - p_i * log (p_i)

Â and summed up over all indices i. For example, we can compute the entropy of

Â the first column, and it will be 1.147.

Â 10:40

We can compare it with the minimal

Â possible entropy, which corresponds to the extremely conserved motif

Â (where all symbols in the column are the same),

Â and with the maximal entropy

Â (where symbola in the column are equally likely).

Â So the entropy of a motif is defined as the sum of the entropies of its columns.

Â And now, the moment of truth.

Â Let's see how the algorithms that we developed

Â 11:15

find out whether the motifs that we find are close to the biological one.

Â Here is the motif logo of the dormancy survival regulator site.

Â The motif logo is constructed in such a way that the total length of four

Â nucleotides in the motif logo correspond to two minus entropy.

Â So for completely conserved column, like nucleotide G

Â in this motif, we will have height two.

Â It is the most conserved columns of the motif.

Â The relative sizes of four nucleotide in

Â each position correspond to frequencies of these nucleotides in the column.

Â 12:10

You can see that our randomized motif

Â search has actually captured a good portion of the motif.

Â But you can also see that we missed in some places.

Â We need to work harder and to figure

Â out what to do to develop better regulatory motif finding algorithms.

Â We move next to another randomized algorithm called Gibbs sampler.

Â