0:29

As we've discussed already, this is just a small example of

a genome assembly problem, and it is actually an over simplification.

In real life, when we assemble a real genome, life is much more complicated,

because these strings are longer.

The resulting strings that we are looking for is of length millions or

even billions, and reads contain errors, I mean the substrings contain errors.

And also there are many repeated elements and so on.

But still, using this toy example, we will be able to

demonstrate the main ideas used in modern genome assemblers.

And as we've discussed already, we are going to use Hamiltonian cycles and

Eulerian cycles to solve this problem.

So in search for an inspiration for solving this problem,

let's consider some known string, and let's consider all its 3-strings.

For example let's consider the word DISCRETE and

all its possible substrings of length 3.

So what we see by looking at this example is that any two

consecutive strings have an overlap of size 2.

By saying overlap, we just mean a common part of size 2.

So for example, the first two substrings, this is DIS and

ISC, they have this common part, IC, right?

And this is, of course, true for any two consecutive strings.

So for example, for CRE and RET, the suffix of CRE,

it is just RE, is equal to prefix of RET, which is also RE.

This just reflects the fact that all these three strings come from the same

initial string, right?

And this gives us the following idea.

What we need to find is an order on our 3-strings

such that if we place these 3-strings in this order,

then there is an overlap of size 2 between any consecutive strings.

For example, the initial order doesn't work because right after AGC

goes ATC and they do not have this overlap of size two, right?

So what we need to find is a permutation of all these 3-strings

which satisfies this property that there is always an overlap of size two

between any two consecutive strings, okay?

But this is actually a Hamiltonian cycle problem, why is that?

Well, because we can draw the following natural graph.

Let's use our strings as nodes in our graph.

And let's add a directed edge from this string to that string if there

is an overlap of size two between them.

Namely, if the suffix of this string of length

two is equal to the prefix of length two of this string, okay?

Let me give you an example.

For example, in this graph, there is an edge from CCA to CAG,

exactly because the suffix of CCA is equal to two,

is equal CA, I'm sorry.

And the prefix of CAG of length two is equal to CA.

So this edge basically reflects the fact that CAG potentially might follow

CCA in the string that we are looking for.

At the same time, there is no edge from GCA, for example, to ATC,

because we know for sure that ATC should not follow GCA, okay?

Because there is no overlap of size two between these two strings, okay?

So we're looking for a Hamiltonian cycle in this graph.

And this is a Hamiltonian cycle.

And in fact, when you have a Hamiltonian cycle, it is not so

difficult to reconstruct the string.

So let's do this together, so we start with the stream TCA.

So at this point we have just a string TCA, and

then by traversing the Hamiltonian cycle that is shown here,

we will gradually extend our string till we get the resulting string.

So from TCA we go to CAG, so we append G to our current string.

From CAG we go to AGC so we append C.

Then we go to GCA, then to CAT, to ATC, to TCC,

and finally to CCA, which finally gives us the resulting string.

So let's just do the sanity check, let's check, so

the first substring is TCA, it is exactly what is needed.

The second is CAG, this is our string.

The next one is AGC, it is AGC, okay,

GCA, GCA, then CAT, CAT is here.

Then ATC, ATC is here, then TCC which is here,

and finally CCA, CCA.

So indeed all the substrings of length three are exactly our initial substrings.

So in fact, we solved the problem.

But the issue with this solution is that, in fact, we reduced our problem

of finding the unknown string to the Hamiltonian cycle problem.

But the issue is that for Hamiltonian cycle problem,

we do not currently have a provably efficient solution.

We do not have an algorithm or a program that solves Hamiltonian cycle problem for

all graphs in practice of size, for example, up to one million or something.

So unfortunately,

the Hamiltonian cycle problem is still an intractable problem for us.

So if the graph has millions of nodes and edges, we do not expect our current

technologies to be able to find a solution quickly for such graphs.

So we definitely need a different method to be able to solve the real

genome assembly problem where the number of the strings is counted in millions or

even billions.

And what we're looking for is also a huge string of size millions or billions, okay?

And as you might have guessed already,

Eulerian cycles are going to help us in this case.

So in the graph that we constructed, it is the so-called overlap graph because, well,

basically an edge represents an overlap.

In this case, each of our input strings is represented by a node.

7:11

Now instead of this, let's use this beautiful and strange looking idea.

Let's try to construct a graph where each of our input strings

is going to be represented by an edge but not by a node.

So this idea was initially proposed by De Bruijn,

who was constructing a so-called universal strings.

And then, it was proposed by Pevzner, Tang,

and Waterman to use this idea in genome assemblers.

Very roughly, the idea is the following.

We're going to represent each string, for

example CAT, as an edge from its prefix to its suffix.

In this case, for CAT, it is going to be an edge from CA to AT.

I will show you an example in a second.

Before this, let me mention that exactly this

idea is used in modern genome assemblers, in state-of-the-art genome assemblers.

Okay, so this is how a De Bruijn graph looks like for our input strings.

So these are our eight input strings.

As we've discussed already, we are going to represent each of the three strings

as a edge in the corresponding graph.

For example, AGC is represented as as an edge from AG to GC.

8:34

Okay, and so on, so we have exactly the same number of edges

as the number of input strings in our case.

Now, what is important in this case is that now

in this case we are looking for an order of edges.

So what we would like to do is to traverse all the edges on by one.

That is, we are looking for an Eulerian cycle.

And this are very good news for us, because for Eulerian cycles,

we have extremely efficient algorithms for finding an Eulerian cycle in practice.

Let me also show you that an Eulerian cycle, indeed,

spells the strings that is needed for our purposes.

So let's start from TC, and from TC we go to to CC,

and at this point we spell the string TCC.

Then we go from TCC to CA, and the edge CC to

CA spells the string CCA, so we add A to our resulting string.

And we continue in the same fashion until we traverse all the edges.

Let's once again do a sanity check, and

let's check that our string indeed contains all input strings.

Okay, so the first string is TCC, this is TCC.

The next one is CCA, this is CCA.

The next one is CAG, so CAG.

The next one is AGC, AGC.

The next one is GCA, GCA is here.

CAT then ATC, and finally TCA.

Okay, it is also interesting to note that the solution that we got here is

different from the solution that we got from the Hamiltonian cycle approach.

This is just because, for our problem, the solution is not unique.

It may be the case that there are many strings that satisfies the requirement

that all its three substrings are exactly the even ones.

And in particular, as you see here in the De Bruijn graph,

there are actually more than one Eulerian cycles here.

By spelling each of these cycles, you actually spell a different solution for

our problem, okay?

So now let me summarize the whole lesson.

First of all, the definition of Eulerian cycle is that, or Eulerian path,

is that it visits every edge exactly once.

While Hamiltonian cycle is required to visit every node exactly once.

The difference in the definition of these problems is very tiny, seems to be tiny.

So we just need to need to traverse all edges or all nodes, okay?

So these problems definitely look very similar to each other, but

they differ dramatically from the computational point of view.

For Eulerian cycles, we have a very simple criteria for

checking whether a given graph has an Eulerian cycle.

And we have a very efficient algorithm for finding a Eulerian cycles.

Which will provably find us very quickly an Eulerian cycle in any graph in

practice consisting of millions of nodes and millions of edges.

And we do not know anything like this for Hamiltonian cycle problem.

And actually the existence of such a polynomial time algorithm is a big

open problem in computer science.

Finally, as we have shown by the genome assembly problem,

the right problem formulation makes all the difference.

So we first reduced our problem of reconstructing a genome from

its unknown part to Hamiltonian cycle problem.

But this leads us to nowhere, because for

Hamiltonian cycle problem, we do not have a polynomial time algorithm.

But then we managed to reduce it to Eulerian cycle problem,

which allowed us to solve this problem much more quicker.

And it is actually used in modern genome assemblers exactly because for

the Eulerian cycle problem, we do have efficient algorithms.