0:04

Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods".

I am Ge Gao from the Center for Bioinformatics, Peking University.

Let's continue this week's topic: Sequence Database Search.

Last time we have illustrated with an example on how to use BLAST

to search databases for similar sequences of the query sequence

within a selected species or across species.

This can facilitate the study of the function and evolution of the query sequence.

In this unit, we will introduce the ideas behind the BLAST algorithm.

As mentioned before, BLAST was first published in 1990.

According to Google Scholar, its two main papers have been cited nearly 50,000 times each.

It is one of the most widely used and appreciated algorithms in bioinformatics.

So far there have been more than 30 different toolkits developed for BLAST.

Many papers have been written on how to use BLAST.

There are even books devoted to this algorithm.

Due to time constraints, we cannot cover every single aspect of BLAST here today.

We will only introduce its basic ideas and algorithms.

Students who are interested can get further information from the Readings section.

As mentioned before, in dynamic programming, the key to reduce the computational time is

to reduce the number of cells to fill in as much as possible.

Specifically, we should try to compute only those cells in or surrounding the optimal alignment path.

BLAST is an extension of this idea.

Specifically, BLAST will first find highly similar small segments between the two sequences,

called the “seeds”.

These seeds will then be extended outward to construct the alignments.

Finally,BLAST will compute the statistical significance for this alignment

in order to reduce false positives.

Specifically, BLAST will first slice the query sequence into multiple small segments,

called “seed words”.

Then, BLAST will search a pre-built index table of all seed words in the sequences in the database

Then, BLAST will search a pre-built index table of all seed words in the sequences in the database

to find the query sequence’s seed words which signifies candidate matches between

the query sequence and sequences in the database.

Students with background in computer sciences may realize immediately that

it would take only linear time (or even near constant time) to search an index table.

This will improve the computational efficiency considerably.

By locating candidates for every seed,

we can get a hit map between the query sequence and candidate match sequences in the database.

As discussed in the previous unit, the optimal alignment path should be parallel to the main diagonal.

Therefore, we can discard those isolated hits.

We keep only the “hit clusters” each of which has two or more successive hits parallel to the main diagonal.

This will reduce the search space further.

Then for each hit cluster, we can extend outward from the two endpoints

until the total score has decreased below a given value X.

Then for each hit cluster, we can extend outward from the two endpoints

until the total score has decreased below a given value X.

Next we can identify the optimal alignment

for each extended hit cluster by dynamic programming,

thus reducing the computational time considerably.

BLAST also uses some other tricks to further enhance its speed and sensitivity.

First, BLAST will mask (i.e. mark as “to be ignored” ) low-complexity sequences

that have numerous copies in the genomes to avoid excessive false positive hits.

Whether a sequence is low-complexity is measured by its information content.

This (information content) formula is very important.

Without it there are too many seed words to choose and use.

More importantly, without it we cannot avoid selecting seed words from the low-complexity regions,

leading to lots of false positive hits.

For example, the “CACACA” repeat

is a typical microsatellite sequence

that has a K value of 0.36.

This sequence will thus be masked to prevent it from giving false positives in database searches.

Also, to improve the sensetivity ,BLAST will consider not only the seed words from the query sequence,

but also the “neighborhood words” similar to the seed words.

The similarity here is evaluated by the substitution matrix.

For example, the similarity between the seed word “DKT”

and its neightbourhood word “DRT” will be 6+2+5=13.

Similarly, the similarity between all possible neighborhood words

and a seed word can be worked out.

Of course, we should only consider highly similar neighborhood words to reduce false positives.

In practice, in the latest version of BLAST,

a neighborhood word will only be considered if it has a similarity of 11 or more with the seed word.

8:14

whether or not there is no real match.

Therefore, we need a method to objectively evaluate the statistical significance of an alignment.

After we get an alignment,

we need to assess its statistical significance to ensure that

this alignment does not just arise by chance.

BLAST uses the E-value to evaluate this significance.

Briefly, E-value denotes the expected number of alignments that can happen by chance

to have a score equal to or greater than that of the current alignment.

Specifically, if the E-value of an alignment is 10,

it means that 10 alignments are expected by chance

to happen whose scores are equal to or greater than that of the current alignment.

Please note that it is not a probability;

it is an expectation, and may be larger than 1 as defined above.

The expectation is worked out from this formula.

We will not discuss this formula in detail today.

But I will explain every variable in this formula.

The "m" is the length of the query sequence.

The "n" is the size of the database.

The "e" is the natural logarithm, and the "S" is the score of your alignment.

The "k" and "lambda" are related to the scoring matrix,

serving as the normalization factors.

As the formula below shows,

the E-value is directly proportional to n, the size of the database.

In other words, the bigger the database, the more likely you will get an alignment just by chance.

This is consistent with the example we have just discussed.

The E-value is also directly proportional to m, the length of the query sequence.

This is because BLAST is a local alignment;

it does not require the alignment to cover the full length of the query sequence.

Intuitively, the E-value is negatively correlated with the score S.

Alignments with higher scores are thus less likely to occur by chance.

The lambda and k in the formula are two correction coefficients

related to the scoring matrix and search space.

The lambda and k in the formula are two correction coefficients related to the scoring system and search space.

They are used to correct the effects of different scoring matrices and search spaces on the final results.

To make it easier to understand, we can convert between p-values and E-values.

As shown in the figure, the E-value and the p-value, i.e. the probability,

are nearly the same when they are less than 0.1.

In particular, when p is 0.05, the E-value is 0.0513.

0.05 is often used as a cut-off for E-value.is.

The smaller the E-value, the more reliable the result is.

Good. Now let's summarize. BLAST is heuristic, making it different

from the previous dynamic programming-based algorithms

(e.g. Needleman-Wunsch and Smith-Waterman).

It cannot guarantee the optimal solution, but it will find a solution that is good enough in much less time.

Specifically, BLAST uses a seeding-and-extending strategy

and applies dynamic programming only in a limited number of short regions.

This has reduced the computational time and improved the speed effectively.

This has reduced the computational time and improved the speed effectively.

However, the reduction of computational time comes at the cost of a decrease in sensitivity.

This is a common tradeoff for all the heuristic algorithms that we will discuss in later courses.

Good. We have briefly introduced the basic ideas behind the BLAST algorithm.

Here are the summary questions for this unit.

These questions are not part of the assignment.

You are encouraged to think about them and discuss your answers and ideas in the online forum.

That's all for this unit. Thank you!