Hello everyone, this is the supplementary video for this week.

I will explain here the models and calculation of SNP calling and genotyping.

First, I will introduce briefly the concept of likelihood and Bayesian approach.

Then I will introduce the genotyping model used by MAQ, one of the early mapping programs.

and the one in SNVMix, which is used to handle cancer genome data.

The likelihood function is a term frequently used in statistics.

It is a function of the parameters of a statistical model.

It represents the probability of observing the current data given the values for parameters.

The likelihood function plays a key role in statistical inference.

Bayesian approach, or Bayesian statistics, is a major branch of statistics.

It is also considered a counterpart of frequentist statistics.

This approach follows the Bayesian rule to treat indeterminate parameters, theta, as random
variables.

Then it transform the problem of estimating parameters based on data

into the problem of estimating the posterior probability distribution of the parameters, given
the observed data.

We can write these procedures by Bayesian theorem as this formula:

the posterior probability distribution equals to the prior probability distribution multiplied by the
likelihood function.

Please note that all the three terms in this formula are functions of θ.

This is a simple demonstration of likelihood function.

Assume that we toss again and again a coin whose probability of landing head-side up is not
equal to that of landing tail-side up.

We also assume that these two probabilities are fixed.

Let’s denote by θ the probability of landing head- side up in a trial.Then the probability of landing
tail-side up in a trial will be 1-θ .

Let’s try to calculate the probability of getting a sequence of HTHH(Head-Tail-Head-Head) after
4 trials.

It is easy to work out the result, which is θ*(1-θ)*θ*θ=θ^3*(1-θ).

Students having studied statistics course will immediately tell that the model used here is 4
Bernoulli sampling processes.

The HTHH sequence observed in this problem is the Data.

What we have calculated is the probability of observing such data, given the value of θ, the
parameter for this model.

If we did not know the probability of the coin landing head up, and we want to estimate it from
data,

then a straightforward way is to find out the parameter that maximizes the likelihood function
— which is the MLE method.

We could do the MLE by differentiating the likelihood function with respect to the parameter
θ.

Set the derivative to 0 and the MLE of θ can be obtained.

In this example, the MLE of θ is 3/4 .

If, however, we had only known that there are 3 head-side ups in the sequence, we might need
to modify the model.

We can switch to binomial distribution model, and find out that the likelihood function is the
number of combinations of 4 elements taken 3 at

a time, multiplied by the θ^3*(1-θ) before,

making it 4*θ^3*(1-θ) .

Given the same data, the MLE of θ under this model is still 3/4.

Similar ideas are used in SNP calling and genotyping based on NGS data.

A statistical model is used to depict the NGS data and to estimate the genotype.

The statistical models used by bioinformatics tools in practice are diverse and more complex.

To make it easier to understand, Dr. Gao has used in class a very simplified model, with only
the erroneously sequenced part considered.

This model is overly simplified compared to models used in practice.

There is no “correct” or “incorrect” statistical models in practice.

It is just that you can “feel” whether it is good or not by checking the power of this model in fitting
and depicting data, and by checking how much of

the result is correct.

Sometimes data in practice violate the assumptions of classical simple models.

That’s why the models used in practice often appear complex.

They might even take advantage of some technical details of algorithm implementation.

If you’re interested in statistical models and algorithms used for SNP calling and genotyping,
you can check these original papers for the

several representative tools.

Of course there are more tools than I have listed here.

I have listed here only several models that, in my opinion, are easier to understand.

MAQ is one of the early mapping programs.

The original paper and supplementary materials have explained in detail the genotyping model it
employs.

This model might be simpler than other models developed later, and I will introduce it in detail
here.

You can see next how the model used in practice is adapted according to some features of the
data observed.

First, let’s review our problem again.

We get a lot of reads by NGS sequencing.

After mapping to the reference genome, a locus might be covered by multiple reads, and might
correspond to multiple bases (from the reads).

Each base from the reads has a probability of being erroneously sequenced, which is
evaluated by baseQ and denoted by ε.

What we want to do is to get the genotype of this locus.

For common diploid species, the genotype of a locus in an individual can have three types.

They are homozygosity with the reference allele/base, heterozygosity, and homozygosity
with the alternative allele/base.

For a diploid individual, we focus only on the two bases, denoted by b and b’, that are the most
dominantly sequenced.

Assume that they are sequenced k and (n-k) times, respectively.

Now let’s calculate the likelihoods for each genotype.

Let’s tackle with the heterozygosity first.

If the genotype is heterozygosity, then it is expected to have lots of “b”s and “b’”s here.

If baseQ is high enough, the variation of k (or (n- k)) depends primarily on the variation of binomial
variation.

We can thus ignore the effects of baseQ, and use a binomial distribution model with the probability
being 0.5 and the total number being n.

The likelihood function is written at the lower-right corner of this slide.

In fact, this can be obtained by setting θ=1/2 in the binomial distribution example mentioned
earlier.

If the genotype is homozygosity (with the reference base or the alternative base), then the
base not seen in this homozygosity must be from

sequencing error (if we include the error coming from PCR in early stage).

Therefore, the paper defined the symbol α_nk as the probability of n bases having exactly k
erroneously sequenced.

his is just the likelihood function we want.

If the bases from the reads have the same baseQ, and their probabilities of being
erroneously sequenced are independent of each

other,

then we can easily work out α_nk using binomial distribution.

We denote it by α-bar.

However, the paper mentioned that the reads data in practice do not have identical baseQ
values.

Also, the independence between probabilities of being erroneously sequenced does not seem
hold.

For some locus, if one is sequencing error, others may more probably be sequencing errors.

Therefore, the paper defined another variable β_nk as the probability of n bases being
observed to have more than k erroneously

sequenced given more than (k-1) erroneously sequenced.

We can deduce by definition the following 4 formulae.

Let’s start from the case where the probabilities of being erroneously sequenced are the same
and independent of each other.

First we write down the formulae for α-bar and β- bar here.

As you can see, β-bar can be defined by α-bar.

Next we assume that the real β is the corresponding β-bar to the power of f_k.

The f_k is larger than 0 and not larger than 1.

Feeding these variables to the formulae with α and β in the previous slide, we can get an α
taking into account the dependency between

probabilities of being erroneously sequenced.

Here we define c_nk as the slide shows.

As the paper has claimed, c_nk is not sensitive to fluctuations in ε-bar when f_k=1 .

Also, when compared to the production term of ε- bar on the right, the c_nk contributes less to
α_nk.

Therefore, the paper made an approximation here for the case where the probabilities of being
erroneously sequenced are not the same.

This C_nk can be pre-computed given β，εbar and f.

This formula is an approximation to the situation that each base has different quality.

The ε-bar is a weighted harmonic mean of ε, the probability of being erroneously sequenced.

The dependency between probabilities of being erroneously sequenced is determined by f_k.

In principle, it can be estimated from data.

But MAQ used a simple formula instead to calculate it, and it is said that this formula works
well.

In practice, c_nk(ε-bar) has been calculated in advance and stored in a table for the program to
check.

Also, the paper claimed that in practice, mapped reads with different directions tend to be
independent of each other.

Thus the likelihood calculated in practice is the formula below, which seems relatively more
complex.