0:00

Hello, everyone. I am Chai Zuying.

Today we will illustrate the differential gene expression analysis.

Along with the explosive development of high-throughput sequencing technology,

it is rapidly becoming the standard method for measuring RNA expression levels.

I will tell you some examples of differential gene expression analysis by high-throughput sequencing, and several analysis tools about it.

We know that we need to get series of datasets by RNA-seq before analyzing RNA expression levels.

Then we identify the differently expressed genes in different conditions from these datasets.

Generally, there are three steps.

First, Normalization of counts; Secondly, parameter estimation of the statistical model;

Finally, test for differential gene expression.

The goal of this paper published in 2013 is to compare different methods for RNA-seq data, such as Cufdiff, edgeR, DESeq and so on.

There are two databases used in this paper, SEQC and ENCODE project.

2:13

The next slide shows the results of

normalizing the different approaches we mentioned right now

including cuffdiff.

compared to the gold standard real-time PCR using the variation analysis

The results they got root mean square error represents the results of variation analysis

In this slide, Y-axis stands for RMSD value

X-axis stands for the different approaches we use

So we can know

these approaches' result is quite same with the real-time PCR

which means they are reliable in normalization level

The second picture shows the same question in different point of view

by analysing these approaches' specificity and sensitivity

the author evaluates their advantages and disadvantages

They use 1000 gene that has been evaluated by real-time PCR

some of these genes are same, some are different

Then we use the ROC value to test the approaches we have mentioned

3:40

This is the pictures Doctor Wei also showed

We can see there are sensitivity and specificity in top left corner

Using their sensitivity and specificity to do the curve

then calculate their AUC, AUC means the area under the curve

By anal sizing the AUC, we could evaluate them

Picture A tells us when you set a threshold value

We have known that 600 of 1000 genes are different, 400 are same

for example, when you set the threshold value is 0.5

these four values could be get one value.

when you set the threshold value is 0.6,different these four values come out

In picture A, when their cutoff is 0.5

we can see these approaches get the same results

which means they are reliable on this occasion

But in the picture B, when you increase the value of cutoff

in other word, the standard you define the variation is higher

you will find the results of ROC tests is different

Apparently, three approaches blew are not reliable including cuffdiff

5:22

In fact, it wants to express the idea that you can evaluate different approaches

by analysing its false positive rate

How to do it?

In our normal experiments, we design a sample with a control

They are the same

so they should show no variation in theory

But when your approaches find they are different, then it means your approaches is not reliable

How to see this picture? Firstly, from top to bottom in order is 25%、50%、75% to 100%

According to the number of reads you have got and the kurtosis

from top to the bottom

you will find that when the kurtosis is low, their SNR (signal to noise ratio) is high

When it comes to bottom, in theory

it means two samples are same

its X axis P values and Y axis Density

their curve should be flat, little change

But we can see when the P value is less than 0.05

cuffdiff appears an increase in its curve

This is the main idea in this picture

Which means its false positive rate is high at that time

cuffdiff is not reliable

The next slide mainly show the sequence depth and number of sample

have effects on gene expression variation analysis

This picture is complex

each line stands for different analysis approach

The rightmost of the picture tells the name of approach

and the left Y axis means the false positive rate

And the X axis tells us the reads’ dilution ratio

100% means there is no dilution

50% means there are half of the reads to test

25% is 25% reads tested

We could use these methods to get the false positive rate

But what is the meaning of this approach?

It tells us this percent number can stand for the sequence depth

The deeper the sequence depth, the lower the false positive rate

These four boxes are on behalf of the kurtosis of 25%, 50%, 75% and 100%

8:10

We can see when the kurtosis is 25%, its false positive rate is very high

But when the kurtosis is higher than 25%, false positive rate becomes low

This picture tells the deeper the sequence depth

the lower the false positive rate

What’s more, we can see there are dark red lines and light red lines in the picture

They represent two and three repeat samples

We could see the light line’s false positive rate is lower than the dark ones

The more tested samples, the lower false positive rate

The conclusion of this picture is that the deeper sequence depth

the more sample, the false positive rate will be lower

Similarly, this figure tells

the sequence depth and the sample tested have effects on sensitivity

These two pictures want to tell us that the deeper sequence depth

the more sample, the false positive rate will be lower and its sensitivity will be higher

Compare increasing sample with increasing sequence depth

we have to know which way is better

Authors emphasized that to increase sample number is better than to increase sequence depth

9:54

When we see this line, different depth of sequence can bring lower range of the values

While increased sample number can bring high range of the final values

So authors said in the paper, which is also the picture wants to show

to increase sample number is always better than to increase sequence depth

This is this paper’s brief conclusion

The first is cuffdiff is not so reliable in some cases

its false positive rate can be high and its sensitivity is not so high

Also, both to increase sample number and to increase sequence depth

can bring its sensitivity and decrease its false positive rate

Thanks!