How should I normalized RNA-seq data to draw the gene expression time series curve?
2
0
Entering edit mode
9.5 years ago
jfhuang.dg ▴ 40

Hello,

I am processing insect development paired-end RNA-seq data.

Eight time points were select to prepare samples, several whole insects were put together to extract the total mRNA in each time point. As a result, I get eight RNA-seq data from the eight development time:

Egg -> 24 hour -> 48 hour -> 72 hour -> 96 hour -> Prepupa -> Pupa -> Adult

I want to draw the gene expresion across the time, like:

But the question is how should I normalized my data across samples to make the expression comparable? I using three different methods to calculate the expression value, but the results confuse me a lot. I will post the gene expression correlation plots to describe.

(1) Map RNA-seq data to genome using tophat, use cuffdiff to generate the fpkm value (As cuffdiff will normalized data across samples), and get very weak correlation across samples.

(2) Map data to genome using tophat, generate rpkm value using the sam output (use rpkmforgenes.py from Sandberg lab),and get the following figure:

(3) Use RSEM to caculate TPM vaule for each gene, and plot

Results between (1) and (2),(3) is so different. The (2) and (3) results get the similar correlation between samples,but the question is, for example, the first sample (Egg) and the last sample (Adult), the insect's phenotype is totally different, but the gene expression correlation seems too high (>0.9).

Could any guy give me a comment of my results? If the expression value need to be normalized across samples after rpkm/tpm or before this process (especially for (2) and (3))? Which method I should use?

RNA-Seq • 8.2k views
ADD COMMENT
0
Entering edit mode

Sorry for the watermark of the pictures, I just could not find a good place to post my figures ......

ADD REPLY
0
Entering edit mode

Did you try logarithmic transformation before calculate the correlations?

ADD REPLY
0
Entering edit mode

I try to take log10 in my RSEM method result, it looks different.

But I don't know if it means my data is usable or not.

image

ADD REPLY
1
Entering edit mode
9.5 years ago
Amitm ★ 2.3k

Hi,

This may seem very primitive but after log-trans, make a boxplot and see the data. I find this very intuitive before puzzling over concordance or correlation.

Also, some normalization is important before drawing conclusions. BioConductor/ R has good packages like DESeq2 for RNA-seq data. Though I have seen them to not perform so well with non-replicate data as yours is.

I can suggest some basic steps which I do to reduce variation. RNA-seq has large no. of genes/ transcripts with 0 or near 0 value (Rider here - experienced with data from human tissues & cell lines only).

1) Calculate avg. exp. value for each gene across all samples.

2) Sort this vector and apply a/an (arbitrary threshold). The idea is to remove the genes which have basal exp. value across all samples. If you make a density plot of this vector in R, it would be clear as to where a cutoff could be made.

3) After this, with the selected gene list, calculate std. dev. and again do a selection process for highly divergent ones.

Ultimately you would be left with a gene set that is non-basal and "responding" to your biological question.

Then calculate correlation or perform Clustering to discover groups of genes.

With non-replicate RNA-seq data there aren't any rigorous statistical methods out there. Above is my take on making the best of available data.

ADD COMMENT
0
Entering edit mode
9.5 years ago

Take a look to this RPub to check if it is useful for you

ADD COMMENT
0
Entering edit mode

An excellent tutorial! The figures inspire me a lot!

Array data usually use log2 value, and the different is array data of samples in different time usually hold in single array, and they will normalized together.

I am not sure if RNA-seq data (FPKM/RPKM/TPM value) should treat more process to normalized among samples. The result of my data (RPKM and TPM) looks unreasonable, as the correlation between the first and the last times is quite high. But using cuffdiff result, they looks quite different.

If RPKM/TPM result need not to treat further(normalized among samples), then it means there is some problem of my data (It may be wrong).

The data is from my collaborator.So I must make sure that the data processing is right, to decide if the problem happened in my data processing or happened before sequencing (design of the wet experiment).

ADD REPLY

Login before adding your answer.

Traffic: 1696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6