Hello,
I am processing insect development paired-end RNA-seq data.
Eight time points were select to prepare samples, several whole insects were put together to extract the total mRNA in each time point. As a result, I get eight RNA-seq data from the eight development time:
Egg -> 24 hour -> 48 hour -> 72 hour -> 96 hour -> Prepupa -> Pupa -> Adult
I want to draw the gene expresion across the time, like:
But the question is how should I normalized my data across samples to make the expression comparable? I using three different methods to calculate the expression value, but the results confuse me a lot. I will post the gene expression correlation plots to describe.
(1) Map RNA-seq data to genome using tophat, use cuffdiff to generate the fpkm value (As cuffdiff will normalized data across samples), and get very weak correlation across samples.
(2) Map data to genome using tophat, generate rpkm value using the sam output (use rpkmforgenes.py from Sandberg lab),and get the following figure:
(3) Use RSEM to caculate TPM vaule for each gene, and plot
Results between (1) and (2),(3) is so different. The (2) and (3) results get the similar correlation between samples,but the question is, for example, the first sample (Egg) and the last sample (Adult), the insect's phenotype is totally different, but the gene expression correlation seems too high (>0.9).
Could any guy give me a comment of my results? If the expression value need to be normalized across samples after rpkm/tpm or before this process (especially for (2) and (3))? Which method I should use?
Sorry for the watermark of the pictures, I just could not find a good place to post my figures ......
Did you try logarithmic transformation before calculate the correlations?
I try to take log10 in my RSEM method result, it looks different.
But I don't know if it means my data is usable or not.