Question

Normalizing Count Data In Rna-Seq

1

Entering edit mode

12.8 years ago

Arun 2.4k

Hello, Suppose I have RNA-seq data for 1) control, say, T0 2) treatment after 4 hours T4 3) treatment after 8 hours T8 and I would like to find out those genes that are differentially expressed between each of these pairs (where T0 vs T4 and T0 vs T8 are most informative/essential to the experimenter).

I perform normalization using edgeR TMM method. However, the way I have been doing it is to normalize count data for each pair (A). That is, for T0 vs T4, I obtain the counts and then perform the TMM normalization and then obtain the candidate genes and then for T0 vs T8, once again do normalization between these two count data and obtain DE genes and so on...

However I am beginning to wonder if this is the way to go or to perform only one normalization by having counts from all genes from all time points altogether (B).

I am not able to convince myself of a good reason to choose between either. Have anyone of you had to work on this type of data or have an idea why you would go for (A) or (B)?

Thank you.

edger rna-seq differential-expression • 5.8k views

ADD COMMENT • link updated 12.8 years ago by seidel 11k • written 12.8 years ago by Arun 2.4k

score 2 · Answer 1 · 2012-05-18

2

Entering edit mode

12.8 years ago

Frenkiboy ▴ 260

You can try the DESeq package, It's estimateSizeFactors uses the complete dataset to perform the normalization.

Then you can test for differential expression on sample vs sample, or fit a GLM.

ADD COMMENT • link 12.8 years ago by Frenkiboy ▴ 260

0

Entering edit mode

Thank you for your answer. However, I don't think the issue is if edgeR has the option to do normalization on all/more than two samples. Rather, which one is better / right? Doing normalization for each pair as and when I test for DE or normalize them all altogether and then test for DE on all pairs. But from what you say, it seems like normalization and then DE on all pairs. Right?

ADD REPLY • link 12.8 years ago by Arun 2.4k

1

Entering edit mode

I think you have it right, yes.

ADD REPLY • link 12.8 years ago by Sean Davis 27k

score 1 · Answer 2 · 2012-05-18

1

Entering edit mode

12.8 years ago

seidel 11k

The problem with option A, is that you calculate different normalization factors between T0 and T4, and between T0 and T8. Inevitably, since T4 and T8 are related samples from the same time course, you'll likely be comparing the results between T4 and T8, but they will have been adjusted differently, so they will differ by this factor. With option B, everything in the pool has been adjusted to the same mean.