Question

How to know which norm method should be used for RNA-seq read counts?

1

Entering edit mode

7.5 years ago

statfa ▴ 790

HI,

I'm applying EBSeq-HMM on my time course read counts (4 time points). There are two possible normalization methods available in the package: "Median" and "Quantile".

This model clusters genes into their most likely path. What I see is that when I use Median normalization method on my data, Gene X's most likely path would be "Up-EE-Up" (EE stands for equally expressed). When I use Qunatile norm method, this gene's most likely path is "Up-Up-Up".

When I plotted the Median and Quantile normalized expression for this gene, I figured out that the slope of the gene expression between time point 2 and 3 in Median norm is less than Quantile. So probably that is why EBSeq-HMM didn't find the difference big enough to show an "Up" path for the genes. Now I don't know which norm method to trust or how to know which one is working better with EBSeq-HMM.

How I can I upload the photos?

https://ibb.co/jUet7a https://ibb.co/mszt7a

Normalization RNA-seq • 3.0k views

ADD COMMENT • link 7.5 years ago by statfa ▴ 790

1

Entering edit mode

You can use either median or quantile, but what is your hypothesis statement? Choose the method that best helps to test your hypothesis.

ADD REPLY • link 7.5 years ago by theobroma22 ★ 1.2k

0

Entering edit mode

There are no units on the Y-axis my friend. Big no no.

The expression of RNA-seq is usually normalised by fpkm or by TPM (better). Why are you using median or quantile? Unless Im mistaken, it sounds like these only take the raw expression counts. Median just means the middle count, while quantile normalising just means grouping the counts into binds, In which case you are not normalising for fragment length or library size, which would make your comparison meaningless. Do you have a way of finding out the exact equation being used in median and quantile?

ADD REPLY • link 7.5 years ago by BioinfGuru ★ 2.1k

0

Entering edit mode

I normalize the data to find the DE genes. genes are normalized by their library sizes using Median, Quantile, TMM, Total, etc. methods. Median and Quantile Normalization methods are available in EBSeq-HMM package. I normalized my raw read counts using those methods and compared the results. fpkm or rpkm are used to normalize the read counts for the read length which is not needed for DE analysis as I know. Is it correct?

Read this paper please: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4917940/

"It is widely known that raw counts are not directly comparable between genes due to differential gene lengths and sequencing depths, and reads per kilobase per million reads (RPKM) can be used to correct the resultant technical bias [11]. In DE analysis between multiple conditions, the gene length does not affect the analysis result since such DE analysis focuses on the same gene. However, the condition comparison could greatly suffer from sample specific effects such as sequencing depth and sample specific GC-content effect. The sample specific GC-content effect could arise if two or more samples are sequenced in the same lane. Several within-lane normalization methods (i.e., regression normalization, global-scaling normalization, and full-quantile normalization) can be used to correct the resultant technical bias [12]. On the other hand, such effect can be absorbed into sample specific sequencing depth if only a single sample is sequenced in each lane, and the following four between-lane normalization methods are designed for correcting the technical bias due to sequencing depth: median normalization, total count normalization, quantile normalization, TMM normalization"