Hi,
I used to analyze RNAseq data mainly derived from our own samples, which are mostly sequenced in the same batch in the same condition. In this case, both DESeq2 and TMM work very well.
However, when I tried to analyze human data from a large cohort, I saw someone using cqn(Conditional Quantile Normalization) normalization, which corrected the GC content bias between lanes(batches, samples?). From the original CQN article (DOI: 10.1093/biostatistics/kxr054), and a paper from Jonathan K. Pritchard's group(DOI: 10.1093/biostatistics/kxr054, supplementary figure 12), it seems that GC content will affect the result significantly, and the GC bias is sample-based.
I compared my results based on logCPM from TMM (logCPM from limma) and CQN (cqn_result$y+cqn_result$offset), they are similar, but logCPM from TMM normalization are much more obvious and significant than that from CQN normalization.
Searching the available literature, it seems that CQN is not as widely applied as TMM or DESeq2, and not too many people compared the difference between these three methods.
I'm now quite confused about my results, especially on how reliable it is.
Any suggestion is welcome and appreciated.
Best regards,
Raymond
If you are interested in GC bias in RNA-seq, you should read the alpine paper (https://doi.org/10.1038/nbt.3682) and check its related BioC package (https://bioconductor.org/packages/release/bioc/html/alpine.html). There is also a talk from Mike Love (the developer) on Youtube. I will not make statements if or if not you should use any method that corrects for this as I have no expertise in this. I personally use
salmon
to quantify my reads which has a--gcBias
flag that implements the GC bias correction method fromalpine
, followed bytximport
and standardedgeR
workflow.Thanks very much! I will look into the raw data processing steps to check the bias corrections. Best regards, Raymond