Entering edit mode
5.1 years ago
Hi,
I am working on a human RNA-Seq project which is included 2 conditions (tumor vs. normal) with 21 and 5 replicate respectively. 4 samples in tumor condition contain 250-300 million reads however other samples consist of 60-100 million reads on average. Since the results show significant differences between these samples it seems DESeq2 normalization does not help.
Is there any suggestion for fixing this problem?
Thanks.
Reza
What makes you conclude this?
Heatmap and PCA plot of my data shows great differences between these 4 samples and the others which I think is because of the millions of reads per sample.
Can you post the code you used? I believe DESeq2 does account for library size differences. Checkout the
estimateSizeFactors
function.Mike Love answers a similar question here: https://support.bioconductor.org/p/107292/
He says that a ~2x more reads can be handled with the size correction in DESeq2 (above function). By the way, the
edgeR
package uses a different normalisation method (TMM): https://www.rdocumentation.org/packages/edgeR/versions/3.14.0/topics/calcNormFactorsYou can also specify the same normalisation method as DESeq2 if you like.
Thanks for your reply. You can see the code I have used below. I also read Mike Love's answer and his suggestions for solving different sequencing depth in samples but I have no idea how to handle my own data.
I used this code to check the number of reads and it seems the problem is from here. "S1", "S2", "S3" and "S4" sample have too low number of reads compared to others.
This is Heatmap of top 500 DE genes which clearly shows the issue.
counts
returns the original, non-normalised counts.His solution is to filter out those genes with low counts:
Note
normalized=TRUE
flag incounts
, which you should be using in your equation to look at the normalised counts.At this stage, either you drop those samples completely, filter as suggested above or continue with your current setup with the assumption that the default normalisation is enough to handle the library differences. From my reading, the number of replicates has more of an impact of DE gene discovery than library size differences.
Thanks for your recommendations. One of our project members suggests dividing the number of counts of each gene on the sum of counts of all genes in a sample to transform the absolute frequency to relative frequency and then performing normalization. Do you think this is right to do?
That sounds very similar to length based normalisation methods such as RPKM. The Deseq2 manual specifically says to input unnormalised data. http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
IMO that's a bad idea.
I'm still confused why you are not satisfied by the normalisation method used by deseq2? Especially if the library sizes are within normal ranges.