I used SortMeRNA to remove rRNA sequences in my raw RNA-seq data. I got ~95% clean data for 7 out of 8 samples. For the remaining one, I only got ~75%..., around 20% was mapped to the eukaryotic 18s and 28s sequence. Later in the differential expression analysis, the wired sample appeared to be an outgroup in the PCA plot and it cannot be clustered with other replicated samples.
Therefore, I may have to discard this sample in my DE analysis. But I may also skip the rRNA removal step so that it will not cause the problem...What should I do?
Also since you generally have some residual rRNA "contamination" even after poly-A selection ... you could be throwing off normalization factors that take into account your library size.
So this means even after counting one should not remove rRNA genes? e.g HISAT2-->FeatureCounts-->DESeq2
Yep exactly, you can filter them out at the end from the DE genes if they are not interesting to you.
No, if you don't care about them then you should remove the counts from the matrix. Otherwise you're needless inflating the tests you're doing and deflating your power. The normalization should be robust to their presence, but if there's a LOT of rRNA contamination in one sample then that tends to cause issues with the normalization factors.
Yeah I was thinking they should keep them in for the size factors calculation but then it would be ok to remove them. But checking through the DESeq2 manual it didn't seem very obvious as to how to do that. In edgeR, it is a little more straightforward...
I think dropping them off the bat would only be OK if you checked that they were similar across samples.
Thank you, Devon. Yes, this is one of the reasons would prefer to remove them only I have not been able to find the gene IDs. Any idea where I can get the list of Drosophila Melanogaster rDNA gene ENSEMBL IDs?
So you did not count them in first place? Here is Ensembl Drosophila rDNA scaffold. Same scaffold at flybase.
Thanks, genomax. You filter them from the raw counts before DESeq2. The scaffold works for alignment but at this point, I already have the counts matrix how can I obtain just the gene IDs?
rRNA reads could be mapped to coding genes which share partial sequence similarity to rRNAs if your reference does not contain rRNA gene. So remove them before alignment might be a better choice.