I have some RNAseq data (human cell line) with ~100m reads. We asked the company to perform ribodepletion and the first report that we received everything seemed okay.
However, when I run FastQ Screen
, I got high % of rRNA in this data. According to this post and assuming that my data would be okay, I was expecting less than 10% rRNA in it (as little as 2-3%). However, I got more than that in some samples and surprisingly the amount of rRNA is not even the same in controls and condition. As you can see in the screenshot below, controls reach > 30% rRNA and the maximum % in the condition does not exceed the 11%
After checking some posts, I think that the best way to proceed (if I want to use this data) is to remove the rRNA, since there are several tools that allows you to do that.
However, I also saw the possibility to add the % rRNA as a covariate in my DE analysis (I will use DESeq2). but I do not know if it makes sense (or I will be able to do it) cause the differences are not homogeneous (I don't have 50% with and without rRNA in controls and conditions).
But... I was wondering if I proceed with the removal of rRNA, I should add a covariate that would be my "batch effect" as my samples will not have the same amount of reads -some will have less cause I had to remove more rRNA reads- (?) (or maybe DESeq2 already tackles this into account?).
Has anybody ever had a similar problem as mine? If so, what would you recommend? I would like to use this data (and of course obtain logic results)
Thanks very much in advance
Probably more technical (ribodepletion efficiency) than biological though.
I agree 100%. Usually I do this by excluding the rows of the count matrix that correspond to rRNA right before doing the DESeq2 analysis.
After removal, be sure that you check that the usable number of reads are not vastly different and you don't have some samples undersequenced so you would get zero counts due to low depth. If so, just sequence deeper. I agree with removing rRNA, they're not compromising quality but simply suck up sequencing depth.
@ATpoint from your point of view, how would you check that? Just running
fastqc
or you meant checking the presence of zero counts in the "counts" file that you get after the alignment?Thanks very much for your feedback!
You would check, e.g. in R based on your count matrix, whether there is a correlation between number of detected genes (for example defined as
counts > 0
) and sequencing depth (that is total counts per sample). And if there is some sort of correlation then make sure that it is not nested by condition, so it does not systematically penalize one group. If so then consider to sequence the undersequenced samples deeper.Understood, I will do that. Many thanks again for your help!
@Carlo Yague Thanks for your feedback! I also saw that approach (exclude the rows before doing DGE analysis), however, I encountered with this post, where they say that rRNA reads could be mapped to coding genes which share partial sequence similarity to rRNA... so it may be better to remove them before anything else.
The best way to ribo-deplete is to align your reads to rRNA sequences (e.g. with bowtie2) and discard the aligned reads.
Otherwise you will indeed get weird mapping results (e.g. I know several genes that have a substantial region which overlaps perfectly with a rRNA).
To quote the original comment, rRNA reads could be mapped to coding genes which share partial sequence similarity to rRNA if your reference does not contain rRNA gene. Make sure to map on the full genome/transcriptome (including rRNA genes) and you will be fine: rRNA read will perferentially map on the rRNA gene compared so mRNA where matches are unlikely to be perfect (and even with perfect match to both mRNA and rRNA, then the mapping quality will be 0 and the read will not be counted).
Thanks very much for your help @dsull!