Question

How to handle RNA seq data with unbalanced % rRNA by group?

1

Entering edit mode

7 months ago

ev97 ▴ 40

I have some RNAseq data (human cell line) with ~100m reads. We asked the company to perform ribodepletion and the first report that we received everything seemed okay.

However, when I run FastQ Screen, I got high % of rRNA in this data. According to this post and assuming that my data would be okay, I was expecting less than 10% rRNA in it (as little as 2-3%). However, I got more than that in some samples and surprisingly the amount of rRNA is not even the same in controls and condition. As you can see in the screenshot below, controls reach > 30% rRNA and the maximum % in the condition does not exceed the 11%

enter image description here

After checking some posts, I think that the best way to proceed (if I want to use this data) is to remove the rRNA, since there are several tools that allows you to do that.

However, I also saw the possibility to add the % rRNA as a covariate in my DE analysis (I will use DESeq2). but I do not know if it makes sense (or I will be able to do it) cause the differences are not homogeneous (I don't have 50% with and without rRNA in controls and conditions).

But... I was wondering if I proceed with the removal of rRNA, I should add a covariate that would be my "batch effect" as my samples will not have the same amount of reads -some will have less cause I had to remove more rRNA reads- (?) (or maybe DESeq2 already tackles this into account?).

Has anybody ever had a similar problem as mine? If so, what would you recommend? I would like to use this data (and of course obtain logic results)

Thanks very much in advance

fastqscreen rRNA DESeq2 RNAseq DGE • 1.1k views

ADD COMMENT • link updated 6 months ago by Carlo Yague 9.0k • written 7 months ago by ev97 ▴ 40

score 3 · Accepted Answer · 2024-10-15

3

Entering edit mode

7 months ago

dsull ★ 7.5k

Just remove the rRNA. No need to add a covariate. You can consider those rRNA differences "biological variation".

I'd only use a covariate if the samples was processed in different ways (e.g. if you intentionally depleted rRNA in some samples but not others).

You can always include more (some ridiculous) covariates to "explain" variation in your model: bioanalyzer fragment size, rRNA content, flow cell, the shelf a particular sample of cells was on in your TC incubator (lol), etc.

But, at some point, you should just let those things be residual variance, unless you have strong evidence that you need to account for it.

If you're really concerned, make a PCA plot after you filter out the rRNA to see if initial rRNA content drives clustering. Or try both models (one w/ the covariate and one w/o).

ADD COMMENT • link 7 months ago by dsull ★ 7.5k

2

Entering edit mode

You can consider those rRNA differences "biological variation".

Probably more technical (ribodepletion efficiency) than biological though.

Just remove the rRNA.

I agree 100%. Usually I do this by excluding the rows of the count matrix that correspond to rRNA right before doing the DESeq2 analysis.

ADD REPLY • link 7 months ago by Carlo Yague 9.0k

1

Entering edit mode

After removal, be sure that you check that the usable number of reads are not vastly different and you don't have some samples undersequenced so you would get zero counts due to low depth. If so, just sequence deeper. I agree with removing rRNA, they're not compromising quality but simply suck up sequencing depth.

ADD REPLY • link 7 months ago by ATpoint 88k

0

Entering edit mode

@ATpoint from your point of view, how would you check that? Just running fastqc or you meant checking the presence of zero counts in the "counts" file that you get after the alignment?

Thanks very much for your feedback!

ADD REPLY • link 6 months ago by ev97 ▴ 40

1

Entering edit mode

You would check, e.g. in R based on your count matrix, whether there is a correlation between number of detected genes (for example defined as counts > 0) and sequencing depth (that is total counts per sample). And if there is some sort of correlation then make sure that it is not nested by condition, so it does not systematically penalize one group. If so then consider to sequence the undersequenced samples deeper.

ADD REPLY • link 6 months ago by ATpoint 88k

0

Entering edit mode

Understood, I will do that. Many thanks again for your help!

ADD REPLY • link 6 months ago by ev97 ▴ 40

0

Entering edit mode

@Carlo Yague Thanks for your feedback! I also saw that approach (exclude the rows before doing DGE analysis), however, I encountered with this post, where they say that rRNA reads could be mapped to coding genes which share partial sequence similarity to rRNA... so it may be better to remove them before anything else.

ADD REPLY • link 6 months ago by ev97 ▴ 40

1

Entering edit mode

The best way to ribo-deplete is to align your reads to rRNA sequences (e.g. with bowtie2) and discard the aligned reads.

Otherwise you will indeed get weird mapping results (e.g. I know several genes that have a substantial region which overlaps perfectly with a rRNA).

ADD REPLY • link 6 months ago by dsull ★ 7.5k

0

Entering edit mode

To quote the original comment, rRNA reads could be mapped to coding genes which share partial sequence similarity to rRNA if your reference does not contain rRNA gene. Make sure to map on the full genome/transcriptome (including rRNA genes) and you will be fine: rRNA read will perferentially map on the rRNA gene compared so mRNA where matches are unlikely to be perfect (and even with perfect match to both mRNA and rRNA, then the mapping quality will be 0 and the read will not be counted).