Question

What % of rRNA contamination is acceptable?

2

Entering edit mode

5.4 years ago

John ▴ 270

Hi

I have n=3 samples of neuronal ribosome immuno-precipitated RNA seq data. All the data has more or less same alignment rate like below:

34167279 reads; of these:
  34167279 (100.00%) were unpaired; of these:
    22847962 (66.87%) aligned 0 times
    7904836 (23.14%) aligned exactly 1 time
    3414481 (9.99%) aligned >1 times
33.13% overall alignment rate

I checked those (66.87%) aligned 0 times they are rRNAs. Looks like need to do more special rRNA depletion steps than given in the Takara Kit, as the sample has enriched Ribosome.

I have around 100 differentially expressed genes (FDR<0.05) but they don't make any sense to the experiment. PCA also didn't separate well. Anybody has suggestion about what percentage of rRNA contamination is acceptable? The number of reads aligned is enough to do differential expression?

Thanks for your help!

RNA-Seq rna-seq alignment • 9.0k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 5.4 years ago by John ▴ 270

2

Entering edit mode

Anybody has suggestion about what percentage of rRNA contamination is acceptable?

That is going change based on experimental aim. Sounds like in your case the results are not making experimental sense with data you have. So it may be time to think about redoing the experiment making sure rRNA depletion works more efficiently. If you have money available you could more deeply sequence existing libraries (giving up on 66% of reads since they will be rRNA again).

ADD REPLY • link 5.4 years ago by GenoMax 151k

0

Entering edit mode

How many replicates do you have per sample?

ADD REPLY • link 5.4 years ago by Istvan Albert 102k

0

Entering edit mode

Three replicates

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 5.4 years ago by John ▴ 270

0

Entering edit mode

Are you making a pairwise comparison with 3 replicates over two conditions, thus comparing a total of 6 sequencing runs?

When asking a question on how much data ought to be enough the most important detail is the experimental design.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 5.4 years ago by Istvan Albert 102k

1

Entering edit mode

I have control and treatment, 3 samples each. They are not paired!

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 5.4 years ago by John ▴ 270

0

Entering edit mode

"paired" refers to control vs. treatment (or rather condition A vs condition B), not to the read data.

ADD REPLY • link updated 3.0 years ago by Ram 45k • written 5.4 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Got it thanks. Then I should call it pairwise comparison between two condition with three replicates each.

ADD REPLY • link 5.4 years ago by John ▴ 270

score 6 · Answer 1 · 2020-01-06

6

Entering edit mode

5.4 years ago

Istvan Albert 102k

Align your reads against the genome, then visualize the alignments with IGV.

Investigate how uniformly are your transcripts covered.

7 million reads that align evenly to transcripts ought to work, as long as transcripts are evenly covered.

A properly depleted RNAseq should have less then 10% rRNA in it (as little as 2-3%). Normally the rRNA would be over 80-90%, thus your data already is depleted.

Now what often happens is that one problem feeds on another. It is rare that "bad" data has only one type of problem. Problems come in bunches.

For example, if the transcripts are fragmented incorrectly you may end up with coverage from one end (5' or 3' only). Most RNA-seq methods do not recognize or inform you of these types of problems, they will happily report various low P-values, when in fact the transcript is not properly covered.

Thus my recommendation is to align against the genome and visually investigate the data to identify the reasons the methods produce odd results.

ADD COMMENT • link 5.4 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks, I looked at them in IGV.

if the transcripts are fragmented incorrectly you may end up with coverage from one end (5' or 3' only).

This is not a problem.

ADD REPLY • link 5.4 years ago by John ▴ 270

0

Entering edit mode

take a screenshot of a gene/transcript that comes up differentially expressed and show it here.

show the coverages for both tracks the control and condition, here is an example (host with imgur):

enter image description here

ADD REPLY • link 5.4 years ago by Istvan Albert 102k

0

Entering edit mode

Here is the screen shot! first three samples are Condition A, Second three samples are condition B. Kat6a is upregulated in Condition B (with log2FC=0.83, padj<0.01 shown by DESEQ2). I don't know why it is aligned to intronic regions. Because the sample is ribosome bound RNA. Thanks

middle

ADD REPLY • link 5.4 years ago by John ▴ 270

0

Entering edit mode

I would say that the data shows too many reads in the intronic regions (your second image) - 10 to 20% of the exonic read counts are present.

Also, there is unexpectedly weird fragmentation throughout - coverages drop the zero in the middle of the exon.

ADD REPLY • link 5.4 years ago by Istvan Albert 102k

0

Entering edit mode

I used featurecounts to count only exons, still the same genes are showing up as DE genes.

Do you think the problem is only in library prep (fragmentation)? or The low number of reads (due to rRNA contamination?

My question is Will I get good coverage If I do deep sequencing with the same library or not? Thanks

ADD REPLY • link 5.4 years ago by John ▴ 270

2

Entering edit mode

The problem here goes much deeper. Simply put most statistical methods cannot account for the type of errors you observe.

The concept of transcript integrity as described here:

http://rseqc.sourceforge.net/#tin-py

is the closest to what you might be after. What you want is to filter not just for p-values but to select transcripts with high transcript integrity.

To be honest, I think your library is a little suspect. It is quite unexpected to see so many intronic regions have coverages.

ADD REPLY • link 5.4 years ago by Istvan Albert 102k