Question

How to screen for rRNA and gDNA contamination in RNA-seq data?

3

Entering edit mode

7.0 years ago

arvchi ▴ 60

I have 24 RNA-seq samples from pig (sus scrofa) and have seen some strange stuff during the QC. When counting features, many samples have around 20 - 80 % of reads assigned to "no feature", and 20 - 50 % not assigned due to "multi-mapping". The proportions vary a lot between samples.

Overall mapping with STAR is not that bad, in total around 90 % of reads are either uniquely mapped or multiple-mapped, so I don't suspect contamination of other species. However, I do want to check for genomic DNA and rRNA.

Questions:

For gDNA: I have checked some samples in IGV. But to do this for all samples is cumbersome, and IGV constantly crashes on my macbook. Are there any systematic ways to assess gDNA contamination?
For rRNA: There are numerous ways suggested when searching around. But I can't figure out any that sounds straightforward to me. Where do I even get at a reference fasta file for pig rRNA sequences? Should I get gene sequences or transcripts? Any simple explanation of this would be extremely helpful.
Can the high amount of "no feature" be due to poor annotation? The lab protocol is poly-A enriched, but it's a custom protocol and we don't know how well it works, so the error could be anywhere.

RNA-Seq • 7.8k views

ADD COMMENT • link updated 7.0 years ago by igor 13k • written 7.0 years ago by arvchi ▴ 60

0

Entering edit mode

Generally, RNA samples are checked on Bioanalyzer kind of platform before subjecting them to library preparation. So chances of gDNA contamination are less. I look for rRNA contamination by looking at duplication levels and raw reads at the rRNA genes.

Best,

ADD REPLY • link 7.0 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Genomic: I know about Bioanalyzer, but wondering if there's any way to check computationally at this point that I have the RNA-seq data.

rRNA: How do you do this, more specifically? How do you find a reference file of rRNA genes and what programs do you use to check duplication levels and mapping towards rRNA genes?

ADD REPLY • link 7.0 years ago by arvchi ▴ 60

0

Entering edit mode

rRNA genes such as Rn18s etc will show millions of reads. That will affect reads on mRNA coding genes. As a result, even house keeping genes such as RNA PolII will show negligible reads on exons. So just upload the bam files and check. The duplication levels are generally 10-40% for RNA-Seq. If rRNA contamination is there, duplication levels will skyrocket and cross 100%.

ADD REPLY • link 7.0 years ago by Satyajeet Khare ★ 1.6k

score 2 · Answer 1 · 2017-12-09

2

Entering edit mode

7.0 years ago

h.mon 35k

A general, fast and independent of annotation method for checking rRNA contamination is BBDuk with the ribokmers.fa.gz file.

Maybe RSeQC read-distribution-py and infer-experiment-py can shed some light on your "no feature" problem. I guess in part may be due to poor annotation, but then I would expect all samples being equally affected, unlike what you are seeing.

P.S.: I am supposing your RNAseq protocol is stranded, otherwise infer-experiment-py won't help.

ADD COMMENT • link 7.0 years ago by h.mon 35k

0

Entering edit mode

Thanks! Unfortunately BBDuk is not available on the cloud computer I am using. BBMap is though. Can that work? Still however, I am really confused about where to get the correct rRNA reference FASTA file. What is that file you referred to? Does it contain ribosomal RNA sequences (RNA)? rRNA gene sequences (DNA)? Is it valid for pig (sus scrofa)?

Protocol is unstranded unfortunately.

ADD REPLY • link 7.0 years ago by arvchi ▴ 60

1

Entering edit mode

BBDuk (bbduk.sh command) comes in the same bundle as BBMap (bbmap.sh), so if one is available, the other should be as well.

The ribokmers.fa is a file with rRNA kmers from the Silva databse, Brian Bushnell explained how it was created here - which, by the way, is a post in the same thread I linked above - and it is available on the google drive link he provided.

ADD REPLY • link 7.0 years ago by h.mon 35k

score 1 · Answer 2 · 2017-12-09

1

Entering edit mode

7.0 years ago

igor 13k

For gDNA, you already have your answer to some degree. As you mentioned, 20-80% of your reads are assigned to "no feature". In other words, they are not overlapping exons, so they are intronic/intergenic. You can blame that on poor annotation. However, the samples should be consistent. If one is 20% and the other is 80%, that 60% difference is not due to annotation. For polyA libraries, 80% assigned is reasonable, especially for a non-common (not well annotated) genome.

For rRNA, check this previous discussion for some ideas: RNA-seq rRNA contamination

ADD COMMENT • link 7.0 years ago by igor 13k

1

Entering edit mode

gDNA: You are right, but the samples are not identical replicates (long story, but in short the samples are very small and may capture different tissue types/cell types, and they can therefore have different contents). In any case, it seems there is no solid way of assessing gDNA contamination vs annotation problems?

rRNA: I've seen that thread, but it still seems inconclusive. Is it viable to: Download rRNA gene sequences (DNA) from biomart and use whatever aligner (e.g. STAR) to map towards that FASTA file?

ADD REPLY • link 7.0 years ago by arvchi ▴ 60