I have 24 RNA-seq samples from pig (sus scrofa) and have seen some strange stuff during the QC. When counting features, many samples have around 20 - 80 % of reads assigned to "no feature", and 20 - 50 % not assigned due to "multi-mapping". The proportions vary a lot between samples.
Overall mapping with STAR is not that bad, in total around 90 % of reads are either uniquely mapped or multiple-mapped, so I don't suspect contamination of other species. However, I do want to check for genomic DNA and rRNA.
Questions:
For gDNA: I have checked some samples in IGV. But to do this for all samples is cumbersome, and IGV constantly crashes on my macbook. Are there any systematic ways to assess gDNA contamination?
For rRNA: There are numerous ways suggested when searching around. But I can't figure out any that sounds straightforward to me. Where do I even get at a reference fasta file for pig rRNA sequences? Should I get gene sequences or transcripts? Any simple explanation of this would be extremely helpful.
Can the high amount of "no feature" be due to poor annotation? The lab protocol is poly-A enriched, but it's a custom protocol and we don't know how well it works, so the error could be anywhere.
Generally, RNA samples are checked on Bioanalyzer kind of platform before subjecting them to library preparation. So chances of gDNA contamination are less. I look for rRNA contamination by looking at duplication levels and raw reads at the rRNA genes.
Best,
Genomic: I know about Bioanalyzer, but wondering if there's any way to check computationally at this point that I have the RNA-seq data.
rRNA: How do you do this, more specifically? How do you find a reference file of rRNA genes and what programs do you use to check duplication levels and mapping towards rRNA genes?
rRNA genes such as Rn18s etc will show millions of reads. That will affect reads on mRNA coding genes. As a result, even house keeping genes such as RNA PolII will show negligible reads on exons. So just upload the bam files and check. The duplication levels are generally 10-40% for RNA-Seq. If rRNA contamination is there, duplication levels will skyrocket and cross 100%.