I would like to screen for rRNA contamination in my RNA-seq. I tried two different methods, which can be summed up as:
- take human GENCODE GTF, filter for "rRNA", extract sequence for the matching coordinates
- download Rfam FASTA file, filter for "ribosomal_rna" and "homo_sapiens" (example protocol)
I then align against each one separately. I don't expect them to yield very close results, but the difference can be 100X. Why such a big difference? Can I trust either?
One possibility is that the Rfam sequences are overestimating the abundance and I am getting a lot of false positives. However, I frequently have alignment rate of less than 5%, which is very reasonable.
How about using the human rDNA repeat sequence to screen against. A link for that is in this post.
But then I will have three different results.
I am actually trying to make this work with Picard CollectRnaSeqMetrics, which needs an intevals file (so it has to be based on reference genome coordinates).
I was wondering if you figured out what the best approach is?