I'm using the an ensembl gtf file vs 61 to remove rRNA from an rRNAseq dataset. Ensembl gtfs contain an rRNA annotation that makes this trivially easy to do.
import os
fin = 'mygenome.0.61.gtf'
fout = os.path.splitext(fin)[0] + 'only_rRNA.gtf'
fin = open(fin,'rU')
fout = open(fout,'w')
for count, line in enumerate(fin):
parts = line.strip().split()
if parts[1] != 'rRNA':
fout.write(line)
However, after trimming my dataset of 1,980 rRNA transcripts I still find obvious rRNAs in it.
e.g.:
ENSACAG00000014849 ribosomal protein L38 (rpl38)
ENSACAG00000005015 ribosomal protein S21 (RPS21)
ENSACAG00000011604 ribosomal protein S27 (rps27)
ENSACAG00000010479 ribosomal protein S12 (Rps12)
ENSACAG00000007960 ribosomal protein S24 (Rps24)
etc.
Has anyone else had this issue? Can you suggest any work arounds. Are there better ensembl gene lists out there I could use to filter? GO terms perhaps?