I have High-throughput sequencing data (cDNA from ribosomal profiling) and when I ran Fastqc on it and looked at the over-represented sequences, many of them were ribosomal genes. This is bad because the experimenters did use a protocol to remove the ribosomal RNA.
Now, I mapped the reads using Tophat, and now want to remove the instances that are mapped to the ribosomal genes. For this, I need a list of all ribosomal genes in rat, and then I can use samtools to remove the said reads. So, is there someplace I can find such a list?
Thank you!
This may be your best bet. rDNA repeats are not well characterized in many cases.
I think both approaches - through gtf and using Biomart are good ideas. However I was not sure about the the link you have mentioned in this comment. As far as I can see, it corresponds to just one rRNA gene, and I was hoping to find a list of all such genes.
This is a repeating unit. There would be multiple copies of the core sequence. If you want to get all known copies then use the rRNA/GTF answer given below.
Thanks a lot for the help!
What have you tried so far to find this information?
I tried to search for it online, and found that Biomart is a way to download them. So there I tried for Rat genes, but I am not sure if the version is correct. The tophat was run with Rn6 version and I was unsure if the rat data in Biomart belonged to that version. Aslo, when I downloaded the list, there were ~330 genes, but I was unsure if this was even close to comprehensive
As long as you find the sequence of the repeat unit you should be reasonably ok. There are multiple copies of these genes across multiple chromosomes and they are not fully characterized even in humans and mouse, afaik.