Hi,
I need to filter rRNA and tRNA from a mouse ribosomal profiling and RNA seq datasets. Am I right with the assumption that since for mouse there exists the "complete repeating unit of Mus musculus ribosomal DNA" as found here, (https://www.ncbi.nlm.nih.gov/nuccore/BK000964), I can simply download the FASTA file, make a bowtie index out of it and align it to it? I also found the Silva rRNA database https://www.arb-silva.de/download/arb-files/ . Does this include the same rRNA sequences that is also included in the repeating unit file from NCBI? Should I prefer any of them?
I would construct a bowtie index with:
$ bowtie-build species_rRNA species_rRNA.fa
and then exclude the mapping reads with:
$ bowtie -p4 -v3 species_rRNA my_reads.fastq \
--un my_reads_rRNA_unalign.fq >FileLocation
Then I would repeat the same for tRNA. I would get tRNAs from http://gtrnadb.ucsc.edu/genomes/eukaryota/Mmusc10/ under the FASTA link on the left side.
Would this be correct? Would I miss some sequences, especially some rRNA sequences, including rRNA sequences not annotated in the "complete repeating sequence" from NCBI or mitochondrial rRNA?
What about snoRNA or other RNA species. I found that most papers only excluded rRNA and sometimes tRNA but most of the time no other RNA species.
Thank you for your help!
You can use
bbsplit.sh
from BBMap suite to bin these sequences out (A: Filtering rRNA from RNAseq data ).That said if you leave those sequences in can you exclude them downstream in the counting step?
Sorry for my delayed answer. Thank you for your suggestion and I will try this. Still I am not sure about the differences in "complete repeating unit of Mus musculus ribosomal DNA" and the Silva rRNA database. E.g. does the "complete repeating unit of Mus musculus ribosomal DNA" contain the 5S rRNA?
Also, is the source for tRNA the right one to use? What about snoRNA or other RNA species? Are they commonly excluded?
Did you ever get an answer? I have the same question now. I think the issue with excluding them in the downstream "counting" step is that in ribosome profiling the bam file is used to call p-sites and periodicity, which would be affected by the huge (30%+) portion of rRNA/tRNA.