Hi, all
Recently I need to do some analysis about ribosomal profiling data. A lot of papers recommended to filter rRNA sequences before mapping to genome. So which way is better? Using rRNA database or only use your organism annotation file(GenBank)? It seems that many people use this website database: https://www.arb-silva.de/download/arb-files/
For genbank method, I plan to download files from ensembl: ftp://ftp.ensembl.org/pub/release-79/genbank/danio_rerio/ and then use biopython to pull out all rRNA related sequences.
What is the normal way to get rid of rRNA in ribosomal profiling data analysis? Any suggestion is welcome!
Thanks for your helpful reply! I am working on zebrafish, not having a full rRNA cassette sequence :( If I understand correctly, I probably should merge silva database and repeatmasker(rRNA&tRNA) file then do filtering mapping, right?
Given the answer from Charles Plessy, you can probably just blacklist a bunch of regions. I don't know how you were planning on analysing the data. When I last did something like this, I used deepTools and a bit of python for the final stuff, so I could trivially blacklist regions. If you plan to use something else, you'll want to make a BED file and reverse intersect with it (
bedtools intersect
).I plan to map raw data to rRNA region using bowtie and then map unmapped fastq file to genome using tophat, as this paper suggested: http://www.nature.com/nature/journal/v503/n7476/full/nature12632.html
By the way, I have tested my data using repeatmasker file (rRNA&tRNA), it turned out that only 2% raw reads mapped to rRNA region. Is it normal?
I use deepTools a lot when dealing with ChIP-seq data ;) never try with RNA related experiments. Maybe it's time... Thanks!
2% is lower than what I got with human/mouse, but that likely just means your library prep was better than what I was given :)
You'll find the
--Offset
option inbamCoverage
useful. I created that for RiboSeq and related datasets so I could quickly check for pausing with a bit of python.