Greetings!
I need some help with performing my rRNA decontamination step properly, which is part of pre-processing pipeline for my Illumina RNA-Seq reads, before mapping to the reference genome (a plant species).
SOME RELEVANT LINKS AND MY CRUDE INFERENCES:
Download Rrna Sequences - SILVA database is useful for human research, not so much for plants etc
rRNA in human - RFAM database searches will identify rRNA matches
rRNA remove from RNAseq data - sormeRNA, and BBsplit are comparable tools useful for rRNA decontamination
Cleaning RNA-Seq data from rRNA - BBDuk can also be used
Filtering rRNA from RNAseq data - BBMap is a tool I used, but output mapped (outm file) to rRNA fasta file, did not always correspond to rRNA sequences, not sure why
https://www.hindawi.com/journals/dpis/2013/854869/ - Eukaryotic SSU rRNA can have introns
MY GOAL:
Decontaminate RNA-Seq Illumina reads by removing rRNA sequences
note: I do have rRNA FASTA sequences from the genome annotation project (62 in all, and of different lengths) - but I am not sure whether they contain introns or not?
With these links, and my main inferences from those links listed above as background information, here are MY QUESTIONS:
Previous posts allude to how sormeRNA is much slower than BBSuite tools, but that they should both give concordant results. Is there a reason(s) to choose sormeRNA over BBsuite tools?
If not, within BBsuite tools, is there a good reason to choose one over the other two ? (BBmap vs. BBsplit Vs BBduk) Should I check for introns in the rRNA sequences, and if yes, then what's a good method for that?
If any of the 62 rRNA sequences contain introns, then my understanding is I remove the intronic sequences and used the spliced rRNA sequences as the reference file for decontamination, correct?
Are there common reason(s) why my BBmap trial returned outm mapped to the rRNA reference FASTA file where some but not all contained rRNA sequence match, as determined using online BLASTn at NCBI?
note: I can post syntax and more detail if this thread goes in the direction of BBmap, but thought that would distract the reader, so they are not included here yet..
I look forward to answers that cover all my questions 1 - 4. Thanks, in advance, for guidance from forum members.
For number 1: I don't think there would be a specific reason to choose one over other.
For number 4: You may need to adjust bbmap alignment parameters. Make alignments more or less stringent. Reduce value of
k=
to allow for more accurate matches.OR you could align your data to human rDNA repeat I linked in the post you have above. Get an idea of % reads aligning there. This is to make sure the % is relatively same across samples (should be less than 5% if the libraries are ribodepleted/poly A entiched). In the final counting step ignore rRNA reads (don't count).