Hello everyone,
I have a question regarding the filtering of a ncRNA dataset containing miRNA. I want to get rid of plant-derived miRNAs. My approach is to use Bowtie2:
- Index: Based on "fused" miRNA fastas to receive one continuous sequence (instead of many small miRNA fasta - I did this step to improve the outcome of the alignment)
- Sequences: ncRNA
My code:
bowtie2 -f -p 15 --very-sensitive-local -x ./index/miRNA -U ./data/ncRNA.fa -S ./alignment/ncRNA_miRNA.sam
with almost no reported alignments: 3106286 reads; of these: 3106286 (100.00%) were unpaired; of these: 3105977 (99.99%) aligned 0 times 308 (0.01%) aligned exactly 1 time 1 (0.00%) aligned >1 times 0.01% overall alignment rate
When I go the other way around and use the ncRNA dataset as the reference and align the miRNA data against it I receive following report: 3769 reads; of these: 3769 (100.00%) were unpaired; of these: 2789 (74.00%) aligned 0 times 148 (3.93%) aligned exactly 1 time 832 (22.07%) aligned >1 times 26.00% overall alignment rate
If I'm not wrong, I'm missing a lot of miRNAs in my ncRNA dataset with my alignment. I would appreciate some hints how to improve my alignment to get as many miRNAs as possible; also if I receive false positives.
Update: Initially, I used mature miRNA for this alignment. Based on the mention of one of the commentators, I tried it with hairpin sequences from mirbase and got a solid amount of filtered sequences. Due to the fact that I want to have my dataset as thoroughly purified of miRNA as possible, I am satisfied with it, even though I will have some false positive. Thanks for the people helping out.
Do you have a reference for that. Never heard of this strategy.
How long are the reads?
I'm guessing the OP is using miRNA hairpin references? OP, did you replace U's with T's in the RNA reference? Are you using a library prep that will capture mature miRNA?
I'm using mature miRNA sequences (and didn't see the point to use the hairpin sequences - note that right now I'm preparing a ncRNA filter data set for my actual small RNA seq data to get rid of all ncRNA except miRNA). All U's were exchanged with T's. And could you please elaborate your last question?
What RNA-seq is that? Common RNA-seq means that you use RNA extraction kit that does not well capture short RNAs (like < 200bp) and common RNA-seq prep enriches also for RNAs with a certain size, not capturing short RNAs. Meaning, you need a special kit and prep for short RNAs.
There is no reference I was following when I decided to fuse the miRNA data. It's just that I don't get any alignments when I align the ncRNA sequences against the normal miRNAs. My guess was that normally alignments are conducted against longer references and not short sequences and thus miRNA references would somehow conflict with the algorithm of bowtie2, leading to low score and no reported alignments.
I hope it is clear that I'm not talking about normal reads obtained from a RNA seq. I'm preparing a ncRNA filter data set for my actual small RNA-seq data. Filter data set = ncRNA (from various ncRNA data bases) - miRNA (plan derived from mirbase)
My stats are: