Hi everyone, I'm starting to work with miRNA sequences (this is my first time handling them during preprocessing).
I'm currently processing some human miRNA (hsa) sequences obtained from ION Torrent (they come from a GEO experiment). According to the paper, adapter removal was already performed, and the FastQC report doesn't show anything particularly concerning or that would justify further trimming beyond removing a few low-quality reads.
Right now, I’m applying relatively lenient trimming settings (AVGQUAL: 17, MINLEN: 17). For alignment, I'm using Bowtie1. I downloaded the reference directly from miRBase (miRNA_mature.fq), and that’s the same file I’m using to build the Bowtie index and align the reads. The results are terrible...
Without trimming: alignment rates range between 8–20%.
With trimming: alignment drops drastically to 0.02–0.03%.
My bowtie configuration is:
bowtie -v 1 -p 5 -S $BOWTIE_INDEX \
-q "$FASTQFILESTRIM/${acc}_trimmed.fastq" > "$BAM_DIR/${acc}.sam"
What might I be doing wrong?
Should I build the Bowtie1 index using only the Homo sapiens miRNAs?
Could the issue be Bowtie itself? Would it be better to use STAR or another aligner?
Geno, thank you very much for your response.
The associated paper doesn’t provide much information… they mention the kit (Ion Total RNA-Seq Kit v2.0), but there’s not a lot of detail about the adapters used. What is somewhat clear is that the researchers removed the adapters, and the data uploaded to GEO already has some level of preprocessing (I can confirm that with FastQC). However, my main question is why there is one SAMN code associated with two SRRs. As far as I understand, this is common when trying to increase sequencing depth, although I’m not sure if it’s a good idea to merge those files ¿?.
Thanks for your clarity regarding Bowtie— I was also quite confident about using it, although I’ve noticed miRDeep2 has gained a lot of popularity (and it also uses Bowtie1 internally for alignment). Maybe I could try aligning with it and bring you all an update comparing both approaches.
Indeed, after trimming, most of the reads are within the expected size range, although there are still very few reads longer than 50 nt.
22bp -- 11,213,664 reads
21bp -- 3,498,946 reads
23bp -- 3,443,761 reads
18bp -- 2,228,113 reads
Still, I find it very strange that without trimming, I get more alignments than with trimming…
Lastly, I was really surprised by your mention of converting U bases in the reference to T’s before building the index. Honestly, I hadn’t thought about that, and when I checked… my reference “mature.fa” does have U bases, but my reads have T. This blew my mind — I hadn’t considered it. Do you think this could be the reason for the terrible alignment?
I’ll try to rerun the alignment and keep you posted.
If your reference has invalid bases then you can get quite unexpected behaviors.
I am not sure what takes place - probably the scoring is weird, and shorter reads cannot rescue many alignments even though the matches may be better; the read length is right at the border where the alignment is dropped.
Looking at the size distribution the data must already be pre-trimmed so it does not make sense to trim again. You should merge data from technical replicates of sequencing if they have the sample sample accession.
That being said, I'd like to hear whether the Us in the reference made any difference. It is conceivable that Bowtie 1 treats U as a T.
Looking at the code, I see that Ns will be treated as As, for example.
Quick representative check with a random Illumina dataset (50 bp, 200K reads from it).
Bowtie v.1.3.1 and
mature.fa
downloaded from miRBase (https://www.mirbase.org/download/mature.fa )After converting U's to T's
Checking an Ion torrent proton dataset (200K reads, 25 bp or smaller reads)
Needs a summary :-) and TLDR:
It looks like references with U do not work correctly with bowtie 1.