Poor alignment rates using Bowtie
1
0
Entering edit mode
3 days ago
omicon ▴ 40

Hi everyone, I'm starting to work with miRNA sequences (this is my first time handling them during preprocessing).

I'm currently processing some human miRNA (hsa) sequences obtained from ION Torrent (they come from a GEO experiment). According to the paper, adapter removal was already performed, and the FastQC report doesn't show anything particularly concerning or that would justify further trimming beyond removing a few low-quality reads.

Right now, I’m applying relatively lenient trimming settings (AVGQUAL: 17, MINLEN: 17). For alignment, I'm using Bowtie1. I downloaded the reference directly from miRBase (miRNA_mature.fq), and that’s the same file I’m using to build the Bowtie index and align the reads. The results are terrible...

Without trimming: alignment rates range between 8–20%.

With trimming: alignment drops drastically to 0.02–0.03%.

My bowtie configuration is:

bowtie -v 1 -p 5 -S $BOWTIE_INDEX \
    -q "$FASTQFILESTRIM/${acc}_trimmed.fastq" > "$BAM_DIR/${acc}.sam"

What might I be doing wrong?

Should I build the Bowtie1 index using only the Homo sapiens miRNAs?

Could the issue be Bowtie itself? Would it be better to use STAR or another aligner?

aligment bowtie2 bowtie miRNAs • 397 views
ADD COMMENT
0
Entering edit mode
3 days ago
GenoMax 150k

The data above is likely referenced in this past thread from OP (if it is not please let us know): miRNAs - Adapters, adapters, adapters i´m so confused

Since this data is published what method was used in that paper and what did the results say?

Would it be better to use STAR or another aligner?

Likely not. miRNA are small and using bowtie v.1.x (ungapped) aligner is appropriate.

I’m applying relatively lenient trimming settings (AVGQUAL: 17, MINLEN: 17).

What is the size distribution of the resulting reads? Are they at least 21-22 bp or longer? If you take a few of the reads that do not align and try to blast them what do they align to.

reference directly from miRBase

You have converted the U bases in the reference to T's before making the index?

ADD COMMENT
0
Entering edit mode

Geno, thank you very much for your response.

The associated paper doesn’t provide much information… they mention the kit (Ion Total RNA-Seq Kit v2.0), but there’s not a lot of detail about the adapters used. What is somewhat clear is that the researchers removed the adapters, and the data uploaded to GEO already has some level of preprocessing (I can confirm that with FastQC). However, my main question is why there is one SAMN code associated with two SRRs. As far as I understand, this is common when trying to increase sequencing depth, although I’m not sure if it’s a good idea to merge those files ¿?.

Thanks for your clarity regarding Bowtie— I was also quite confident about using it, although I’ve noticed miRDeep2 has gained a lot of popularity (and it also uses Bowtie1 internally for alignment). Maybe I could try aligning with it and bring you all an update comparing both approaches.

Indeed, after trimming, most of the reads are within the expected size range, although there are still very few reads longer than 50 nt.

22bp -- 11,213,664 reads

21bp -- 3,498,946 reads

23bp -- 3,443,761 reads

18bp -- 2,228,113 reads

Still, I find it very strange that without trimming, I get more alignments than with trimming…

Lastly, I was really surprised by your mention of converting U bases in the reference to T’s before building the index. Honestly, I hadn’t thought about that, and when I checked… my reference “mature.fa” does have U bases, but my reads have T. This blew my mind — I hadn’t considered it. Do you think this could be the reason for the terrible alignment?

I’ll try to rerun the alignment and keep you posted.

ADD REPLY
0
Entering edit mode

If your reference has invalid bases then you can get quite unexpected behaviors.

I am not sure what takes place - probably the scoring is weird, and shorter reads cannot rescue many alignments even though the matches may be better; the read length is right at the border where the alignment is dropped.

ADD REPLY
0
Entering edit mode

Looking at the size distribution the data must already be pre-trimmed so it does not make sense to trim again. You should merge data from technical replicates of sequencing if they have the sample sample accession.

ADD REPLY
0
Entering edit mode

That being said, I'd like to hear whether the Us in the reference made any difference. It is conceivable that Bowtie 1 treats U as a T.

Looking at the code, I see that Ns will be treated as As, for example.

ADD REPLY
0
Entering edit mode

Quick representative check with a random Illumina dataset (50 bp, 200K reads from it).

Bowtie v.1.3.1 and mature.fa downloaded from miRBase (https://www.mirbase.org/download/mature.fa )

$ bowtie -x mature_u SRR12730374.fastq --trim3 28 -l 10 -S mature_u.sam
# reads processed: 200000
# reads with at least one alignment: 0 (0.00%)
# reads that failed to align: 200000 (100.00%)
**No alignments**

After converting U's to T's

$ bowtie -x mature_t SRR12730374.fastq --trim3 28 -l 10 -S mature_t.sam
# reads processed: 200000
# reads with at least one alignment: 12642 (6.32%)
# reads that failed to align: 187358 (93.68%)
Reported 12642 alignments

Checking an Ion torrent proton dataset (200K reads, 25 bp or smaller reads)

$ bowtie -x mature_u SRR31362081.fastq -l 10 -S mature_u.sam
# reads processed: 200000
# reads with at least one alignment: 16153 (8.08%)
# reads that failed to align: 183847 (91.92%)
Reported 16153 alignments

$ bowtie -x mature_t SRR31362081.fastq -l 10 -S mature_t.sam
# reads processed: 200000
# reads with at least one alignment: 40724 (20.36%)
# reads that failed to align: 159276 (79.64%)
Reported 40724 alignments
ADD REPLY
0
Entering edit mode

Needs a summary :-) and TLDR:

It looks like references with U do not work correctly with bowtie 1.

ADD REPLY

Login before adding your answer.

Traffic: 1466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6