Question

Aligning very short sequences using bowtie2, I need fine control over how many mismatches are allowed. Adjust scoring?

1

Entering edit mode

16 months ago

Mauro ▴ 20

Hi, I'm using bowtie2 to align very short sRNA sequences (18-45nt) to reference files. I've been using the very-sensitive-local preset to align to the human genome, and bowtie2 was allowing for one, maybe 2 mismatches. However when testing against shorter references (my tRNAs) the sequences with mismatches and gaps don't align properly. I need granular control to align first without allowing any mismatches, and then allowing only one.

This is the test fasta file I'm using, with fragments of a tRNA and some induced errors (mismatches, gaps, etc).

https://pastebin.com/VhcES6ay

Against a index made from the human tRNA reference.

http://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/hg38-mature-tRNAs.fa

The bowtie2 command

bowtie2 --local --very-sensitive-local -x human_trna -f sequences.fa -S out.sam

The parameter -N changes the amount of mismatches allowed during the preseed alignment, but doesen't affect the end result in this case. I believe I can achieve what I want by changing the scoring function, but I'd like some feedback on whether this is a good approach or if there's another tool better fitted for me. I need a reliable system to align with a maximun amount of mismatches to work for 100nt references (tRNA), 2000nt (rRNA) and the human genome, so I worry about reliablity. I know I can check how many mismatches there are on a sam file, but in this case my sequences are not aligning properly.

I appreciate any feedback, thanks.

EDIT: I need to replicate the -v bowtie 1 option (allow X mismatches) changing the scoring options in bowtie2, but I'm unsure on how to proceed. I read the --score-min docs and tried to change it to L,-1,1 but it throws an error when running in local mode because "match score is set to 2 (default in local mode) and my score function can be negative". Which it should't unless my read lengh is 0, since my score equation is Threshold = -1 + len(read). I can force my match bonus to 0 but then I don get alignments.

bowtie2 • 2.6k views

ADD COMMENT • link updated 14 months ago by Brian Bushnell 20k • written 16 months ago by Mauro ▴ 20

1

Entering edit mode

Unless you need gapped alignments (which you probably don't need with reads < 45 bp) using bowtie v.1.x may be a better choice.

ADD REPLY • link 16 months ago by GenoMax 148k

0

Entering edit mode

You are right, bowtie 1 has -v which let's me set a maximum number of mismatches, but I lose the reads that used to align thanks to soft clipping. I may have to run both if no other answer comes up. According to bowtie2 docs, I should be able to replace the -v command by changing my scoring options, but that's why I posted my question initially.

ADD REPLY • link 16 months ago by Mauro ▴ 20

0

Entering edit mode

16 months ago

Cristian • 0

Hey there, Have you tried reported parameters to map short sequences? Just have a look to this parameters from ShapeMapper2 program, reported here and adapted to my tests as:

bowtie2 -p 4 --local --sensitive-local --mp 3,1 --rdg 5,1 --rfg 5,1 --dpad 30 --maxins 800 --ignore-quals --no-unal -x $idx -U $reads --un $unmapped -S $sam_file

Coincidentally, I am using those parameters to map tRNAs. Instead using bowtie2, I am using segemehl, as suggested this publication from Hoffmann et. al. (2018).

ADD COMMENT • link 16 months ago by Cristian • 0

0

Entering edit mode

I looked into this performance asessment to choose bowtie2: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4931105/

It's not applicable to all short sequences, since afaik miRNA never have splicing, but i thought their conclusions should translate well into alinging to known mature transcripts. Thanks for your suggestions. I'll have a look and see if it's better than what I'm using. However I'm trying to be very strict with my alignments since I'm detecting tRNA fragments, and they have 100nt reads

ADD REPLY • link 16 months ago by Mauro ▴ 20

0

Entering edit mode

14 months ago

Brian Bushnell 20k

BBMap has a "subfilter" flag that you can use to control this; e.g. "subfilter=2" bans alignments with more than 2 substitutions (they will be considered unmapped). However, I think it might make more sense to post-filter the sam file. You can do that with filtersam.sh (in the BBTools package) like this:

filtersam.sh in=mapped.sam out=filtered.sam maxsubs=2 ref=ref.fa mbv=999

The point of the program is actually to get rid of reads with sequencing errors, so it does other stuff too, but you can also use it this way.

ADD COMMENT • link 14 months ago by Brian Bushnell 20k

score 1 · Accepted Answer · 2023-08-08

1

Entering edit mode

16 months ago

Mauro ▴ 20

Summary :

I was trying to use these settings to explicitly allow one (and only one) mismatch:

bowtie2 --local --very-sensitive-local --score-min L,-1,1

Which resolves to min-score-threshold = -1 + len(read). However I get the following error:

Error: the match penalty is greater than 0 (2) but the --score-min function can be less than or equal to zero. Either let the match penalty be 0 or make --score-min always positive.

And changing the match score to 0 breaks all alignments. So I ended up with

bowtie2 --local --very-sensitive-local --score-min L,0,0.99

Which is not as explicit as i'd like, and does not allow me to change it to be two mismatches instead of one, but get's the job done. I feel like my solution should be valid, but since I don't know why it needs the score to be a positive value I'll leave it at that.

ADD COMMENT • link 16 months ago by Mauro ▴ 20

0

Entering edit mode

Have you looked into STAR?

I don't know if anybody truly tested performance with short reads, but it has worked well for me in the past with <50nt reads.

I had to change a number of options, gleaned from https://github.com/alexdobin/STAR/issues/169, since I also wanted to retain soft-clipping.

--outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 --outFilterMismatchNmax 2

From my understanding, setting the the first two to 0 (the third one should be same as default value) helps with soft-clipping of short reads. In theory, at its default value of 0.66, with a 45nt read, you may be able to clip up to 15 bp, but I haven't seen if this performs as expected, and I'm sure the interplay of Scoring/number of matches complicates the picture a little bit.

For a little more context, when left these at their default (0.66) or changed to (0.3) the majority of reads were "too short" unmapped reads. After setting the above options, the rate was rescued and mapping seemed to behave as expected, but I didn't check extensively for any mistakes.

So, if you set those first to options to 0, you should be able to explicitly control mismatches using the last option and still retain soft clipping.

ADD REPLY • link 16 months ago by rfran010 ★ 1.3k

0

Entering edit mode

Thanks. I looked into STAR when designing the pipeline and working with the full genome, but it's ram requirements were a real pain. I ended up going for transcriptome first (bowtie2), then genome (HISAT2).

I looked into this performance asessment to choose bowtie2: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4931105/

It's not applicable to all short sequences, since afaik miRNA never have splicing, but should translate well into alinging to known mature transcripts. I'll try to make bowtie2 work and jump to STAR or other tool if it doesen't. Thanks for the parameter explanation.

ADD REPLY • link 16 months ago by Mauro ▴ 20