I am kind of new to bioinformatics and I am trying to understand why I have been getting extremely different results when I trim my sequences to 20 bp vs when I use seed length of 20 on bowtie. (sorry if it is a stupid question, but I need to know)
My reads are 75 bp. If I remove adapter sequences and etc and run bowtie using -n 0 -l 20, I get only about 45% of reads aligning to my bacterial genome.
If I trim them all down to 20 bp and do the same thing, I get 75% of reads aligned.
I thought by limiting the seed to 20 bp, only the first 20 bp would be considered for the alignment with 0 mismatches. Shouldn't that give me a similar result to the trimmed ones? Or is the whole 75 bp considered despite the seed length?
You have the good understanding of the -n and -l options. However, this is not because a read meets the requirement "0 mismatch in the seed of 20 nucleotides" that the read will map! So, even if the first 20 nucleotides align perfectly, it does not mean that the 40 following nucleotides are good. ;)
read = ATGCAATT GCATGGACATCGA
ref = ATGCAATT AATTAAGGCCAATT
The read will not match even if you use the options -l 8 -n 0.
That is the reason why you don't have the same results with your trimmed reads.
To deal with this, you can try the -v option. Which allow you to set a number of mismatch over all the read (-l is ignored).
the seed length is just a 'seed' or starting point for the alignment.