My aim is to dispatch query sequences from fasta file using blastn according to 2 subject sequences.
Here, an example of a query sequence :
>M02945:227:000000000-BHPRH:1:2114:12657:28629 1:N:0:1
GAGAGCTCTATTACTGCACTGGAGGGTAGC
Here, my 2 subjects sequences, they are both 30bp long :
>ESR
GAGAGCTCTATTTCTTCGGTGTTACCAGCT
>Cercle
GAGAGCTCTATTACTGCACTGGAGGGAAGC
By look, the sequence M02945:227:000000000-BHPRH:1:2114:12657:28629 1:N:0:1 will go to Cercle.fa file (29 first bases match with 1 mismatch position 27).
And using blastn :
blastn -query test_sequence_for_blast.fa -subject ref.fa -culling_limit 1 -task blastn-short -outfmt 6
M02945:227:000000000-BHPRH:1:2114:12657:28629 ESR 100.00 8 0 0 2 9 9 2 0.008 16.4
M02945:227:000000000-BHPRH:1:2114:12657:28629 Cercle 100.00 26 0 0 1 26 1 26 1e-13 52.0
My problem is, I would like to know how many mismatch do blastn found in the full length subject sequence. So, first try I decrease the mismatch penalty to -1 (because here I just have the 1->26 information).
blastn -query test_sequence_for_blast.fa -subject ref.fa -culling_limit 1 -penalty -1 -task blastn-short -outfmt 6
M02945:227:000000000-BHPRH:1:2114:12657:28629 ESR 100.00 8 0 0 2 9 9 2 0.027 14.3
M02945:227:000000000-BHPRH:1:2114:12657:28629 Cercle 96.67 30 1 0 1 30 1 30 8e-12 46.0
Nice, I got the 1->30 information, ok that's great.
I inserted a second mismatch position 29
>M02945:227:000000000-BHPRH:1:2114:12657:28629 1:N:0:1
GAGAGCTCTATTACTGCACTGGAGGGTACC
blastn -query test_sequence_for_blast.fa -subject ref.fa -culling_limit 1 -penalty -1 -task blastn-short -outfmt 6
M02945:227:000000000-BHPRH:1:2114:12657:28629 ESR 100.00 8 0 0 2 9 9 2 0.008 16.4
M02945:227:000000000-BHPRH:1:2114:12657:28629 Cercle 100.00 26 0 0 1 26 1 26 1e-13 52.0
And again I lost the info from 26->30... Because at the end I want to totally send to trash sequences with more than 2 mismatches
How to force blastn to align 1->30 straigth and yield the number of mismatch ?
I also tried to modify gap open penalty and gap extend penalty but that didn't do the trick...
Thanks for helping !
Thanks for this help, actually my query sequences are 100-150 bp long, I just trimmed the example sequence here for a better understanding. I think i'll substring the 30 first bases of each sequences and use bowtie2 --end-to-end on them.
Why substring the first 30 bases?
Bowtie2
orbwa mem
should be supporting longer readsI'm just interested in the first 30 bases
Nevertheless, don’t use a local aligner if you want a full sequence alignment.
You could also look at just calculating Hamming or Levenshtein distances if your sequences are pretty similar and don’t need aligning first