Heya,
Is it possible to configure blastn in such a way that it will split even single nucleotide mismatches into two HSPs? For testing, I made the following two fasta files:
>SEQ1_FASTA
CTGCGGATGACTTAACACGCTAGGGACGTGGAGTCGATTCCATCGATGGTTATAAATCAAAGATTCGGAATGCTGTCTGGAGGGTGAATCTAACGGTGCGTATCTCGATTGCTCAGTCGCTTTTCGTACTGCGCGAAAGTTCGTACCGCTCATTCACTTGGTTCCGAAGCCTGTCCTGATATATGAATCCAAACTAGAGCGGGGCTCTTGACATTTGGAGTTGTAAATATCTAATACTCCAATCGGCTTTTACGTGCACCACCGCGGGCGGCTGACGAGGGTCTCACACCGAGAAACAAGACAGTTCCGGGCTGGAAGTAGCGCCGGCTAAGGAAGACGCCTGGTACAGCAGGACTATGAAACCGGTACAAAGGCAACATCCTCACTTGGGTGAACCGAAACGCGGTATCAAGGTTACTTTTTGGATACCTGAAACAAATCCCATGGTAGTCCTTAGACTTGGGAGTCTATCACCCCTAGGGCCCATATCTGGAAATAGACGCCAAGTTCAATCCGTATTCCGACGTACGATGGAACAGTGTGGGTGAGACGTGCTTCATTTATACCCTACGCAGGCTGGACCGAGGTCCGCAAGGCGCGGCGGTGCACAAGCAATTGACAACTAACCACCGTGTATTCATTATGGTACCAGGGACTTTAAGCCGAGTCAATGGAGCTCGCAATACAGAGTTTACCGCATCTTGCCGTAACTGACAAACTGTGATCCACCACAAGTCAAGCCATTGCCTCTTAGACACGCCGTTAGAGTAATTATGTAAACTTTGCGCGGCTTGACTACGACTCGTTCAGTCACGTCCGAGGGCACAATCCTATTCCCATTTGTATGTTCAGCTATCTTCTACCCATCCCCGGAAGTTAAGTAGGTCGTGAGATGCCGAGGAGGCTCTCGTTCATCCCGTGGGACATCAAGCTTCCCCTTGATAAAGCACCCCGCTCGGGTATGGCAGAGAGGACGCCTTCTGAATTGTGCTATCCCTCG
>SEQ2_FASTA
CTGCGGATGACTTAACACGCTAGGGACGTGGAGTCGATTCCATCGATGGTTATAAATCAAAGATTCGGAATGCTGTCTGGAGGGTGAATCTAACGGTGCGTATCTCGATTGCTCAGTCGCTTTTCGTACTGCGCGAAAGTTCGTACCGCTCATTCACTTGGTTCCGAAGCCTGTCCTGATATATGAATCCAAACTAGAGCGGGGCTCTTGACATTTGGAGTTGTAAATATCTAATACTCCAATCGGCTTTTACGTGCACCACCGCGGGCGGCTGACGAGGGTCTCACACCGAGAAACAAGACAGTTCCGGGCTGGAAGTAGCGCCGGCTAAGGAAGACGCCTGGTACAGCAGGACTATGAAACCGGTACAAAGGCAACATCCTCACTTGGGTGAACCGAAACGCGGTATCAAGGTTACTTTTTGGATACCTGAAACAAATCCCATGGTAGTCCTTAGACTTGGGAGTCTATCACCCCTAGGGCCCATATCTGGAAATAGGCGCCAAGTTCAATCCGTATTCCGACGTACGATGGAACAGTGTGGGTGAGACGTGCTTCATTTATACCCTACGCAGGCTGGACCGAGGTCCGCAAGGCGCGGCGGTGCACAAGCAATTGACAACTAACCACCGTGTATTCATTATGGTACCAGGGACTTTAAGCCGAGTCAATGGAGCTCGCAATACAGAGTTTACCGCATCTTGCCGTAACTGACAAACTGTGATCCACCACAAGTCAAGCCATTGCCTCTTAGACACGCCGTTAGAGTAATTATGTAAACTTTGCGCGGCTTGACTACGACTCGTTCAGTCACGTCCGAGGGCACAATCCTATTCCCATTTGTATGTTCAGCTATCTTCTACCCATCCCCGGAAGTTAAGTAGGTCGTGAGATGCCGAGGAGGCTCTCGTTCATCCCGTGGGACATCAAGCTTCCCCTTGATAAAGCACCCCGCTCGGGTATGGCAGAGAGGACGCCTTCTGAATTGTGCTATCCCTCG
The only difference between the two being a single SNP at position 500. I want to use the output of blastn downstream, so I'm hoping to do this with blast. I've been playing with the rewards / penalty for mismatches, but it seems I can't make it that sensitive. Anyone else knows how to do this? Possibly this is not feasible with blast, so if anyone knows some other tool to get HSPs based on small variants, that would be cool too :)
I notided there's an 'ungapped' option, that does precisely this for gaps (even if they are 1 nt). But I did not find the equivalant for a mismatch. Here's the result of ungapped:
blastn -query seq1.fn -subject seq2.fn -outfmt 6 -ungapped
SEQ1_FASTA SEQ2_FASTA 100.000 500 0 0 501 1000 500 999 0.0 962
SEQ1_FASTA SEQ2_FASTA 100.000 499 0 0 1 499 1 499 0.0 960
So, I want this, but then for single NT mismatches.
Those are two entries, not two files. Do you have them as one entry per file?
Yes, these are actually two files. I'm running:
Hi i know that i'm not answering what you asked but a few months ago i did read about snp detection algorithms and i learn that Blastn is a very error prone tool to detect small mutations like snps and smal in/dels. I strongly recommend Dnadiff from the Mummer Tools to do snap calls in large sequences. Dnadiff have a lot of output files that are really parsable.
Thanks for the tip. I'd still prefer to get this directly out of blast though. If it can detect a gap (option ungapped), why shouldn't it be able to detect a SNP?
I don't know how to explain exactly why, because there is some difficult programming that I don't fully understand. But the reason I read about why blast is not the best tool for calling the SNP is because of its scoring functions and alignment methods. They are optimized to provide the best possible local aligniment for a long chain of nucleotides. Therefore, it is possible to find large large mutations accurately, but small single mutations are not well recorded by the algorithm, leading to many false positives. Good results are possible, but you need to optimize the blasting parameters. Even so, I think (this is a personal opinion) that the blast is not the best tool for this sort of analysis.
I agree it's not the best tool. The problem is that I'm trying to show this to people, but in order to do so, I must compare it to other methods in a fair way.
Either way, thanks for the tip. I now at least got my SNPS using DNADIFF.