Blast Alignment Bug
2
1
Entering edit mode
13.1 years ago
Maria K ▴ 60

I tried to find a short oligonucleotide sequence (probe) in a transcript and I knew for sure that the transcript contained the probe. But the latest version of the BLAST stand-alone algorithm (2.2.25+) found no match for the probe. Surprisingly enough, when I split the probe sequence in two parts both were found in the transcript one after the other. Moreover, when I deleted the two first nucleotides from the probe sequence, BLAST managed to find the correct matching. Could anyone explain what kind of problem I am facing? I do have about 20 such probe sequences that were not found by BLAST even if there was a perfect matching.

I tried to find matching with the following parameters:

blastn -query probe.fa -db target -task blastn-short -word_size 7 -evalue 100 -out res.out

UPDATE: It was really helpful to change the -wordsize parameter to 5. The BLASTN algorithm managed to find the correct matching. BUT there are still several probes, for which it fails to find the correct matching although the transcript contains the probe sequence for sure. The stand-alone BLAST version allows to set the -wordsize parameter >=4, but even with -word_size=4 the matching couldn't be found. The online BLAST finds the matching. What should I do in this case?

The new problem data is:

>probe_seq

CCCCCCCCTCGGAGAGAGAGAGA

>transcript_seq

tccctctcccccccttctctctctctccgaggggggggggtcccagggagggaggggggg tcccccgatcagcatgtggctcctggcgctgtgtctggtggggctggcgggggctcaacg cgggggagggggtcccggcggcggcgccccgggcggccccggcctgggcctcggcagcct cggcgaggagcgcttcccggtggtgaacacggcctacgggcgagtgcgcggtgtgcggcg cgagctcaacaacgagatcctgggccccgtcgtgcagttcttgggcgtgccctacgccac gccgcccctgggcgcccgccgcttccagccgcctgaggcgcccgcctcgtggcccggcgt gcgcaacgccaccaccctgccgcccgcctgcccgcagaacctgcacggggcgctgcccgc catcatgctgcctgtgtggttcaccgacaacttggaggcggccgccacctacgtgcagaa ccagagcgaggactgcctgtacctcaacctctacgtgcccaccgaggacggtccgctcac aaaaaaacgtgacgaggcgacgctcaatccgccagacacagatatccgtgaccctgggaa gaagcctgtgatgctgtttctccatggcggctcctacatggaggggaccggaaacatgtt cgatggctcagtcctggctgcctatggcaacgtcattgtagccacgctcaactaccgtct tggggtgctcggttttctcagcaccggggaccaggctgcaaaaggcaactatgggctcct ggaccagatccaggccctgcgctggctcagtgaaaacatcgcccactttgggggcgaccc cgagcgtatcaccatctttggttccggggcaggggcctcctgcgtcaaccttctgatcct ctcccaccattcagaagggctgttccagaaggccatcgcccagagtggcaccgccatttc cagctggtctgtcaactaccagccgctcaagtacacgcggctgctggcagccaaggtggg ctgtgaccgagaggacagcgctgaagctgtggagtgtctgcgccggaagccctcccggga gctggtggaccaggacgtgcagcctgcccgctaccacatcgcctttgggcccgtggtgga tggcgacgtggtccccgatgaccctgagatcctcatgcagcagggagaattcctcaacta cgacatgctcatcggcgtcaaccagggagagggcctcaagttcgtggaggactctgcaga gagcgaggacggtgtgtctgccagcgcctttgacttcactgtctccaactttgtggacaa cctgtatggctacccggaaggcaaggatgtgcttcgggagaccatcaagtttatgtacac agactgggccgaccgggacaatggcgaaatgcgccgcaaaaccctgctggcgctctttac tgaccaccaatgggtggcaccagctgtggccactgccaagctgcacgccgactaccagtc tcccgtctacttttacaccttctaccaccactgccaggcggagggccggcctgagtgggc agatgcggcgcacggggatgaactgccctatgtctttggcgtgcccatggtgggtgccac cgacctcttcccctgtaacttctccaagaatgacgtcatgctcagtgccgtggtcatgac ctactggaccaacttcgccaagactggggaccccaaccagccggtgccgcaggataccaa gttcatccacaccaagcccaatcgcttcgaggaggtggtgtggagcaaattcaacagcaa ggagaagcagtatctgcacataggcctgaagccacgcgtgcgtgacaactaccgcgccaa caaggtggccttctggctggagctcgtgccccacctgcacaacctgcacacggagctctt caccaccaccacgcgcctgcctccctacgccacgcgctggccgcctcgtccccccgctgg cgccccgggcacacgccggcccccgccgcctgccaccctgcctcccgagcccgagcccga gcccggcccaagggcctatgaccgcttccccggggactcacgggactactccacggagct gagcgtcaccgtggccgtgggtgcctccctcctcttcctcaacatcctggcctttgctgc cctctactacaagcgggaccggcggcaggagctgcggtgcaggcggcttagcccacctgg cggctcaggctctggcgtgcctggtgggggccccctgctccccgccgcgggccgtgagct gccaccagaggaggagctggtgtcactgcagctgaagcggggtggtggcgtcggggcgga ccctgccgaggctctgcgccctgcctgcccgcccgactacaccctggccctgcgccgggc accggacgatgtgcctctcttggcccccggggccctgaccctgctgcccagtggcctggg gccaccgccacccccaccgcccccctcccttcatcccttcgggcccttccccccgccccc tcccaccgccaccagccacaacaacacgctaccccacccccactccaccactcgggtata gggggtgggtggggaggccctcctccccggccctccctggcccggccactccgaaggcag ggaggaggacttggcaactggcttttctcctgtggagtcgtcacacgccatccagcagcg ctaaggtggacatgggattcctccctgcgatgcgtgtctttcccacgcagagaagcccag tctcttctctggatctgggcctttgaacaactggggggcgttttctcccccccattggga caccagtcttcggtgtgtggaatgtggtattttcccgcgtggaggtgtgctttctcacaa cggggtgtgttttcccatgtgcagggtgaggtttttttttgccaccctggacacatgttg gccccctcaaagaatttctgtggggatttgtaccccagaatcctgttcccccatcccttc tcccacctcctcccctctccctccccctggagaccctggaagtggtgtgttcacatacag tgacccttggccaccagaccacagaggatggagcctgggaagcagcgaggaaatcacagc cccctcgcccctgcctcccttgcccctaccccggcgaagcatgttccccccgacgccccc cttggcacaagtcagatgaagcacgttctgccggggaggccctcaccttccagagaggac agacacagatttcctgctgggggagggaggagtccacgcatcctgatgctgcctggaagc ttattttcccgtggccaggacgcatttctctgagtggaaacaggttcttgcatgtggatg tgtgtttccccaggcagacggcccctctcttcccagcacttccctgcctcccccaggcct caggcccagcacccagttcctcctcacatggcaggtgagcacagacttctagttggcagg agctgaggagggtgaacaaaccccgagggaggcccggcccttgctcccgagttgggggga gggggtgtggcaacgtgccccccgcagaggccacgcatgtttgaccaaagccctcattgt ggtccgaggacagccttttccccaggcctcagagcattgctcatccgtgccaaactgggt aggtggatttgagcggaaagactcccaaaatgtgccaagaatttcccagtcccaggcagg gcaggggaaactaagggcaagcaggatacagggcgagggatgtggcaggtgagggggctc ccgcctgtgccccttctcctcaccatgtctcccccaccctgcctcagttctccgttcccc ttcatctccgtccccctctttgaagctgtccccatctcagtgtcagaccagccttctcct cagctgaccaccctcctctgacccacgccccctccttgtctgaaagaaaggagccttgaa tggtggagggaggcagtggggagaaaggtctcaccggacaggttgggagaatgaggtcag cggtgctggggaacagatggagggggcagtggggacagggcttgggcagacaccagcagg aataatttgaaatgtgtgaggtgactccccggagggccttgggcttgggcatttgggaaa agaatgatgtctggaagggcttaagggacacagtggacgaggggagagtcctcatctgct ggcattttgtggggtgttagtgccaaacttgaataggggctggggtgctgtcttccactg acacccaaatccagaatccctggtcttgagtccccagaactttgcctcttgactgtccct tctcttcctacctccatccatggaaaattagttattttctgatcctttcccctgcctggt ctagctcctctccaaacagccatgccctccaaatgctagagacctgggccctgaaccctg tagacagatgccctcagaattggggcatgggaggggggctgggggaccccatgattcagc cacggactccaatgcccagctcctctccccaaaacaatcccgacaatcccttatccctac cccaaccctttgcggctctgtacacatttttaaacctggcaaaagatgaagagaatattg taaatataaaagtttaactgtt

And the correct matching position is 38-14.

blast blastn alignment • 3.8k views
ADD COMMENT
0
Entering edit mode

Could you provide the command you used to run the local copy of BLAST. It may be that the parameter settings you used are different to the ones used by the web tool.

ADD REPLY
0
Entering edit mode

I used

blastn -query probe.fa - db target -task blastn-short -wordsize 7 -evalue 100 -out blastprobe.out It turned out that if wordsize=5 BLASTN manages to find the correct matching.

ADD REPLY
7
Entering edit mode
13.1 years ago
Jake ▴ 150

You need to use a smaller Word size. The online version automatically works that out. This is nicely explained here. So:- Like yourself when I run blast with the standard parameters I get no hits.

Running with:-

blastn -db data.fasta -query target.fasta -word_size 5

I get the hit.

ADD COMMENT
0
Entering edit mode

Best answer +1. For oligos, you can even back the word size down a bit more.

ADD REPLY
0
Entering edit mode

Thanks, Jake! That was really helpful. But I still don't understand, why BLASTN worked for about 800 000 of short 25-nucleotide sequences correctly and only for 20 sequences i have to use a smaller word_size parameter. Perhaps it's a problem of realisation.

ADD REPLY
0
Entering edit mode

UPDATE: With this probe sequence the algorithm works, but I found several more sequences for which BLASTN fails to find the correct matching even with -word_size=4. And 4 is the smallest allowed parameter value.

ADD REPLY
0
Entering edit mode

Also disable masking. For old blastall, the option is "-F F".

ADD REPLY
0
Entering edit mode

@Maria: to me, it is impropriate to claim a program has a bug before you thoroughly understand how it works.

ADD REPLY
1
Entering edit mode
13.0 years ago
Maria K ▴ 60

Actually the answer could be found at NCBI FAQs http://www.genebee.msu.su/blast/blast_faqs.html

It's advised to remove filtering of the query sequence. BLAST filters low-complexity sequences and thus the query sequence may become even shorter than 25 nucleotides and the matching will have low statistical significance. To disable filtering the -dust parameter should be set to 'no'. In this way BLASTN finds the matching:

>blastn -query probe.fa -db target -task blastn-short -word_size 7 -evalue 1000 -dust no -out res.out

ADD COMMENT
0
Entering edit mode

Option -F F works the same way.

ADD REPLY

Login before adding your answer.

Traffic: 2529 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6