Standalone blast results wrong
1
0
Entering edit mode
10.2 years ago
biobio ▴ 50

Hi,

I downloaded the NR database from NCBI about 2 months ago. The past few times I have run a blast search on some contigs, the results have been wrong on many of the query sequences. I checked this by taking the sequence and using NCBI's web blast and comparing the results. Often, the results that are plant viruses on local blast turn up as plant sequence on web blast. The options are all kept the same between web blast and local blast and the only difference I can think of is that my database is about 2 months older than NCBI's. Before I download this database again, do you think this is the reason for different results? Could there be another confounding factor?

Thanks!

blast blastx • 3.0k views
ADD COMMENT
0
Entering edit mode
Did you check for low comolexity filtering. AFAIK it's on using web blast, not too sure about stand alone though. Also can you check whether the alignments are for the exact same query sequence(s). Since its plant genomes and viruses those sequences might be present in both references (maybe due to contamination of the plants genome). However if that is the case I'd go and try a newer stand alone db.
ADD REPLY
0
Entering edit mode

Is it only a matter of the order of the results? I assume you get multiple hits to the sequence

ADD REPLY
3
Entering edit mode
10.0 years ago
Michael 55k

You absolutely have to use identical input data, hence download the most recent version of the database, if you want to make any claims about reproducibility of the result of two different programs. Also, you didn't indicate which version of local Blast you used, e.g. Blast+, you should use the most recent version here too, and report GIs for hits that potentially got reclassified taxa. Indeed you should look up the GI's of the top hits and compare their annotation.

Also, sequences, due to non-redundancy, might have multiple ID's and also multiple taxa. The sequence is possibly from a viral sequence integrated into the plant genome and therefore annotated with the plant's taxid.

Other misconceptions:

  1. Results are not "wrong". A wrong alignment would be one that shows sequences which do not align or that couldn't be reproduced using smith-waterman given the obtained score. You could spot them immediately from the homology string, and therefore this didn't happen.
  2. The output of blast, especially looking only at the top hits gives no definitive answer about what the query sequences "are" (I guess you mean which organism it is coming from). In case of multiple hits, the ordering might be arbitrary, some alignments might get identical e-value and score, and therefore you cannot infer the origin of the sequence this way.
ADD COMMENT
0
Entering edit mode

The virus 'problem' is exactly the reason why I asked for complexity masking...

ADD REPLY

Login before adding your answer.

Traffic: 3039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6