Hello !
I'm currently trying to develop a local core-genome MLST tool using a combination of a huge genes database and blastn but I came into outputs I can't explain. Here's what I have:
- My genome: genome.fasta
- My database, comprised of ~1000 genes and n alleles:
- db/GENE01.fasta: 1500 sequences.
- db/GENE02.fasta: 1200 sequences.
- etc
Now, I proceed in two similar ways but with different outputs. Solution #1, generate one DB per gene using makeblastdb and blast each DB to my genome:
makeblastdb -in db/GENE01.fasta -dbtype nucl
blastn -query genome.fasta -db db/GENE01.fasta -evalue 1e-5 -outfmt "6 qseqid qlen length pident sseqid sseq" -out "output_01.txt"
This allowed me to get a match with 100% identity with the allele #27:
NODE_6_length_104687_cov_62.173891 104687 843 100.000 GENE01_27 TTATCTTAGTTGTT...
Now, solution #2 is to compile EVERY genes and alleles from my database (~1,900,000 sequences in total) into one big file "fullDB.fasta" and blast it to my genome:
cat "db/*.fasta" > fullDB.fasta
makeblastdb -max_file_sz '4GB' -in fullDB.fasta -dbtype nucl
blastn -query genome.fasta -db fullDB.fasta -evalue 1e-5 -outfmt "6 qseqid qlen length pident sseqid sseq" -out "fulloutput.txt"
But when I do that (no error and pretty quick), I'm unable to obtain the same results as solution #1. No more GENE01_27 for example.
Is there something I'm missing when using makeblastdb / blastn -db? I supposed it's because genes have closely similar alleles (oftenly one or two SNP differences) but in that case why am I able to obtain something when blasting all alleles from GENE01?
To summarize, solution #1 get correct results but is way too slow (blastn x1000) and solution #2 is fast (only one blastn) but with uncorrect results.
Any idea? Thanks for your time!
Your first solution seems to do the job, I didn't know local blast had such limitations ! I'll try to test your other ones to see if it also changes the output.
Thanks a lot :)
Yeah it scares me!!! I am running many BLASTs for many mitogenomes at the moment. There are many highly similar sequence hits on NT across many species and the default number of alignment often does not include the correct species unless I ramp up the number of reported alignments and switch to 'old' task blastn