Hi everyone,
I am quite new to using the BLAST Command Line Applications and would love any and all help. I am trying to use the application to run many sequences stored in a folder against a database I have created but am not sure exactly how to do this or if I am on the right track.
I downloaded and set up BLAST from NCBI using the Standalone set up for windows instructions online. From these instructions I have created a database that I want to Blast multiple sequences against using the code makeblastdb -in db\elasmo_full.fa -parse_seqids -blastdb_version 5 -title "Elasmobranchii genomes full" -dbtype nucl
. The data was from complete genomes I downloaded from NCBI. I have been able to run single sequences against this database using blastn -db elasmo_full.fa -query zAsic4.2_4.fa -perc_identity 50 -outfmt 7 -max_target_seqs 5
(I set the perc_identity
low as a start point to find anything even remotely similar and will increase it in subsequent searches).
I am wondering if there is a code to blast a folder of .fa files, as opposed to individual files, against this database I created? I was hoping to save all the small sequences in one folder to save time rather than running them one by one.
Thanks in advance for any help or advice.
Cheers,
D
You can put multiple FASTA sequences in your query file and run blastn once. If you want to find anything even remotely similar remove the
-perc_identity
argument and add-task blastn
.Thanks so much for your response - so the best course is to just compile all my FASTA sequences in one file and then use that file to run against the database? Will this tell me which of the sequences gets the hit in my FASTA file/database? Cheers
Yes, this would be the simplest approach. As a result you will get one file (formatted as outfmt 7) with BLAST results for all your query sequence. You will know which sequence gets hit by looking at the first column (query id) in BLAST output. Of course, alternatively you can BLAST a directory of query FASTA files. You would need to do a 'for loop' in command line using bash that would iterate over each file in query directory. Inside the for-loop you would run blastn (https://stackoverflow.com/questions/138497/iterate-all-files-in-a-directory-using-a-for-loop).
Great, that's very helpful. Thanks again - being new to this area it feels a bit overwhelming so I appreciate you providing advice! Cheers
Let me know if you need anything else.
Hi,
I know it has been a while since this query but after combining files and running some blasts I have noticed that there are oddities in my results. If I run a blast:
on a database of two fish the results I get appear to be missing matches I get when I run a blast on the two fish as individuals (e.g.
blastn -db zebra.fasta -query zAsic2.fasta -task blastn -outfmt 7 -max_target_seqs 100
).I thought this may be an issue as I manually combined the FASTA files by converting to .txt copy and pasting and then converting back to .fa. So I tried to combined through powershell using
code as I had seen this was used to combine FASTA files by others. However, when I try to convert that into a database I get the error
Blast option error: file db\combined.fa does not match the input format type, default input type is FASTA
. So I am currently stuck as my manually combined FASTA files are not recognizing matches that should be there and combining them another way has also hit a roadblock. Any help would be greatly appreciated!Thanks,
D
Hi, I would check first if you have the same number of sequences in individual FASTA files as in the manually combined FASTA file. This way, you will know if you combined the files correctly. You can count FASTA sequences by counting
>
signs. The easiest way is to usegrep -c '^>' lamprey_zebra.fasta
.Also, could you show us some BLAST hits that you are getting using the combined database and not getting when using individual databases? E-value in BLAST is dependent on the size of the database - the bigger the database, the greater the e-value. So if you BLAST a sequence against species A, the hit you get will have smaller e-value than if you BLAST the same sequence against a database of multiple species (including species A) the same hit from species A will have greater E-value. By default, BLAST reports hits with e-value <= 10. It is thus possible that when you use the combined database some of your hits exceeded the e-value threshold and were not reported in the output.
Hi,
Thanks again for your help - I am unfortunately still having some difficulty. I reviewed and the number of sequences appears to be correct when I use
blastdb_aliastool
, however, it is only returning results from one of the databases included and changing the evalue inflates the matches beyond what I had expected.I blast the combined database using:
and the individuals using:
The combined database only returned the same results as the
elasmo_full
(4 hits) when the combined should ideally be returning a total of 21 which was the sum of all the individual blasts.If I change the evalue to 100 then I get 43 hits. I guess I don't understand why the change is so different between the individuals as elasmo_full is by far the biggest file containing 9815 sequences, while the rest all contain 26,26,13,13. The % identity matches are all above 80%, with most being 90-100%. If I inflate the evalue, would the results not be considered very valuable or noteworthy? As an additional note an evalue of 23 gives 16 returns and the next jump is at 78 which moves to the 43 returns. Any further insight would be greatly appreciated!
Many Thanks,
D
Is there any chance you could share FASTA files of query and individual species?
Of course - here is a dropbox link for the FASTA files https://www.dropbox.com/sh/19krs5c1s2wpoyh/AADQFcwHc1ygbHAtL8E93ynpa?dl=0
Let me know your thoughts. Cheers, D
Thanks! I would also need the query sequence. I'll check if I can reproduce your issue.
No problems - I think I have determined the issue is with the large discrepancy in the size of the databases, but I'll attach the query sequence as well in case I am incorrect!
Then:
Finally, I blasted the combined database and the individuals which yields different results:
Thanks again