Hello everyone, I made a blast research and I got the output in the table format 6.
So I got several hits and I would like to retrieve each fasta sequences from the blast database with the start and end corresponding of the hit.
I saw that that the command blastdbcmd
should be able do to the trick but I cannot find where you put the blast output to indicate where to take the sequence in the database (with start and end informations).
Thank you for your help.
I don't think you can do exactly what you want to do with
blastdbcmd
, as I don't think takes blast output specifically. According to the docs, it looks like you can specify coordinates based on the identifiers though:Extract different sequence ranges from the BLAST databases
The command below will extract two different sequences: bases 40-80 in human chromosome Y (GI 13626247) with the masked regions in lowercase characters (notice argument 30, the masking algorithm ID which is available in this BLAST database) and bases 1-10 in the minus strand of human chromosome 20 (GI 14772189).
I'm rusty when it comes to
blastdbcmd
but I think broadly what you'll need to do is: - Convert blast output to just a list of IDs of interest. - Pass this file with coordinates in a manner that matches the IDs of the sequences you created the database from, and print each sequence accordingly.For instance, in the above, you're replace the
printf
command with a manipulation of the columnar data from the blastoutput (seecut
/awk
etc).Hi everyone, I have some news concerning the idea you proposed me. The issue is in term on time because if I write a command line for each hit blast, it is about
310 000
commande lines to execute in order to recover the fasta sequences in the database corresponding to the range start and end.So I have my file with the 310 000
blastdbcmd -db ....
to execute by doingfor line in read -r file_script.txt line ; do command "$line"; done
and it takes a lot a time, at least 4 hours to execute all the blastdbcmd commande in the file.So I was wondering if there where not another faster way to recover a fasta sequences in database with the range start and end ?
Please don’t add answers unless you are actually answering the top level post.
As for your speed issue, retrieving that many hits from a blast database is never going to be especially fast, but you should be able to parallelise the process using
GNU parallel
instead of a loop.You could use a RAM disk or something (if you have plenty of RAM available) and read the database into memory. That would be about the fastest way to do this using blast.
You could also look into using
pyfaidx
orsamtools faidx
(e.g. Extract User Defined Region From An Fasta File ) with intervals to get the sequence from original database file. This would likely be the fastest way.