I am using the module Biopython module NCBIWWW to blast some sequences online. I would like to blast my sequences against different databases available, however I cannot find a comprehensive list of them.
Here is an eample of simple query to the Nucleotide collection database using "blastn" algorithm.
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", some_sequence)
As you can see, the database Nucleotide collection is specified as "nt". With what shall I substitute "nt" in case I want to query the Human GRCh37/hg19 database for example? And if I want to query other species/builds? Is there any comprehensive list available where I can find the short names for all the databases available at http://blast.ncbi.nlm.nih.gov ?
Thanks!
There doesn't seem to be a list anywhere, however if you look at the NCBI BLAST database FTP website you can see the names match what is listed on the BLAST webservice. I'm assuming that the name prefix (i.e. htgs of htgs.[0-9].tar.gz) is the name of the database.
ftp://ftp.ncbi.nlm.nih.gov/blast/db/
As for wanting to blast against a specific human chromosome only, maybe the 'human_genomic', or select 'refseq_genomic' and provide a entrez query for the human taxonomy ID: [taxid]9606.
If you're blasting a large number of sequences you may not want to use the WWW service.
I tryed already using the name from the files of the ftp server you linked... it does not work and if you look at the name there is no way to understand the version of the human genome build for example, there is only "human_genome" without any other indications... The same in case of taxid, 9606 is human what? Which build?
Of course if I need to blas a large number of sequences I will make it local.. But right now I just need to blast few sequences against different human builds and I wante to make a script for it. It's pity I cannot find info about this topic in the biopython manual nor in the ncbi website...
refseq_genomic
and the entrez query "9606[taxid] AND grch38" get me hits against the GRCh38 assembly of the human genome.I'm not sure if there's a way to get more specific, you might be stuck with filtering the blast results based on subject sequence GI/acc/title.
The information is available in biopython/NCBI literature, it might just be a matter of figuring out the right entrez query to use, or filtering the results. I'm not sure if you can get down to the level of a specific assembly.
See also: Help To Blast Sequences Against Drosophila Genome In Biopython
Can you make an example of a
NCBIWWW.qblast
call using "refseq_genomic and the entrez_query 9606[taxid] AND grch38 "?Here is what I came up with but I cannot find how to specify the genome build...
You can specify by using the
-entrez_query
optionEDIT: You can see the description of Entrez search fields here. There is a search field called "Genome Project" which was replaced by BioProject. Here's the list of BioProject IDs for all human genome assemblies. Once you decide which assembly you want to use, you can then restrict the BLAST search as follows (for GRCh38.p2 assembly).
EDIT2: I realized that the suggestion above won't be helpful for you as both the builds GRCh38 and GRCh37 will have the same BioProject ID. Also, if you want to use the "Genome Project" field, you need to use the "refseq_genomic" BLAST database instead of nt.
For future readers that happen upon this page, the correct entrez query to specify sequences from taxid 9606 is
Perhaps this is a change on NCBI's side since the previous comment.