BLAST via Biopython NCBIWWW. Where can I find the complete database list?
2
2
Entering edit mode
9.9 years ago
alec_djinn ▴ 390

I am using the module Biopython module NCBIWWW to blast some sequences online. I would like to blast my sequences against different databases available, however I cannot find a comprehensive list of them.

Here is an eample of simple query to the Nucleotide collection database using "blastn" algorithm.

from Bio.Blast import NCBIWWW

result_handle = NCBIWWW.qblast("blastn", "nt", some_sequence)

As you can see, the database Nucleotide collection is specified as "nt". With what shall I substitute "nt" in case I want to query the Human GRCh37/hg19 database for example? And if I want to query other species/builds? Is there any comprehensive list available where I can find the short names for all the databases available at http://blast.ncbi.nlm.nih.gov ?

Thanks!

biopython blast databases • 15k views
ADD COMMENT
2
Entering edit mode

There doesn't seem to be a list anywhere, however if you look at the NCBI BLAST database FTP website you can see the names match what is listed on the BLAST webservice. I'm assuming that the name prefix (i.e. htgs of htgs.[0-9].tar.gz) is the name of the database.

ftp://ftp.ncbi.nlm.nih.gov/blast/db/

As for wanting to blast against a specific human chromosome only, maybe the 'human_genomic', or select 'refseq_genomic' and provide a entrez query for the human taxonomy ID: [taxid]9606.

If you're blasting a large number of sequences you may not want to use the WWW service.

ADD REPLY
0
Entering edit mode

I tryed already using the name from the files of the ftp server you linked... it does not work and if you look at the name there is no way to understand the version of the human genome build for example, there is only "human_genome" without any other indications... The same in case of taxid, 9606 is human what? Which build?

Of course if I need to blas a large number of sequences I will make it local.. But right now I just need to blast few sequences against different human builds and I wante to make a script for it. It's pity I cannot find info about this topic in the biopython manual nor in the ncbi website...

ADD REPLY
0
Entering edit mode

refseq_genomic and the entrez query "9606[taxid] AND grch38" get me hits against the GRCh38 assembly of the human genome.

I'm not sure if there's a way to get more specific, you might be stuck with filtering the blast results based on subject sequence GI/acc/title.

The information is available in biopython/NCBI literature, it might just be a matter of figuring out the right entrez query to use, or filtering the results. I'm not sure if you can get down to the level of a specific assembly.

See also: Help To Blast Sequences Against Drosophila Genome In Biopython

ADD REPLY
0
Entering edit mode

Can you make an example of a NCBIWWW.qblast call using "refseq_genomic and the entrez_query 9606[taxid] AND grch38 "?

Here is what I came up with but I cannot find how to specify the genome build...

mycall = NCBIWWW.qblast(program="blastn", database="refseq_genomic", sequence=some_sequence, organism="`9606`")
ADD REPLY
1
Entering edit mode

You can specify by using the -entrez_query option

entrez_query="9606[taxid] AND grch38"

EDIT: You can see the description of Entrez search fields here. There is a search field called "Genome Project" which was replaced by BioProject. Here's the list of BioProject IDs for all human genome assemblies. Once you decide which assembly you want to use, you can then restrict the BLAST search as follows (for GRCh38.p2 assembly).

entrez_query="PRJNA168[Genome Project]"

EDIT2: I realized that the suggestion above won't be helpful for you as both the builds GRCh38 and GRCh37 will have the same BioProject ID. Also, if you want to use the "Genome Project" field, you need to use the "refseq_genomic" BLAST database instead of nt.

ADD REPLY
2
Entering edit mode

For future readers that happen upon this page, the correct entrez query to specify sequences from taxid 9606 is

"txid9606[ORGN]"

Perhaps this is a change on NCBI's side since the previous comment.

ADD REPLY
7
Entering edit mode
9.9 years ago
Siva ★ 1.9k

This is the list of preformatted BLAST databases available at BLAST FTP site.

I parsed the list and tested these databases with a script. All of them work except 3.

Protein databases

env_nr
nr
pataa
pdbaa
refseq_protein
swissprot

Non-RefSeq nucleotide databases

env_nt
est
est_human
est_mouse
est_others
gss
htgs
nt
patnt
pdbnt
sts
vector
wgs

RefSeq nucleotide databases

human_genomic #obsolete, renamed to refseq_genomic_human
human_genomic_transcript #obsolete
mouse_genomic_transcript #obsolete
other_genomic
refseq_genomic
refseq_rna
refseqgene #obsolete

To your second question about how to search against an older human genome build (Human GRCh37/hg19), I am afraid you cannot do it with remote BLAST. You need to download the older data and create a BLAST database and run BLAST locally.

ADD COMMENT
0
Entering edit mode

Thanks Siva, this solved my question. I finally got a list and I am sure I cannot ask for a specific build from remote (that's pity).

Thanks in general to everybody that participated to this thread. I have found your hints very useful!

ADD REPLY
0
Entering edit mode

Isn't this more or less what I linked?

ADD REPLY
1
Entering edit mode
9.9 years ago
pld 5.1k

Found it:

Make your database 'refseq_genomic' and your entrez query "GCF_000001405.28[refseq]". Each assembly has an alternate ID that is in RefSeq and GenBank.

You should be able to replace that with any RefSeq assembly ID.

ADD COMMENT
0
Entering edit mode

Did you try with a query sequence? Did you get any hits? I used a human gene from the current assembly as query and it did not produce any hits. If I remove the -entrez_query option, it runs fine and produces hits.

I am not sure if the field "refseq" even exists in Entrez search. The strange thing is there is no error or warning message if I use fields that definitely don't exist. For fun, I just tried "biostars",

blastn -query test.fa -db refseq_genomic -out test.out -remote -entrez_query "GCF_000001405.28[**biostars**]"

And there was no warning or error message and it produced the same output (No hits found) as using "refseq".

If I use "nt" database instead of refseq_genomic, it does throw a warning

Warning: Warning: conversion_warning: Failed to collect db stats for nt (entrez query: GCF_000001405.28[refseq]) Warning: conversion_warning: Query 'lcl|Query_1 gi|224589802:c5248301-5246696 Homo sapiens chromosome 11, GRCh37.p13 Primary Assembly' (# 1): Warning: GI or TI list filtering resulted in an empty database.
ADD REPLY
1
Entering edit mode

Using BioPython and searching for NPC1 (NM_000271) I get the following hits:

gi|732663877|ref|NW_011332701.1| Homo sapiens chromosome 15 genomic patch of type FIX, GRCh38.p2 PATCHES HG2139_PATCH
gi|698040068|ref|NW_009646197.1| Homo sapiens chromosome 3 genomic patch of type FIX, GRCh38.p2 PATCHES HG2066_PATCH
gi|698040066|ref|NW_009646199.1| Homo sapiens chromosome 5 genomic patch of type NOVEL, GRCh38.p2 PATCHES HSCHR5_7_CTG1
gi|732663880|ref|NW_011332698.1| Homo sapiens chromosome 13 genomic patch of type FIX, GRCh38.p2 PATCHES HG2288_HG2289_PATCH
gi|698040061|ref|NW_009646204.1| Homo sapiens chromosome 12 genomic patch of type FIX, GRCh38.p2 PATCHES HG23_PATCH
gi|732663887|ref|NW_011332691.1| Homo sapiens chromosome 3 genomic patch of type FIX, GRCh38.p2 PATCHES HG126_PATCH
gi|698040070|ref|NW_009646195.1| Homo sapiens chromosome 1 genomic patch of type FIX, GRCh38.p2 PATCHES HG2058_PATCH

Not finding hits versus not finding a result on the query are different errors, the tag is valid. No results means the entrez query was valid and blast didn't find anything in the sequences identified by the entrez query. The warning you got comes up when the entrez query is invalid or doesn't match anything in the database. No results is not 'basically the same' as getting a warning.

The difference seems to be the default programs run by blastn command line versus what BioPython does. Unless you specify the -task option, blastn will run megablast by default.

When I run blastn -remote I don't find any hits, this means searching with megablast didn't find anything. If I tell blastn to search with blastn (blastn -task=blastn), I get results:

Command:

blastn -task=blastn -db refseq_genomic -query npc1.fasta -entrez_query "GCF_000001405.28[assembly]" -remote

Results:

Sequences producing significant alignments:                          (Bits)  Value
ref|NW_011332701.1|  Homo sapiens chromosome 15 genomic patch of ...  42.8    0.008
ref|NW_009646197.1|  Homo sapiens chromosome 3 genomic patch of t...  39.2    0.096
ref|NW_009646199.1|  Homo sapiens chromosome 5 genomic patch of t...  39.2    0.096
ref|NW_011332698.1|  Homo sapiens chromosome 13 genomic patch of ...  35.6    1.2
ref|NW_009646204.1|  Homo sapiens chromosome 12 genomic patch of ...  35.6    1.2
ref|NW_011332691.1|  Homo sapiens chromosome 3 genomic patch of t...  33.7    4.1
ref|NW_009646195.1|  Homo sapiens chromosome 1 genomic patch of t...  33.7    4.1

These appear to be exactly the same as what is returned by BioPython.

I'm guessing that refseq_genomic and refseq both return no results because refseq_genomic might be a subset of refseq.

ADD REPLY
0
Entering edit mode

According to the NCBI page I linked to earlier, the fields "refseq" and "assembly" do not exist . I think if the entrez_query does not have any allowed keywords, it is doing a free text search. You can test this by using just "GCF_000001405.28" and you will get the same results as using "GCF_000001405.28[assembly]" or "GCF_000001405.28[refseq]".

You are correct that using the default "megablast" does not produce any hits but using "blastn" produces hits. This is even more confusing. We are searching a human nucleotide sequence against its own genome (intra-species search). From the BLAST program selection guide on page 3 [ PDF link ]

**megablast**: for sequence identification, intra‐species comparison
**discontiguous megablast**: for cross‐species comparison, searching with coding sequences                                                                          
**blastn**: for searching with shorter queries, cross‐species comparison

So, the default 'megablast' search should produce results. Also, if you look at the matches you posted (using blastn), all those are weak hits of shorter regions and not a single hit from Chromosome 18 (the chromosomal location of your query sequence). I think because of the entrez_query option, it is searching a very small subset of the refseq_genomic database and it produces unintended results.

EDIT: If I do a Entrez search at NCBI website for "GCF_000001405.28", it gives 31 sequences and the BLAST hits you posted are from these 31 sequences. So, it is clearly not searching the entire human genome from the refseq_genomic database.

ADD REPLY
0
Entering edit mode

No, you can have a human sequence not match well enough for megablast to find hits against it in a chromosome sequence, i.e. mRNAs.

I was searching with a human mRNA sequence and megablast didn't find any hits. I'd imagine discontiguous megablast would work.

ADD REPLY
0
Entering edit mode

Yes, the list is the same, but Siva's answer is more precise. Anyway, I really have appreciated your efforts.

ADD REPLY
0
Entering edit mode

Yes, mRNA is a different story. Just to be absolutely clear, the reason why we did not find hits is not because of using megablast but restricting the search space using entrez_query. Both megablast and blastn without the entrez_query produces more or less the same results for your query.

Anyway, thanks for the discussion. I learned now more about the entrez_query and megablast.

ADD REPLY

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6