I went to the blast ftp database, there are 18 nt
files, each is less than 800 MB, and for refseq_genome
it has 83 files, most of which are larger than 800 MB, which means the refseq_genome
is much larger than nt
database. However, when I search the definition of nt
on http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml, it says nt
database include All GenBank + RefSeq Nucleotides + EMBL + DDBJ + PDB sequences (excluding HTGS0,1,2, EST, GSS, STS, PAT, WGS). No longer "non-redundant".
My question is:
- In my understanding RefSeq Nucleotides should include
refseq_genome
andrefseq_rna
, sorefseq_genome
should be much smaller thannt
database. why isrefseq_genome
alone is much larger than the wholent
database? - I tried one accession number
NZ_AARG01000001.1
from refseq bacteria genome, and blastn againstnt
andrefseq_genome
database. Fornt
case, it took a few seconds and got less than 10 hits. Forrefseq_genome
database, it took more than 10 minutes and got more than 100 results (all the accession number began with NZ). Then I searched NZ and found NZ represent not completed project. So the difference betweennt
andrefseq_genome
is that nt doesn't include NZ records?
Hi, I just wonder how you get the information of
And also the number of bases? Thanks.
The summary information for the databases is from the NCBI's BLAST service, the database help ('?' icon next to the database selection) shows the details of the database. The information for the number of bases in the database comes from the summary information included in BLAST search results for each database, the location of this varies depending on the output format, on the NCBI's BLAST service this is available in the "Search Summary" section of the default HTML result.
Thank you. This helps me a lot.
I have follow-up questions. How will it be like if we draw a venn diagram to show the relationship among nt database, refseq genome sequences and refseq representative sequences? Thanks.