Entering edit mode
9.7 years ago
jeremy.cox.2
▴
130
I set out to download and compile the complete refseq bacteria database.
I download from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/
*.genomic.fna.gz files
After decompresion, the files total ~100 GB. Whereas my nt bacterial database is only 12 GB. And I expect refseq to be smaller than nt. So I think I have misunderstood what I want to download.
Can you help me figure out what files or how to know what files I actually want? I am doing the same for viruses and fungi.
This seems strange. Are you sure the bacterial nt database was downloaded accurately?
nt is non-redundant, refseq genomic is not. Just check how huge the
refseq_genomic
blast db is in comparison to nt (26 tar.gz vs 152 tar.gz files)..Yes, I downloaded the NT database correctly. I downloaded a single "nt.fa" file, compressed.
It would seem that the total size of the refseq_genomic blast db ~26 GB, so clearly I have downloaded the WRONG FILES.
This question is how do I know which files are the correct files?
Are you sure it's just 26 GB? In the ftp, there are 152 ~0.9GB refseq_genomic tar.gz files. Surely uncompromising them makes it even bigger, i.e. over ~136 GB.