All you should need to do to download all bacteria in RefSeq in fasta format is:
ncbi-genome-download -F fasta bacteria
You can optionally ask for assembly completeness levels with -l complete
for example, and the output format can be changed with -F
For a small test I've done the following:
ncbi-genome-download -l complete -F fasta --genus "Serratia" -v bacteria
You obviously don't need to specify a genus for your case. I stopped it after a few downloads so that it didn't take ages. Nevertheless I ended up with a large file structure and the result of find ./Serratia -name "*.fna.gz"
gave me:
./refseq/bacteria/GCF_000422085.1/GCF_000422085.1_ASM42208v1_genomic.fna.gz
./refseq/bacteria/GCF_001417865.2/GCF_001417865.2_ASM141786v2_genomic.fna.gz
./refseq/bacteria/GCF_000975245.1/GCF_000975245.1_ASM97524v1_genomic.fna.gz
./refseq/bacteria/GCF_001280365.1/GCF_001280365.1_ASM128036v1_genomic.fna.gz
./refseq/bacteria/GCF_001022215.1/GCF_001022215.1_ASM102221v1_genomic.fna.gz
./refseq/bacteria/GCF_000783915.2/GCF_000783915.2_ASM78391v2_genomic.fna.gz
./refseq/bacteria/GCF_002220515.1/GCF_002220515.1_ASM222051v1_genomic.fna.gz
./refseq/bacteria/GCF_001294565.1/GCF_001294565.1_ASM129456v1_genomic.fna.gz
./refseq/bacteria/GCF_000513215.1/GCF_000513215.1_DB11_genomic.fna.gz
./refseq/bacteria/GCF_000336425.1/GCF_000336425.1_ASM33642v1_genomic.fna.gz
./refseq/bacteria/GCF_000828775.1/GCF_000828775.1_ASM82877v1_genomic.fna.gz
./refseq/bacteria/GCF_001559135.2/GCF_001559135.2_ASM155913v2_genomic.fna.gz
There may be many fewer results with actual genomes downloaded than the number of folders, as the filtering for completeness is done in a separate step (IIRC), so you may have many folders which are empty except for some md5sums.
Anyway, to extract all you files you can simply do:
find ./ -name "*.fna.gz" -exec gunzip -v {} \; # -v is optional
If you have lots to do, you may want to do the extractions in parallel, in which case you can look in to using find | xargs
or parallel
, but I'll leave that to you to research.
I then get the following from another find ./ -name "*.fna"
command:
./GCF_000422085.1/GCF_000422085.1_ASM42208v1_genomic.fna
./GCF_001417865.2/GCF_001417865.2_ASM141786v2_genomic.fna
./GCF_000975245.1/GCF_000975245.1_ASM97524v1_genomic.fna
./GCF_001280365.1/GCF_001280365.1_ASM128036v1_genomic.fna
./GCF_001022215.1/GCF_001022215.1_ASM102221v1_genomic.fna
./GCF_000783915.2/GCF_000783915.2_ASM78391v2_genomic.fna
./GCF_002220515.1/GCF_002220515.1_ASM222051v1_genomic.fna
./GCF_001294565.1/GCF_001294565.1_ASM129456v1_genomic.fna
./GCF_000513215.1/GCF_000513215.1_DB11_genomic.fna
./GCF_000336425.1/GCF_000336425.1_ASM33642v1_genomic.fna
./GCF_000828775.1/GCF_000828775.1_ASM82877v1_genomic.fna
./GCF_001559135.2/GCF_001559135.2_ASM155913v2_genomic.fna
That's all you need to do to to get the files extracted in fasta format (or whatever format you chose at the start, just swap the file extensions accordingly).
I don't think you actually downloaded the data files. Those appear to be just the md5sums for the files. I don't know what is the speed of internet connection you have but downloading all bacterial RefSeq should need hundreds of GB storage and will take a significant amount of time.
Can you show us the output of
ls -lh *
.Ok the whole size of the directory which has the bacteria is 26G. I found it with "du -h path/to/the/top/of/ directory". When using the command you mentioned i got the information below which shows the refseq bacteria hasn't been downloaded. My speed of connection is pretty much good. So i really don't know what is happening. Even not sure this command "ncbi-genome-download bacteria" is working fine.
Here is some part of output from ls -lh *
Those are (from this help page)
I don't recollect exactly what that file has in it. You can try to
gunzip
one and see.I think you should have used the following command (adjust the reference part if you truly need all bacteria, check the GitHub manual page for
ncbi-genome-download
carefully)I am going to suggest that you run this as a test first before you try to redownload the entire bacterial set.
Thanks but it seems this command
is also not working well. only 120 fasta files would be downloaded. While there are lots of more in here. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/
Did you check all the options there are for that command on the GitHub page? If you need everything then go with
I used this initially before trying the command below:
And sometimes it doesn't give me even the _genomic.gbff.gz files. I really don't understand what is going wrong!
Are you trying to extract the 'cds_from_genomic.fna.gz' and the non CDS?
I am trying to get only non CDS files which can be something like "GCF_000007725.1_ASM772v1_genomic.fna.gz" . I just put example of one of downloaded bacteria directories...
Try using the grep option -r [recursive], because based on your results from ls -lh it may look like the files you are trying to extract are in a subdirectory and you not finding them. For example, grep -r "*file_pattern_to_find.fna.gz" > output.file
I used this: grep -r "*_genomic.fna.gz" > output.file got this 'grep: input file ‘output.file’ is also the output' and nothing is in the output.file