Question

How to extract Refseq of downloaded files from NCBI

0

Entering edit mode

6.7 years ago

Shelle ▴ 30

I have downloaded all bacteria refseq from NCBI website. I am interested only in gunzip format of Fasta files like

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Abditibacterium_utsteinense/latest_assembly_versions/GCF_002973605.1_ASM297360v1/GCF_002973605.1_ASM297360v1_genomic.fna.gz

The format of files that i got as a result of this command "ncbi-genome-download bacteria" is in this format:

d3d4a4c01a15dee5a054b38a3178bf12  ./GCF_000007725.1_ASM772v1_assembly_report.txt
c132f1a3ba2b00383f2a1d92e4460e2b  ./GCF_000007725.1_ASM772v1_assembly_stats.txt
7a2f6dc85caefaf326362077f72bb1ad  ./GCF_000007725.1_ASM772v1_cds_from_genomic.fna.gz
7e65c3da25f5a35d8a7860d6c478bf67  ./GCF_000007725.1_ASM772v1_feature_count.txt.gz
2d82d4315ca7a2004a3b03bc55aa42af  ./GCF_000007725.1_ASM772v1_feature_table.txt.gz
576cc643ef00d289009c95518f3792f5  ./GCF_000007725.1_ASM772v1_genomic.fna.gz
5a491b9ae2550dd9b6379e4f9054c4a2  ./GCF_000007725.1_ASM772v1_genomic.gbff.gz
25b139d63e6cd46484ac27daa8532b79  ./GCF_000007725.1_ASM772v1_genomic.gff.gz

. . . Can someone tell me how i get only extract files with "_genomic.fna.gz" name? I have tried commands like below

 find path/to/my/current directory -name "*_genomic.fna.gz"

But it is not giving me anything while the format of files is in directory. Does someone have any other solution or suggestion?

sequence assembly FASTA • 7.9k views

ADD COMMENT • link updated 6.7 years ago by Joe 22k • written 6.7 years ago by Shelle ▴ 30

0

Entering edit mode

I don't think you actually downloaded the data files. Those appear to be just the md5sums for the files. I don't know what is the speed of internet connection you have but downloading all bacterial RefSeq should need hundreds of GB storage and will take a significant amount of time.

Can you show us the output of ls -lh *.

ADD REPLY • link 6.7 years ago by GenoMax 151k

0

Entering edit mode

Ok the whole size of the directory which has the bacteria is 26G. I found it with "du -h path/to/the/top/of/ directory". When using the command you mentioned i got the information below which shows the refseq bacteria hasn't been downloaded. My speed of connection is pretty much good. So i really don't know what is happening. Even not sure this command "ncbi-genome-download bacteria" is working fine.

Here is some part of output from ls -lh *

GCF_900478395.1:
total 3.2M
-rw-rw-r--. 1 ... 3.2M Aug 31 01:07 GCF_900478395.1_32135_B01_genomic.gbff.gz
-rw-rw-r--. 1 ... 1.1K Aug 31 01:07 MD5SUMS

GCF_900478415.1:
total 1.4M
-rw-rw-r--. 1 ... 1.4M Aug 31 01:06 GCF_900478415.1_35910_E02_genomic.gbff.gz
-rw-rw-r--. 1 ... 1.1K Aug 31 01:06 MD5SUMS

GCF_900478715.1:
total 2.2M
-rw-rw-r--. 1 ... 2.2M Aug 31 01:07 GCF_900478715.1_31885_B02_genomic.gbff.gz
-rw-rw-r--. 1 ... 1.1K Aug 31 01:07 MD5SUMS

GCF_900478735.1:
total 1.2M
-rw-rw-r--. 1 ...1.2M Aug 31 01:05 GCF_900478735.1_33763_D01_genomic.gbff.gz
-rw-rw-r--. 1 ... 1.1K Aug 31 01:05 MD5SUMS

GCF_900478755.1:
total 2.5M
-rw-rw-r--. 1 ... 2.5M Aug 31 01:05 GCF_900478755.1_32473_C02_genomic.gbff.gz
-rw-rw-r--. 1 ... 1.1K Aug 31 01:05 MD5SUMS

GCF_900492165.1:
total 1.2M
-rw-rw-r--. 1 ... 1.2M Aug 31 01:06 GCF_900492165.1_chr1_genomic.gbff.gz
-rw-rw-r--. 1 ...  970 Aug 31 01:06 MD5SUMS

GCF_900492555.1:
total 1.9M
-rw-rw-r--. 1 ... 1.9M Aug 31 01:06 GCF_900492555.1_CECT9104_genomic.gbff.gz
-rw-rw-r--. 1 ... 1018 Aug 31 01:06 MD5SUMS

ADD REPLY • link 6.7 years ago by Shelle ▴ 30

0

Entering edit mode

Those are (from this help page)

GenBank flat file format of the genomic sequence(s) in the assembly. This file includes both the genomic sequence and the CONTIG description (for CON records), hence, it replaces both the .gbk & .gbs format files that were provided in the old genomes FTP directories.

I don't recollect exactly what that file has in it. You can try to gunzip one and see.

I think you should have used the following command (adjust the reference part if you truly need all bacteria, check the GitHub manual page for ncbi-genome-download carefully)

ncbi-genome-download --format fasta --refseq-category reference bacteria

I am going to suggest that you run this as a test first before you try to redownload the entire bacterial set.

ncbi-genome-download --format fasta --genus "Streptomyces coelicolor" bacteria

ADD REPLY • link 6.7 years ago by GenoMax 151k

0

Entering edit mode

Thanks but it seems this command

ncbi-genome-download --format fasta --refseq-category reference bacteria

is also not working well. only 120 fasta files would be downloaded. While there are lots of more in here. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/

ADD REPLY • link 6.7 years ago by Shelle ▴ 30

0

Entering edit mode

Did you check all the options there are for that command on the GitHub page? If you need everything then go with

ncbi-genome-download --format fasta bacteria

ADD REPLY • link 6.7 years ago by GenoMax 151k

0

Entering edit mode

I used this initially before trying the command below:

ncbi-genome-download --format fasta --refseq-category reference bacteria

And sometimes it doesn't give me even the _genomic.gbff.gz files. I really don't understand what is going wrong!

ADD REPLY • link 6.7 years ago by Shelle ▴ 30

0

Entering edit mode

Are you trying to extract the 'cds_from_genomic.fna.gz' and the non CDS?

ADD REPLY • link 6.7 years ago by ddeemer ▴ 10

0

Entering edit mode

I am trying to get only non CDS files which can be something like "GCF_000007725.1_ASM772v1_genomic.fna.gz" . I just put example of one of downloaded bacteria directories...

ADD REPLY • link 6.7 years ago by Shelle ▴ 30

0

Entering edit mode

Try using the grep option -r [recursive], because based on your results from ls -lh it may look like the files you are trying to extract are in a subdirectory and you not finding them. For example, grep -r "*file_pattern_to_find.fna.gz" > output.file

ADD REPLY • link 6.7 years ago by ddeemer ▴ 10

0

Entering edit mode

I used this: grep -r "*_genomic.fna.gz" > output.file got this 'grep: input file ‘output.file’ is also the output' and nothing is in the output.file

ADD REPLY • link 6.7 years ago by Shelle ▴ 30

score 5 · Accepted Answer · 2018-09-02

All you should need to do to download all bacteria in RefSeq in fasta format is:

ncbi-genome-download -F fasta bacteria

You can optionally ask for assembly completeness levels with -l complete for example, and the output format can be changed with -F

For a small test I've done the following:

ncbi-genome-download -l complete -F fasta --genus "Serratia" -v bacteria

You obviously don't need to specify a genus for your case. I stopped it after a few downloads so that it didn't take ages. Nevertheless I ended up with a large file structure and the result of find ./Serratia -name "*.fna.gz" gave me:

./refseq/bacteria/GCF_000422085.1/GCF_000422085.1_ASM42208v1_genomic.fna.gz
./refseq/bacteria/GCF_001417865.2/GCF_001417865.2_ASM141786v2_genomic.fna.gz
./refseq/bacteria/GCF_000975245.1/GCF_000975245.1_ASM97524v1_genomic.fna.gz
./refseq/bacteria/GCF_001280365.1/GCF_001280365.1_ASM128036v1_genomic.fna.gz
./refseq/bacteria/GCF_001022215.1/GCF_001022215.1_ASM102221v1_genomic.fna.gz
./refseq/bacteria/GCF_000783915.2/GCF_000783915.2_ASM78391v2_genomic.fna.gz
./refseq/bacteria/GCF_002220515.1/GCF_002220515.1_ASM222051v1_genomic.fna.gz
./refseq/bacteria/GCF_001294565.1/GCF_001294565.1_ASM129456v1_genomic.fna.gz
./refseq/bacteria/GCF_000513215.1/GCF_000513215.1_DB11_genomic.fna.gz
./refseq/bacteria/GCF_000336425.1/GCF_000336425.1_ASM33642v1_genomic.fna.gz
./refseq/bacteria/GCF_000828775.1/GCF_000828775.1_ASM82877v1_genomic.fna.gz
./refseq/bacteria/GCF_001559135.2/GCF_001559135.2_ASM155913v2_genomic.fna.gz

There may be many fewer results with actual genomes downloaded than the number of folders, as the filtering for completeness is done in a separate step (IIRC), so you may have many folders which are empty except for some md5sums.

Anyway, to extract all you files you can simply do:

 find ./ -name "*.fna.gz" -exec gunzip -v {} \;   # -v is optional

If you have lots to do, you may want to do the extractions in parallel, in which case you can look in to using find | xargs or parallel, but I'll leave that to you to research.

I then get the following from another find ./ -name "*.fna" command:

./GCF_000422085.1/GCF_000422085.1_ASM42208v1_genomic.fna
./GCF_001417865.2/GCF_001417865.2_ASM141786v2_genomic.fna
./GCF_000975245.1/GCF_000975245.1_ASM97524v1_genomic.fna
./GCF_001280365.1/GCF_001280365.1_ASM128036v1_genomic.fna
./GCF_001022215.1/GCF_001022215.1_ASM102221v1_genomic.fna
./GCF_000783915.2/GCF_000783915.2_ASM78391v2_genomic.fna
./GCF_002220515.1/GCF_002220515.1_ASM222051v1_genomic.fna
./GCF_001294565.1/GCF_001294565.1_ASM129456v1_genomic.fna
./GCF_000513215.1/GCF_000513215.1_DB11_genomic.fna
./GCF_000336425.1/GCF_000336425.1_ASM33642v1_genomic.fna
./GCF_000828775.1/GCF_000828775.1_ASM82877v1_genomic.fna
./GCF_001559135.2/GCF_001559135.2_ASM155913v2_genomic.fna

That's all you need to do to to get the files extracted in fasta format (or whatever format you chose at the start, just swap the file extensions accordingly).