Question

how to download all protein sequnces of a bacteria using ncbi ftp site?

3

Entering edit mode

6.3 years ago

sharmatina189059 ▴ 110

How can I download all protein sequences of complete genome sequences of Acinetobacter baumannii from ncbi ftp site?

ncbi • 3.5k views

ADD COMMENT • link updated 6.3 years ago by Joe 21k • written 6.3 years ago by sharmatina189059 ▴ 110

Joe · Answer 1 · 2018-09-13

1

Entering edit mode

6.3 years ago

Joe 21k

See my answer here: A: How to extract Refseq of downloaded files from NCBI

ADD COMMENT • link 6.3 years ago by Joe 21k

0

Entering edit mode

OP needs to get *protein.faa.gz files since protein data is needed.

ADD REPLY • link 6.3 years ago by GenoMax 148k

1

Entering edit mode

OP take a look at the help for ncbi-genome-download. Give the option --format protein-fasta to get what you want.

(or download the genome or CDS data and tranform it yourself)

ADD REPLY • link 6.3 years ago by Joe 21k

0

Entering edit mode

I am running this command ncbi-genome-download -l complete,chromosome bacteria --genus "Acinetobacter baumannii" --format protein-fasta

but this gives me MD5SUMS file names like this. I need fasta sequnces.

260ac38772d1f9d98641f03bc5b07596  ./GCF_000018445.1_ASM1844v1_assembly_report.txt
d3b3df68700a410823ff5ab347294110  ./GCF_000018445.1_ASM1844v1_assembly_stats.txt
3c329eae370e70cb5fe3d318944ff2a9  ./GCF_000018445.1_ASM1844v1_cds_from_genomic.fna.gz
283123b31bc184dad8a5112758c3dac8  ./GCF_000018445.1_ASM1844v1_feature_count.txt.gz
3b2c5e5971cf64dec0cbc9b4105e4723  ./GCF_000018445.1_ASM1844v1_feature_table.txt.gz
21d351875d083b9d039e5152ee386b85  ./GCF_000018445.1_ASM1844v1_genomic.fna.gz
44f267ac471a1a751e007a77f2be976f  ./GCF_000018445.1_ASM1844v1_genomic.gbff.gz
8ab9cd32a6125e45e478315c4e933905  ./GCF_000018445.1_ASM1844v1_genomic.gff.gz
19a54b30d9fcdc1bdff15dd57d3ebe53  ./GCF_000018445.1_ASM1844v1_protein.faa.gz
f1216342941f7ec20fc52d35391c7a98  ./GCF_000018445.1_ASM1844v1_protein.gpff.gz
4ec4c663b32630249858689a42609eac  ./GCF_000018445.1_ASM1844v1_rna_from_genomic.fna.gz
f0cff22a6c824dc98013967ecfe8a418  ./GCF_000018445.1_ASM1844v1_translated_cds.faa.gz
b70f0ea964ce5c4f79deca5b287919f1  ./annotation_hashes.txt

ADD REPLY • link updated 6.2 years ago by Joe 21k • written 6.2 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

should be included in *.faa

ADD REPLY • link 6.2 years ago by Sishuo Wang ▴ 230

0

Entering edit mode

The MD5sums are always provided. They correspond to the files you need which should be present in a folder named GCF_000....

You command is also wrong. complete,chromosome is not one argument to the --assembly-level option. You should specify one or the other. Similarly, bacteria is also a positional argument and should come last in the command.

Make sure you read the documentation on the github page.

Try:

ncbi-genome-download -s refseq -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

or

 ncbi-genome-download -s genbank -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

I got:

$  ls /refseq/bacteria

GCF_000018445.1:
GCF_000018445.1_ASM1844v1_protein.faa.gz  MD5SUMS

GCF_000021145.1:
GCF_000021145.1_ASM2114v1_protein.faa.gz  MD5SUMS

GCF_000021245.2:
GCF_000021245.2_ASM2124v2_protein.faa.gz  MD5SUMS

GCF_000069245.1:
GCF_000069245.1_ASM6924v1_protein.faa.gz  MD5SUMS

GCF_000186665.3:
GCF_000186665.3_ASM18666v4_protein.faa.gz  MD5SUMS

GCF_000187205.2:
GCF_000187205.2_ASM18720v4_protein.faa.gz  MD5SUMS
...
...

You may still end up with some empty folders, so you'll need to pull out all the fasta files seperately after with something like find ./ -name "*.faa.gz"

ADD REPLY • link 6.2 years ago by Joe 21k

2

Entering edit mode

Hi Joe hope you are save and well.

Why this works:

ncbi-genome-download -n -s refseq bacteria --genera Zhihengliuella -l "complete,chromosome" -v -F "cds-fasta" --flat-output -p 4 -r 10

INFO: Using cached summary.
Considering the following 1 assemblies for download:
GCF_002848265.1 Zhihengliuella sp. ISTPL4   ISTPL4

And this script dont?

echo "Downloading genomes from NCBI"

input="bac_taxa.txt"

while IFS= read -r line
do
  mkdir $line
  cd $line
  echo "Downloading $line genomes from NCBI"
  ncbi-genome-download -n -s refseq bacteria --genera $line -l "complete,chromosome" -v -F "cds-fasta" --flat-output -p 4 -r 10

  cd ..
done < "$input"

Downloading Acidiphilium genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidipila genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidipropionibacterium genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisarcina genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisoma genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisphaera genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidithiobacillus genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidithrix genomes from NCBI
Unsupported assembly level: cds-fasta

My list of genera(example):

Acidovorax
Acinetobacter
Acrocarpospora
Actibacterium
Actinoallomurus
Actinoalloteichus
Actinobacillus
Actinobacteria
actinobacterium
Actinobaculum
Actinocatenispora
Actinocorallia
Actinocrispum
Actinokineospora
Actinomadura

I used to download all genomic fasta and works just fine! Thanks

ADD REPLY • link 4.0 years ago by psschlogl ▴ 50

1

Entering edit mode

Im not at a computer to test this at the moment, but my guess would be that your loop isn't synthesising the command properly. It may be the quotes around cds-fasta. Check your command is well formed and introduce the flags one by one in the loop to narrow down the issue.