how to download all protein sequnces of a bacteria using ncbi ftp site?
1
3
Entering edit mode
6.2 years ago

How can I download all protein sequences of complete genome sequences of Acinetobacter baumannii from ncbi ftp site?

ncbi • 3.5k views
ADD COMMENT
1
Entering edit mode
6.2 years ago
Joe 21k

See my answer here: A: How to extract Refseq of downloaded files from NCBI

ADD COMMENT
0
Entering edit mode

OP needs to get *protein.faa.gz files since protein data is needed.

ADD REPLY
1
Entering edit mode

OP take a look at the help for ncbi-genome-download. Give the option --format protein-fasta to get what you want.

(or download the genome or CDS data and tranform it yourself)

ADD REPLY
0
Entering edit mode

I am running this command ncbi-genome-download -l complete,chromosome bacteria --genus "Acinetobacter baumannii" --format protein-fasta

but this gives me MD5SUMS file names like this. I need fasta sequnces.

260ac38772d1f9d98641f03bc5b07596  ./GCF_000018445.1_ASM1844v1_assembly_report.txt
d3b3df68700a410823ff5ab347294110  ./GCF_000018445.1_ASM1844v1_assembly_stats.txt
3c329eae370e70cb5fe3d318944ff2a9  ./GCF_000018445.1_ASM1844v1_cds_from_genomic.fna.gz
283123b31bc184dad8a5112758c3dac8  ./GCF_000018445.1_ASM1844v1_feature_count.txt.gz
3b2c5e5971cf64dec0cbc9b4105e4723  ./GCF_000018445.1_ASM1844v1_feature_table.txt.gz
21d351875d083b9d039e5152ee386b85  ./GCF_000018445.1_ASM1844v1_genomic.fna.gz
44f267ac471a1a751e007a77f2be976f  ./GCF_000018445.1_ASM1844v1_genomic.gbff.gz
8ab9cd32a6125e45e478315c4e933905  ./GCF_000018445.1_ASM1844v1_genomic.gff.gz
19a54b30d9fcdc1bdff15dd57d3ebe53  ./GCF_000018445.1_ASM1844v1_protein.faa.gz
f1216342941f7ec20fc52d35391c7a98  ./GCF_000018445.1_ASM1844v1_protein.gpff.gz
4ec4c663b32630249858689a42609eac  ./GCF_000018445.1_ASM1844v1_rna_from_genomic.fna.gz
f0cff22a6c824dc98013967ecfe8a418  ./GCF_000018445.1_ASM1844v1_translated_cds.faa.gz
b70f0ea964ce5c4f79deca5b287919f1  ./annotation_hashes.txt
ADD REPLY
0
Entering edit mode

should be included in *.faa

ADD REPLY
0
Entering edit mode

The MD5sums are always provided. They correspond to the files you need which should be present in a folder named GCF_000....

You command is also wrong. complete,chromosome is not one argument to the --assembly-level option. You should specify one or the other. Similarly, bacteria is also a positional argument and should come last in the command.

Make sure you read the documentation on the github page.

Try:

ncbi-genome-download -s refseq -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

or

 ncbi-genome-download -s genbank -l complete --genus "Acinetobacter baumannii" -v -F protein-fasta bacteria

I got:

$  ls /refseq/bacteria

GCF_000018445.1:
GCF_000018445.1_ASM1844v1_protein.faa.gz  MD5SUMS

GCF_000021145.1:
GCF_000021145.1_ASM2114v1_protein.faa.gz  MD5SUMS

GCF_000021245.2:
GCF_000021245.2_ASM2124v2_protein.faa.gz  MD5SUMS

GCF_000069245.1:
GCF_000069245.1_ASM6924v1_protein.faa.gz  MD5SUMS

GCF_000186665.3:
GCF_000186665.3_ASM18666v4_protein.faa.gz  MD5SUMS

GCF_000187205.2:
GCF_000187205.2_ASM18720v4_protein.faa.gz  MD5SUMS
...
...

You may still end up with some empty folders, so you'll need to pull out all the fasta files seperately after with something like find ./ -name "*.faa.gz"

ADD REPLY
2
Entering edit mode

Hi Joe hope you are save and well.

Why this works:

ncbi-genome-download -n -s refseq bacteria --genera Zhihengliuella -l "complete,chromosome" -v -F "cds-fasta" --flat-output -p 4 -r 10

INFO: Using cached summary.
Considering the following 1 assemblies for download:
GCF_002848265.1 Zhihengliuella sp. ISTPL4   ISTPL4

And this script dont?

echo "Downloading genomes from NCBI"

input="bac_taxa.txt"

while IFS= read -r line
do
  mkdir $line
  cd $line
  echo "Downloading $line genomes from NCBI"
  ncbi-genome-download -n -s refseq bacteria --genera $line -l "complete,chromosome" -v -F "cds-fasta" --flat-output -p 4 -r 10

  cd ..
done < "$input"

Downloading Acidiphilium genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidipila genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidipropionibacterium genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisarcina genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisoma genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidisphaera genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidithiobacillus genomes from NCBI
Unsupported assembly level: cds-fasta
Downloading Acidithrix genomes from NCBI
Unsupported assembly level: cds-fasta

My list of genera(example):

Acidovorax
Acinetobacter
Acrocarpospora
Actibacterium
Actinoallomurus
Actinoalloteichus
Actinobacillus
Actinobacteria
actinobacterium
Actinobaculum
Actinocatenispora
Actinocorallia
Actinocrispum
Actinokineospora
Actinomadura

I used to download all genomic fasta and works just fine! Thanks

ADD REPLY
1
Entering edit mode

Im not at a computer to test this at the moment, but my guess would be that your loop isn't synthesising the command properly. It may be the quotes around cds-fasta. Check your command is well formed and introduce the flags one by one in the loop to narrow down the issue.

ADD REPLY
0
Entering edit mode

Yeah that worked. Thanks man. Paulo

ADD REPLY

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6