Question

How to retrieve single protein fasta file for multiple species?

0

Entering edit mode

6.8 years ago

arsilan324 ▴ 90

Hi all,

We are trying to make protein database of multiple organisms say E. coli, T. ferroxidans, B. subtilus, etc. This is what we want to use for matching our orbitrap output and we want to do that only with those species which we have found through Illumina sequencing. These are approximately 400+ genera. So, can you suggest any smart way of doing so? Like I provide the names of organisms and retrieve single fasta file?

Thank you very much!

FASTA Protein Multiple_Species Database • 2.6k views

ADD COMMENT • link updated 6.8 years ago by Elisabeth Gasteiger ★ 2.4k • written 6.8 years ago by arsilan324 ▴ 90

0

Entering edit mode

You can use @5heikii's script here.

cating the individual fasta genome proteins files into a giant one afterwards should be a simple task.

Note: See new answer/commnet below.

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

running this code didn't generate any fasta file. Although both the list of species (species.txt) and assembly_summary.txt are is same folder. Am i missing something?

ADD REPLY • link 6.8 years ago by arsilan324 ▴ 90

score 2 · Answer 1 · 2018-02-09

2

Entering edit mode

6.8 years ago

GenoMax 147k

Try this if you need RefSeq (modified version of @5heikki's code):

$ more species.txt 
Bifidobacterium adolescentis

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt

$ IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES){print $20}}' assembly_summary_refseq.txt | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_protein.faa.gz"}'; done | sh

Otherwise

$ wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

 IFS=$'\n'; for next in $(cat species.txt); do awk -v SPECIES=^"$next" 'BEGIN{FS="\t"}{if($8 ~ SPECIES){print $20}}' assembly_summary.txt | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_protein.faa.gz"}'; done

You will get many strains etc by this method. If you need very specific strains then you could awk '{print $8,$9,$10}' assembly_summary.txt > species and only take those that you need.

ADD COMMENT • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

thanks!! this worked perfectly. I have list of files such as GCF_000164035.1_ASM16403v1_protein.faa.gz and the next step would be to combine them together. Can you guide me there as well? Thanks a lot!!! :)

ADD REPLY • link 6.8 years ago by arsilan324 ▴ 90

1

Entering edit mode

If you want the final data file uncompressed: zcat G*.gz > final.faa
If you want to keep the final data compressed: cat G*.gz > final.faa.gz

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

I have prepared another list of archea this time but this command is not working. Is there any other assembly summary for archea?

ADD REPLY • link 6.8 years ago by arsilan324 ▴ 90

0

Entering edit mode

Post examples of names that are not working.

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

Here are examples, 1- Halodesulfurarchaeum formicicum 2- Methanosphaera cuniculi

The whole list can be seen here...

https://gold.jgi.doe.gov/organisms?Organism.Domain=ARCHAEAL&Organism.Type%20Strain=Yes&Organism.Active=Yes

ADD REPLY • link 6.8 years ago by arsilan324 ▴ 90

0

Entering edit mode

First one should work: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/886/955/GCF_001886955.1_ASM188695v1/GCF_001886955.1_ASM188695v1_protein.faa.gz

Second does not have a refseq genome. You may have to try second option of plain genomes. These may only have genomic sequence at times. https://www.ncbi.nlm.nih.gov/protein/?term=txid1077256[Organism:noexp]

ADD REPLY • link 6.8 years ago by GenoMax 147k

score 1 · Answer 2 · 2018-02-12

1

Entering edit mode

6.8 years ago

Elisabeth Gasteiger ★ 2.4k

If you are working with UniProt, you can retrieve the data programmatically as described here (with code examples): https://www.uniprot.org/help/api_downloading https://www.uniprot.org/help/api_queries

ADD COMMENT • link 6.8 years ago by Elisabeth Gasteiger ★ 2.4k