Question

How to get fasta sequences for CDS if I have proteins IDs?

0

Entering edit mode

6.1 years ago

little_more ▴ 70

I have a large table where each column contains protein IDs of a particular group of orthologs. How do I map these protein IDs to gene IDs and then get a file with fasta sequences of all genes for each column?

sequence gene • 2.1k views

ADD COMMENT • link updated 6.1 years ago by h.mon 35k • written 6.1 years ago by little_more ▴ 70

0

Entering edit mode

which organism? can you use Ensembl/BioMart? how can I bake a pie?

ADD REPLY • link 6.1 years ago by JC 13k

0

Entering edit mode

let's put a bit more flesh to that bone http://www.ensembl.org/biomart/martview/

ADD REPLY • link 6.1 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Always provide a few examples when asking this type of question. Protein ID's could be anything and the answer will depend on what kind they are.

NCBI's unix utils would almost certainly work if the ID's are from GenBank.

ADD REPLY • link 6.1 years ago by GenoMax 147k

0

Entering edit mode

Oh, my bad. All IDs are from GenBank Escherichia genome assemblies (.faa files). For example, AAN78512.1, BAB33431.1, BAB33432.1.

P.S. I know that I can simply go to NCBI and get CDS for each protein manually but the question is how to do this for a large number of ID groups. I've heard something about EDirect but maybe there is a common way to do this with one line.

ADD REPLY • link 6.1 years ago by little_more ▴ 70

0

Entering edit mode

If you need to get all CDS's for E. coli O157:H7 then those are available here. If the ID's are from different genomes then it is a different problem. Let me look into it some.

ADD REPLY • link 6.1 years ago by GenoMax 147k

0

Entering edit mode

IDs are from different genomes. In fact, I have a table with protein IDs:

           group1   group2   group3   group4   ... 
bac1          ID1      ID2      ID3      ID4
bac2          ID5      ID6      ID7      ID8
bac3          ID9     ID10     ID11     ID12
...

and I need to get a file with fasta sequences of CDS for each group. Suppose I have all .fna assembly files. Could I use BioPython to get the files?

ADD REPLY • link 6.1 years ago by little_more ▴ 70

score 0 · Answer 1 · 2018-10-17

0

Entering edit mode

6.1 years ago

h.mon 35k

Try:

efetch -db protein -format fasta_cds_na -id AAN78512

edit: works the same with:

efetch -db protein -format fasta_cds_na -id AAN78512.1

ADD COMMENT • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Thank you! But is it possible to use the command for > 500 IDs? Documentations says 'a comma-delimited list of UIDs may be provided... but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method'.

ADD REPLY • link 6.1 years ago by little_more ▴ 70

0

Entering edit mode

You could run the efetch command via a loop. Be sure to sign up for an NCBI API_KEY and use it. Use discretion when sending in those queries so as to not get IP banned.

ADD REPLY • link 6.1 years ago by GenoMax 147k

0

Entering edit mode

Hi! When I try to run the same command, efetch does not take any action but just prints out the help. Any clue why this happens?

ADD REPLY • link 6.0 years ago by shubhra.bhattacharya ▴ 140

0

Entering edit mode

this can have many reasons, the most frequent problem is a typo. In case you want more profound help, please post your exact command here. Please use the 101010 code formatting button (fifth in the ribbon above)

ADD REPLY • link 6.0 years ago by Carambakaracho ★ 3.3k