I have a large table where each column contains protein IDs of a particular group of orthologs. How do I map these protein IDs to gene IDs and then get a file with fasta sequences of all genes for each column?
I have a large table where each column contains protein IDs of a particular group of orthologs. How do I map these protein IDs to gene IDs and then get a file with fasta sequences of all genes for each column?
Try:
efetch -db protein -format fasta_cds_na -id AAN78512
edit: works the same with:
efetch -db protein -format fasta_cds_na -id AAN78512.1
You could run the efetch command via a loop. Be sure to sign up for an NCBI API_KEY and use it. Use discretion when sending in those queries so as to not get IP banned.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
which organism? can you use Ensembl/BioMart? how can I bake a pie?
let's put a bit more flesh to that bone http://www.ensembl.org/biomart/martview/
Always provide a few examples when asking this type of question. Protein ID's could be anything and the answer will depend on what kind they are.
NCBI's unix utils would almost certainly work if the ID's are from GenBank.
Oh, my bad. All IDs are from GenBank Escherichia genome assemblies (.faa files). For example, AAN78512.1, BAB33431.1, BAB33432.1.
P.S. I know that I can simply go to NCBI and get CDS for each protein manually but the question is how to do this for a large number of ID groups. I've heard something about EDirect but maybe there is a common way to do this with one line.
If you need to get all CDS's for E. coli O157:H7 then those are available here. If the ID's are from different genomes then it is a different problem. Let me look into it some.
IDs are from different genomes. In fact, I have a table with protein IDs:
and I need to get a file with fasta sequences of CDS for each group. Suppose I have all .fna assembly files. Could I use BioPython to get the files?