I would like to get a FASTA file with protein sequences given a list of Entrez Gene IDs, e.g.:
19084
112407
18113
...etc.
For example, gene id 19084 links to protein ID 54036156, with fasta sequence
>gi|54036156|sp|Q9DBC7.3|KAP0_MOUSE RecName: Full=cAMP-dependent protein kinase type I-alpha regulatory subunit
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGSTLRKRKMYEEFLSKVSILESLDKWERLTVADALEPVQFEDGQ
KIVVQGEPGDEFFIILEGTAAVLQRRSENEEFVEVGRLGPSDYFGEIALLMNRPRAATVVARGPLKCVKL
DRPRFERVLGPCSDILKRNIQQYNSFVSLSV
I bet there's a way to do this with e-utils, but I can't figure out how. I realize that this can be done using the Ensembl Biomart, but the ID conversion from Entrez to Ensembl gene IDs results in lots of duplicates, i.e. one to many and many to one mappings both ways.
Edit: need to do this for ~17,000 gene IDs.
Thanks Pierre. Do you know how this will scale to ~17,000 IDs?
No. I would download NCBI ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz , extract the protein acn and use NCBI batch entrez with those acns.
Did this.
returns 37 lines. For gene id 19084, the protein accession I'm looking for is 54036156. How do I know this is the right one out of those 37 lines?
Using this approach I get a lot of duplicates unfortunately. Biomart, with your toy example of 3 ids, gives no duplicates.
what is a "duplicate" ? same id or same sequence ? anyway, It's easy to remove the sequence/name duplicate with a simple sort | uniq
Many, many protein IDs for the same gene ID. E.g. searching for entrez gene 19804 ( http://www.ncbi.nlm.nih.gov/gene/?term=19084 ) returns protein accession number Q9DBC7.3 (genepept 54036156 - http://www.ncbi.nlm.nih.gov/protein/54036156 ). But executing the command above:
... returns 37 lines with unique protein accession numbers. I can't see what's different about the line containing 54036156. A simple sort|uniq wouldn't work in this case.
I meant sort| uniq after linearizing the fasta sequence to get the unique sequences. Or just use "sort -u" to only keep one record per gene-id.
Thanks Pierre. Hmm, looks like I'm exhausting my available 32GB memory at the fgrep step.
I was using it to go faster. You can remove it from the pipeline.
Hello,
I'm looking at your answer two years later and am wondering if I could get some help with your updated code (for finding protein IDs as opposed to protein FASTA sequences). I tried the code as typed, but I get the following output:
with no results (
protein_gi.txt
is blank, as isordered_geneids.txt
). When I remove-t ' '
after
sort
,ordered_geneids.txt
is generated appropriately, but I am at a loss as to how I could alter thecurl
line to get the code to function. Any help would be greatly appreciated.