Question

Get Fasta File With Protein Sequences Given Entrez Gene Ids

4

Entering edit mode

12.2 years ago

Stephen 2.8k

I would like to get a FASTA file with protein sequences given a list of Entrez Gene IDs, e.g.:

19084
112407
18113
...etc.

For example, gene id 19084 links to protein ID 54036156, with fasta sequence

>gi|54036156|sp|Q9DBC7.3|KAP0_MOUSE RecName: Full=cAMP-dependent protein kinase type I-alpha regulatory subunit
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGSTLRKRKMYEEFLSKVSILESLDKWERLTVADALEPVQFEDGQ
KIVVQGEPGDEFFIILEGTAAVLQRRSENEEFVEVGRLGPSDYFGEIALLMNRPRAATVVARGPLKCVKL
DRPRFERVLGPCSDILKRNIQQYNSFVSLSV

I bet there's a way to do this with e-utils, but I can't figure out how. I realize that this can be done using the Ensembl Biomart, but the ID conversion from Entrez to Ensembl gene IDs results in lots of duplicates, i.e. one to many and many to one mappings both ways.

Edit: need to do this for ~17,000 gene IDs.

entrez eutils fasta • 20k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 12.2 years ago by Stephen 2.8k

Ram · Answer 1 · 2012-09-10

7

Entering edit mode

12.2 years ago

Pierre Lindenbaum 164k

use NCBI-ELINK to get the protein id from the gene-id

echo -e "19084\n112407\n18113" | while read G; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=protein&id=${G}" | grep -A 1 "<Link>" | grep "<Id>" | cut -d '>' -f 2 | cut -d '<' -f 1 | while read S ; do curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${S}&retmode=text&rettype=fasta" ; done;  done

>gi|336020222|gb|AEH77257.1| cAMP-dependent protein kinase type I-alpha regulatory subunit transcript variant 4, partial [Mus musculus]
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGST

>gi|336020220|gb|AEH77256.1| cAMP-dependent protein kinase type I-alpha regulatory subunit transcript variant 3, partial [Mus musculus]
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGST

>gi|336020218|gb|AEH77255.1| cAMP-dependent protein kinase type I-alpha regulatory subunit transcript variant 2, partial [Mus musculus]
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGST

>gi|148702421|gb|EDL34368.1| protein kinase, cAMP dependent regulatory, type I, alpha, isoform CRA_a [Mus musculus]
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
(...)

Updated: i've used ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz to get the relation gene-protein:

echo "19084" > geneids.txt
echo "112407" >> geneids.txt
sort -t '    ' -k1,1 geneids.txt | uniq > ordered_geneids.txt
curl -s "ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz" | gunzip -c | cut -d '    ' -f2,7 |\
        grep -v "-" | fgrep -w -f geneids.txt | sort -t '    ' -k1,1 -k2,2 -u  |\
            join -t '    ' -1 1 -2 1 ordered_geneids.txt - | cut -d '    ' -f2 > protein_gi.txt

result:

$ cat protein_gi.txt 
269308213
30794476

then use http://www.ncbi.nlm.nih.gov/sites/batchentrez and "download" to retrieve the sequences.

ADD COMMENT • link 12.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks Pierre. Do you know how this will scale to ~17,000 IDs?

ADD REPLY • link 12.2 years ago by Stephen 2.8k

0

Entering edit mode

No. I would download NCBI ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz , extract the protein acn and use NCBI batch entrez with those acns.

ADD REPLY • link 12.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Did this.

zcat gene2accession.gz | awk '{if ($2==19084) print}' | wc -l

returns 37 lines. For gene id 19084, the protein accession I'm looking for is 54036156. How do I know this is the right one out of those 37 lines?

ADD REPLY • link 12.2 years ago by Stephen 2.8k

0

Entering edit mode

Using this approach I get a lot of duplicates unfortunately. Biomart, with your toy example of 3 ids, gives no duplicates.

ADD REPLY • link 12.2 years ago by Joachim Jacob • 0

0

Entering edit mode

what is a "duplicate" ? same id or same sequence ? anyway, It's easy to remove the sequence/name duplicate with a simple sort | uniq

ADD REPLY • link 12.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Many, many protein IDs for the same gene ID. E.g. searching for entrez gene 19804 ( http://www.ncbi.nlm.nih.gov/gene/?term=19084 ) returns protein accession number Q9DBC7.3 (genepept 54036156 - http://www.ncbi.nlm.nih.gov/protein/54036156 ). But executing the command above:

zcat gene2accession.gz | awk '{if ($2==19084) print}' | wc -l

... returns 37 lines with unique protein accession numbers. I can't see what's different about the line containing 54036156. A simple sort|uniq wouldn't work in this case.

ADD REPLY • link 12.2 years ago by Stephen 2.8k

0

Entering edit mode

I meant sort| uniq after linearizing the fasta sequence to get the unique sequences. Or just use "sort -u" to only keep one record per gene-id.

ADD REPLY • link 12.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks Pierre. Hmm, looks like I'm exhausting my available 32GB memory at the fgrep step.

ADD REPLY • link 12.2 years ago by Stephen 2.8k

0

Entering edit mode

I was using it to go faster. You can remove it from the pipeline.

ADD REPLY • link 12.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hello,

I'm looking at your answer two years later and am wondering if I could get some help with your updated code (for finding protein IDs as opposed to protein FASTA sequences). I tried the code as typed, but I get the following output:

sort: multi-character tab `    '
sort: multi-character tab `    '
join: illegal tab character specification
cut: bad delimiter
cut: bad delimiter

with no results (protein_gi.txt is blank, as is ordered_geneids.txt). When I remove -t ' '

after sort, ordered_geneids.txt is generated appropriately, but I am at a loss as to how I could alter the curl line to get the code to function. Any help would be greatly appreciated.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by hmg ▴ 20

score 1 · Answer 2 · 2012-09-10

1

Entering edit mode

12.2 years ago

Vivek ★ 2.7k

You could likely use Batch entrez to do a bulk download for the 17k ids and get a sequence file in genbank format.

This can be parsed using Bio perl.

ADD COMMENT • link 12.2 years ago by Vivek ★ 2.7k

0

Entering edit mode

It seems that the default genbank download does not contain that information because this is a related item and not the actual protein it encodes for. Perhaps there is a more detailed genbank or other format that does contain that information

ADD REPLY • link 12.2 years ago by Istvan Albert 102k

score 1 · Answer 3 · 2012-09-11

I would convert the Entrez ID's to RefSeq mRNA ID's using DAVID:

DAVID

Save the text output, then use awk to grab the column of RefSeq ID's

awk '{print $2}' DAVID_conversions.txt > refseq_ids.txt

Paste that list into the TableBrowser @ UCSC to get a fasta file of all your genes of interest:

Genome Browser

Output for your test cases:

>NP_035054.1
MESGFTSKDTYLSHFNPRDYLEKYYSFGSRHCAENEILRHLLKNLFKIFC
LGAVKGELLIDIGSGPTIYQLLSACESFTEIIVSDYTDQNLWELQKWLKK
EPGAFDWSPVVTYVCDLEGNRMKGPEKEEKLRRAIKQVLKCDVTQSQPLG
GVSLPPADCLLSTLCLDAACPDLPAYRTALRNLGSLLKPGGFLVMVDALK
SSYYMIGEQKFSSLPLGWETVRDAVEEAGYTIEQFEVISQNYSSTTSNNE
GLFSLVGRKPGRSE

>NP_068680.1
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFL
REYFERLEKEEARQIQCLQKTGIRTDSREDEISPPPPNPVVKGRRRRGAI
SAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSHLDDNERSDIF
DAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSF
GELALIYGTPRAATVKAKTNVKLWGIDRDSYRRILMGSTLRKRKMYEEFL
SKVSILESLDKWERLTVADALEPVQFEDGQKIVVQGEPGDEFFIILEGTA
AVLQRRSENEEFVEVGRLGPSDYFGEIALLMNRPRAATVVARGPLKCVKL
DRPRFERVLGPCSDILKRNIQQYNSFVSLSV

>NP_082409.2
MPLGHIMRLDLEKIALEYIVPCLHEVGFCYLDNFLGEVVGDCVLERVKQL
HYNGALRDGQLAGPCAGVSKRHLRGDQITWIGGNEEGCEAINFLLSLIDR
LVLYCGSRLGKYYVKERSKAMVACYPGNGTGYVRHVDNPNGDGRCITCIY
YLNKNWDAKLHGGVLRIFPEGKSFVADVEPIFDRLLFFWSDRRNPHEVQP
SYATRYAMTVWYFDAEERAEAKKKFRNLTRKTESALAKD