Question

How to get canonical protein sequences from Entrez gene id in batch?

0

Entering edit mode

3.1 years ago

Yep ▴ 20

Hi, I am trying to map some 10000+ Entrez Gene IDs to protein sequences. For the sake of my downstream task (learn protein representations for those Gene IDs), I can only accept N vs 1 mapping (i.e., N Gene IDs correspond to the same protein sequence), but not 1 vs N mapping (i.e., 1 Gene ID corresponds to N protein sequences).

As a Python/R user with no experience in Perl, the current strategy I am using is to transition between two online databases: [NCBI] Gene ID --> [UniProt] Accession ID --> [UniProt] canonical protein sequence. However, the problem is the first transition would result in a fair number of multi-to-multi mapping. After a careful examination, I found that the problem lies in the UniProt side, where the same Gene IDs are assigned to multiple proteins accession IDs.

I then tried to transition first from Gene ID to Gene Symbol, but gene symbols are even more misleading. Are there some ways to directly access canonical protein sequences from NCBI via Gene IDs using Python?

Example gene IDs: 276,277,278,801,805,3020,3021,7278,113457,11013,286527,6818,445329,7258,728137,100289087

protein sequence database python Entrez • 2.6k views

ADD COMMENT • link updated 3.1 years ago by GenoMax 147k • written 3.1 years ago by Yep ▴ 20

0

Entering edit mode

Please post a few example gene ID. EntrezDirect would be one way of getting this done programmatically.

ADD REPLY • link 3.1 years ago by GenoMax 147k

0

Entering edit mode

Have added a few. Thanks for the recommendations, I briefly skimmed through and this looks similar to Biopython. In fact, I wanted to try directly accessing the protein sequences from Entrez, but the conversion between gene IDs and protein IDs (GenBank protein accession) is quite ambiguous, by which I mean there are a lot of multi-to-multi mappings. I used bioDBnet to do the conversion. As an example, I can hardly distinguish which one is the canonical protein ID for gene ID 2:

ADD REPLY • link 3.1 years ago by Yep ▴ 20

0

Entering edit mode

If you are referencing a gene record ID then it points to a number of protein sequence. For example for 277 gene ID you have the following records in protein database.

$ esearch -db gene -query "277 [uid]" | elink -target protein | efetch -format fasta | grep ">"
>NP_001373854.1 alpha-amylase 1B precursor [Homo sapiens]
>NP_001008219.1 alpha-amylase 1B precursor [Homo sapiens]
>sp|P0DTE7.1|AMY1B_HUMAN RecName: Full=Alpha-amylase 1B; Flags: Precursor
>AAI44453.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>AAI32996.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>AAI32986.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>AAI32998.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>AAI32988.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>AAH92444.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>AAH69463.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>AAH69347.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>AAH70302.1 AMY1A protein, partial [Homo sapiens]
>AAH63129.1 Amylase, alpha 1A (salivary) [Homo sapiens]
>BAF85030.1 unnamed protein product [Homo sapiens]

(All sequence trimmed due to space constraint)

You could also UniProt sequence (which will likely get you one sequence) using the above search

$ esearch -db gene -query "277 [uid]" | elink -target protein | efetch -format fasta | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' | grep "_HUMAN" | tr "\t" "\n"
>sp|P0DTE7.1|AMY1B_HUMAN RecName: Full=Alpha-amylase 1B; Flags: Precursor
MKLFWLLFTIGFCWAQYSSNTQQGRTSIVHLFEWRWVDIALECERYLAPKGFG

ADD REPLY • link 3.1 years ago by GenoMax 147k

0

Entering edit mode

Thanks for the great demo! As I mentioned in my reply to Istvan Albert I actually need only one canonical protein sequence from one gene ID.

The second script looks like what I need! I just installed EntrezDirect and am still dealing with some bugs, so I cannot test the script to gene ID 10627, which is a one-on-multiple-protein mapping gene. I am not familiar with awk and am not sure where in the script UniProt is linked, but it would be great if Entrez could directly do so. Will try test in Biopython later to see the results.

ADD REPLY • link 3.1 years ago by Yep ▴ 20

0

Entering edit mode

Use conda to install EntrezDirect and that should take care of the problems. I showed you examples of searches for 10627 and 103910 below.

ADD REPLY • link 3.1 years ago by GenoMax 147k

0

Entering edit mode

Since a gene by definition could be connected to multiple proteins, what is it that you are asking really?

What is the goal? How would/should the multiple mapping be resolved? Are you asking for a database to resolve that for you?

Edit: also I would not call this "accurate" protein ids. There is nothing "accurate" about collapsing multiple proteins into a single gene. If anything it is an "inaccurate" mapping.

ADD REPLY • link 3.1 years ago by Istvan Albert 101k

0

Entering edit mode

Thanks for the reminder and have revised the wording in my post. You are right that there could be a bunch of proteins connected to one gene. However, the reason I have to figure out one canonical protein sequence is that I need to learn the representation of proteins from a protein-protein interaction dataset, where the proteins are denoted only by gene IDs (I am also quite confused about this). I understand the protein sequence I choose for the downstream might not be the real isoform of the protein actually interacting with the other proteins, but for my downstream task, a canonical sequence is enough.

In fact, UniProt has the so-called "canonical sequence" posted for each protein, but I just need to resolve the multi-mapping stuff. For example, gene ID 10627 is mapped to both UniProt accession P19105 (aka gene symbol MYL12A) and UniProt accession O14950 (aka gene symbol MYL12B, mapped along with gene ID 103910) in UniProt. In NCBI the gene IDs are clearly distinct, and in UniPort the gene symbols for the two proteins are also different. But it is just mapped to two of them.

While I did notice that in NCBI Gene database I can download a single CDS protein sequence for one gene, I am not aware of how to do so in Entrez...

ADD REPLY • link 3.1 years ago by Yep ▴ 20

score 2 · Accepted Answer · 2021-11-01

2

Entering edit mode

3.1 years ago

GenoMax 147k

If you are happy with UniProt sequences then this should do the trick.

$ esearch -db gene -query "10627 [uid]" | elink -target protein | efetch -format fasta | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' | grep "_HUMAN" | tr "\t" "\n"
>sp|P19105.2|ML12A_HUMAN RecName: Full=Myosin regulatory light chain 12A; AltName: Full=Epididymis secretory protein Li 24; Short=HEL-S-24; AltName: Full=MLC-2B; AltName: Full=Myosin RLC; AltName: Full=Myosin regulatory light chain 2, nonsarcomeric; AltName: Full=Myosin regulatory light chain MRLC3
MSSKRTKTKTKKRPQRATSNVFAMFDQSQIQEFKEAFNMIDQNRDGFIDKEDLHDMLASLGKNPTDEYLDAMMNEAPGPINFTMFLTMFGEKLNGTDPEDVIRNAFACFDEEATGTIQEDYLRELLTTMGDRFTDEEVDELYREAPIDKKGNFNYIEFTRILKHGAKDKDD

and

$ esearch -db gene -query "103910 [uid]" | elink -target protein | efetch -format fasta | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' | grep "_HUMAN" | tr "\t" "\n"
>sp|O14950.2|ML12B_HUMAN RecName: Full=Myosin regulatory light chain 12B; AltName: Full=MLC-2A; Short=MLC-2; AltName: Full=Myosin regulatory light chain 2-B, smooth muscle isoform; AltName: Full=Myosin regulatory light chain 20 kDa; Short=MLC20; AltName: Full=Myosin regulatory light chain MRLC2; AltName: Full=SHUJUN-1
MSSKKAKTKTTKKRPQRATSNVFAMFDQSQIQEFKEAFNMIDQNRDGFIDKEDLHDMLASLGKNPTDAYLDAMMNEAPGPINFTMFLTMFGEKLNGTDPEDVIRNAFACFDEEATGTIQEDYLRELLTTMGDRFTDEEVDELYREAPIDKKGNFNYIEFTRILKHGAKDKDD

ADD COMMENT • link 3.1 years ago by GenoMax 147k

0

Entering edit mode

Thanks! I installed Entrez Direct on my Win10 laptop and there are (as expected) some bugs when implementing the code with Git Bash (or perhaps cygwin). Briefly speaking, I have no problem running esearch -db gene -query "10627 [uid]", but when it becomes esearch -db gene -query "10627 [uid]" | elink -target protein or more, some error keep coming out. Am trying to fix it. Maybe I will install a WSL instead.

nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ elink.fcgi -id Unable,to,locate,xtract,executable.,Please,execute,the,following,nquire,dwn,ftp.ncbi.nlm.nih.gov,entrez,entrezdirect,xtract.UNSUPPORTED.gz,gunzip,f,xtract.UNSUPPORTED.gz,chmod,x,xtract.UNSUPPORTED -WebEnv MCID_6180802004ade96d2d580872 -dbfrom gene -db protein -cmd neighbor_history -linkname gene_protein -tool edirect -edirect 16.2 -edirect_os MSYS_NT

Unable to locate xtract executable. Please execute the following:

  nquire -dwn ftp.ncbi.nlm.nih.gov entrez/entrezdirect xtract.UNSUPPORTED.gz
  gunzip -f xtract.UNSUPPORTED.gz
  chmod +x xtract.UNSUPPORTED

ADD REPLY • link 3.1 years ago by Yep ▴ 20

0

Entering edit mode

it only runs on Unix-like systems (thus you'd need to install WSL), that's what the "UNSUPPORTED" really means there