Hi, I am trying to map some 10000+ Entrez Gene IDs to protein sequences. For the sake of my downstream task (learn protein representations for those Gene IDs), I can only accept N vs 1 mapping (i.e., N Gene IDs correspond to the same protein sequence), but not 1 vs N mapping (i.e., 1 Gene ID corresponds to N protein sequences).
As a Python/R user with no experience in Perl, the current strategy I am using is to transition between two online databases: [NCBI] Gene ID --> [UniProt] Accession ID --> [UniProt] canonical protein sequence. However, the problem is the first transition would result in a fair number of multi-to-multi mapping. After a careful examination, I found that the problem lies in the UniProt side, where the same Gene IDs are assigned to multiple proteins accession IDs.
I then tried to transition first from Gene ID to Gene Symbol, but gene symbols are even more misleading. Are there some ways to directly access canonical protein sequences from NCBI via Gene IDs using Python?
Example gene IDs: 276,277,278,801,805,3020,3021,7278,113457,11013,286527,6818,445329,7258,728137,100289087
Please post a few example gene ID. EntrezDirect would be one way of getting this done programmatically.
Have added a few. Thanks for the recommendations, I briefly skimmed through and this looks similar to Biopython. In fact, I wanted to try directly accessing the protein sequences from Entrez, but the conversion between gene IDs and protein IDs (GenBank protein accession) is quite ambiguous, by which I mean there are a lot of multi-to-multi mappings. I used bioDBnet to do the conversion. As an example, I can hardly distinguish which one is the canonical protein ID for gene ID 2:
If you are referencing a
gene
record ID then it points to a number of protein sequence. For example for277
gene ID you have the following records in protein database.(All sequence trimmed due to space constraint)
You could also
UniProt sequence
(which will likely get you one sequence) using the above searchThanks for the great demo! As I mentioned in my reply to Istvan Albert I actually need only one canonical protein sequence from one
gene ID
.The second script looks like what I need! I just installed EntrezDirect and am still dealing with some bugs, so I cannot test the script to
gene ID
10627, which is a one-on-multiple-protein mapping gene. I am not familiar with awk and am not sure where in the script UniProt is linked, but it would be great if Entrez could directly do so. Will try test inBiopython
later to see the results.Use
conda
to install EntrezDirect and that should take care of the problems. I showed you examples of searches for 10627 and 103910 below.Since a gene by definition could be connected to multiple proteins, what is it that you are asking really?
What is the goal? How would/should the multiple mapping be resolved? Are you asking for a database to resolve that for you?
Edit: also I would not call this "accurate" protein ids. There is nothing "accurate" about collapsing multiple proteins into a single gene. If anything it is an "inaccurate" mapping.
Thanks for the reminder and have revised the wording in my post. You are right that there could be a bunch of proteins connected to one gene. However, the reason I have to figure out one canonical protein sequence is that I need to learn the representation of proteins from a protein-protein interaction dataset, where the proteins are denoted only by
gene ID
s (I am also quite confused about this). I understand the protein sequence I choose for the downstream might not be the real isoform of the protein actually interacting with the other proteins, but for my downstream task, a canonical sequence is enough.In fact, UniProt has the so-called "canonical sequence" posted for each protein, but I just need to resolve the multi-mapping stuff. For example,
gene ID
10627 is mapped to bothUniProt accession
P19105 (akagene symbol
MYL12A) andUniProt accession
O14950 (akagene symbol
MYL12B, mapped along withgene ID
103910) in UniProt. In NCBI the gene IDs are clearly distinct, and in UniPort the gene symbols for the two proteins are also different. But it is just mapped to two of them.While I did notice that in NCBI Gene database I can download a single CDS protein sequence for one gene, I am not aware of how to do so in Entrez...