Clearly maps to a gene, which in turn has an Ensembl ID, but it does not appear in the tables available from uniprot like the one linked to above. Why is this and where can I find a complete mapping from any uniprot/uniref ID to an Ensembl gene ID? thanks.
As usual, the answer to "how do I map ID X to ID Y" is to use BioMart or the UCSC Table browser. Please search this site for answers on those topics; if you have trouble leave a comment and we can supply brief instructions. I just tried BioMart using UniProt/SwissProt Accession Q9Y5I3 as filter and it returned ENSG00000204970/ENST00000378133 as Gene ID and Transcript ID attributes.
I am very well aware of these resources and tried them. When you download BioMart tables and ask for the Uniprot ID along with ENSG ids, you get a table back that does not contain Q9Y5I3. The ID is not found. Searching for this ID as a filter with each ID is not a solution -- I am looking for a table that contains a proper mapping so that it can be programmatically searched.
I managed to get a table from BioMart that has the Q9Y5I13 Id but now it's missing others like A2VEC9. I don't understand why these ids are not in tables
Why are you downloading tables? The example I have was via the web interface. If certain IDs do not map, it's simply because Ensembl is unsure whether there's a canonical gene for that protein product.
The reason you're not getting ALL the IDs when you download from BioMart is that BioMart cannot handle that amount of data. It just stops working partway through your query. You need to filter by your list.
Alternatively, if you do want a complete list, then you can use the Perl API.
UniProt to Ensembl cross references are currently being cleaned up and corrected. This is not an instantaneous process and will take some time. Try writing to help@uniprot.org for more details.
Using the UCSC mysql server and the tables uniProt.extDbRef and uniProt.extDb :
$ echo -e "Q9Y5I3\nQ04721" |\
awk '{printf("select REF.acc,REF.extAcc1,REF.extAcc2,REF.extAcc3 from uniProt.extDbRef as REF, uniProt.extDb as EXT where EXT.val=\"ENSEMBL\" and EXT.id=REF.extDb and REF.acc=\"%s\";\n",$0);}' |\
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N
Q9Y5I3 ENST00000378133 ENSP00000367373 ENSG00000204970
Q04721 ENST00000256646 ENSP00000256646 ENSG00000134250
As usual, when it comes to mappings between uniprot and external ids, the most reliable approach is to look into the Trembl/SwissProt flatfiles and parse for the uniprot accession and the desired external accession. Uniprot acc is listed in the AC field, external ones are within the DR fields. So in the case of Q9Y5I3, this looks like this (uniprot_sprot.dat):
DR Ensembl; ENST00000378133; ENSP00000367373; ENSG00000204970.
BioPython offers a nice interface for parsing UniProt files without much ado. I'm sure BioPerl/... have similar interfaces.
I could not find this. When I downloaded uniprot_sprot.dat from UniProt, I got this:
$ grep Q9Y5I3 uniprot_sprot.dat
AC Q9Y5I3; O75288; Q9NRT7;
CC IsoId=Q9Y5I3-1; Sequence=Displayed;
CC IsoId=Q9Y5I3-2; Sequence=VSP_000670;
CC IsoId=Q9Y5I3-3; Sequence=VSP_000671, VSP_000672;
DR ProteinModelPortal; Q9Y5I3; -.
DR SMR; Q9Y5I3; 27-678.
DR IntAct; Q9Y5I3; 1.
DR STRING; Q9Y5I3; -.
DR PRIDE; Q9Y5I3; -.
DR neXtProt; NX_Q9Y5I3; -.
DR InParanoid; Q9Y5I3; -.
DR Genevestigator; Q9Y5I3; -.
So no Ensembl reference... where did you get the flat file you mention?
UniParc is another way of doing this mapping since this is based on checksum collisions from numerous protein resources. If you want a 1:1 mapping this is a good place to look. http://www.uniprot.org/uniparc/UPI00001273C7
As well as the above pieces of software, or going into the flat file, is a tool called PICR (Protein Identifier Cross-Reference Service; http://www.ebi.ac.uk/Tools/picr/search.do). This was done a few years back to assist in issues like this, but was originally written to help cross-mapping of data placed into PRIDE (http://www.ebi.ac.uk/pride/). It is something I resort to frequently to identify other database entries for a particular protein or gene.
As to why it happens, I am not sure. But SwissProt from which Q9Y5I3 comes from is the manually created part of UniProt - i.e., the data is fully annotated by a human being. It has probably been taken from a skeleton generated by software used to create the TrEMBL portion of UniProt. The problem as always is how many links do you follow, and which links and annotation do you trust. The reason SwissProt has such a great reputation is the degree and quality of annotation it provides. Another possibility is that data has been updated elsewhere, and not been amended in the UniProt entry: databases are continually updated and keeping databases in sync is a nightmare task.
I was writing this at the same time as Julian was adding his useful comment. I am also a Swiss-Prot annotation fan but I can confirm that there are constitutive problems for ID x-mapping in general for a significant proportion of human proteins as indicated by the following UniProt queries
(organism:"Homo sapiens [9606]") AND reviewed:yes = 20,237
(organism:"Homo sapiens [9606]") AND reviewed:yes AND database:(type:ensembl) = 18,685
(organism:"Homo sapiens [9606]") AND reviewed:yes AND database:(type:ensembl) AND database:(type:hgnc) AND database:(type:geneid) = 18,250
Ensembl 67.37 = 21,065 including 568 novel (i.e. not 100% match to UniProt)
The Biomart numbers should be similar but any way you look at it there is ~ 8% discordance Swiss-Prot > Ensembl and residual for HGNC and EGID. The numbers also indicate ~ 1000 Ensembl proteins are not in Swiss-Prot (but some may be in TrEMBL)
For Q9Y5I3 it looks like the flat file had the x-ref but not the UniProt web interface (i.e. I can click UniProt > HGNC > Ensembl but not direct) Maybe this is the sync problem Julian points out.
Julian, can you get PICR nos that are concordant with the type I have shown ?
I recommend the Uniprot idmapping.dat.gz file. After comparing mappings available from Uniprot, Ensembl, UCSC Table browser and Biomart, I found Uniprot's to be the most complete:
The gene annotations in the xx_idmapping.dat.gz and xx_idmapping_selected.tab.gz files do not contain the same information -- despite being smaller, the .dat.gz file includes gene annotations which are left empty in the tabular "selected" version of the mapping.
OK - so which of the six sources we have referred to (UniProt/UniParc/Ensembl/BioMart/UCSC/PICR) operationally executes the primary UniProt > Ensembl mapping and how ?
So as far as my knowledge of all of these things go Ensembl maps its proteins to UniProtKB accessions using a 100% identity match (please assume this even though data in 67 will disagree) or by using a direct association given to Ensembl by UniProt. The BioMart referred to in this post is the Ensembl Gene Mart so same rules apply as before.
UniParc does it's own mappings using MD5 digests of sequence and clusters identical checksums together. PICR uses UniParc in its mappings but can also use other forms of alignment/lookup (see http://www.ebi.ac.uk/Tools/picr/implementation.do for more information).
UniProt also do a mapping to Ensembl but I would rather let them comment on this process to avoid mis-informing you.
As usual, the answer to "how do I map ID X to ID Y" is to use BioMart or the UCSC Table browser. Please search this site for answers on those topics; if you have trouble leave a comment and we can supply brief instructions. I just tried BioMart using UniProt/SwissProt Accession Q9Y5I3 as filter and it returned ENSG00000204970/ENST00000378133 as Gene ID and Transcript ID attributes.
I am very well aware of these resources and tried them. When you download BioMart tables and ask for the Uniprot ID along with ENSG ids, you get a table back that does not contain Q9Y5I3. The ID is not found. Searching for this ID as a filter with each ID is not a solution -- I am looking for a table that contains a proper mapping so that it can be programmatically searched.
I managed to get a table from BioMart that has the Q9Y5I13 Id but now it's missing others like A2VEC9. I don't understand why these ids are not in tables
Why are you downloading tables? The example I have was via the web interface. If certain IDs do not map, it's simply because Ensembl is unsure whether there's a canonical gene for that protein product.
The reason you're not getting ALL the IDs when you download from BioMart is that BioMart cannot handle that amount of data. It just stops working partway through your query. You need to filter by your list.
Alternatively, if you do want a complete list, then you can use the Perl API.
UniProt to Ensembl cross references are currently being cleaned up and corrected. This is not an instantaneous process and will take some time. Try writing to help@uniprot.org for more details.