Hi,
I annotated MAF files sourced from TCGA using variant effect predictor from Ensembl. ( Followed this tutorial - https://www.biostars.org/p/91806/#200161)
I then wanted to get the reference protein sequence for these variants and was using the Swissprot identifier ( eg. AT1A3_HUMAN) to source the sequence from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
I noticed there are variants whose reference amino acid doesn't match the amino acid at the mutated position in the uniprot sequence. Rather if you go to the uniprot page of that ID (http://www.uniprot.org/uniprot/P13637), sequence isoforms have the correct amino acid at the given position.
Eg. ATP1A3 c.257C>T p.Pro86Leu p.P86L AT1A3_HUMAN UPI0001914BDE
But if I use the uniparc fasta file and the uniparc identifier (UPI0001914BDE) , the 86th position of the protein sequence has a Proline as expected.
I wanted to know if using Uniparc identifier is the correct ID to map annotated variants to amino acid sequence or should we ignore variants where there is a mismatch between the Swissprot canonical sequence and the variant annotation ?
Thanks!
Hi Elisabeth - exactly my pt that the Uniparc identifier seems to have the correct sequence. Is that the correct identifier to be using ?
UniParc sequences are uncurated and reflect the raw data as submitted to the cross-referenced databases, e.g. EMBL, RefSeq, PDB etc. It is certainly not wrong to use UniParc identifiers, especially on a large scale.
However, you might miss out on the expert biocuration that has gone into UniProtKB/Swiss-Prot, for individual entries. Unfortunately, UniProtKB/Swiss-Prot is not necessarily complete, and variants/polymorphisms are not annotated to the level of individual isoforms.