Question

Protein Sequence identifier to use after annotating variants (Swissprot or Uniparc) ?

0

Entering edit mode

8.4 years ago

aditi.qamra ▴ 270

Hi,

I annotated MAF files sourced from TCGA using variant effect predictor from Ensembl. ( Followed this tutorial - https://www.biostars.org/p/91806/#200161)

I then wanted to get the reference protein sequence for these variants and was using the Swissprot identifier ( eg. AT1A3_HUMAN) to source the sequence from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

I noticed there are variants whose reference amino acid doesn't match the amino acid at the mutated position in the uniprot sequence. Rather if you go to the uniprot page of that ID (http://www.uniprot.org/uniprot/P13637), sequence isoforms have the correct amino acid at the given position.

Eg. ATP1A3 c.257C>T p.Pro86Leu p.P86L AT1A3_HUMAN UPI0001914BDE

But if I use the uniparc fasta file and the uniparc identifier (UPI0001914BDE) , the 86th position of the protein sequence has a Proline as expected.

I wanted to know if using Uniparc identifier is the correct ID to map annotated variants to amino acid sequence or should we ignore variants where there is a mismatch between the Swissprot canonical sequence and the variant annotation ?

Thanks!

vep uniprot ensembl • 2.2k views

ADD COMMENT • link updated 8.4 years ago by Elisabeth Gasteiger ★ 2.4k • written 8.4 years ago by aditi.qamra ▴ 270

score 0 · Answer 1 · 2016-07-13

0

Entering edit mode

8.4 years ago

Elisabeth Gasteiger ★ 2.4k

Each isoform annotated in UniProtKB/Swiss-Prot has its own UniParc identifier: http://www.uniprot.org/uniparc/?query=P13637&sort=score

Entry           UniProtKB   Length
UPI0000124FC3   P13637.2 (obsolete);    1013
UPI000013E791   P13637; A0A0D9S121; F7HPP6; Q969K5.1 (obsolete); P13637-1;  1013
UPI00001614EC   P13637.1 (obsolete);    1013
UPI0001914BDE   B7Z2T0.1 (obsolete); P13637-3;  1026
UPI00020651B0   F5H6J6.1 (obsolete); P13637-2;  1024

UPI0001914BDE corresponds to P13637-3, which has 13 extra residues near the N-terminus: MG → MGSGGSDSYRIATSQ compared to the canonical sequence.

Pro-86 from P13637-3 corresponds to Pro-73 in the canonical sequence. I am not familiar with the MAF files, but it seems to be a coincidence that residue 86 in the canonical sequence is a Leu.

ADD COMMENT • link 8.4 years ago by Elisabeth Gasteiger ★ 2.4k

0

Entering edit mode

Hi Elisabeth - exactly my pt that the Uniparc identifier seems to have the correct sequence. Is that the correct identifier to be using ?

ADD REPLY • link 8.4 years ago by aditi.qamra ▴ 270

0

Entering edit mode

UniParc sequences are uncurated and reflect the raw data as submitted to the cross-referenced databases, e.g. EMBL, RefSeq, PDB etc. It is certainly not wrong to use UniParc identifiers, especially on a large scale.

However, you might miss out on the expert biocuration that has gone into UniProtKB/Swiss-Prot, for individual entries. Unfortunately, UniProtKB/Swiss-Prot is not necessarily complete, and variants/polymorphisms are not annotated to the level of individual isoforms.

ADD REPLY • link 8.3 years ago by Elisabeth Gasteiger ★ 2.4k