Isoforms Missing In Uniprot Isoforms Fasta File
1
1
Entering edit mode
13.0 years ago
Pablo Pareja ★ 1.6k

Hi,

I just ran into a problem regarding isoforms specified in Uniprot XML files which are not present in the 'every' isoform fasta file available in Uniprot.

One example would be the protein: Q8IYH5 where we can find the following isoforms:

<comment type="alternative products">
<event type="alternative splicing"/>
<isoform>
<id>Q8IYH5-1</id>
<name>1</name>
<sequence type="displayed"/>
</isoform>
<isoform>
<id>Q8IYH5-2</id>
<name>2</name>
<sequence type="described" ref="VSP_025511"/>
</isoform>
<isoform>
<id>Q8IYH5-3</id>
<name>3</name>
<sequence type="described" ref="VSP_025509 VSP_025510"/>
<note>No experimental confirmation available.</note>
</isoform>
<isoform>
<id>Q8IYH5-4</id>
<name>4</name>
<sequence type="described" ref="VSP_025512 VSP_025513"/>
<note>No experimental confirmation available.</note>
</isoform>
</comment>

Most of them can be found in the general isoform fasta file, however the isoform with ID: Q8IYH5-1 is not present in the file.

Do you know if this is a bug/isolated case? Is there any reason/relationship between the ones missing here?

Thanks in advance,

Pablo

uniprot isoform protein xml fasta • 3.5k views
ADD COMMENT
0
Entering edit mode

Hi Pablo, the isoform fasta file (uniprotsprotvarsplic.fasta) only contains additional isoforms (see ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/README.varsplic) and not the "canonical" sequences, i.e. those marked <sequence type="displayed"/>

The isoform file needs to be combined with the file uniprotsprot.dat (or uniprotsprot.fasta) to obtain all isoform sequences.

See also What is the canonical sequence? http://www.uniprot.org/faq/30 How to retrieve sets of protein sequences? http://www.uniprot.org/faq/38

Elisabeth

ADD REPLY
6
Entering edit mode
13.0 years ago

Sorry, this is my first post - I just posted my reply as a comment by mistake.

Here you go again:

The isoform fasta file (uniprot_sprot_varsplic.fasta) only contains additional isoforms (see ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/README.varsplic) and not the "canonical" sequences, i.e. those marked

<sequence type="displayed"/>

in the xml.

The isoform file needs to be combined with the file uniprot_sprot.dat (or uniprot_sprot.fasta) to obtain all isoform sequences.

See also

What is the canonical sequence? http://www.uniprot.org/faq/30

How to retrieve sets of protein sequences? http://www.uniprot.org/faq/38

Elisabeth

ADD COMMENT
0
Entering edit mode

Thanks for your quick answer, I have a few remarks about this though. Why not including information about the isoform like name and sequence in the XML files? Besides that, I just downloaded the file uniprot_sprot.fasta and there is no entry for the isoform Q8IYH5-1 here either. Should it be somewhere else?

ADD REPLY
0
Entering edit mode

Isoform names are in the xml: see e.g. <name>1</name> in your above example.

Sequences are not, I can check for you whether this has ever been discussed.

Q8IYH5-1 is the displayed (canonical) sequence of Q8IYH5, and its sequence is therefore identical to that of Q8IYH5. uniprot_sprot.fasta has an entry for all canonical isoforms, including Q8IYH5.

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6