Hi,
I just ran into a problem regarding isoforms specified in Uniprot XML files which are not present in the 'every' isoform fasta file available in Uniprot.
One example would be the protein: Q8IYH5 where we can find the following isoforms:
<comment type="alternative products">
<event type="alternative splicing"/>
<isoform>
<id>Q8IYH5-1</id>
<name>1</name>
<sequence type="displayed"/>
</isoform>
<isoform>
<id>Q8IYH5-2</id>
<name>2</name>
<sequence type="described" ref="VSP_025511"/>
</isoform>
<isoform>
<id>Q8IYH5-3</id>
<name>3</name>
<sequence type="described" ref="VSP_025509 VSP_025510"/>
<note>No experimental confirmation available.</note>
</isoform>
<isoform>
<id>Q8IYH5-4</id>
<name>4</name>
<sequence type="described" ref="VSP_025512 VSP_025513"/>
<note>No experimental confirmation available.</note>
</isoform>
</comment>
Most of them can be found in the general isoform fasta file, however the isoform with ID: Q8IYH5-1 is not present in the file.
Do you know if this is a bug/isolated case? Is there any reason/relationship between the ones missing here?
Thanks in advance,
Pablo
Hi Pablo, the isoform fasta file (uniprotsprotvarsplic.fasta) only contains additional isoforms (see ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/README.varsplic) and not the "canonical" sequences, i.e. those marked <sequence type="displayed"/>
The isoform file needs to be combined with the file uniprotsprot.dat (or uniprotsprot.fasta) to obtain all isoform sequences.
See also What is the canonical sequence? http://www.uniprot.org/faq/30 How to retrieve sets of protein sequences? http://www.uniprot.org/faq/38
Elisabeth