Hi,
I'm currently working on preparing a new release of Bio4j and I just ran into a problem with Uniref 100/90/50 XML files.
I found out how there are a lot of representative members that don't comply with the expected XML syntax, because either things are specified in a different way or there's more or less info missing.
Supposably this is how things should look like:
<entry id="UniRef100_P99999" updated="2005-02-01">
<name>Cytochrome c</name>
<representativeMember>
<dbReference type="UniProtKB ID" id="CYC_HUMAN" >
<property type="UniProtKB accession" value="P99999" />
<property type="UniProtKB accession" value="P00001" />
<property type="UniProtKB accession" value="Q6NUR2" />
<property type="UniProtKB accession" value="Q6NX69" />
<property type="UniProtKB accession" value="Q96BV4" />
<property type="UniParc ID" value="UPI0000128BBF" />
<property type="UniRef90 ID" value="UniRef90_P99999"/>
<property type="UniRef50 ID" value="UniRef50_P99999"/>
<property type="protein name" value="Cytochrome c" />
<property type="NCBI taxonomy" value="9606" />
<property type="source organism" value="Homo sapiens" />
<property type="length" value="104" />
<property type="overlap region" value="2-105" />
</dbReference>
<sequence length="104" checksum="D47C9B513DF1C5C2">
GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWG
EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
</sequence>
</representativeMember>
</entry>
And this is a sample of the weird ones I've found:
<entry id="UniRef100_UPI000194DDB6" updated="2011-10-19">
<name>Cluster: UPI000194DDB6 UniRef100 entry</name>
<property type="member count" value="1"/>
<property type="common taxon" value="root"/>
<property type="common taxon ID" value="1"/>
<representativeMember>
<dbReference type="UniParc ID" id="UPI000194DDB6">
<property type="UniRef90 ID" value="UniRef90_UPI0000E816FE"/>
<property type="UniRef50 ID" value="UniRef50_Q92625"/>
<property type="length" value="1128"/>
<property type="isSeed" value="true"/>
</dbReference>
<sequence length="1128" checksum="30DD0A7E86C8660E">
....
Where as you can see there's no information about the protein uniprot accession, (even not the protein name either)
In total I have found 965.244 entries with some sort of problem/info missing only in Uniref100, (you can find them in this txt file ).
Do you have any idea of why this may be happening? Are all these ids related somehow?
I'd really appreciate any feedback.
Cheers,
Pablo Pareja
You are correct, these are all UniParc only clusters. The UniParc sequence is representative as its the only sequence in the cluster. i.e. there is always a representative sequence in a cluster.
@Raquel Tobes & @jerven thanks for the info ;)