I'm in the process of automating an assembly of paired multiple sequence alignments (i.e. MSAs with two proteins aligned after each other) in order to do some paired sequence processing on them.
In order to do this, I'm querying two protein families from pfam, and I'm trying to associate them.
I've understood that asserting that the genomic locations of the two proteins are adjacent is appropriate for associating them with high confidence in my case (since they're situated in the same operon)
So, this is the question:
Given the xml-information in uniprot, what is the best way to assert the genomic proximity/adjacency? (Can I find it easily using the BioPython API for example?)
In the case that the specific loci index information is lacking (i suspect this is often the case) it appropriate to compare the uniprot identifiers (e.g. K0D1W6 in http://www.uniprot.org/uniprot/K0D1W6.xml) for similarity using some measure?
Thanks!
Hope to get some intelligent mind out there to help me. I'd be forever grateful!
Thanks a lot!
That helps, especially in the case of merged genes. However, are there certain standards for exactly what the ordered locus name format should contain? Reason being, I'm stumbling upon all of these really weird looking OLNs, for example:
Coming from the same organism. It seems slightly messy
The identifiers you saw seem to be ORFnames in UniProtKB, not OLNs, suggesting that they are not really ordered:
e.g. http://www.uniprot.org/uniprot/U5G7Y3
OLNs are included in UniProtKB only if they were attributed by the group that sequenced the genome.
Thanks, that clarifies it!