Question

Merging Data From Pfam Pdb And Uniprot

6

Entering edit mode

13.8 years ago

Aurobhima ▴ 100

Hi,

Does anyone know of a method to map between Pfam, PDB and UniProt. I have very specific criteria I want to select data with, and this requires a combination of these three databases.

I have been working on a solution for some time now on my own, but would like to know if anyone else has been doing something like this and if they'd be interested in discussing this with me.

Thanks

pdb uniprot • 10k views

ADD COMMENT • link updated 9.1 years ago by konrad.koehler • 0 • written 13.8 years ago by Aurobhima ▴ 100

Ram · Answer 1 · 2011-07-11

8

Entering edit mode

13.8 years ago

Khader Shameer 18k

Residue-level cross reference data based on PDB is available via SIFTS annotations.

Please check the following files at SIFTS Quick Access:

pdb_chain_uniprot.lst - A summary of the PDBe to UniProt residue level mapping, showing the start and end residues of the mapping using SEQRES, PDB sequence and UniProt numbering.

pdb_chain_pfam.lst - A summary of the Pfam domain identifier(s)(derived via the UniProt mapping) for each PDB chain that has been processed.

You can use two files and use one identifier to map to others. This is the best cross-reference for PDB-Uniprot-Pfam I could find. I am using this in my analysis.

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 13.8 years ago by Khader Shameer 18k

2

Entering edit mode

What kind of issues ?

ADD REPLY • link 13.8 years ago by Khader Shameer 18k

0

Entering edit mode

Thanks.. we did try it before and found that there are some issues with it.. which is why we went our own way.. but it is the closest I've seen to what I'm looking for..

ADD REPLY • link 13.8 years ago by Aurobhima ▴ 100

score 4 · Answer 2 · 2011-07-11

Another answer for fun, using bio2rdf :-)

from http://uniprot.bio2rdf.org/sparql use the following query

select ?id ?pdb ?pfam  where {
?s <http://purl.org/dc/elements/1.1/identifier> ?id .
?s a <http://bio2rdf.org/core:Protein> .
?s  <http://www.w3.org/2000/01/rdf-schema#seeAlso>  ?pdb .  
?s  <http://www.w3.org/2000/01/rdf-schema#seeAlso>  ?pfam . 
FILTER regex(?pdb, "pdb:") 
FILTER regex(?pfam, "pfam:")

} limit 100 ##remove this for a larger answer

id  pdb     pfam
uniprot:P13744  http://bio2rdf.org/pdb:2E9Q     http://bio2rdf.org/pfam:PF00190
uniprot:P13744  http://bio2rdf.org/pdb:2EVX     http://bio2rdf.org/pfam:PF00190
uniprot:Q8GBW6  http://bio2rdf.org/pdb:1ON3     http://bio2rdf.org/pfam:PF01039
uniprot:Q8GBW6  http://bio2rdf.org/pdb:1ON9     http://bio2rdf.org/pfam:PF01039
uniprot:Q10666  http://bio2rdf.org/pdb:3C2G     http://bio2rdf.org/pfam:PF00505
uniprot:Q9FK25  http://bio2rdf.org/pdb:1NII     http://bio2rdf.org/pfam:PF08100
uniprot:Q9FK25  http://bio2rdf.org/pdb:1NII     http://bio2rdf.org/pfam:PF00891
uniprot:P31946  http://bio2rdf.org/pdb:2BQ0     http://bio2rdf.org/pfam:PF00244
uniprot:P31946  http://bio2rdf.org/pdb:2C23     http://bio2rdf.org/pfam:PF00244
uniprot:Q12802  http://bio2rdf.org/pdb:2DRN     http://bio2rdf.org/pfam:PF00169
(...)

score 2 · Answer 3 · 2011-07-11

2

Entering edit mode

13.8 years ago

Michael Kuhn 5.0k

It's all there on UniProt in "Cross-references", e.g. see this entry for NMB1681. The data is also available in the export formats, e.g. text format.

ADD COMMENT • link 13.8 years ago by Michael Kuhn 5.0k

4

Entering edit mode

Example of an inconsistency?

ADD REPLY • link 13.8 years ago by Neilfws 49k

0

Entering edit mode

I have these data, but there are inconsistencies in the cross references between the 3 databases.. I wish it were that straightforward..

ADD REPLY • link 13.8 years ago by Aurobhima ▴ 100

score 1 · Answer 4 · 2011-07-11

1

Entering edit mode

13.8 years ago

Chris Evelo 10k

You might want to have a look at our BridgeDB, which was developed to help you solve questions like this. See: http://www.bridgedb.org

ADD COMMENT • link 13.8 years ago by Chris Evelo 10k

1

Entering edit mode

thanks I'll have a look into it.. it could be useful..

ADD REPLY • link 13.8 years ago by Aurobhima ▴ 100

score 0 · Answer 5 · 2011-07-11

0

Entering edit mode

13.8 years ago

Pierre Lindenbaum 166k

The file ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz seems to contain all the IDs.

curl  -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz" |\
gunzip -c |\
egrep -i '(accession|pfam|pdb)'

  (...)
  <accession>P0C9E9</accession>
  <accession>P0C9K3</accession>
  <dbReference id="PF01639" key="10" type="Pfam">
  <accession>P0C9I4</accession>
  <dbReference id="PF01639" key="10" type="Pfam">
  (...)
  <property type="PDB accession" value="1KMH"/>
  (...)

ADD COMMENT • link 13.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

there is also similar data in the Pfam, and there is the UniProt ID in the header of PDB files.. but they don't play nice with each other.

ADD REPLY • link 13.8 years ago by Aurobhima ▴ 100

score 0 · Answer 6 · 2011-07-11

0

Entering edit mode

13.8 years ago

Nabellaleen ▴ 10

There is also modern solution using data crossing softwares (ie : http://www.isoft.fr/bio/biopack_data_en.htm ). It definitly fills me with despair to see people "reinvent the wheel" for the main but only first step of their work : data access and mining ...

ADD COMMENT • link 13.8 years ago by Nabellaleen ▴ 10

0

Entering edit mode

Thanks.. I'll have a look.. not sure I'm re-inventing the wheel though.. I have yet to find something that comes close to what it is I'm trying to do.. I need to make very specific selection criteria, e.g. all Pfam domains which are only present in non-membrane mitochondria proteins. Or which protein structures can be found exclusively extra-cellular in Eukaryotes.. if I'm reinventing the wheel, I'd be really happy to use the existing one.. :-)

ADD REPLY • link 13.8 years ago by Aurobhima ▴ 100

0

Entering edit mode

It sounds like you should be able to build a query to answer that using SRS. Or at most a couple of queries!

ADD REPLY • link 13.8 years ago by Iain ▴ 260

0

Entering edit mode

In fact, it exists softwares which permit to easily import, read, parse, filter and cross data with total control on all parameters. So, this type of software permit to make a pipeline for your needs or for a lot of other needs in some days. And when I say "reinvent the wheel" it's not about your specific analysis but about re-designing of script each time with only some minor changes but with a large time-cost :)

ADD REPLY • link 13.8 years ago by Nabellaleen ▴ 10

score 0 · Answer 7 · 2011-07-11

Using uniprot.org Using customize display in the uniprot entry view

Or using a mapping service http://www.uniprot.org/uniprot/?tab=mapping.

If you want to discuss the way uniprot maps to PDBe (not so straight forward as you might think) contact help@uniprot.org. Pfam comes directly out of the interpro results and there should not be that much skew between these databases.

score 0 · Answer 8 · 2011-07-11

You could try using the SRS service in the EBI.

http://srs.ebi.ac.uk/

This service links many databases with each other.

There is a tutorial available: http://www.embl.de/~seqanal/courses/srscourse/srstut.html

An example taken directly from this tutorial, the query: enzyme < pdb gives all the enzyme database entries for which the 3D structure is known!

Ram · Answer 9 · 2011-07-11

0

Entering edit mode

13.8 years ago

Aleksandr Levchuk 3.2k

I am planning to use a hash/checksum of the protein sequences to cross-link Uniprot to others.

SEquence Globally Unique IDentifier (SEGUID) is a hashing standard (based on SHA1) - it was specifically developed for uniquely identifying protein sequences.

See also: our PostgreSQL sequence-to-seguid implementation http://dba.stackexchange.com/questions/66/biological-sequences-of-uniprot-in-postgresql

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.8 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

The sequencing cross referencing tool at the EBI might save you some time. http://www.ebi.ac.uk/Tools/picr/

ADD REPLY • link 13.8 years ago by Iain ▴ 260

0

Entering edit mode

Get in touch and let's see if we can merger my approach with yours, I think your idea has real potential.

ADD REPLY • link 13.8 years ago by Aurobhima ▴ 100

score 0 · Answer 10 · 2016-04-16

This is a very old thread, however I still have not found a good way for pdb to uniprot residue mapping that doesn't rely on a web server that may not be up to date. SIFTS may be the way to go, but has a complicated data structure. Below is a simple self-contained biopython function which relies on an on-the-fly sequence alignment to determine the residue mapping. There may be more elegant ways to script this, but the following works.

from Bio.PDB import *
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.PDBList import PDBList
from Bio import pairwise2
from Bio import SeqIO

def resmap(chain, uniprot_sequence):
# Returns a PDB to UniProt residue number dictionary. 

    ppb=PPBuilder()
    polypeptides = ppb.build_peptides(chain)
    pdb_sequence = ""
    for polypeptide in polypeptides:
        pdb_sequence = pdb_sequence + polypeptide.get_sequence()
    pdb_res_nums = sortedres.id[1] for res in chain if res.id[0] == " ")

    residue_list = Selection.unfold_entities(chain, 'R')
    alignments = pairwise2.align.globalms(uniprot_sequence, pdb_sequence, 2, -1, -.5, -.1)    
    uniprot_align = str(alignments[0][0])
    pdb_align     = str(alignments[0][1])

    uniprot_map = []
    count = 0
    for residue in uniprot_align:
        if residue != "-":
            count += 1
            uniprot_map.append(count)
        else:
            uniprot_map.append(-1)

    pdb_map = []
    count = -1
    for residue in pdb_align:
        if residue != "-":
            count += 1
            pdb_map.append(pdb_res_nums[count])
        else:
            pdb_map.append(-1)

    matches = []
    for index, residue in enumerate(uniprot_map):
        if uniprot_align[index] == pdb_align[index] and uniprot_align[index] != "-" and pdb_align[index] != "-":
            matches.append(True)
        else:
            matches.append(False)

    mapping = {}
    for index, match in enumerate(matches):
        if match:
            mapping[pdb_map[index]] = uniprot_map[index]

    return mapping