Parsing Pdbcodes To Their Cath Numbers
2
2
Entering edit mode
12.8 years ago
Reyhaneh ▴ 530

Hi;

I am looking for a simple file which for every PDBcode-chain (ex. 1e6jP) I can get all the representative CATH numbers for all domains. for example:

PDBcode   CATH  
1e6jP01   1.10.375.10  
1e6jP02   1.10.1200.30

I have looked at the download section of CATH website but was not able to find such a file. Do you have any suggestion?

Thank you;

Reyhaneh

pdb • 3.5k views
ADD COMMENT
7
Entering edit mode
12.8 years ago

EDIT Actually, I realised my first attempt didn't, strictly speaking, answer your question. This should be a bit better.

You want this file: <http://release.cathdb.info/v3.4.0/CathDomainList>

Column 0 gives you the PDB code, columns 1-4 give you the CATH classification down to the homology level. So to parse:

import urllib
import re

def get_pdb_dict():
    """Takes CATH domain list (from URL) and returns dictionary of PDB codes
    & their CATH families"""
    pdbs = {}
    fh = urllib.urlopen('http://release.cathdb.info/v3.4.0/CathDomainList')
    lines = fh.read().split('\n')
    fh.close()
    for line in lines:
        #ignore comments
        if not line.startswith('#'):
            tokens = line.split()
            #lines are space-delimited, so need re.split() here
            tokens = re.split('\s+', line)
            pdb = tokens[0]
            #split the PDB into root identifier and chain id
            pdb_root = pdb[0:5]
            pdb_chain = pdb[5:]
            #could be more/less precise by using more/fewer columns
            cath = '.'.join(tokens[1:5])
            try:
                pdbs[pdb_root].append((pdb_chain,cath))
            except KeyError:
                pdbs[pdb_root] = [(pdb_chain,cath)]
    return pdbs

if __name__ == '__main__':
    p = get_pdb_dict()
    chains = p['1e6jP']
    print chains

[biostar-code/python]$ python parse_cath_domain.py 
[('01', '1.10.375.10'), ('02', '1.10.1200.30')]
ADD COMMENT
1
Entering edit mode

Beaten by seconds! I'll just note that the file is about 11.3 MB and has not been updated for sometime (date 2010-11-21). And that "grep 1e6j CathDomainList" gives you a quick view of the entries.

ADD REPLY
0
Entering edit mode

+1 @neilfws grep would usually be my preferred solution, admittedly. I just fancied writing some code ;)

ADD REPLY
0
Entering edit mode

@Simon Cockell Thank you very much. I saw this file but didn't understand the format before. Thanks for the clear explanation.

ADD REPLY
0
Entering edit mode

Here is the more up to date version of the file

http://release.cathdb.info/v3.5.0/CathDomainList

FILE NAME: CathDomainList.v3.5.0

FILE DATE: 21.09.2011

CATH VERSION: v3.5.0

VERSION DATE: 21.09.2011

ADD REPLY
0
Entering edit mode

Here is the more up to date version of the file release.cathdb.info/v3.5.0/CathDomainList

FILE NAME: CathDomainList.v3.5.0

# FILE DATE: 21.09.2011

ADD REPLY
1
Entering edit mode
12.8 years ago
Neilfws 49k

If you prefer not to download and parse files, CATH provides a web service which returns XML for a given PDB code. For example: 1E6J.

You could then extract the CATH code using the XML parsing library of your choice. Quick and dirty Ruby example:

#!/usr/bin/ruby
require 'rubygems' # ruby 1.8
require 'mechanize'
require 'crack'

agent = Mechanize.new
page  = agent.get("http://www.cathdb.info/pdb/1e6j?view=xml")
doc   = Crack::XML.parse(page.body)

doms  = doc['document']['cath_pdb_query']['cath_domain'].map {|d|
  [d['domain_id'], d['cath_code']]
}

doms.each {|d|
  puts d.join("\t")
}

# result
1e6jH01 2.60.40.10.5.1.11.1.2
1e6jH02 2.60.40.10.10.1.1.1.139
1e6jL01 2.60.40.10.6.5.2.2.2
1e6jL02 2.60.40.10.3.1.1.1.411
1e6jP01 1.10.375.10.1.1.2.7.1
1e6jP02 1.10.1200.30.2.1.2.3.1
ADD COMMENT
0
Entering edit mode

A nice one. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6