Question

Mapping between PDB fasta sequence and DSSP

1

Entering edit mode

10.5 years ago

Kevin ▴ 100

Hi,

I am trying to map between a set of fasta sequences from PDB entries and DSSP.

My starting sequence is the fasta sequence supplied with each PDB reconrd (no breaks!). I now want to map the values in the DSSP record to the sequence. The desired result can be seen below:

NLYFQSMKAAIAQINTAALRHNLAVVKRHAPQCKIIAVVKANAYGH - main sequence (no breaks)
_________KAAIAQIN_________LAVVKRHAPQCKIIAVV_________ - DSSP sequence (_ shows the breaks in the PDB chain)

I expected this to be a simple process but I now see that the RESNUM entry in the DSSP record does not actually map to a residue number in the fasta file.

For example (4BHY:A):

The fasta file is as follows:

>4BHY:A|PDBID|CHAIN|SEQUENCE
MHHHHHHDYDIPTTENLYFQSMKAAIAQINTAALRHNLAVVKRHAPQCKIIAVVKANAYGHGLLPVARTLVDADAYAVAR
# ------------^__^ (line added to highlight sequence)
IEEALMLRSCAVVKPIVLLEGFFSAADLPVLAANNLQTAVHTWEQLEALEQADLPAPVVAWLKLDTGMHRLGVRADEMPA
FIERLAKCKNVVQPFNIMTHFSRSDELEQPTTREQIDLFSQLTAPLLGERAMANSAGILAWPDSHCDWVRPGVILYGVSP
FPNTVAADYDLQPVMTLKTQLIAVRDHKAGEPVGYGANWVSDRDTRLGVIAIGYGDGYPRMAPNGTPVLVNGRIVPLVGR
VSMDMTTVDLGPGATDKAGDEAVLWGEGLPVERVADQIGTIPYELITKLTSRVFMEYV

The start of the DSSP file is:

 #  RESIDUE AA STRUCTURE BP1 BP2  ACC     N-H-->O    O-->H-N    N-H-->O    O-->H-N    
    1   -6 A E              0   0  122      0, 0.0  1074,-0.2     0, 0.0   288,-0.1    
    2   -5 A N        +     0   0  124    286,-0.3   287,-0.0  1073,-0.0  1072,-0.0   
    3   -4 A L        -     0   0   49      1,-0.1     3,-0.1     0, 0.0  1072,-0.1  
    4   -3 A Y        -     0   0  205      1,-0.3     2,-0.2  1072,-0.0    -1,-0.1   
    5   -2 A F  S    S-     0   0  146   1072,-0.0     2,-0.4     0, 0.0    -1,-0.3  
    6   -1 A Q        +     0   0   49     -2,-0.2  1071,-0.0     1,-0.2     0, 0.0  
    7    0 A S  S    S-     0   0   76     -2,-0.4    -1,-0.2     0, 0.0     0, 0.0

The problem is that the fasta sequence starts much earlier that the PDB sequence. I figured that this would not be a problem as the RESNUM col should reference a residue in the fasta sequence. However, according to the DSSP resnum this begins at residue -6.

I need to get a starting point within the fasta sequence for each DSSP file. The fact that the sequence entry for this PDB shows the information I want means that these mapping are available but I can't seem to find them. I could do string matching of the first few amino acids in a chain to find a starting point, but I would prefer to use manually curated mappings as I have over 20,000 structures and there are bound to be some errors if I do this. Any suggestions are much appreciated.

EDIT:

A simple solution is to take all unbroken chains in DSSP and use the sequences directly, however I would like to avoid this as I have the full PDB fasta sequence and need to map to that.

mapping pdb dssp • 3.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by Kevin ▴ 100

Ram · Answer 1 · 2014-05-27

The data in DSSP is based on the PDB structure, and so exhibits the same problems as handling the sequences in the PDB entry. In a PDB entry there are typically two types of sequence:

The sequence from the SEQRES records, which is commonly the target sequence from cloning (EMBOSS formats 'pdbseq' or 'pdbnucseq')
The actual sequence appearing in the ATOM records (i.e. the sequence from the structure) (EMBOSS formats 'pdb' or 'pdbnuc')

In your example entry (PDB:4BHY) the ATOM records start with the residue at -6 and progress from there. Which is consistent with the DSSP data (DSSP:4BHY).

A simple approach would be to extract the two types of sequence for each chain using EMBOSS seqret and then use a global pairwise alignment (for example using EMBOSS needle or EMBOSS stretcher) to figure out which parts of the chain sequences correspond. Fortunately this has already been done for the PDB entries in SIFTS, which provides residue mappings between the two types of PDB sequence data and UniProtKB and mappings to various other databases.

Depending on your requirements you might find that PDBFINDER has the data you are looking for in a more consumable form, for example see: PDBFINDER2:4BHY.