Hi,
I am trying to map between a set of fasta sequences from PDB entries and DSSP.
My starting sequence is the fasta sequence supplied with each PDB reconrd (no breaks!). I now want to map the values in the DSSP record to the sequence. The desired result can be seen below:
NLYFQSMKAAIAQINTAALRHNLAVVKRHAPQCKIIAVVKANAYGH - main sequence (no breaks)
_________KAAIAQIN_________LAVVKRHAPQCKIIAVV_________ - DSSP sequence (_ shows the breaks in the PDB chain)
I expected this to be a simple process but I now see that the RESNUM entry in the DSSP record does not actually map to a residue number in the fasta file.
For example (4BHY:A):
The fasta file is as follows:
>4BHY:A|PDBID|CHAIN|SEQUENCE
MHHHHHHDYDIPTTENLYFQSMKAAIAQINTAALRHNLAVVKRHAPQCKIIAVVKANAYGHGLLPVARTLVDADAYAVAR
# ------------^__^ (line added to highlight sequence)
IEEALMLRSCAVVKPIVLLEGFFSAADLPVLAANNLQTAVHTWEQLEALEQADLPAPVVAWLKLDTGMHRLGVRADEMPA
FIERLAKCKNVVQPFNIMTHFSRSDELEQPTTREQIDLFSQLTAPLLGERAMANSAGILAWPDSHCDWVRPGVILYGVSP
FPNTVAADYDLQPVMTLKTQLIAVRDHKAGEPVGYGANWVSDRDTRLGVIAIGYGDGYPRMAPNGTPVLVNGRIVPLVGR
VSMDMTTVDLGPGATDKAGDEAVLWGEGLPVERVADQIGTIPYELITKLTSRVFMEYV
The start of the DSSP file is:
# RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N
1 -6 A E 0 0 122 0, 0.0 1074,-0.2 0, 0.0 288,-0.1
2 -5 A N + 0 0 124 286,-0.3 287,-0.0 1073,-0.0 1072,-0.0
3 -4 A L - 0 0 49 1,-0.1 3,-0.1 0, 0.0 1072,-0.1
4 -3 A Y - 0 0 205 1,-0.3 2,-0.2 1072,-0.0 -1,-0.1
5 -2 A F S S- 0 0 146 1072,-0.0 2,-0.4 0, 0.0 -1,-0.3
6 -1 A Q + 0 0 49 -2,-0.2 1071,-0.0 1,-0.2 0, 0.0
7 0 A S S S- 0 0 76 -2,-0.4 -1,-0.2 0, 0.0 0, 0.0
The problem is that the fasta sequence starts much earlier that the PDB sequence. I figured that this would not be a problem as the RESNUM col should reference a residue in the fasta sequence. However, according to the DSSP resnum this begins at residue -6.
I need to get a starting point within the fasta sequence for each DSSP file. The fact that the sequence entry for this PDB shows the information I want means that these mapping are available but I can't seem to find them. I could do string matching of the first few amino acids in a chain to find a starting point, but I would prefer to use manually curated mappings as I have over 20,000 structures and there are bound to be some errors if I do this. Any suggestions are much appreciated.
EDIT:
A simple solution is to take all unbroken chains in DSSP and use the sequences directly, however I would like to avoid this as I have the full PDB fasta sequence and need to map to that.