Question

biopython.PDB residue.get_id() does not match sequence index

0

Entering edit mode

5.2 years ago

stefan-3000 • 0

Hi all,

I would like to retrieve the sequence index position for an amino acid residue from a .cif file using Biopython's PDB package. My goal is to get the sequence index of spatially neighboring AAs using Neighborsearch I have the following piece of code.

IN:

import Biopython.PDB as bpdb

structure = bpdb.MMCIFParser().get_structure('subt', 'file.cif')

for chain in structure[0]:
    for res in chain:
        for atom in res:
            atoms.append(atom)
neighbor = bpdb.NeighborSearch(atoms)

coords = []
for idx, residue in enumerate(structure[0].get_residues()):
    if bpdb.is_aa(residue):
        name = residue.get_resname()
        ca_atom = residue['CA']
        coords.append((idx, residue, ca_atom.get_coord()))
        occupancy = ca_atom.get_occupancy()
        if 33 < idx < 40:
            print(f'AA:{name}, idx:{idx}, get_id:{residue.get_id()[1]}')

OUT:

AA:ILE, idx:34, get_id:35
AA:SER, idx:35, get_id:36
AA:THR, idx:36, get_id:38
AA:HIS, idx:37, get_id:39
AA:PRO, idx:38, get_id:40
AA:ASP, idx:39, get_id:41

Now as you can see, the sequential index idx gives the real primary structure position of the AA, but the index, retrieved with the residue.get_id() function, sometimes jumps (as can be seen at idx:37) and therefore seems to indicate the wrong position. The problem becomes clear when searching for spatial neighbors, since I would like to find the (correct) sequence index of those neighbors:

IN:

radius = 5
coord = coords[36][2]
neighbors = neighbor.search(coord, radius, level='R')

for item in neighbors:
    print(item)

OUT:

<Residue SER het=  resseq=36 icode= >
<Residue HOH het=W resseq=1135 icode= >
<Residue THR het=  resseq=38 icode= >

But as you can see is is not possible to trace back the position of THR (get_id:38 against idx:36) Why does it jump and how can I get the correct get_id()?

Hope someone can help me out,

Stefan

biopython PDB NeighborSearch MMCIFParser index • 4.3k views

ADD COMMENT • link updated 5.2 years ago by fishgolden ▴ 520 • written 5.2 years ago by stefan-3000 • 0

score 0 · Answer 1 · 2019-10-02

I managed to find a workaround which gives me the idx for the neighboring AAs:

radius = 5
coord = coords[36][2]
neighbors = neighbor.search(coord, radius, level='R')

sequence_index = []
for item in neighbors:
    for i in range(len(coords)):
        if item == coords[i][1]:
            sequence_index.append(coords[i][0])

However this still does not answer my question why the get_id() skips a position in some cases, I have a feeling it has something to do with the water molecules (Residue HOH), but cannot find the arguments for that statement...

score 0 · Answer 2 · 2019-10-02

I'm not using biopython but as I thought this is one of the general pit fall of handling PDB structure, I give an answer. (please give a comment if it isn't)

Why idx and id are different is because "id" means not "index" but "identifier".

Residues in PDB have several ids. There are two such ids at least.

label_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.label_seq_id.html

auth_seq_id http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.auth_seq_id.html

Looking at your print result of the elements, "resseq" must be resSeq written in here https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html

Thus they are the numbers given by the author (must be auth_seq_id), I guess.

Why it jumps is, most of the cases, because there are "missing residues (residues do not show in structure. thus do not have coordinates)".

Or sometimes, experiments were performed without those residues.