Hello BioStar community,
I'm having a small issue with list indexing. I am extracting certain information from a PDB (protein information) file and need certain fields of the file to be copied into a list. The entries look like this:
ATOM 1512 N VAL A 222 8.544 -7.133 25.697 1.00 48.89 N
ATOM 1513 CA VAL A 222 8.251 -6.190 24.619 1.00 48.64 C
ATOM 1514 C VAL A 222 9.528 -5.762 23.898 1.00 48.32 C
I am using the following syntax to parse these lines into a list:
charged_res_coord = [] # store x,y,z of extracted charged resiudes
for line in pdb:
if line.startswith('ATOM'):
atom_coord.append(line)
for i in range(len(atom_coord)):
for item in charged_res:
if item in atom_coord[i]:
charged_res_coord.append(atom_coord[i].split()[1:9])
The problem begins with entries such as the following.
ROW1) ATOM 1572 NH2 ARG A 228 7.890 -13.328 16.363 1.00 59.63 N
ROW2) ATOM 1617 N GLU A1005 11.906 -2.722 7.994 1.00 44.02 N
Here, the code that I use to extract the third spatial coordinate (the last of the three consecutive non-integer values) produces a problem:
because 'A1005' (second row) is considered as a single list entry, while 'A' and '228' (first row) are two list entries, when I use a loop to index the 7th element it extracts '16.363' (entry I want) for first row and 1.00 (not entry I want) for the second row.
chargedrescoord[1] ['1572', 'NH2', 'ARG', 'A', '228', '7.890', '-13.328', '16.363']
chargedrescoord[10] ['1617', 'N', 'GLU', 'A1005', '11.906', '-2.722', '7.994', '1.00']
The loop I use goes like this:
for i in range(len(lys_charged_group)):
lys_charged_group[i][7] = float(lys_charged_group[i][7])
The [7] is the problem - in lines that are like ROW1 the code extracts the correct value, but in lines that are like ROW2 the code extracts the wrong value. Unfortunately, the different formats of rows are interspersed throughout the PDB file, reflecting the that both "A1000" and "A, 100" values occur so I don't know if I can solve this using textprocessing routines? Would I have to use regular expressions?
Many thanks for your help!
As dimkal mentions, using existing parsers is probably the best way to go, however, if you insist on doing this your own way, then I would probably use a regex and pull out the individual groups into separate variables.
See my answer for the code :)