Dear all,
I have been using the Prody module ESSA (http://prody.csb.pitt.edu/tutorials/essa_tutorial/index.html) for single proteins by directly retrieving them from the PDB. This worked fine and as described in the tutorial they provide. Now I would like to extend this and do the same calculation on ~200 structures by iterating over downloaded PDB files in a directory. However, it seems like the parser that prody uses doesn't recognize my downloaded PDB files and I don't understand why. Here is my code:
# packages
from prody import *
from numpy import *
from matplotlib.pyplot import *
from pandas import *
import os
ion()
directory = '/home/lefebrej95/Documents/PDB_ESSA_test_format'
ext = ('.pdb')
for f in os.listdir(directory):
if f.endswith(ext):
fetchPDB(f)
atoms = parsePDB(f, compressed = True)
essa = ESSA()
essa.setSystem(atoms)
essa.scanResidues()
with style.context({'figure.dpi': 600}):
essa.showESSAProfile()
I am getting this error message, no matter what PDB file I am trying to parse:
runfile('/home/lefebrej95/.config/spyder-py3/untitled1.py', wdir='/home/lefebrej95/.config/spyder-py3')
@> WARNING '7l67.pdb' is not a valid identifier. @> Matching 10 modes across 136 modesets... [ 99%] 1s@> WARNING '7l62.pdb' is not a valid identifier. Traceback (most recent call last):
File "/home/lefebrej95/.config/spyder-py3/untitled1.py", line 35, in <module> atoms = parsePDB(f, compressed = True)
File "/home/lefebrej95/anaconda3/lib/python3.8/site-packages/ProDy-2.0-py3.8-linux-x86_64.egg/prody/proteins/pdbfile.py", line 123, in parsePDB return _parsePDB(pdb[0], **kwargs)
File "/home/lefebrej95/anaconda3/lib/python3.8/site-packages/ProDy-2.0-py3.8-linux-x86_64.egg/prody/proteins/pdbfile.py", line 205, in _parsePDB pdb, chain = _getPDBid(pdb)
File "/home/lefebrej95/anaconda3/lib/python3.8/site-packages/ProDy-2.0-py3.8-linux-x86_64.egg/prody/proteins/pdbfile.py", line 192, in _getPDBid raise IOError('{0} is not a valid filename or a valid PDB '
OSError: 7l62.pdb is not a valid filename or a valid PDB identifier.
Does anyone of you have experience with ProDy or experienced a similar problem already?
Any help is appreciated!
Best regards,
Jonathan
I don't use this program, but a
fetchPDB
command would sound like something that is downloading a file rather than opening it from the local disk. If that's the case, it should be no surprise that your file names are not recognized as valid PDB IDs.Maybe try the same script but skip the
fetchPDB(f)
line? And maybe setcompressed = False
in your next line?I think the issue with it saying "not a valid filename" is that it isn't. You are forgetting the path. Have you tried changing the fetch command to:
And combine that with Mensur Dlakic 's suggestion of
compressed = False
unless your files aretar.gz
versions?Another option is to use those files just to parse out the PDB code and then continue to with the script as-is.
I'm seeing it says
OSError: 7l62.pdb is not a valid filename or a valid PDB identifier.
. Did you try changing the fetch command to the following?That would would be to use those files just to parse out the PDB code and then continue to with the script as-is. But it will be slower because essentially you are getting the files from somewhere else another time. However, it may not really add much time overall and gives you options.
Plus, this option may allow things to work if the versions if the files you downloaded and have in your
directory
location aren't quite whatfetchPDB
andparsePDB
want. Right now from what you you've provided I cannot tell if you have compressed versions, full old-style PDB files, or mcif versions of the structure files?I should add I also don't use this program presently; however, it seems you actually have Python issues here. (If you indeed have the PDB files in the same format the program needs.)
Thanks!
adding
to the code solved the problem and PDBs are recognized. Also I deleted
because it was useless here!
No I only have the problem of python trying to do the calculation on all PDBs in parallel, which kills the computer because of insufficient RAM. I am currently trying to write a workaround to tell python that he should load and calculate each file one-by-one. If anyone of you has a smart idea, this would also be highly appreciated!
Anyways, thanks a lot for your help!
There's probably a setting you can supply when calling the program to limit it to one processor. That will then make it process serially. It's actually a good sign about how the software was developed to see it defaulting to parallel; it is harder to implement yourself when you move to a more powerful system.
Not finding the setting easily though. There's mention of
n_cpu=1
here; however, those seem different than commands you are using.An alternative, hacky way to limit it slamming your system would be to slow things down in your loop. Instead of calling all those tasks, build some time in your loop. Use
import time
near the top of your script and tell it at the bottom of your loop tosleep
long enough for one to process one at time. Inside your loop near bottom add something like:That makes it sleep 600 seconds or ten minutes before looping to call next task. Adjust the number of seconds as you see fit.