I'm writing some code to get sequences back from PDB files, wrapping the BioPython functionality.
Here's the code at the moment:
#!/usr/bin/env python
'''Convert a PDB file in to its representative sequence.'''
import argparse
import sys
import os
try:
from Bio.PDB import *
from Bio import SeqIO
except ImportError:
sys.stderr.write('Could not import from the Bio module. Try installing with `python -m pip install biopython`')
sys.exit(1)
def get_args():
"""Parse command line arguments"""
try:
parser = argparse.ArgumentParser(
description='Convert a PDB file to its representative sequence(s) via BioPython.')
parser.add_argument('-i', '--input', action='store', default=sys.stdin,
help='Input PDB file (defaults from STDIN for pipes).')
parser.add_argument('-o', '--output', action='store', default=sys.stdout,
help='Output sequence file (defaults to STDOUT for pipes).')
if len(sys.argv) == 1:
parser.print_help(sys.stderr)
exit(1)
except:
sys.stderr.write("An exception occurred with argument parsing. Check your provided options.")
return parser.parse_args()
def main():
args = get_args()
PDB, PPB = PDBParser(), PPBuilder()
if args.input is sys.stdin:
structure = PDB.get_structure("File from stdin", args.input)
else:
structure = PDB.get_structure(os.path.basename(args.input), args.input)
sequences = [peptide.get_sequence() for peptide in PPB.build_peptides(structure)]
for each in sequences:
print(each)
if __name__ == '__main__':
main()
Now, the problem arises that I want to be able to pipe a PDB file in to the script as well as be able to pass files. Ordinarily, this would be pretty trivial via sys.stdin
, however BioPython seemingly does something a little unusual with it's file
objects.
If I pass open(sys.stdin)
, it (rightly) protests that it expects a string for coercing to Unicode etc, but recieved a file
object. Ok, says I, it should be enough to pass sys.stdin
to the get_structure()
constructor method, as per the line
structure = PDB.get_structure("File from stdin", args.input)
However this results in the following error:
File "pdb2seq.py", line 55, in <module>
main()
File "pdb2seq.py", line 46, in main
structure = PDB.get_structure("File from stdin", args.input)
File "/home/wms_joe/bin/miniconda2/lib/python2.7/site-packages/Bio/PDB/PDBParser.py", line 86, in get_structure
self._parse(handle.readlines())
ValueError: I/O operation on closed file
Seems obvious enough, the file is closed some how. The BioPython source for this method does the following:
def get_structure(self, id, file):
69 """Return the structure.
70
71 Arguments:
72 - id - string, the id that will be used for the structure
73 - file - name of the PDB file OR an open filehandle
74
75 """
76 with warnings.catch_warnings():
77 if self.QUIET:
78 warnings.filterwarnings("ignore", category=PDBConstructionWarning)
79
80 self.header = None
81 self.trailer = None
82 # Make a StructureBuilder instance (pass id of structure as parameter)
83 self.structure_builder.init_structure(id)
84
85 with as_handle(file, mode='rU') as handle:
86 self._parse(handle.readlines())
87
88 self.structure_builder.set_header(self.header)
89 # Return the Structure instance
90 structure = self.structure_builder.get_structure()
91
92 return structure
So my guess is that BioPython's as_handle()
method is doing something to the file handle resulting in it being closed prematurely? However, I can't see anything obvious in as_handle()
's source that would result in the handle being closed?
46 def as_handle(handleish, mode='r', **kwargs):
47 r"""Context manager to ensure we are using a handle.
48
49 Context manager for arguments that can be passed to SeqIO and AlignIO read, write,
50 and parse methods: either file objects or path-like objects (strings, pathlib.Path
51 instances, or more generally, anything that can be handled by the builtin 'open'
52 function).
53
54 When given a path-like object, returns an open file handle to that path, with provided
55 mode, which will be closed when the manager exits.
56
57 All other inputs are returned, and are *not* closed.
58
59 Arguments:
60 - handleish - Either a file handle or path-like object (anything which can be
61 passed to the builtin 'open' function: str, bytes, and under
62 Python >= 3.6, pathlib.Path, os.DirEntry)
63 - mode - Mode to open handleish (used only if handleish is a string)
64 - kwargs - Further arguments to pass to open(...)
65
66 Examples
67 --------
68 >>> with as_handle('seqs.fasta', 'w') as fp:
69 ... fp.write('>test\nACGT')
70 >>> fp.closed
71 True
72
73 >>> handle = open('seqs.fasta', 'w')
74 >>> with as_handle(handle) as fp:
75 ... fp.write('>test\nACGT')
76 >>> fp.closed
77 False
78 >>> fp.close()
79
80 Note that if the mode argument includes U (for universal new lines)
81 this will be removed under Python 3 where is is redundant and has
82 been deprecated (this happens automatically in text mode).
83
84 """
85 # If we're running under a version of Python that supports PEP 519, try
86 # to convert `handleish` to a string with `os.fspath`.
87 if hasattr(os, 'fspath'):
88 try:
89 handleish = os.fspath(handleish)
90 except TypeError:
91 # handleish isn't path-like, and it remains unchanged -- we'll yield it below
92 pass
93
94 if isinstance(handleish, basestring):
95 if sys.version_info[0] >= 3 and "U" in mode:
96 mode = mode.replace("U", "")
97 if 'encoding' in kwargs:
98 with codecs.open(handleish, mode, **kwargs) as fp:
99 yield fp
100 else:
101 with open(handleish, mode, **kwargs) as fp:
102 yield fp
103 else:
104 yield handleish
You have forgotten to say what version of Biopython, Python, etc you are using. Also could you simplify your example?
Does this work for you - I was using Python 2.7.10 or Python 3.7.0rc1 with Biopython 1.73dev0 (i.e. a prerelease of 1.73 from git) and things seem fine?
It is unlikely to be the problem, but does that example file PDB file work for you? It is from
Tests/PDB/
in the Biopython source code.Ah good point thanks Peter:
Python 2.7.12 :: Continuum Analytics, Inc.
andPython 3.6.0 :: Continuum Analytics, Inc.
give the same issues.Biopython is
1.72
on Py3 and1.71
on Py2.Your example code does work for me, and also works for the PDB file I was attempting previously. The only thing that occurs to me is that there is then perhaps an issue with how argparse yields its default instead?
It seems the following code omitting argparse works:
Strange but perhaps you are right and something happens inside argparse? I would suggest setting the defaults to the string
"-"
(minus sign) in the command line API, and when you come to callget_structure
use:Some tools support the minus sign as short hand for stdin or stdout, so this is a bonus.
Yeah I had thought about offering
-
as a shorthand instead, though it seemed to me it should be possible to do without it altogether. That certainly works as a work around, though I'm surprised argparse interferes like this. I've never come across an issue like that before.I can see from the code you care about the peptide fragments as inferred from the structural data (breaking at discontinuities), otherwise it would be one line to get the PDB sequence out as FASTA format:
Yep, in this particular case I care about preserving the fragments (though do not care about heteroatoms etc).