Dear BioStar Community,
I am analyzing a dataset of bacterial proteins structured as follows:
'>PDB1a0na_unknown
PPRPLPVAPGSSKT
'>PDB1a1ta_ENZ
MQKGNFRNQRKTVKCFNCGKEGHIAKNCRAPRKKGCWKCGKEGHQMKDCTERQAN
etc ....
This dataset has about 15000 entries. What I am trying to do is extract all proteins according to their annotations (ENZ, MEM, unknown, etc.) to perform sequence/structure analysis on them. I am using a list comprehension to do this as follows (ENZ annotation shown here):
def retrieve_enzyme_proteins(filename):
with open(filename) as file:
for line in file:
# search for the proteins annotated as enzymes (ENZ)
if line[0]='>' and '_ENZ' in line:
# I need a line here telling the function to jump to the next line and extract the sequence
return [line.strip('\n') for line in file if line[0] != '>']
I have only been able to extract EVERY protein sequence so far and would like a logical structure that tells my program to look for the desired annotation (e.g. ENZ) and then jump to the next line where its sequence begins and extract that sequence? Apologies if the indentation is messed up, the copy-paste does that. I am using Python 2.7.3. Any help would be much appreciated!
@Matt Many thanks for the code and tips, the function raises "UnboundLocalError: local variable 'sequence' referenced before assignment", is this because the line sequence=str() should be before the return statement?
I think you might have a problem with your file, but I've updated the code above so that you shouldn't get the error. This error just meant that the line
sequence += line.strip()
was being evaluated before a header was encountered. You might have a malformed line at the beginning of your file.From what it looks like, OP has a blank line between the FASTA definition line and the associated sequence:
OP, this is not the correct format for FASTA. A blank line after a header delimits the end of that entry.