Greetings to the Biostar community,
I am currently following a bioinformatics module as part of a biomedical degree (I am basically a python newbie) and the following task is required as part of a Python programming assignment:
extract motif sequences (amino acid sequences, so basically strings in programmatic-speak, that have been excised from algorithms implementing a multiple sequence alignment and subsequently iterative database scanning to generate the best conserved sequences. The ultimate idea is to infer functional significance from such "motifs").
These motifs are stored on a public database in files which have multiple data fields corresponding to each protein (uniprot ID, Accession Number, the alignment itself stored in a hyperlink .seq file), currently one of which is of interest in this scope. The data field is called "extracted motif sets".
My question is how to go about writing a script that will essentially parse the "motif strings" and output them to a file. I have now coded the script so that it looks as follows (I don't write the results to files yet):
import os, re, sys, string
printsdb = open('/users/spyros/folder1/python/PRINTSmotifs/prints41_1.kdat', 'r')
protname = None
final_motifs = []
for line in printsdb.readlines():
if line.startswith('gc;'):
protname = line.lstrip()
#string.lower(name) # convert to lowercase
break
def extract_final_motifs(protname):
"""Extracts the sequences of the 'final motifs sets' for a PRINTS entry.
Sequences are on lines starting 'fd;' A simple regex is used for retrieval"""
for line in printsdb.readlines():
if line.startswith('fd;'):
final_motifs = re.compile('^\s+([A-Z]+)\s+<')
final_motifs = final_motifs.match(line)
#print(final_motifs.groups()[0])
motif_dict = {protname : final_motifs}
break
return
motif_dict = extract_final_motifs('ADENOSINER')
print(motif_dict)
The problem now is that while my code loops over a raw database file (prints41_!.kdat) instead of connecting to the public database using urllib module, as suggested by Simon Cockell below, the ouput of the script is simply "none" on the python shell, whereas it should be creating a list such as [AAYIGIEVLI, AAYIGIEVLI, AAYIGIEVLI, etc..]
Does anybody have any idea where the logic error is? Any input appreciated!!
@spyros: If the input is in a "public database" why not link to samples of the data being processed. Also, it'd be of use to know you crossed posted this question here.
@blunders is right, a link to the data you are trying to parse would really help people to help you out.
@blunders: I originally posted the question on stackoverflow, where a member suggested I post it here as well since it is a bioinformatics-field forum, hence the double posting.
A link to one of the files on database: http://www.bioinf.manchester.ac.uk/cgi-bin/dbbrowser/PRINTS/DoPRINTS.pl?cmd_a=Display&qua_a=/Full&fun_a=Code&qst_a=ADENOSINER
What I would like is basically to parse the sequences (strings of capitals) under the field "FINAL MOTIF SETS" (it is the last data field) which in this case is divided into 6 sections according to "motif number". I need to collect these sequences.
IMHO, you should address it to BioStar community, not stackoverflow :)
Spyros, if your edited code is exactly as you've pasted here then I think I can see the problem... extract_final_motifs() doesn't return anything :)
... actually looking back over it, there are few little errors. (The best way to deal with complex problems is to break them down into the little tasks you have to do, and test each step as you build your solution)
Yes, there are a number of problems with this script, not least of which is the fact that extract_final_motifs() returns None (which means that will always be the output of your program, regardless of whether what precedes it works or not, which it doesn't). You should read about what
break
does, and when to use it appropriately http://docs.python.org/release/2.5.2/ref/break.html