Hi,
I am new to computational biology and Python and lately, I've been fiddling with Python, specifically Biopython, to parse FASTA files and present summary information of the sequences in the file in an elegant manner.
I understand that ORFs can be identified and translated via Biopython using this tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc51
In my case though, I don't wish to translate my ORFs. I instead wish to compute the lengths of these ORFs and present them in a table next to information about the sequence they were found in (i.e. identifier).
I've tried the following messy format with very little success. I've highlighted where I've tried to input the ORF finder:
DNA = open("file.fasta").read()
import re
from Bio import SeqIO
for index, record in enumerate(SeqIO.parse(open("file.fasta"), "fasta")):
for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
for frame in range(3):
length = 3 * ((len(record)-frame) // 3) #Multiple of three
**for orf in DNA:
orf = re.search(r"ATG([ATGC]{:}TAA|TAG|TGA)", DNA).group()**
print "ID = %s, length %i, frame %i, strand %i" \
% (record.id, len(orf), frame, strand)
The reason I've opened my FASTA file twice is because I can't enter record.seq
into the re.search
expression without an error. I'd also like to insert a bit of code to find the start position of my ORFs but I'm having trouble as it is inserting the ORF finder.
Any advice on how to improve my code above is much appreciated!
Instead of opening your file twice, you should call the str() function on your record.seq object, and pass that to re.search() - it shouldn't throw an error because you are passing it a string argument (which is what re.search() expects)
Thank you for the advice!
Please don't delete posts after someone has helped you - imagine what the site would be like if everyone deleted their posts after someone took their time to assist.