I recently started playing around with NGS data and Biopython, so this question might come as a bit of no-brainer to some but I havent quite figured it out on my own.
I realize that it's pretty straight forward to check the AA sequence to see if there is a stop symbol, but since I filter those sequences out eventually anyways I figured maybe I can avoid doing a whole bunch of translation()
calls:
for rec in read1:
nucSeq = rec.seq
aaSeq = Seq.translate(nucSeq)
if '*' in str(aaSeq):
starSeqs += 1
else:
# do awesome stuff!
PS: I could obviously do a RegEx search over the sequence for the stop codons, but that's neither pretty nor efficient. I was wondering if there is a method/api in the library that does the black magic for me.
Just curious, would a regex search really be less efficient, and why?
It would be difficult to really know without testing it, but regexs are usually pretty slow (O(n)) compared to a static string match. I've seen a lot of whining about how Java's regex (which is in everything, like split()) is O(n) when people even provide hints like start/end of line anchors, etc. I mean, it's not a big issue which is what i think you're really getting at - reading the file off the disk is probably 99% of the time taken to execute...