Hey all,
So I'm using plain python (I'm not using BioPython) to search for a string in E. Coli genome. How I do it is that I read each line of a fasta sequence, and I'll do an if sequence return thing on it, a pseudocode like this:
ecoli_sequence = open('ecolik12.fasta', 'r')
a = ecoli_sequence.readlines()[1:]
for y in a:
if "TATAAA" in y:
print ("it's here"+y)
else:
print("it wasn't in the genome, stupid code albeit")
ecoli_sequence.close()
however, there is one big problem. if my sequence is at the interface of each line, it can't recognize it. What do you guys suggest?
Please help, I really will appreciate it.
Truly yours,
In your program you should store the whole sequence of the Ecoli chromosome as a single large string of about 5 million characters. The FASTA format was developed in the age of punch cards when line length was restricted to 80 characters. Please note, that bacterial chromosomes are circular. It is common praxis to represent them as linear stings, but there remains one corner case, if your pattern partially overlaps the start and the end of the linear sequence.
If you want to be able to parse a variable sequence line fasta file without biopython, there are a couple of posts on biostars with example code of how to do that. Here is a great blog post on how to use python's itertools.groupby to do that:
https://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/
Umm, sorry, but what do you mean by "interface" here?
okay imagine the fasta sequence like (fasta sequences are divided in lines):
with my code, it can't find the sequence if it's at the end of the first line and at the beginning of the second line.