Question

Searching for a sequence string in python?

1

Entering edit mode

8.7 years ago

auryndb ▴ 70

Hey all,

So I'm using plain python (I'm not using BioPython) to search for a string in E. Coli genome. How I do it is that I read each line of a fasta sequence, and I'll do an if sequence return thing on it, a pseudocode like this:

ecoli_sequence = open('ecolik12.fasta', 'r')
a = ecoli_sequence.readlines()[1:]
for y in a:
    if "TATAAA" in y:
        print ("it's here"+y)
    else:
        print("it wasn't in the genome, stupid code albeit")
ecoli_sequence.close()

however, there is one big problem. if my sequence is at the interface of each line, it can't recognize it. What do you guys suggest?

Please help, I really will appreciate it.

Truly yours,

python genome • 9.2k views

ADD COMMENT • link updated 8.7 years ago by John 13k • written 8.7 years ago by auryndb ▴ 70

2

Entering edit mode

In your program you should store the whole sequence of the Ecoli chromosome as a single large string of about 5 million characters. The FASTA format was developed in the age of punch cards when line length was restricted to 80 characters. Please note, that bacterial chromosomes are circular. It is common praxis to represent them as linear stings, but there remains one corner case, if your pattern partially overlaps the start and the end of the linear sequence.

ADD REPLY • link 8.7 years ago by piet ★ 1.9k

1

Entering edit mode

If you want to be able to parse a variable sequence line fasta file without biopython, there are a couple of posts on biostars with example code of how to do that. Here is a great blog post on how to use python's itertools.groupby to do that:

https://drj11.wordpress.com/2010/02/22/python-getting-fasta-with-itertools-groupby/

ADD REPLY • link 8.7 years ago by Damian Kao 16k

0

Entering edit mode

Umm, sorry, but what do you mean by "interface" here?

ADD REPLY • link 8.7 years ago by Ram 44k

0

Entering edit mode

okay imagine the fasta sequence like (fasta sequences are divided in lines):

 atatatacggcgcgagaatatctctc**atata**
 **atat**ctctctatattatagccctctctctctctaa

with my code, it can't find the sequence if it's at the end of the first line and at the beginning of the second line.

ADD REPLY • link 8.7 years ago by auryndb ▴ 70

score 1 · Answer 1 · 2016-03-06

1

Entering edit mode

8.7 years ago

GouthamAtla 12k

use biopython to properly handle fasta file.

from Bio import SeqIO

for rec in SeqIO.parse("ecolik12.fasta","fasta"):
    if "TATAAA" in rec.seq:
        . . . .

It looks like you were already suggested to use BioPython to read a fasta file. Importing big sequences into Python

ADD COMMENT • link 8.7 years ago by GouthamAtla 12k

score 1 · Answer 2 · 2016-03-06

If you don't want to add the BioPython dependency, you just need to stitch together the lines with ''.join(...):

a = ecoli_sequence.readlines()[1:]
a= ''.join([x.strip() for x in a])

if "TATAAA" in a:
    print ("Found")
else:
    print("it wasn't in the genome, stupid code albeit")

And if you want to test whether the substring is at the start-end junction of the circular genome:

x= "TATAAA"
xl= len(x)-1
junct= a[len(a)-xl:] + a[0:xl]
if x in junct:
    ...

score 1 · Answer 3 · 2016-03-06

The fastest algorithm I can think of without reading the whole file into memory first would be:

seq = 'TGCACTGATG'
seq_len = len(seq)
with open('ecoli.fa', 'r') as ecoli_sequence:
    previous = ''
    for line,data in enumerate(ecoli_sequence):
        if seq not in previous + data:
            previous = data[-seq_len:-1]
        else:
            if seq in previous + data[:seq_len]:
                print "It's split over lines:", line, "and", line+1
            else:
                print "It's in line: ", line+1
            break

I'm not sure for small genomes it's any faster than reading the whole file into memory. Actually I'd say reading the whole file into memory is probably significantly faster than this. But then you have to deal with parsing out the \n's Speaking of which, if you have \r\n at the end of each line, not \n, the -1 has to be a -2. You can use strip put meh