Question

How to read a DNA sequence for Motif analysis more efficiently?

0

Entering edit mode

8.7 years ago

auryndb ▴ 70

I wrote a code in python to read DNA sequences and do a motif alignment on them but I'm looking for a more efficient way to do this. See below if you can help:

handle = open("a.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
with open("Output.txt", "w") as text_file:
    text_file.write(a)

f = 0
z = 100
b = ''
while f < len(a):
    b += a[f:z]+'\n'
    f += 1
    z += 1
with open("2.txt", "w") as runner_mtfs:
   runner_mtfs.write(b)

I want to do a bunch of analysis on each line of b, but I don't know of any more efficient way to do this. The out put file is more than 500 megabytes. Any suggestions, the first file is just a DNA sequence, and it the first line of code I'm joining all the lines together, and I'm departing 100 base pairs every time so I could do analysis on it.

python • 2.2k views

ADD COMMENT • link 8.7 years ago by auryndb ▴ 70

0

Entering edit mode

Python is pretty slow, particularly at tasks involving I/O and string processing. You'd get a huge speedup using C to process arrays of strings and running an analysis on the substrings within memory (if possible).

ADD REPLY • link 8.7 years ago by Alex Reynolds 36k

0

Entering edit mode

So you are generating 100bp fragments from the initial string, with a sliding window of 1bp, to find motifs?