Question

Fasta Sliding Window

0

Entering edit mode

11.4 years ago

st.ph.n ★ 2.7k

I'm trying to write a python script that uses a sliding window. Here is the code:

v = open("ex.fasta", "r")

def sliding_window(sequence, winSize, step):
        numOfChunks = ((len(sequence)-winSize)/step)+1
        for i in range(0,numOfChunks*step,step):
                yield sequence[i:i+winSize]

size = int(14786)
w = 500
while size > w:
        for line in v:
                if not line.startswith(">"):
                         myseq = line.rstrip()
                         myvect = sliding_window(myseq, 500, 500)
                         for r in myvect:
                                  print(r)

I want it to be able to produce chunks of the sequence in window sizes of 500, with a step size of 500, i.e. no overlap. However, the trouble I'm having is the lines for the fasta file are 76bp long for all lines except the last is ~40bp. Choosing anything <= 76 it will produce the desired outcome. Anything > 76 does not work. I've tried creating a string (using the whole sequence rather than a list, but it still does not work. Any help is appreciated.

python • 10k views

ADD COMMENT • link updated 11.4 years ago by Andrzej Zielezinski 11k • written 11.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I really didn't get your question but you can concatenate multiple lines to make a complete sequence. See here:A: multiline fasta to single line fasta

ADD REPLY • link 11.4 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks Ashutosh, but i'm in the middle of learning python. I'd like to stay away from one-liners and perl, and keep it all to a single script.

ADD REPLY • link 11.4 years ago by st.ph.n ★ 2.7k

score 4 · Answer 1 · 2014-02-27

4

Entering edit mode

11.4 years ago

Andrzej Zielezinski 11k

As Philip and Neil suggested, use BioPython for reading FASTA records. As for memory usage, the parse function from SeqIO module is a generator that iterates over FASTA records and holds in memory one record at a time.

Hope this is what you wanted:

UPDATED 2:

Say, ex.fasta contains:

>seq
MAFAE
PAASS
LPNGD
CGR

Running this:

from Bio import SeqIO

def chunks(seq, win, step):
    seqlen = len(seq)
    for i in range(0,seqlen,step):
        j = seqlen if i+win>seqlen else i+win
        yield seq[i:j]
        if j==seqlen: break

for seq_record in SeqIO.parse("ex.fasta", "fasta"):
    seq = seq_record.seq
    for subseq in chunks(seq, 5, 5):
        print subseq

Outputs:

MAFAE
PAASS
LPNGD
CGR

ADD COMMENT • link 11.4 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Thanks a.zielezinski. I'll take a closer look at your suggestion.

I modified the code a bit, using Biopython to read the flie. But it's cuting the file off and I"m not sure at what bp position, but it's not finishing the file. I think it's because the size of the file (in bp) is not divisible by 500. So, I tried to tell it to break when it hits the negative number it would hit (I realize that this is bad practice, but this is for testing purposes at the moment), but it's still not printing it.

from Bio import SeqIO

for record in SeqIO.parse("ex.fasta", "fasta"):
        v = record.seq

def sliding_window(sequence, winSize, step):
        numOfChunks = ((len(sequence)-winSize)/step)+1
        for i in range(0,numOfChunks*step,step):
                yield sequence[i:i+winSize]

size = 14820
m = 500

for line in v:
        myvect = sliding_window(v, m, m)
        for r in myvect:
                print(r)
        size -= m
        if size <= -180:
                break

Suggestions on this current piece of code is appreciated.

ADD REPLY • link 11.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Your approach would work since I want step size equal to the window size so there is no overlap. What if I wanted say a step size of 1? That can be easily modified by the function I wrote above, by changing the input for the step size.

ADD REPLY • link 11.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I just updated my answer. I included your function as it allows step to be smaller than winSize. Does this answer your question?

ADD REPLY • link 11.4 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Yes. In order to change the step size. So, your code works exactly like the edit I posted. However, the issue I'm still having is the last window for the sequence is not printed b/c subtracting 500 from it puts it below 0 with size -= m. I sill need that last bit of sequence, and am unsure how to tell it to print.

ADD REPLY • link 11.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Now I understand. I fixed it and updated my answer.

ADD REPLY • link 11.4 years ago by Andrzej Zielezinski 11k

1

Entering edit mode

Perfect! Dziękuję bardzo!

ADD REPLY • link 11.4 years ago by st.ph.n ★ 2.7k

score 1 · Answer 2 · 2014-02-26

For the sequence as one string, BioPython is the laziest way.

from Bio import SeqIO
for seq in SeqIO.parse("ex.fasta", "fasta"):
    myseq = str(seq.seq) # is now one huge string of all your nucleotides for each sequence in ex.fasta

For the sliding window, have a look at collections.deque. Just intialize a deque with your string and pop 500 from the left; if it raises an IndexError, you're at the end. Will post something later.

Edit: Here's something ugly with collections.deque:

def slide(a_deque, size):
    done = False
    while True:
        counter = 0
        sub = []
        while counter < size:
            try:
                sub.append(a_deque.popleft())
            except IndexError:
                done = True
                break
            counter += 1
        print sub
        if done:
            break
from collections import deque
slide(deque(myseq), 500)

Each of the "sub"-lists should be of length 500 if size = 500, except for the last one. You can then do with "sub" whatever you like. Haven't really tested this.

Note: This method looses the speed advantages of a deque since it re-introduces a list. There are many better ways.

score 0 · Answer 3 · 2014-02-26

0

Entering edit mode

11.4 years ago

Neilfws 49k

There's no need to reinvent the FASTA parser. Biopython will do this for you. See the tutorial.

Example: let's say file myseq.fa contains:

>myseq1
acaatcatactcta
aagatcgatgatag

Then the sequence string is accessed like so:

>>> from Bio import SeqIO
>>> for record in SeqIO.parse("seq.fa", "fasta"):
...     print(record.seq)
... 
acaatcatactctaaagatcgatgatag

ADD COMMENT • link 11.4 years ago by Neilfws 49k

0

Entering edit mode

I've used Biopython for other things, and I know how to open files and do some other nifty things. Does using parse store it to memory? I would later like to use this script for a larger fasta file, say human sequence.

ADD REPLY • link 11.4 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

No it loads it up from the disk to memory for each entry in the fasta file. If you put the whole human genome as a single fasta entry it will load all of it into ram. Python isn't the leanest with memory as it is and biopython doesn't help, I suggest you avoid it and use numpy character arrays to reduce the memory footprint if you must work with very large sequence.

ADD REPLY • link 11.4 years ago by pld 5.1k