I'm trying to write a python script that uses a sliding window. Here is the code:
v = open("ex.fasta", "r")
def sliding_window(sequence, winSize, step):
numOfChunks =((len(sequence)-winSize)/step)+1
for i in range(0,numOfChunks*step,step):
yield sequence[i:i+winSize]
size = int(14786)
w = 500
while size > w:
for line in v:
if not line.startswith(">"):
myseq = line.rstrip()
myvect = sliding_window(myseq, 500, 500)for r in myvect:
print(r)
I want it to be able to produce chunks of the sequence in window sizes of 500, with a step size of 500, i.e. no overlap. However, the trouble I'm having is the lines for the fasta file are 76bp long for all lines except the last is ~40bp. Choosing anything <= 76 it will produce the desired outcome. Anything > 76 does not work. I've tried creating a string (using the whole sequence rather than a list, but it still does not work. Any help is appreciated.
As Philip and Neil suggested, use BioPython for reading FASTA records. As for memory usage, the parse function from SeqIO module is a generator that iterates over FASTA records and holds in memory one record at a time.
Hope this is what you wanted:
UPDATED 2:
Say, ex.fasta contains:
>seq
MAFAE
PAASS
LPNGD
CGR
Running this:
from Bio import SeqIO
def chunks(seq, win, step):
seqlen = len(seq)for i in range(0,seqlen,step):
j = seqlen if i+win>seqlen else i+win
yield seq[i:j]if j==seqlen: breakfor seq_record in SeqIO.parse("ex.fasta", "fasta"):
seq= seq_record.seq
for subseq in chunks(seq, 5, 5):
print subseq
Thanks a.zielezinski. I'll take a closer look at your suggestion.
I modified the code a bit, using Biopython to read the flie. But it's cuting the file off and I"m not sure at what bp position, but it's not finishing the file. I think it's because the size of the file (in bp) is not divisible by 500. So, I tried to tell it to break when it hits the negative number it would hit (I realize that this is bad practice, but this is for testing purposes at the moment), but it's still not printing it.
from Bio import SeqIO
for record in SeqIO.parse("ex.fasta", "fasta"):
v = record.seq
def sliding_window(sequence, winSize, step):
numOfChunks =((len(sequence)-winSize)/step)+1
for i in range(0,numOfChunks*step,step):
yield sequence[i:i+winSize]
size = 14820
m = 500
for line in v:
myvect = sliding_window(v, m, m)for r in myvect:
print(r)
size -= m
if size <= -180:
break
Suggestions on this current piece of code is appreciated.
Your approach would work since I want step size equal to the window size so there is no overlap. What if I wanted say a step size of 1? That can be easily modified by the function I wrote above, by changing the input for the step size.
Yes. In order to change the step size. So, your code works exactly like the edit I posted. However, the issue I'm still having is the last window for the sequence is not printed b/c subtracting 500 from it puts it below 0 with size -= m. I sill need that last bit of sequence, and am unsure how to tell it to print.
For the sequence as one string, BioPython is the laziest way.
from Bio import SeqIO
forseqin SeqIO.parse("ex.fasta", "fasta"):
myseq = str(seq.seq)# is now one huge string of all your nucleotides for each sequence in ex.fasta
For the sliding window, have a look at collections.deque. Just intialize a deque with your string and pop 500 from the left; if it raises an IndexError, you're at the end. Will post something later.
Edit: Here's something ugly with collections.deque:
def slide(a_deque, size):
done= False
while True:
counter = 0
sub =[]while counter < size:
try:
sub.append(a_deque.popleft())
except IndexError:
done= True
break
counter += 1
print sub
if done:
break
from collections import deque
slide(deque(myseq), 500)
Each of the "sub"-lists should be of length 500 if size = 500, except for the last one. You can then do with "sub" whatever you like. Haven't really tested this.
Note: This method looses the speed advantages of a deque since it re-introduces a list. There are many better ways.
I've used Biopython for other things, and I know how to open files and do some other nifty things. Does using parse store it to memory? I would later like to use this script for a larger fasta file, say human sequence.
No it loads it up from the disk to memory for each entry in the fasta file. If you put the whole human genome as a single fasta entry it will load all of it into ram. Python isn't the leanest with memory as it is and biopython doesn't help, I suggest you avoid it and use numpy character arrays to reduce the memory footprint if you must work with very large sequence.
I really didn't get your question but you can concatenate multiple lines to make a complete sequence. See here:A: multiline fasta to single line fasta
Thanks Ashutosh, but i'm in the middle of learning python. I'd like to stay away from one-liners and perl, and keep it all to a single script.