Anyone knows if there is an implemented function in biopython to calculate gap windows and their length in a multiple sequence alignment.
I found a module (http://biopython.org/DIST/docs/api/Bio.Align.AlignInfo-pysrc.html) to calculate the consensus sequence but it do not allow to set that 1 gap in any sequence should give an output of a gap in the consensus.
My input is a Clustalw alignment.
I'm just interested in getting the number of gap windows and their length. Since the clustal format only gives you * or : or blank as output (consensus) I thought it was first needed to create this consensus showing the gaps, then count them.
This is the solution I came up with. It uses the SeqIO module from Bio (Biopython).
from Bio import SeqIO
handle = open('file_name', "rU") #File name corresponds to a multiple alignment in clustal format
records = list(SeqIO.parse(handle, "clustal"))
def calc_gaps(records):
j=0
gaps = 0
gap = 0
window_len = 0
windows = []
while j<len(records[0].seq): #j iterates through the positions of each sequence
i=0
while i< len(records):
if records[i].seq[j] == '-':
gap = 1
break
i+=1
if gap==1:
window_len +=1
gaps=gaps+1
else:
if window_len!=0:
windows.append(window_len)
window_len=0
j+=1
gap=0
if j==len(records[0].seq) and window_len!=0:
windows.append(window_len)
return gaps, len(windows), (float( sum(windows) ) / len(windows))
(gaps,windows,mean)=calc_gaps(records) #Here we call the function and get number of gaps, number of windows and mean number of gaps per window
would it not be easier to remove the "-" afterwards? Unless I am missing what you are trying to accomplish?
I'm just interested in getting the number of gap windows and their length. Since the clustal format only gives you * or : or blank as output (consensus) I thought it was first needed to create this consensus showing the gaps, then count them.