I need to calculate the conservation or percentage of identity across several sequences in a multiple alignment (not pairwise but across all sequences). I wrote a function that works with clustal format where column conservation is marked with *
(and therefore is just a matter of counting stars). I was wondering if someone would have an equivalent function for FASTA alignments.
def calc_ident(alignment_location):
iden = 0.0 #Method1: Calculated divided by the total length of the multiple alignment
iden2 = 0.0 #Method2: Calculated only with the positions suitable for evolutionary analysis (no gap in any sequence)
f = open(alignment_location, "r")
completo = f.read()
stars = completo.count('*')
f.close()
handle = open(alignment_location, "rU")
records = list(SeqIO.parse(handle, "clustal"))
long_total = len(records[0].seq)
iden = (stars*100/long_total)
long_useful = calc_long(records) #Calculating length for score2
if long_useful == 0: #Calculating score2
iden2=0
else:
iden2 = (stars*100/long_useful)
handle.close()
return iden, iden2, long_total, long_useful
def calc_long(records):
length = 0
j=0
while j<len(records[0].seq):
i=0
while i< len(records):
if records[i].seq[j] == '-':
break
elif i==(len(records)-1):
length += 1
i+=1
j+=1
return length
Why don't you just convert fasta to clustal? (http://sequenceconversion.bugaco.com/converter/biology/sequences/clustal_to_fasta.php)
That's what I do, but it's cumbersome to add middle steps when you can avoid them.