Obtain conservation (or % ident) in a multiple sequence alignment
2
0
Entering edit mode
9.8 years ago
Biojl ★ 1.7k

I need to calculate the conservation or percentage of identity across several sequences in a multiple alignment (not pairwise but across all sequences). I wrote a function that works with clustal format where column conservation is marked with * (and therefore is just a matter of counting stars). I was wondering if someone would have an equivalent function for FASTA alignments.

def calc_ident(alignment_location):
    iden = 0.0 #Method1: Calculated divided by the total length of the multiple alignment
    iden2 = 0.0 #Method2: Calculated only with the positions suitable for evolutionary analysis (no gap in any sequence)
    f = open(alignment_location, "r")
    completo = f.read()
    stars = completo.count('*')
    f.close()
    handle = open(alignment_location, "rU")
    records = list(SeqIO.parse(handle, "clustal"))
    long_total = len(records[0].seq)
    iden = (stars*100/long_total)
    long_useful = calc_long(records) #Calculating length for score2
    if long_useful == 0: #Calculating score2
        iden2=0
    else:
        iden2 = (stars*100/long_useful)
    handle.close()
    return iden, iden2, long_total, long_useful

def calc_long(records):
    length = 0
    j=0
    while j<len(records[0].seq):
        i=0
        while i< len(records):
            if records[i].seq[j] == '-':
                break
            elif i==(len(records)-1):
                length += 1
            i+=1
        j+=1
    return length
alignment Python • 5.9k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

That's what I do, but it's cumbersome to add middle steps when you can avoid them.

ADD REPLY
1
Entering edit mode
9.8 years ago
Felix Francis ▴ 600

You could calculate Shannon's entropy for each column of the alignment. Lower entropy value indicates more conserved columns.

https://github.com/ffrancis/Multiple-sequence-alignment-Shannon-s-entropy

ADD COMMENT
0
Entering edit mode

Is there any publication describing (or comparing it against other measures) the Shannon entropy measure as a value to characterize how good a multiple alignment is, with real data?

ADD REPLY
1
Entering edit mode

Re: "calculate the conservation or percentage of identity across several sequences in a multiple alignment":

Yes, Shannon's entropy would be a good option. It may also be used to approximately evaluate alignment quality when you don't have any reference (not the best approach though). http://www.biomedcentral.com/1471-2148/10/210/

For evaluating alignment accuracy:

NORMD (Thompson et al., 2001), MUMSA (Lassmann and Sonnhammer, 2005)], SP score, TC score, Q-score, TCS, and other consistency and probabilistic scoring approaches may be used.

ADD REPLY
0
Entering edit mode
9.8 years ago

In R you can use Bio3d:

library(bio3d)

aln <- read.fasta("aln.fasta")
ent <- entropy(aln)

# standard entropy score for a 22-letter alphabet
ent$H

# plot results
plot(ent$H, type="h")

see the documentation for the entropy() function: http://thegrantlab.org/bio3d/html/entropy.html

ADD COMMENT

Login before adding your answer.

Traffic: 1931 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6