How can I obtain column statistics from a multiple sequence alignment?
2
1
Entering edit mode
4.1 years ago
ej6474 ▴ 10

I have aligned about ~2k viral genomes (highly similar in identity about 99%) using MAFFT to a reference genome to build a full MSA and I want to obtain column statistics for the ends of the genome (i.e. percentage of matches in a particular column, or mismatches, gaps). I am having trouble finding tools to do this. So far I have tried HMMER's easel minimap too called alistats to generate some statistics such as "information in bits for each column", "per-column residue counts" and etc.

My end goal is to see on the ends of the alignment, where I need to trim for downstream phylogenetic analysis. Can any advise the best way to do this? Thank you!

alignment sequencing genome • 1.2k views
ADD COMMENT
0
Entering edit mode

Hi, have you tried guidance? http://guidance.tau.ac.il/

ADD REPLY
0
Entering edit mode
4.1 years ago
Mensur Dlakic ★ 28k

I may be misunderstanding what you are trying to do, so take that into account when considering my suggestions.

Trees represent phylogenetic distance between sequences, which is a function of their differences. You are not going to have much difference to begin with if your genomes of interest are in 99% identity range. If on top of that you are trimming ends of alignments that are presumably also divergent, you will further remove a meaningful signal. Generally speaking, trimming is done for columns where there is little useful information, for example where the fraction of gaps is higher than 0.5. Trimming is generally not done based on sequence identity, unless you have a very good reason to suspect that a given part of the alignment is poor (probably doesn't apply to genomes in the 99% identity range).

What I am suggesting to you is to stop worrying about column statistics so you can manually remove the columns, but instead to use programs that will objectively remove the columns. One such tool is BMGE, which removes columns based on entropy.

https://research.pasteur.fr/en/software/bmge-block-mapping-and-gathering-with-entropy/

trimAl has numerous options for removing columns based on many criteria:

https://github.com/scapella/trimal

ADD COMMENT
0
Entering edit mode
4.1 years ago
Joe 21k

If you know biopython it's quite easy to calculate various statistic via the AlignIO module. I put something quite crude together a while back for instance:

https://github.com/jrjhealey/bioinfo-tools/blob/master/MSAnalysis.py

ADD COMMENT

Login before adding your answer.

Traffic: 2595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6