Entering edit mode
7.2 years ago
TEman
▴
10
I want to remove all rare insertions (when it occurs in less than 5% of the sequences) in a multiple sequence alignment file (clustal .aln) with 699 sequences.
That is, I have a MSA with many columns containing only one or two insertions while the rest of the sequences are blank "-". It is by far too much to do manually.
Any suggestions how to do this?
Do you specifically want to do this in R?
If you use BioPython, you can create an ungapped concensus sequence with a threshold for inclusion of a particular residue in a column.