I'm trying to build a tree from this alignment. I've converted the numbers with more than one algorithm into letters: a for 10, b for 11 and p for 25.
I have tried to use Phylip pars to build a tree, although I don't really understood how it works. So I want to understand the program to provide the correct input or if someone has other solutions, please be free to help me :)
What do the numbers mean? Are they just 25 different categories, or is the number 3 more similar to 4 than it is to 15? In the latter case, turning the numbers into an alphabet like you have done would not make sense.
If i understand what you are trying to do. Then a distance method is probably best (pars is for "unorded multi-state data" so, for instance, if you had taxa with 1, 3, and 12 repeats at one locus the it would treat all comparisons as one step away from each other)
I'm pretty sure Arlequin has a method for this, but it's pretty easy in R if you are comfortable with it. The only thing you'll need to think about is the best distance measure
data <- read.table('repeats.tsv', header=T) # a subset of your data
head(data ,3)
taxon Ll L2 L3 L4 L5
1 1574 2 2 3 2 2
2 1585 4 2 2 2 3
3 1588 6 2 2 1 2
data.dist <- dist(data[2:6], method='manhattan')
data.dist
1 2 3 4 5
2 4
3 6 4
4 1 3 5
5 3 3 5 2
6 1 5 7 2 2
library(ape) #a phylogenetics package install.packages() to get it
plot(nj(data.dist))
That uses manhattan distance to compare taxa, which is just a fancy way of saying counting the total number of differences between each taxon. I don't know about your markers, but that might not really reflect how they evolve - maybe they can double in a generation so the distance between 3 and 6 shouldn't be three times the distance from 3 to 4. You should probably do a little research about the best way to compare your markers.
What do the numbers mean? Are they just 25 different categories, or is the number 3 more similar to 4 than it is to 15? In the latter case, turning the numbers into an alphabet like you have done would not make sense.
the numbers represent MIRU (Mycobacterial Interspersed Repetitive Units), number of repeats. What would the best option then?