I've used the R libraries pegas
and ape
to do this. Pegas provides the function haplotype
to get the frequency of each unique seqeunce, which make it all straight forward
#example sequence data, use read.dna() to get sequences from file
> seq_data <- woodmouse[sample(1:15, 100, replace = TRUE), ]
> h <- haplotype(seq_data)
#turn the haplotype object into a 0/1 matrix
> tab <- sapply(attr(h, 'index'), function(i)
sapply(1:dim(seq_data)[1], function(j) sum(i==j)))
> head(tab[,1:5])
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 1 0 0 0 0
## [3,] 0 1 0 0 0
## [4,] 0 0 1 0 0
## [5,] 0 1 0 0 0
## [6,] 0 0 0 1 0
#rows are individuals, all should have one and only one haplotype
> all(rowSums(tab)==1)
##[1] TRUE
#label the rows with their sequence name
rownames(tab) <- labels(seq_data)
If you make this conversion a lot, it's easy to write R scripts that take command line arguments and the like
I like that idea of using R packages.
Thank you for your answer.