Question

How to do clustering of bacteria genome based on hamming distance.

1

Entering edit mode

8.6 years ago

jeccy.J ▴ 60

Can anyone suggest me how to do clustering a set of bacterial genome based on their hamming or snp distance ?

clustering genome • 2.8k views

ADD COMMENT • link updated 8.6 years ago by Sej Modha 5.3k • written 8.6 years ago by jeccy.J ▴ 60

0

Entering edit mode

More detail is really needed. What exactly is your problem? How to calculate Hamming distance or SNP for two genomes? Which clustering algorithm to use once you've calculated the Hamming distances? Must it be Hamming or SNP distance, or are you in fact looking for distance metrics better suited for the problem you are trying to solve? How closely related are the genomes you want to cluster?

ADD REPLY • link 8.6 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

just get a matrix of distances MxN and use simple ward clustering or you could even try MDS. Both done in R ward clustering with manhattan distance for example:

pvclust(data = t(mydata),method.hclust = "ward.D",method.dist = "manhattan",nboot = 10000)

additionally you will get p-value for each clade as the number of replicated clusters

ADD REPLY • link 8.6 years ago by stolarek.ir ▴ 700

score 0 · Answer 1 · 2017-01-27

0

Entering edit mode

8.6 years ago

Sej Modha 5.3k

cd-hit can be used for clustering.

ADD COMMENT • link 8.6 years ago by Sej Modha 5.3k

0

Entering edit mode

It would be pointless to apply cd-hit to complete bacterial genome sequences (unless they were very similar sharing the same exact gene order and stuff). Perhaps a better strategy would be to build a distance matrix with e.g. all-vs-all MUMmer. Counting shared k-mers could also result in a relatively representative distance matrix..

ADD REPLY • link 8.6 years ago by 5heikki 11k