Similarity Measures Appropriate For Hierarchical Clustering On Gene Content/Binary Data
2
1
Entering edit mode
11.4 years ago
simonalpha ▴ 10

Hi,

I've got a table of presence/absence data for a number of genes (around 100) between different samples (< 10) derived from genome sequence data, and have been playing with doing hierarchical clustering to simply illustrate the similarity between each of these strains, borrowing some ideas from microarray analysis.

However, I'm unsure about which distance measure I should utilise to construct the distance matrix for clustering. Currently, I'm using Hamming distance, given the binary nature of the data; but I'm concerned about it not being normalised, or accounting for the joint presence of genes.

Any suggestions for alternatives for this type of data, or recommendations of papers etc I could read for a better understanding of distance metric choice would be much appreciated.

Thanks,

Simon

statistics • 7.2k views
ADD COMMENT
2
Entering edit mode
11.3 years ago
Christian ★ 3.1k

I used the Jaccard index before as similarity measure between two gene groups.

ADD COMMENT
0
Entering edit mode

That is one of the metrics I'm trying to decide between. Any particular reason you chose that method?

ADD REPLY
0
Entering edit mode

It factors in group size.

ADD REPLY
1
Entering edit mode
11.4 years ago
Biojl ★ 1.7k

Hi,

I usually build this kind of clusterings from binary data (presence/absence). I'm just calculating the distance matrix using the binary method

In R:

dist_matrix<-Dist(matrix, method='binary') #Create distances matrix

I know it's quite simple but this kind of analysis are just to take a glimpse of the data. A better approximation might be to use the RPKM values instead of just presence/absence.

ADD COMMENT
0
Entering edit mode

Probably should have been clearer, I'm using genomic sequence, as opposed to transcriptomics. Unless I've missed something, RPKM is for RNAseq type data, right?

ADD REPLY
1
Entering edit mode

Yes, RPKM is for RNAseq. The code I posted is to create a distance matrix from binary data (presence/absence) of a gene, hence it can be used for your data. I assumed it was RNAseq because you're comparing presence/absence of genes in different strains... in which species are you working that different strains have that much different genes to be able to create reliable clusters? I think it would be much more useful to construct the matrices from differences in the multiple alignments created from the orthologous genes.

ADD REPLY
0
Entering edit mode

I wish I was able to do that! I'm looking at a region encoding surface antigens in bacteria that seems to undergo a fair bit of HGT, so where there are orthologous genes, they don't represent the entire region. Hence using the presence/absence approach. I'm trying to come up with alternatives, but not having much luck.

ADD REPLY

Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6