Question

Calculate AU p-value of the first division in R pvrect function

0

Entering edit mode

7.7 years ago

johntikas ▴ 10

I was recently investigating hierarchical cluster analysis algorithms using multiscale bootstrap resampling and would like to ask the following questions regarding one case of mine, using R package 'pvclust'

What are the best options for Distance (default;correlation) and Cluster method (default;average) when using percentages (%) as input data?
How can you calculate AU and BP p-values for the very first division (the one showing 'edge #', instead of actual numbers)?

R next-gen sequencing • 1.7k views

ADD COMMENT • link updated 7.7 years ago by Jean-Karim Heriche 27k • written 7.7 years ago by johntikas ▴ 10

score 0 · Answer 1 · 2017-03-10

0

Entering edit mode

7.7 years ago

Jean-Karim Heriche 27k

1- What kind of data do you have ? If your data are vectors of percentages and each vector sums to 100, this is known as compositional data and the standard way of dealing with this is to do a logratio transform. See compositional data analysis in a nutshell and the reference book "The statistical analysis of compositional data" by Aitchison.
You may also be interested in the R package robCompositions.

2- The BP and AU values are calculated for all branches of the dendrogram, including those from the first split i.e. the first split of the data gives two branches so you get two numbers, one for each branch. The values are associated with a cluster, for example the BP is the frequency of occurrence of a cluster in the bootstrap samples.

ADD COMMENT • link 7.7 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

The percentages range from 0.00 to 1.00. I was informed that correlation in distance, along with ward.D2 in cluster method was suitable for that kind of data.
I know, I just need a metric in the form of AU/BP or anything else in terms of statistical tests to report that the first two clusters (which, in my case have AU>95) are different. I have considered using a dummy variable with 0.00 at each row to get that pair of numbers, but I am not sure if that's contextually appropriate.

ADD REPLY • link 7.7 years ago by johntikas ▴ 10

0

Entering edit mode

If your features are independent percentages, then you can treat your vectors as any other multidimensional ones. Which distance measure to use depends on different things, one being what notion of similarity you want to capture, another being the dimensionality of your vectors. On this last point, you may want to avoid anything based on Euclidean distance if your vectors have more than ~20 features because Euclidean distance is easily prone to the distance concentration phenomenon.

AU/BP measure the confidence of the associated cluster. They can't tell you that clusters are the same or different. Also, in hard clustering, the clusters are by definition different since a data point can only be part of one cluster. So could you clarify your second question ?

ADD REPLY • link 7.7 years ago by Jean-Karim Heriche 27k