I was recently investigating hierarchical cluster analysis algorithms using multiscale bootstrap resampling and would like to ask the following questions regarding one case of mine, using R package 'pvclust'
- What are the best options for Distance (default;correlation) and Cluster method (default;average) when using percentages (%) as input data?
- How can you calculate AU and BP p-values for the very first division (the one showing 'edge #', instead of actual numbers)?
If your features are independent percentages, then you can treat your vectors as any other multidimensional ones. Which distance measure to use depends on different things, one being what notion of similarity you want to capture, another being the dimensionality of your vectors. On this last point, you may want to avoid anything based on Euclidean distance if your vectors have more than ~20 features because Euclidean distance is easily prone to the distance concentration phenomenon.
AU/BP measure the confidence of the associated cluster. They can't tell you that clusters are the same or different. Also, in hard clustering, the clusters are by definition different since a data point can only be part of one cluster. So could you clarify your second question ?