Question

How to statistically compare two unit vectors?

0

Entering edit mode

7.2 years ago

clyumath ▴ 20

I want to statistically compare two unit vectors. Is there a statistical test like below?

The null hypothesis is that the two unit vectors have the same distribution. In this case, for the below three unit vectors, nv and nu are very similar, so Test(nv, nu) should produce a large p value, but Test(nv, nw) should produce a small p value that may reject the null.

Please note, this is about vector not about a group of numbers, so the components in the vector are fixed, i.e., Kolmogorov-Smirnov test may not be proper here.

Any suggested statistical test?

nv = (0.29521, 0.61899, 0.72374, 0.03809, 0.066660)

nu = (0.28004, 0.59743, 0.74678, 0.03734, 0.07468)

nw = (0.01467, 0.04401, 0.99752, 0.04401, 0.02934)

Thank you!

statistics • 5.9k views

ADD COMMENT • link updated 7.2 years ago by russhh 5.7k • written 7.2 years ago by clyumath ▴ 20

1

Entering edit mode

How about computing the Wilcoxon's RST (Shapiro–Wilk test is significant for nw so it is not normally distributed)? I didn't understand if your data is paired or not. If yes, use the Wilcoxon's signed-RST.

ADD REPLY • link 7.2 years ago by Tom_L ▴ 360

0

Entering edit mode

You should give more information. How are these vectors generated ? Do we know anything about the population they come from ? If the vectors are independent samples drawn from multivariate normal distributions, look for Hotelling's two-sample t-squared statistic.

ADD REPLY • link 7.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks Jean-Karim. The unit vector is derived from a 5-category frequency vector.

ADD REPLY • link 7.2 years ago by clyumath ▴ 20

0

Entering edit mode

You've normalized the vectors to length 1 but do the original frequencies sum to 1 for each sample, i.e. are you dealing with compositional data ? Or are the categories measured independently ?

ADD REPLY • link 7.2 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

No, the original vector is about frequency counts, e.g. (5, 4, 3, 2, 1) and (50, 40, 30, 20, 10), after normalizing, we treat these two as the same, i.e., their distributions are the same. Second question: Yes, categories were measured independently.

ADD REPLY • link 7.2 years ago by clyumath ▴ 20

score 0 · Answer 1 · 2017-08-31

0

Entering edit mode

7.2 years ago

russhh 5.7k

I don't know about statistical analysis, but you could certainly cluster using distances defined by the cosine distance between the pairs of vectors (effectively the angle between the vectors; 1 - cosine_similarity). This would mean that dist(a, b) == dist(-a, b) for your vectors (this might be unreasonable so have a think about it), but it looks like all the components are non-negative. Based on this, you could proceed in R:

library(magrittr)    # %>%
library(proxy)       # cosine distance
# note that the columns are your unit vectors, 
#   so we have to transpose prior to `stats::dist` or use `by_rows` in `proxy::dist`:
DF <- data.frame(nv = c(0.29521, 0.61899, 0.72374, 0.03809, 0.066660),
    nu = c(0.28004, 0.59743, 0.74678, 0.03734, 0.07468),
    nw = c(0.01467, 0.04401, 0.99752, 0.04401, 0.02934)
    )
# Calculate cosine distance
d <- proxy::dist(DF, by_rows = FALSE, method = "cosine")
d
             nv           nu
nu 0.0006453473             
nw 0.2428455211 0.2208336479
# Cluster, then plot a dendrogram
hclust(d) %>% plot

ADD COMMENT • link 7.2 years ago by russhh 5.7k

0

Entering edit mode

Alternatively, you could randomly sample 2 points on a unit sphere in 5d (sim to here) [and if necessary map these values into the all-positive fraction of that sphere, if that's where you're theoretically working], calculate the cosine distance between those two points. Repeat this loads of times and compare your observed three distances to the distances for the randomly selected pairs.

ADD REPLY • link 7.2 years ago by russhh 5.7k

0

Entering edit mode

Thanks russhh! Yes, mathematical distances definitely work for that. Another alternative one by considering the unit vector as a probability distribution is Kullback–Leibler divergence or Jensen–Shannon divergence. But what I need is a statistical test, simply speaking, I need a p value....