Hi I have generated a distance matrix using this code which has been collected from here:
plotDistances = function(p = GlobalPatterns, m = "wunifrac", s = "X.SampleID", d = "SampleType", plot = TRUE) {
require("phyloseq")
require("dplyr")
require("reshape2")
require("ggplot2")
# calc distances
wu = phyloseq::distance(p, m)
wu.m = melt(as.matrix(wu))
# remove self-comparisons
wu.m = wu.m %>%
filter(as.character(Var1) != as.character(Var2)) %>%
mutate_if(is.factor, as.character)
# get sample data (S4 error OK and expected)
sd = sample_data(p) %>%
select(s, d) %>%
mutate_if(is.factor,as.character)
# combined distances with sample data
colnames(sd) = c("Var1", "Type1")
wu.sd = left_join(wu.m, sd, by = "Var1")
colnames(sd) = c("Var2", "Type2")
wu.sd = left_joinwu.sd, sd, by = "Var2")
# plot
p = ggplotwu.sd, aes(x = Type2, y = value)) +
theme_bw() +
geom_point() +
geom_boxplot(aes(color = ifelse(Type1 == Type2, "red", "black"))) +
scale_color_identity() +
facet_wrap(~ Type1, scales = "free_x") +
theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ggtitle(paste0("Distance Metric = ", m)) +
ylab(m) +
xlab(d)
# return
if (plot == TRUE) {
return(p)
} else {
returnwu.sd)
}
}
I have total 52 samples (13 controls + 39 tests). So, it has generated [(52x52)-52] =2652 rows after removing the self distances. Now, the distance matrix looks something like (Omitted many lines for space issues):
wu.sd
Var1 Var2 value Type1 Type2
1 ERR260268_profile ERR275252_profile 0.6813452 control obese
2 ERR260265_profile ERR275252_profile 0.7162228 control obese
3 ERR260264_profile ERR275252_profile 0.6417904 control obese
4 ERR260263_profile ERR275252_profile 0.5717646 control obese
5 ERR260261_profile ERR275252_profile 0.5619948 obese obese
6 ERR260260_profile ERR275252_profile 0.5124622 control obese
7 ERR260259_profile ERR275252_profile 0.5812824 control obese
...
15 ERR260239_profile ERR275252_profile 0.5676038 obese obese
16 ERR260238_profile ERR275252_profile 0.6059405 obese obese
17 ERR260235_profile ERR275252_profile 0.5886723 obese obese
18 ERR260227_profile ERR275252_profile 0.5431291 control obese
19 ERR260226_profile ERR275252_profile 0.6558075 control obese
20 ERR260224_profile ERR275252_profile 0.5683788 obese obese
21 ERR260222_profile ERR275252_profile 0.7357532 obese obese
22 ERR260217_profile ERR275252_profile 0.4715252 control obese
23 ERR260216_profile ERR275252_profile 0.5456019 control obese
24 ERR260215_profile ERR275252_profile 0.4235138 control obese
25 ERR260209_profile ERR275252_profile 0.6692677 control obese
26 ERR260205_profile ERR275252_profile 0.5776058 control obese
27 ERR260204_profile ERR275252_profile 0.6388663 obese obese
28 ERR260194_profile ERR275252_profile 0.4795330 obese obese
29 ERR260193_profile ERR275252_profile 0.5681197 control obese
30 ERR260182_profile ERR275252_profile 0.6187430 obese obese
31 ERR260179_profile ERR275252_profile 0.4991601 obese obese
32 ERR260178_profile ERR275252_profile 0.5084147 obese obese
33 ERR260177_profile ERR275252_profile 0.3764679 obese obese
34 ERR260176_profile ERR275252_profile 0.7565849 obese obese
35 ERR260175_profile ERR275252_profile 0.5899777 obese obese
...
44 ERR260154_profile ERR275252_profile 0.5464272 obese obese
45 ERR260152_profile ERR275252_profile 0.5769312 obese obese
46 ERR260150_profile ERR275252_profile 0.6799476 obese obese
47 ERR260148_profile ERR275252_profile 0.5879797 obese obese
48 ERR260147_profile ERR275252_profile 0.7578910 control obese
49 ERR260145_profile ERR275252_profile 0.7650143 obese obese
50 ERR260140_profile ERR275252_profile 0.5465588 obese obese
51 ERR260136_profile ERR275252_profile 0.5901467 obese obese
52 ERR275252_profile ERR260268_profile 0.6813452 obese control
53 ERR260265_profile ERR260268_profile 0.6878742 control control
54 ERR260264_profile ERR260268_profile 0.6256261 control control
55 ERR260263_profile ERR260268_profile 0.6589539 control control
56 ERR260261_profile ERR260268_profile 0.6366749 obese control
57 ERR260260_profile ERR260268_profile 0.6534903 control control
58 ERR260259_profile ERR260268_profile 0.6027080 control control
59 ERR260258_profile ERR260268_profile 0.6506323 obese control
60 ERR260252_profile ERR260268_profile 0.4559215 control control
61 ERR260250_profile ERR260268_profile 0.5596617 obese control
62 ERR260245_profile ERR260268_profile 0.6294801 obese control
63 ERR260244_profile ERR260268_profile 0.6248457 obese control
64 ERR260242_profile ERR260268_profile 0.5400533 control control
65 ERR260240_profile ERR260268_profile 0.5882032 obese control
66 ERR260239_profile ERR260268_profile 0.7411252 obese control
67 ERR260238_profile ERR260268_profile 0.7181616 obese control
68 ERR260235_profile ERR260268_profile 0.7699306 obese control
...
80 ERR260193_profile ERR260268_profile 0.6769742 control control
81 ERR260182_profile ERR260268_profile 0.7567220 obese control
82 ERR260179_profile ERR260268_profile 0.6433029 obese control
83 ERR260178_profile ERR260268_profile 0.6629624 obese control
[ reached 'max' / getOption("max.print") -- omitted 2452 rows ]
Here you can see there are all possible combinations of control and test samples each for two times (just imagine a checker board of 52 samples). But I want to extract out only the control-control distances and the obese-obese distances each for once. How can I do that?
Thanks, dpc
But that also will extract each of the distances twice. Not? Suppose, distances between X & Y will be like X-Y and Y-X. How to take each of the distance comparison only once?
thanks