How to extract out specific distances from a distance matrix?
2
0
Entering edit mode
4.4 years ago
dpc ▴ 250

Hi I have generated a distance matrix using this code which has been collected from here:

plotDistances = function(p = GlobalPatterns, m = "wunifrac", s = "X.SampleID", d = "SampleType", plot = TRUE) {

  require("phyloseq")
  require("dplyr")
  require("reshape2")
  require("ggplot2")

  # calc distances
  wu = phyloseq::distance(p, m)
  wu.m = melt(as.matrix(wu))

  # remove self-comparisons
  wu.m = wu.m %>%
    filter(as.character(Var1) != as.character(Var2)) %>%
    mutate_if(is.factor, as.character)

  # get sample data (S4 error OK and expected)
  sd = sample_data(p) %>%
    select(s, d) %>%
    mutate_if(is.factor,as.character)

  # combined distances with sample data
  colnames(sd) = c("Var1", "Type1")
  wu.sd = left_join(wu.m, sd, by = "Var1")

  colnames(sd) = c("Var2", "Type2")
  wu.sd = left_joinwu.sd, sd, by = "Var2")

  # plot
  p = ggplotwu.sd, aes(x = Type2, y = value)) +
    theme_bw() +
    geom_point() +
    geom_boxplot(aes(color = ifelse(Type1 == Type2, "red", "black"))) +
    scale_color_identity() +
    facet_wrap(~ Type1, scales = "free_x") +
    theme(axis.text.x=element_text(angle = 90, hjust = 1, vjust = 0.5)) +
    ggtitle(paste0("Distance Metric = ", m)) +
    ylab(m) +
    xlab(d)

  # return
  if (plot == TRUE) {
    return(p)
  } else {
    returnwu.sd)
  }
}

I have total 52 samples (13 controls + 39 tests). So, it has generated [(52x52)-52] =2652 rows after removing the self distances. Now, the distance matrix looks something like (Omitted many lines for space issues):

wu.sd
                 Var1              Var2     value   Type1   Type2
1   ERR260268_profile ERR275252_profile 0.6813452 control   obese
2   ERR260265_profile ERR275252_profile 0.7162228 control   obese
3   ERR260264_profile ERR275252_profile 0.6417904 control   obese
4   ERR260263_profile ERR275252_profile 0.5717646 control   obese
5   ERR260261_profile ERR275252_profile 0.5619948   obese   obese
6   ERR260260_profile ERR275252_profile 0.5124622 control   obese
7   ERR260259_profile ERR275252_profile 0.5812824 control   obese
...
15  ERR260239_profile ERR275252_profile 0.5676038   obese   obese
16  ERR260238_profile ERR275252_profile 0.6059405   obese   obese
17  ERR260235_profile ERR275252_profile 0.5886723   obese   obese
18  ERR260227_profile ERR275252_profile 0.5431291 control   obese
19  ERR260226_profile ERR275252_profile 0.6558075 control   obese
20  ERR260224_profile ERR275252_profile 0.5683788   obese   obese
21  ERR260222_profile ERR275252_profile 0.7357532   obese   obese
22  ERR260217_profile ERR275252_profile 0.4715252 control   obese
23  ERR260216_profile ERR275252_profile 0.5456019 control   obese
24  ERR260215_profile ERR275252_profile 0.4235138 control   obese
25  ERR260209_profile ERR275252_profile 0.6692677 control   obese
26  ERR260205_profile ERR275252_profile 0.5776058 control   obese
27  ERR260204_profile ERR275252_profile 0.6388663   obese   obese
28  ERR260194_profile ERR275252_profile 0.4795330   obese   obese
29  ERR260193_profile ERR275252_profile 0.5681197 control   obese
30  ERR260182_profile ERR275252_profile 0.6187430   obese   obese
31  ERR260179_profile ERR275252_profile 0.4991601   obese   obese
32  ERR260178_profile ERR275252_profile 0.5084147   obese   obese
33  ERR260177_profile ERR275252_profile 0.3764679   obese   obese
34  ERR260176_profile ERR275252_profile 0.7565849   obese   obese
35  ERR260175_profile ERR275252_profile 0.5899777   obese   obese
...
44  ERR260154_profile ERR275252_profile 0.5464272   obese   obese
45  ERR260152_profile ERR275252_profile 0.5769312   obese   obese
46  ERR260150_profile ERR275252_profile 0.6799476   obese   obese
47  ERR260148_profile ERR275252_profile 0.5879797   obese   obese
48  ERR260147_profile ERR275252_profile 0.7578910 control   obese
49  ERR260145_profile ERR275252_profile 0.7650143   obese   obese
50  ERR260140_profile ERR275252_profile 0.5465588   obese   obese
51  ERR260136_profile ERR275252_profile 0.5901467   obese   obese
52  ERR275252_profile ERR260268_profile 0.6813452   obese control
53  ERR260265_profile ERR260268_profile 0.6878742 control control
54  ERR260264_profile ERR260268_profile 0.6256261 control control
55  ERR260263_profile ERR260268_profile 0.6589539 control control
56  ERR260261_profile ERR260268_profile 0.6366749   obese control
57  ERR260260_profile ERR260268_profile 0.6534903 control control
58  ERR260259_profile ERR260268_profile 0.6027080 control control
59  ERR260258_profile ERR260268_profile 0.6506323   obese control
60  ERR260252_profile ERR260268_profile 0.4559215 control control
61  ERR260250_profile ERR260268_profile 0.5596617   obese control
62  ERR260245_profile ERR260268_profile 0.6294801   obese control
63  ERR260244_profile ERR260268_profile 0.6248457   obese control
64  ERR260242_profile ERR260268_profile 0.5400533 control control
65  ERR260240_profile ERR260268_profile 0.5882032   obese control
66  ERR260239_profile ERR260268_profile 0.7411252   obese control
67  ERR260238_profile ERR260268_profile 0.7181616   obese control
68  ERR260235_profile ERR260268_profile 0.7699306   obese control
...
80  ERR260193_profile ERR260268_profile 0.6769742 control control
81  ERR260182_profile ERR260268_profile 0.7567220   obese control
82  ERR260179_profile ERR260268_profile 0.6433029   obese control
83  ERR260178_profile ERR260268_profile 0.6629624   obese control
[ reached 'max' / getOption("max.print") -- omitted 2452 rows ]

Here you can see there are all possible combinations of control and test samples each for two times (just imagine a checker board of 52 samples). But I want to extract out only the control-control distances and the obese-obese distances each for once. How can I do that?

Thanks, dpc

wgs phyloseq • 3.0k views
ADD COMMENT
1
Entering edit mode
4.4 years ago

This is an R programming question not a bioinformatics one. Anyway try something along the line of wu.sd[wu.sd$Type1=='control' & wu.sd$Type2 == 'control']

ADD COMMENT
0
Entering edit mode

But that also will extract each of the distances twice. Not? Suppose, distances between X & Y will be like X-Y and Y-X. How to take each of the distance comparison only once?

thanks

ADD REPLY
1
Entering edit mode
4.4 years ago

Hi @dpc,

To take the pairwise comparisons only once, you need to do something like this:

## get only pairwise comparisons when both samples belong to the same group, e.g., sample A and B, with both of them belonging to control condition
sub_wu.sd <- wu.sd[wu.sd$Type1==wu.sd$Type2,]

## loop over rows to get the 2 sample names, order them by alphabetic order and concatenate "-", and add this info into a new column
for ( i in seq( nrow( sub_wu.sd ) ) ) {
  samples <- c( unlist( sub_wu.sd[ i, c("Var1", "Var2") ] ) )
  new_ord_samp <- paste( sort( samples ), collapse = "-" )
  sub_wu.sd[ i, "Sample_ordered"] <- new_ord_samp
}

## remove duplicated pairwise samples based on the new column information
uniq_wu.sd <- sub_wu.sd[ !duplicated( sub_wu.sd$Sample_ordered ),]

I can't understand why you got a different no. of unique comparisons with my code than I was expecting using that formula. However, now you can see if the number of rows that you are getting is the same obtained with my code or not.

I hope this helps,

António

ADD COMMENT
0
Entering edit mode

Yes sir. Now, it's giving the same number of rows as of your function (i.e. 699). When I taken the duplicate values, it gives 1398 i.e., exactly the twice. So, the process of removing duplicates is fine I think. But, the p-value given by your function is 8.7e-06, while from tutorial function after removing duplicates is 0.52. High descrepency!!!

I have found that in both cases there are 171 values for controls and 528 values for test samples (171+528= 699). Just for checking I have compared the distance values from outputs of your function and tutorial function. The starnge thing is none of the values matches. So, only thing comes to my mind is the function in distance calculation. you have used distance(physeq = physeq, method = method) whereas tutorial uses phyloseq::distance(physeq, "bray"). What do you think?

EDIT: I have seen that in your function change from physeq::distance(physeq, "bray") to distance(physeq, method=bray) does not impart any change. The P_value remains the same. So, I think probably the problem is in some other part of the code.

ADD REPLY
1
Entering edit mode

Hi @dpc,

Sorry for the late reply. I've been very busy last days.

So, I decided to make a true comparison between my function and the tutorial function, and the result is the same with the GlobalPatterns data. Here is the code (you need to import and run my function and the tutorial function first):

## Comparing built-in function 'beta_boxplot()'   
#with online function 'plotDistances()' from: https://rdrr.io/github/jeffkimbrel/jakR/src/R/plotDistances.R
# built-in fun
beta_boxplot_result_test <- beta_boxplot(physeq = GlobalPatterns, method = "bray", group = "SampleType")
beta_boxplot_result_test$data

# online fun
online_fun_test <- plotDistances(p = GlobalPatterns, m = "bray", s = "X.SampleID", d = "SampleType", plot = FALSE)
sub_set_comp_online_test <- online_fun_test[online_fun_test$Type1==online_fun_test$Type2,] # filter result to show the same comparisons 
for ( i in seq(nrow(sub_set_comp_online_test))) {
  samples <- c( unlist(sub_set_comp_online_test[ i, c("Var1", "Var2")]))
  new_ord_samp <- paste( sort(samples), collapse = "-" )
  sub_set_comp_online_test[ i, "Sample_ordered"] <- new_ord_samp
}
sub_set_comp_online_test_uniq <- sub_set_comp_online_test[!duplicated(sub_set_comp_online_test$Sample_ordered),] 

order_samples <- beta_boxplot_result_test$data$sample_pair %>% 
strsplit("-") %>% sapply(X = ., FUN = function(x) order(x) )
samples2order <- strsplit(x = beta_boxplot_result_test$data$sample_pair, split = "-")  
samples_ordered <- lapply(seq(samples2order), function(x) paste( samples2order[x][[1]][order_samples[,x]], collapse = "-") )
beta_boxplot_result_test$data$Sample_ordered <- unlist(samples_ordered)

left_join(x = beta_boxplot_result_test$data, y = sub_set_comp_online_test_uniq, by = "Sample_ordered")

The result is this:

https://ibb.co/kSg9dW8

As you can see the values match. If I sum the beta-diversity values independently for both tables, I get: 15.83836

So, before when you said that you did not get the same result when summing the values from both analyses independently, I think that it might not have worked with your data due to some specificity or you did not subset the table in the right way, in order to compare the same comparisons.

António

ADD REPLY
0
Entering edit mode

Thanks a lot sir. I will let you know once I check it again.

dpc

ADD REPLY
0
Entering edit mode

Firstly, many many thanks for your constant kind support. I don't know why, but this time both the functions (your's and tutorial's) are running fine and showing the same output (P_value: 0.015). I apologise for the time you wasted for comparing the two functions.

Again, many many thanks, dpc

ADD REPLY
0
Entering edit mode

António, thank you for your contribution. I'd like to draw your attention to the post How to add images to a Biostars post - this explains how to add images. You need the direct link to the image (the one you get by right-clicking on image then hitting copy-link-to-image). The link you're using is the web page that contains the image and that takes away from the user experience on the website.

ADD REPLY
0
Entering edit mode

Thank you RamRS!

Next time I add a image, I'll follow these instructions.

António

ADD REPLY

Login before adding your answer.

Traffic: 2010 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6