Entering edit mode
4.4 years ago
dpc
▴
250
Hi !!! I am working with profiled metagenomic taxonomy abundance data. I want to generate beta diversity (bray-curtis) boxplot from a phyloseq object where two groups (control and test) will be shown. Something like this figure. Can anybody help by sharing some code or some useful tutorial?
Hi,
Phyloseq does not have built-in functions that allow you to produce a figure like that. You need to calculate the Bray-Curtis distance with
vegan
orphyloseq
, but then, you need to extract the information from the matrix to build a box plot like that with for instance,ggplot2
.António
I have used the following code after looking at a tutorial and getting the following plot. I don't know why two plots are coming. Can you tell some modifications to get a single one?
Remove the
facet_wrap(~ Type1, scales = "free_x") +
line from your code and try again.Let me know if this solved the issue,
António
It's giving this image.
Remove also
geom_point() +
!Let me know if this solved the issue,
António
Not working. Image
Did you remove both:
facet_wrap(~ Type1, scales = "free_x") +
andgeom_point() +
?António
I just wrote the function below. You might want to test it. I did not test it extensively, so it might have errors. If you use it you need to be very careful and test it on a dataset that you know well.
It takes as input the phyloseq object, the method that you want to use to perform beta-diversity (using one of the methods that you can pass to phyloseq beta-diversity analysis) and the group variable (on the sample_data frame):
Again, you should be very careful and test well first. I'll perform some tests later and I'll come back in case I found errors. The result is a list with a data frame and the box plot (from ggplot class). So, you can compare this data frame with all pairwise comparisons belonging to each group with the table obtained from phyloseq and see if everything looks fine.
I hope this helps,
António
Hi... I could not reply yesterday as I reached highest number of posts a day (as a new member). I am seeing it works nicely with "GlobalPatterns", but not with my data set. I don't understand what have you done here:
Why is there a loop?
Hi,
Probably there are many ways of getting that information, even without a loop, but here the idea is to loop over the different levels of the group variable, e.g., let's say that you have a variable called sex, where the levels can be male or female. The loop will go through female and, then, male and for each will retrieve the list of samples belonging to male or female. So, basically, group2samp, in this example, will contain two elements male or female, in which each of them will hold a vector of samples belonging to themselves.
The only reason that I can think that is not working for you is because
group
parameter needs to be afactor
class variable. If you give acharacter
class variable will not work. So, you can check which class is by doing:class(get_variable(sample_data(your_physeq_object), your_group_variable))
.If your variable is different from
factor
, you need to convert it tofactor
class before trying again.António
Yes sir... Finally It's running. I am getting an output like this after removing "jeom jitter" . Thanks a lot sir for your kind help. I am really indebted to you.
Also, I will like to know some basics about what you have done in this script. I think you have calculated "bray" index between the controls samples; and then, between the obese samples. After that, you have compared their median by box plot. Right?
(One little request: It would be great if you upvote my questions so that I can conversate for longer here. Yeasterday reached highest limit too early, only 5 messages )
Sorry, I up voted your question now!
So, if you want to remove the points without changing the original code function, you can because you are working with a plot from
ggplot
class, by doing:You can even change the
x-
andy-axis
labels by doing:You can write what ever you want in
xlab()
andylab()
for yourx-
andy-axis
labels. You can change the whole plot, background and so on, because the plot is fromggplot
class. Since you also have the data frame that was use to create the plot, you can even create the plot from scratch.Is this plot that you got equal to the plot that you did before following the tutorial?
António
No sir. They are different. This is yours and this is from tutorial.
So, I added the
geom_boxplot(outlier.shape=NA) +
to the code above. If you run again with the updated code, those outlier points should be removed.Regarding the differences, you can not compare the 2 plots because with the tutorial you're plotting twice, the type variable is being used as
x-axis
label and asfacet_wrap
label. You should remove facet, and then you compare them.António
Thanks sir. I think the problem there is not the facet rather the distance matrix is problem. Because, they have calculated distances among control samples, among obese samples and between control and obese samples. When I check
wu.sd
, this is the result (omitted some lines for space limitation):In your case you have calculated distances among the controls and among the obese samples. Right?
If you look at line 1 -
1 ERR260268_profile ERR275252_profile 0.6813452 control obese
- and line 52 -52 ERR275252_profile ERR260268_profile 0.6813452 obese control
, you see that they are the same. So, the thing is, when you do a beta-diversity analysis you get a distance matrix table, that is square, all the samples are compared against all the samples. This means that your upper and bottom part of the matrix are exactly the same, i.e., comparing, in terms of beta-diversity, sample A against B, or sample B against A, the value of beta-diversity is the same. What you have in your table is exactly this, all the comparisons. What you have in my data frame is just the upper part of the matrix, i.e., sample A against B, but you don't have sample B against A, because is the same comparison.Where did you find this tutorial code?
António
From here I've collected it.
or I should use this one?
Yes, I've also measured the distance among all possible combinations. Actually, the code that I've used is the same as it appears in that online script/function:
beta_div_dist <- distance(physeq = physeq, method = method)
(my) versuswu = phyloseq::distance(p, m)
(online).The way that I retrieve the values and plot differs. The difference is that I retrieved only the unique values for each pairwise comparison, for each group (sorry I forgot to mention this before). This means that there were many comparisons made for which I did not recover the pairwise comparisons. So, for instance in your case, you are only interested in comparisons made within control samples or within obese samples, and not between control and obese samples, right? So, in the case of the online function, you'll get all: within control samples, within obese samples, and between control and obese samples pairwise comparisons. In addition, for the online function will get twice the exact same no., as I explained you before, from the comparison between sample A and B, and between sample B and A. That's why you've such a big difference between the result of my function and the online function. So, assuming that n is the number of samples that you've, in order to get the no. of values using the online function you need to apply the formula:
(n * n) - n
(i.e., all the samples against all, minus the no. of samples that correspond to the same sample being compare against itself: sample A against sample A is equal to 0 beta-diversity). To calculate the no. of pairwise comparisons retrieved with my function, you need to do the following, considering g as the number of samples for one of the groups being compared:((g * g) - g)/2
(you need to sum this result to the next group). In your case it should be like:(((gc * gc) - gc)/2) + (((go * go) - go)/2) = 699
(where gc is the no. of samples for the control group, and go the no. of samples for the obese group).If you run my and the online functions (filtering the result of the online function to show only the comparisons retrieved with my function), you'll see that the result is the same:
So, The plot made by the online function gives you all possible sample group pairwise comparisons. The red boxplots are for within sample group comparisons. So, If you compare the boxplots obtained with my function with the red boxplots obtained with the online function, they should be the same.
I'll answer below to your second part.
António
Yes, I think your first option is Ok, regarding the Wilcoxon test:
I tested with the example data and it works fine:
I just subset to
"Feces" and "Ocean"
because otherwise I would not be able to do Wilcoxon with multiple sample groups.You might consider using ggpubr to do a plot similar to the one that you want:
Instead of
fake_data
use the data obtained with my function. The result of Wilcoxon should be the same obtained before with the functionpairwise.wilcox.test()
.Let me know if this worked,
António
Okay. I understand. But here I have 13 control and 39 obese samples. So, the calculation supposed to be: 78+741 = 819. But, it is 699 here. Doesn't match.
As far as the wilcoxon test is concerned, I am getting p value
8.7e-06
forp_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, beta_boxplot_result$data$group).
While pvalue0.67
forp_values_types <- pairwise.wilcox.test(beta_boxplot_result$data$beta_div_value, sample_data(physeq)$type)
So, the result from first option seems very unusual. Not? However, here in this tutorial they have used the group data from
Regarding the lack of match about
78+741 = 819
, I need to think about it.The p-values are different because you can not use
sample_data(physeq)$type
, the order does not match the order atbeta_boxplot_result$data$beta_div_value
. Probably the first value is ok, because you have a lot of replicates and there seems to be a difference between control and obese.You might open a new question, regarding the issue o p-value, since as @RamRS pointed out, this is too long.
António
Ok sir. Thanks... if you resolve the Issue regarding the sum difference, please write here. And, I will post a different question for the p value.... thanks....
I did the counts for the online version assuming that you've 52 samples (13 + 39), and based on
(52 * 52) - 52
you get 2652 and not 2852. So, are you sure about the no. of lines. To know the exact no. you need to donrow(data_frame)
, wheredata_frame
is your data table. This is quite strange and I can't understand why is happening.António
Sorry. It's 2652. That's fine. But, did you think about
nrows
coming from your script which was 699 in spite of 819? Also, I will draw your attention to this query related to this question only. Sorry, Sir @RamRS. Please, allow us for last a few converasations.Please do not ask for upvotes or hand out votes for every comment. That is bad etiquette. There is a reason these limits exist - this chain is an indication that either your premise or the level of effort you put in the initial question is not good enough. If there were sufficient detail in your question, you'd not need to have this prolonged back-and-forth.