Dear Biostars,
I have two populations, POP1 and POP2. Each with 5 samples. Each sample having a number of variants as given in column POP1 and POP2 in below table. I want to calculate p-value for each sample.
No. of Variants
No. of Samples POP1 POP2 P-value (chi-square)
1 181 143 ?
2 122 147 ?
3 148 118 ?
4 129 92 ?
5 137 101 ?
Please suggest some methods or R-packages. Thanks!
Thanks for answering Vijay, I will try it. May be you can explain as I am pretty confused with chi square test, as it requires expected value of the outcome, here in this case i have only observed values suppose for sample 1 if i want to compare variants in POP1 and POP2, how can I calculate p-value to find whether the differentiation is significant or not? Hope you can help me, Thanks!
You're welcome! first things first, let's discuss about the objective of this statistical test in your case. Let me know what is your "null" and "alternative" hypothesis i.e. what is that you want to prove/disprove through this test?
From this test, I want to prove that POP1 is having samples with more variants than POP2, As it seems in table but I want to get a p-value for each sample to be confident.
If it is understood correctly, you want to check whether a given sample (e.g. sample1) is different between POP1 & POP2 in terms of number of variants. By simply having one category, the p-value estimate has no meaning. How can you say whether 181 is different than 143. similarly 122 vs 147. Blindly using R code and getting p-value without understanding underlying statistics is dangerous.
What you need to have is minimum 2 nominal categories (e.g population type and variant count). In your case you have one (Population type).
Solution
## Lets say total positions: 1000
## Variants: 181 (POP1) & 143 (POP2)
## Non Variants: 1000-181 (POP1) & 1000-143(POP2)
So now you have 2 way contingency table (Population vs Variant type).
Fisher exact test or chi-sq will now work well.
Same has to be applied for each sample.
+1 for this. Precise answer! Can't agree more on that. IS IT VALID is more important than HOW IT IS DONE!!
That's the reason I asked "what's the null/alternate hypothesis". On top of that, p-value is seriously misunderstood in such cases (I am one of the victims of p value ;) ). I strongly recommend going through this article ; a motivation to go through this will be the below statement from the same article:
"P values are commonly used to test (and dismiss) a ‘null hypothesis’, which generally states that there is no difference between two groups, or that there is no correlation between a pair of characteristics. The smaller the P value, the less likely an observed set of values would occur by chance — assuming that the null hypothesis is true. A P value of 0.05 or less is generally taken to mean that a finding is statistically significant and warrants publication. But that is not necessarily true"
I agree, what if I take one population as a control and another as a case. Then, is it a right way to calculate p-values, as we do in RNASeq expression analysis.
This test only works for categorical data (data in categories), such as Gender {Men, Women} or color {Red, Yellow, Green, Blue} etc, but not numerical data such as height or weight. As pointed out by Persistent LABS, "How can you say whether 181 is different than 143. similarly 122 vs 147"; i.e. you cannot apply this concept in your case.