Question

How to calculate p-value to find significant difference between two populations based on the number of variants?

0

Entering edit mode

7.4 years ago

Hemant Gupta ▴ 10

Dear Biostars,

I have two populations, POP1 and POP2. Each with 5 samples. Each sample having a number of variants as given in column POP1 and POP2 in below table. I want to calculate p-value for each sample.

                      No. of Variants
No. of Samples        POP1     POP2     P-value (chi-square)
1                     181      143         ?
2                     122      147         ?
3                     148      118         ?
4                     129      92          ?
5                     137      101         ?

Please suggest some methods or R-packages. Thanks!

SNP R variants chi square test significance test • 5.6k views

ADD COMMENT • link updated 7.4 years ago by Rajesh Detroja ▴ 200 • written 7.4 years ago by Hemant Gupta ▴ 10

score 1 · Answer 1 · 2017-06-27

Hi Hemant,

I think following code will help you to calculate P-value using Chi-Square test for variant count of each samples:

count_table.txt

Sr.No   POP1    POP2
1   181 143
2   122 147
3   148 118
4   129 92
5   137 101

R Code

d <- read.table(file="count_table.txt", header=T, sep='\t')
for(row in 1:nrow(d)){
    pvalue <- (chisq.test(c(d[row,2],d[row,3]))$p.value)
    mydata <- c(d[row,1], d[row,2], d[row,3], pvalue)
    print(mydata)
}

Output

Sr.No   POP1    POP2    P-Value
1   181.00  143.00  0.03476276
2   122.00  147.00  0.1274396
3   148.00  118.00  0.06585373
4   129.00  92.00   0.01281428
5   137.00  101.00  0.01962017

You can edit code as you want for output format

Inspired from: https://stackoverflow.com/questions/39647580/how-to-loop-chi-sq-test-in-r?answertab=votes#tab-top

score 0 · Answer 2 · 2017-06-27

0

Entering edit mode

7.4 years ago

lakhujanivijay 5.9k

R can by default perform chi square test. No specialized packages required. An example can be found here

Launch R and do:

?chisq.test

ADD COMMENT • link 7.4 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Thanks for answering Vijay, I will try it. May be you can explain as I am pretty confused with chi square test, as it requires expected value of the outcome, here in this case i have only observed values suppose for sample 1 if i want to compare variants in POP1 and POP2, how can I calculate p-value to find whether the differentiation is significant or not? Hope you can help me, Thanks!

ADD REPLY • link 7.4 years ago by Hemant Gupta ▴ 10

2

Entering edit mode

You're welcome! first things first, let's discuss about the objective of this statistical test in your case. Let me know what is your "null" and "alternative" hypothesis i.e. what is that you want to prove/disprove through this test?

ADD REPLY • link 7.4 years ago by lakhujanivijay 5.9k

0

Entering edit mode

From this test, I want to prove that POP1 is having samples with more variants than POP2, As it seems in table but I want to get a p-value for each sample to be confident.

ADD REPLY • link 7.4 years ago by Hemant Gupta ▴ 10

3

Entering edit mode

If it is understood correctly, you want to check whether a given sample (e.g. sample1) is different between POP1 & POP2 in terms of number of variants. By simply having one category, the p-value estimate has no meaning. How can you say whether 181 is different than 143. similarly 122 vs 147. Blindly using R code and getting p-value without understanding underlying statistics is dangerous.

What you need to have is minimum 2 nominal categories (e.g population type and variant count). In your case you have one (Population type).

Solution

## Lets say total positions: 1000

## Variants: 181 (POP1) & 143 (POP2)

## Non Variants: 1000-181 (POP1) & 1000-143(POP2)

So now you have 2 way contingency table (Population vs Variant type).

Fisher exact test or chi-sq will now work well.

Same has to be applied for each sample.

ADD REPLY • link 7.4 years ago by Persistent LABS ▴ 750

0

Entering edit mode

+1 for this. Precise answer! Can't agree more on that. IS IT VALID is more important than HOW IT IS DONE!!

That's the reason I asked "what's the null/alternate hypothesis". On top of that, p-value is seriously misunderstood in such cases (I am one of the victims of p value ;) ). I strongly recommend going through this article ; a motivation to go through this will be the below statement from the same article:

"P values are commonly used to test (and dismiss) a ‘null hypothesis’, which generally states that there is no difference between two groups, or that there is no correlation between a pair of characteristics. The smaller the P value, the less likely an observed set of values would occur by chance — assuming that the null hypothesis is true. A P value of 0.05 or less is generally taken to mean that a finding is statistically significant and warrants publication. But that is not necessarily true"

ADD REPLY • link 7.4 years ago by lakhujanivijay 5.9k

0

Entering edit mode

I agree, what if I take one population as a control and another as a case. Then, is it a right way to calculate p-values, as we do in RNASeq expression analysis.

ADD REPLY • link 7.4 years ago by Hemant Gupta ▴ 10

0

Entering edit mode

This test only works for categorical data (data in categories), such as Gender {Men, Women} or color {Red, Yellow, Green, Blue} etc, but not numerical data such as height or weight. As pointed out by Persistent LABS, "How can you say whether 181 is different than 143. similarly 122 vs 147"; i.e. you cannot apply this concept in your case.

ADD REPLY • link 7.4 years ago by lakhujanivijay 5.9k