Dear all,
I have a two-column data frame (head is attached below) and both of them contain strings. The first column contains the location information of genes that are either "5p" or "3p". And the second column has information if "Frameshift" or "No Frameshift" happens at that location.
head(df)
Location type
3p No FrameShift
5p FrameShift
3p No FrameShift
5p No FrameShift
3p No FrameShift
3p FrameShift
In total, I have 80 rows and I created a barplot where the x-axis is based on "Frameshift" or "No Frameshift" and the height of bars shows the frequency of 5p and 3p in each category of the x-axis. I want to perform a statistical test to see if the frequency difference between 5p and 3p in the "Frameshift" category or the "No Frameshift" is significant.
I came up with the chi-square as below, do you think it is suitable for my data:
library(dplyr)
library(stats)
#converting to table
contingency_table <- table(df$Location, df$type)
#chi-square test for each comparison (between 5p freq and 3p freq in Frameshift AND between 5p freq and 3p freq in No Frameshift)
chi2_result_1 <- chisq.test(contingency_table[1:2, 1])
chi2_result_2 <- chisq.test(contingency_table[1:2, 2])
#p-values
p_value_comparison_1 <- chi2_result_1$p.value
p_value_comparison_2 <- chi2_result_2$p.value
Thank you in Advance.
I do not see any problem in your comparisons.
Chi-Square
goodness-of-fit test can be used even for a single variable comparing observed and expected frequencies across multiple categorical variables.Thank you for your comment. My contingency table looks as this:
And I aimed to calculate the p-value for the difference between the 5p and 3p categories in the Frameshift section (meaning 34 vs 16) and the No Frameshift section (meaning 18 vs 12). Based on your comment I think my approach is correct, right?