Question

How to compare expression in gene list

1

Entering edit mode

8.4 years ago

Lila M ★ 1.3k

Hi everybody, I'm trying to analyze a data set of genes. I have 3 gene list: P1 (n =312), P2 (n =7444) and P3 (n = 553). I compared each gene list with an additional one (Z, n = 9488) in order to know the number of common genes between them, so I created the following table:

#my_data
 overlap_Z non_overlap_Z
P1      51         261
P2     121        6232
P3      89         464

Also, I create another table comparing the overlaps in Z, as follow:

#my_data_Z
                 P1   P2     P3
overlap_Z       164   3030   341
non_overlap_Z   9324  6458   9174

In order to know if there are diferences between the samples, I've carried out a chi.square test in R chisq.test(my_data, correct = FALSE) and chisq.test(my_data_Z, correct = FALSE). In both cases, the p-value < 2.2e-16. So these results are saying to me that the list (P1, P2, and P3) are ratiocinated with Z. Now, I would like to know how is this correlation and which list are more correlated with Z. Which analysis do you recommend to me?

Thanks!

expresion statistic R • 2.3k views

ADD COMMENT • link updated 4.9 years ago by Biostar 20 • written 8.4 years ago by Lila M ★ 1.3k

score 1 · Answer 1 · 2017-03-24

1

Entering edit mode

8.4 years ago

Annika Forsingdal ▴ 250

Since you write "expression in gene list" I'm assuming that you are looking at gene expression data. If that's the case I would correlate fold changes or log2(fold changes) for each of your genes lists (P1-3) with fold changes in Z.

ADD COMMENT • link 8.4 years ago by Annika Forsingdal ▴ 250

0

Entering edit mode

With expression in gene list, I want to say the number of overlapping. If the genes included in any list (P1, P2, P3) has more overlaps in Z, I could say that the gene list P1 are more expressed in Z. I don't have expression data, I have number of genes (or frequencies)

ADD REPLY • link 8.4 years ago by Lila M ★ 1.3k

score 1 · Answer 2 · 2017-03-24

1) The use of chi-square test is absolutely unwarranted here. Chi-square test is done to know if categorical variables are independent. The key words are 'category' and 'independence'. Category means that there is a large population, and then there are different categories to subdivide the population. Check this example from http://stattrek.com/chi-square-test/independence.aspx?Tutorial=AP

In an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference.

As you can see, you can divide the total population in different categories (M/F and D/R/I). And what you interested in knowing is if those categories are independent (P<0.05) or are associated. Check the above URL again to know how the hypothesis testing (Ho and Ha) is done in case of chi-square.

Now coming to your case: Your P1, P2 and P3 data are independent sets and they do not form a partition of your sample space. What I mean is that you don't have a unique big population, out of which P1, P2 and P3 -- 3 different categories of data is drawn. In fact, the way you have posed the problem, I would assume that P1, P2 and P3 are independent. There is no point in further checking their Independence thru chi-square (even if it were right!)

2)

Now, I would like to know how is this correlation and which list are more correlated with Z. Which analysis do you recommend to me?

I am unable to understand your need. First you wanted to check for independence and then a correlation -- these are mutually exclusive things! Also note that correlation means that you have many data of one kind (say x) and many data of other kind (say y), and you would like to know if there is a pattern in the data - means if one of them could be predicted from other. Think of it like this: if you plot all the x-y pairs, do you see a visual pattern on the graph (like increasing x increases/decreases y). Now coming to your data: you have just 2 (or 6 lets say, but then they are already independent P1, P2 and P3) numbers (=number of intersection gene with Z) - measuring different things. These 2 numbers will just form a point in the graph, that I told earlier -- there is no correlation you can draw from just only a single point in the graph.

Ok, let me guess: you are trying to give a p-value to the proportion of common genes matching with Z. I'm afraid that it is not possible. P-values are used for distributions (means a large number of objects), not for numbers. HTH! I'm running out of time :)

score 1 · Answer 3 · 2017-03-24

This is a question which can be answered using frequency based statistics, such as conditional probability, joint probability, etc. somewhat like what Venn diagrams demonstrate. There are ways to approach this, but takes a lot of explaining to you on how to do it; like the length of the previous answer on basic statistics. Basically, I think you want to test the effect of modification, estimate/calculate the common odds ratio and then test this ratio. Look up the Mantel-Haenszel Method, and see if it applies to you. From here you can bootstrap and do Fishers exact test, but again this may be beyond your question.