Gene list overlap - Null distribution
2
0
Entering edit mode
8.0 years ago
Wario • 0

Hello everyone,

This is probably a stupid question but I need help.

I want to calculate the null distribution for the gene overlap between 2 lists.

The first list is Chip-seq data and the second RNA-seq. And the background genome is 20000 thousand genes. I have this data for 50 samples.

The first sample has a Chip list with 751 genes and a 590 RNA-seq gene list.

I tried it with r but the result looks odd.

ts = replicate(5000,t.test(rnorm(751),rnorm(590))$statistic) 
range(ts)

pts = seq(-3.5, 3.5,length=100)
plot(pts,dt(pts,df=25),col='red',type='l') 
lines(density(ts))
RNA-Seq ChIP-Seq • 1.7k views
ADD COMMENT
0
Entering edit mode

I formatted your code (using the 101010 button) for readability, but perhaps you should check I did it correctly.

ADD REPLY
0
Entering edit mode

Thanks, didn't know about that.

ADD REPLY
2
Entering edit mode
8.0 years ago

As was mentioned by @Lars Juhl Jensen, the standard null distribution for two gene lists is the hypergeometric distribution. However, this assumes that all genes are independent and equally likely to show up. There are several reasons why this might not be the case:

  • Longer genes are more likely to be called differentially expressed as you have more power to detect (higher read numbers)
  • You don't say how your chip-seq gene list is devired. If it is by overlapping with the gene region, then again, longer genes are more likely to overlap if you are assigning peaks to genes based on a promoter region or gene territory, are all promoters/territories the same length?

There are a couple of ways around this. First the pacakge goseq is designed to manage gene length bias in differential expression analysis. While you are not doing GO analysis, the problem is conceptually equivalent.

Alternatively the program GAT (gene association tester), tests whether a set of intervals overlaps with another set of intervals more often than you would expect, accounting for all length bias, GC content bias etc.

ADD COMMENT
1
Entering edit mode
8.0 years ago

You could model this with a simple hypergeometric distribution, if you make the assumption that all genes are equally likely to appear on the two lists.

ADD COMMENT

Login before adding your answer.

Traffic: 2405 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6