Question

How to compute enrichment of one feature in the other across the genome (log fold, and pvalue)

0

Entering edit mode

6.2 years ago

User 7754 ▴ 290

Hi,

I have two bed files, and I would like to know if bed file A is enriched in bed file B compared to what is expected, given the genome sizes. I found how to get a p-value for this using Fisher, at least an idea of the baseline: bedtools fisher -a testA.bed -b testB.bed -g chrom_hg19.sizes,

Now I had three questions: Is this the usual way to do enrichment of features genome-wide? My feature in bed file A does not include the sex chromosomes, while bed file B and chrom_hg19.sizes do. Would this influence the estimates? Is there a way to get a log fold change?

Thanks very much!

genome enrichment • 3.9k views

ADD COMMENT • link 6.2 years ago by User 7754 ▴ 290

0

Entering edit mode

A log fold change compared to what? Sure, the sex chromosomes will impact the estimates, but you can remove them if you feel it's appropriate. I don't see anything wrong with this approach, it's more or less the same as any GO/Pathway enrichment analysis.

ADD REPLY • link 6.2 years ago by jared.andrews07 ★ 19k

1

Entering edit mode

Hi, thank you Jared. 1. Some bed files A have the sex chromosomes, some bed files A do not. So, in cases that do not include the sex chromosomes, I should compute enrichment of bed files A in bed files B after removing the sex files? Then the enrichments of bed files A in bed files B will be all comparable across multiple bed files? 2. What tool is normally used for enrichments of features across the genome (bedtools fisher)? 3. Log fold change is over expected, given the size of the genome. How to compute that? bedtools fisher does not give that, only the p-value...

ADD REPLY • link 6.2 years ago by User 7754 ▴ 290

score 2 · Accepted Answer · 2019-05-07

2

Entering edit mode

6.2 years ago

Friederike 9.0k

Is this the usual way to do enrichment of features genome-wide?

This is a generally acceptable way to calculate whether the number of features overlapping between A and B is greater than expected by chance.

My feature in bed file A does not include the sex chromosomes, while bed file B and chrom_hg19.sizes do. Would this influence the estimates?

The question you should be asking yourself is why are the sex chromosomes missing from one file but not the other? If both files represent peak regions from two different transcription factors and one just happens to not bind on any sex chromosome, while the other does, that would be fine.

Is there a way to get a log fold change?

Fold change of what? The Fisher test is not taking any scores that may be stored in either BED file into account. It simply intersects the regions of both files and determines how many regions are shared. This number is compared to the expected overlap of regions based on the size of the genome (if you have a tiny genome and loads of regions in both BED files it's probably not very surprising to see tons of overlaps. If you have a large genome, few regions in both BED files and they still manage to overlap quite a bit, that's usually more of an indication that there's something interesting going on.

ADD COMMENT • link 6.2 years ago by Friederike 9.0k

0

Entering edit mode

Thank you Friederike for your replies! I see the difference now, it is the number of features overlapping, but I would still think that if there is a p-value there must be a way to also say, e.g., it is 3 times more enriched than expected by chance (after taking account of the genome length)? The reason the sex chromosomes are missing in some cases is because they have not been covered (were removed from analysis) in the original data.

ADD REPLY • link 6.2 years ago by User 7754 ▴ 290

0

Entering edit mode

If one sample was blind to the sex chromosome, you should eliminate them from all the files.

it is 3 times more enriched than expected by chance

Sounds to me like you'd be interested in the odds ratio; I don't think that BEDtools offers that test, so you would have to do that by yourself, e.g. with R.

ADD REPLY • link 6.2 years ago by Friederike 9.0k

0

Entering edit mode

it's only a few datasets that don't have sex chromosome, not one sample, and since I am computing the enrichment for each dataset I don't see why I would need to remove them. Is there any reason I am not seeing? I found this other question that could help actually for enrichment (maybe it is a duplicate then?): Сalculating fold-enrichment of ChIP-seq peaks intersecting with promoters (vs. genome average)

ADD REPLY • link 6.2 years ago by User 7754 ▴ 290

0

Entering edit mode

If you've treated those few samples special before (by removing the sex chromosomes from the get-go), you should keep it that way. The universe of possible overlaps is smaller for those and that should be taken into account.

If you're interested in calculating differences between read coverage for specific regions (yet another measure!), the link you posted may give you some useful pointers, yes.

So far, you've suggested/asked about:

Do I have more regions from file A overlapping with the regions from file B than expected by chance? --> Answer with Fisher test
How many more regions to I have overlapping from file A overlapping with file B? --> odds ratio
What is the ratio of the number of reads in any set of specific regions (which ones?)? --> logFC of read coverages

All of these are different measures that come with different details and caveats (for example, for 3. how would you define the region for which you're computing the read coverage difference? Just the part that two regions overlap precisely? The union of the genome region that is covered by the two regions (very rarely will they have the exact same coordinates in files A and B)? Just the entire region from file A as long as some part of it overlaps with file B? ...)

ADD REPLY • link 6.2 years ago by Friederike 9.0k