Question

Selection Of Background Gene Set In Enrichment Analysis

23

Entering edit mode

12.8 years ago

Andrew Su 4.9k

Tools like GSEA are a great way to translate gene lists into higher-level processes by detecting enrichment of functional gene sets. Detection of "enrichment" depends on selecting an appropriate background set of genes as a baseline, and even web-based tools like DAVID allow users to customize the background. A colleague and I were just discussing our various approaches on selecting the background, and I thought I'd poll the community here on their preferred methods. Obviously pointers the published literature would be ideal, but informal and unpublished thoughts are welcome too...

enrichment statistics • 29k views

ADD COMMENT • link updated 3.9 years ago by Sebastian Hesse ▴ 350 • written 12.8 years ago by Andrew Su 4.9k

score 5 · Answer 1 · 2012-02-23

5

Entering edit mode

12.8 years ago

Andrew Su 4.9k

Well, the general idea was first proposed by Maciej Jończyk in a comment to Will's answer, but I thought it deserved its own answer to see if others use this approach. And if so, how...

As Will states (and ahill also alludes to), the background should include 'any gene that COULD HAVE been positive'. Most interpret that to at least be limited to all the genes interrogated by the technology platform (e.g., microarray). But couldn't/shouldn't one also limit to genes that are expressed in the tissue or cell type being profiled?

For example, suppose I'm comparing thymus samples in two biological states. If I use a microarray- or genome-wide background, then very likely the enriched categories will generally report enrichment of gene lists related to T-cells. Doesn't it make sense to limit the background to only genes that expressed in the thymus, in which case one might find enrichment of specific pathways that are altered between my conditions? My guess would be that such pathways would be obscured when using the broader background set.

If this makes sense, then can anyone point to a publication that explores such an approach?
Or, anyone care to describe how to determine what genes are expressed in the tissue or cell type of interest? (Personally, I usually just take the set of genes that are expressed above some arbitrary threshold of detection across any of my samples.)

ADD COMMENT • link 12.8 years ago by Andrew Su 4.9k

0

Entering edit mode

But couldn't/shouldn't one also limit to genes that are expressed in the tissue or cell type being profiled?

In my RNA-seq experiment I can definitely confirm that without limiting the background gene sets to genes that are actually expressed in my tissue of interest, I see enrichment of tissue-specific genes even without enrichment among differentially expressed genes.

My tentative conclusion from this is that one should remove genes from the background gene set if they are not expressed in any sample of the experiment, but I am curious what other people think about that.

ADD REPLY • link 9.4 years ago by Christian ★ 3.1k

3

Entering edit mode

This recent article covers this exact topic and argues to definitely remove non-expressed genes from the background: http://www.genomebiology.com/2015/16/1/186

ADD REPLY • link 9.3 years ago by Christian ★ 3.1k

0

Entering edit mode

Thanks for this link. Very helpful.

ADD REPLY • link 8.4 years ago by arezansoff ▴ 20

score 5 · Answer 2 · 2016-09-01

That always makes me wonder... I find it a very unsafe approach to eliminate genes as non-expressed in particular tissue/cell type. What if we actually introduce further bias because of detection issues? I.e. can we really say that certain set of genes are not expressed at all under no conditions in particular tissue/cell type. NB: tissues and cells are dynamic and responsive, there is no static state and static signature that would be true under all conditions. That's why we do the experiments after all. Therefore the argument that because some genes might not be detected, we should remove even more genes from the background set, doesn't really convince me.

Now, I can understand the point some people make that if we get a transcriptomic profile of a tissue and compare to "universe" background all we'll learn will be that we study that tissue. Yet, if I design experiments aiming at discovering an enriched/enhanced process, I would normally compare the same tissue/cell type, e.g. treated and untreated. Which means that the tissue- or cell type-specific signature will be "filtered out" at the level of DE, as those genes should be more or less at the same expression level, and the enriched sets will contain genes regulated by the treatment. Unless the treatment also affects e.g. differentiation rate of the tissue or its identity, then I would receive terms relevant to that tissue phenotype, but in that case obviously I would want to know they are regulated.

So, with all the possible biases, I still feel that comparing against all the genes that could be expressed (hence all the genes) is more biologically relevant than comparing against an artificially/arbitrarily selected background.

But I would be very happy if somebody could suggest a thorough reading on the topic, especially related to NGS (RNA-seq and ChIP-seq data). I found the brief article cited above a bit disappointing.

score 3 · Answer 3 · 2012-02-21

3

Entering edit mode

12.8 years ago

Will 4.6k

I think the general idea is for the background to be 'any gene that COULD HAVE been positive'. So if you were doing a microarray/chipSeq/SNP/etc. analysis the background would be all genes on the chip ... since only those genes could have been deferentially expressed.

If you're doing some sort of computational analysis it would be all genes in the database you analyzed (which may or may not be all genes in the organism).

I'm still unclear (and I think the community in general is unclear) what to do with Next-Gen sequencing results. I've seen examples where people use all genes in the mapped-reference and instances where people put a cutoff and said they only considered genes with >X hits.

ADD COMMENT • link 12.8 years ago by Will 4.6k

1

Entering edit mode

I agree with Will

'any gene that COULD HAVE been positive'

should be used in population set. At the same time that imply that genes which doesn't show signal above background should be filtered-out.

ADD REPLY • link 12.8 years ago by boczniak767 ▴ 870

0

Entering edit mode

Although I will stand corrected on the Next-Gen statements, its been a year or so since I did something in that field and the community may have solidified since then.

ADD REPLY • link 12.8 years ago by Will 4.6k

0

Entering edit mode

Maciej, that's exactly the type of answer I was looking for. If filtering out genes below background is something you actually do, would you mind fleshing out your approach in its own answer?

ADD REPLY • link 12.8 years ago by Andrew Su 4.9k

score 1 · Answer 4 · 2012-02-21

From what experimental platform are you getting your gene lists? For experimental platforms like Affymetrix arrays, where the set of monitored genes is nominally known, we have sometimes defined the background to be the set of all genes monitored by the particular array design used in the experiment. A refinement of this method is to select the subset of genes for which the array could ever yield a signal, since some probesets will never detect a message (or differential expression) above background noise. This subset of genes might be approximated by interrogating a large number of diverse historical experiments. For other platforms like mass-spectrometry based proteomics, selection of an appropriate background that reflects the biases of the assay seems even more difficult.

score 1 · Answer 5 · 2012-02-23

I deal with NGS data. Most paper I've read just flat out remove anything with less than X (usually 10 - 15) reads mapped. I don't think I've ever read why they decided to use that arbitrary number.

I think with most RNA-seq experiments, what we are interested in are differentially expressed genes. Most differential expression determinations are done by taking into account the number of reads and the variability among replicates. Usually genes with low read mappings will have such abysmal p-values that they are never considered differentially expressed unless it is extremely consistent among replicates.

So I never remove genes less than X amount of reads. I just normalize my reads and run them through the DE analysis, letting the DE determination get rid of stuff that are lowly expressed.

score 1 · Answer 6 · 2021-01-25

1

Entering edit mode

3.9 years ago

Sebastian Hesse ▴ 350

I would like to refer this topic and ask if a statistical point could be made? Couldn't we calculate the effect of different background sizes?

ADD COMMENT • link 3.9 years ago by Sebastian Hesse ▴ 350

0

Entering edit mode

This is an intersting topic, but I think it should really be a new question, especially given the age of this one.

ADD REPLY • link 3.9 years ago by i.sudbery 20k