Question

RNAseq differential expression analysis : no significative FDR but significative GO enrichment

1

Entering edit mode

5.6 years ago

guillaume.rbt ★ 1.0k

Hi all,

I'm currently doing RNAseq differential expression analysis, on which I've got no significative genes with FDR < 0.05. (I'm working on human tumor biopsies data, with 111 samples.)

However, when I perform GO enrichment analysis on the top hits (p-value < 0.05, logFC > 1 or < -1), it gives significative enriched pathways, which seems consistent from a biology point a view.

This bring me two questions :

Could those results be relevant? Would the biological signal detected with GO enrichment in the top hits counteract the fact that there is no significative genes detected?
If so, how could I illustrate those findings, I've tried to do heatmaps on a subset of genes belonging to a specific pathway, but, as the gene expression differential is rather low between the two studied conditions, there is no relevant clustering of the samples in the figure between the two conditions. (see below for example the type of expression patterns that I get for one gene between the two studied conditions)

enter image description here

Thank you in advance for any input

RNA-Seq differential expression • 3.0k views

ADD COMMENT • link 5.6 years ago by guillaume.rbt ★ 1.0k

1

Entering edit mode

Could you explain how you did your DGE analysis please ? Also could you explain your experiment design (how many groups, and sample per group)

ADD REPLY • link 5.6 years ago by Nicolas Rosewick 11k

0

Entering edit mode

I study the difference of gene expression between a group of responder (n=60) and a group of non responders (n=51) to a treatment. My dataset is made of data from 3 different studies, hence I've corrected for study variations by taking the study as a confounding effect in my design (I use Limma/Voom).

ADD REPLY • link 5.6 years ago by guillaume.rbt ★ 1.0k

1

Entering edit mode

Have you run and interpreted a PCA analysis of your data ?

I think this is important, since if you have not well separated samples, the DE analysis will fail. In some cases, it could be worthy to discard some of your samples based upon the PCA analysis

ADD REPLY • link 5.6 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Yes I run PCA before doing my differential expression analysis. There were no clustering between the responders and non responders group, but a clustering linked to the studies.

ADD REPLY • link 5.6 years ago by guillaume.rbt ★ 1.0k

1

Entering edit mode

Did you combine independent datasets into the same ststistical analysis? If so it is normal and expected what you see, that is called a batch effect. What do you mean by study?

ADD REPLY • link 5.6 years ago by ATpoint 88k

0

Entering edit mode

Yes I mean that there is a batch effect, which should be corrected in the design I've used.

ADD REPLY • link 5.6 years ago by guillaume.rbt ★ 1.0k

1

Entering edit mode

I do not think this is possible / a good approach. If you really have three independent studies and the studies are = groups that you use in your design there is no way to distinguish biological from batch effect. You would need replicates of all conditions in each group. Are the three studies at least identical in terms of sample preparation = same RNA preparation regime, same sample prep kit (probably the most important factor) etc, or is this completely different?

ADD REPLY • link 5.6 years ago by ATpoint 88k

0

Entering edit mode

Unfortunately the details of RNA preparation are not given for two of the datasets that I've used. I know that a strong batch effect is present, that's why I'm cautious with the results.

I've tried other ways of correcting the batch effect (using the Limma function removeBatchEffect before the differential expression test, and also analysing independantly each dataset then doing a meta-analysis of p-values with Stouffer's test) When I cross the results of each meat-analysis method I get similar results, with the same seemingly relevant biological signal.

ADD REPLY • link 5.6 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

So does each dataset contain the groups you are analyzing so responders and non-responders or are the non-responders from one study and the responders from another study?

ADD REPLY • link 5.6 years ago by ATpoint 88k

1

Entering edit mode

Fortunately all datasets contains both responders and non responders samples.

ADD REPLY • link 5.6 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

The PCA will let you how much dispersion you have in your data in general terms. If data are not clearly separated into clusters, I would expect a weak DE result

ADD REPLY • link 5.6 years ago by Antonio R. Franco ★ 5.2k

1

Entering edit mode

Hi,

Be aware, that during DGE analysis you are looking at differences on the gene level. When for instance, the severity of the cancer developement comes from an higher amount of erroneously spliced mRNA transcripts in one sample group, the summarized gene expression stays the same, because every read which is associated to the same gene is counted as a hit.

Generally, I personally, would think that in most cases you probably have differentially expressed genes in the comparison between 2 groups. However, if your to groups a very heterogeneous, showing higher differences inbetween individuals of the same group than between the sample groups, than you get no significantly expressed genes in your analysis. If this could be the case I would ask the provider of the data: Where all the samples prepared at the same facility? Do you have the same gender distribution in both groups? Do the donors come from the same region? Were the samples collected at roughly the same stage of cancer progression. If not, you have to include this information during definition of the design for the DGE analysis software.

Edit: I didn't saw your comment before posting. So I would only ask the question, wether the difference between the groups could potentially come from alternative splicing?

ADD REPLY • link 5.6 years ago by caggtaagtat ★ 1.9k

0

Entering edit mode

I indeed have higher differences inbetween individuals of the same group than between the sample groups,hence I wasn't expecting highly significative different response in my results. I didn't think about the possibility of different alternative splicing between the groups, thanks for the idea I will dig into that!

ADD REPLY • link 5.6 years ago by guillaume.rbt ★ 1.0k

1

Entering edit mode

I have encountered this situation quite a few times and tend to accept and report results based on GO/Pathway enrichment significance. One argument is that DE p-value < 0.05 and fdr > 0.05 does not mean your full set of genes is insignificant. Rather, your set of genes is likely to include some proportion of false positives which may actually be filtered-out by the enrichment analysis. The second argument is more abstract in that you gain "bits of small evidence" in your DE while enrichment provides you with the "big picture". Having said that, I would check the workflow by submitting several random sets from your gene list and make sure you do not always end up with enriched terms.

ADD REPLY • link 5.6 years ago by jomo018 ▴ 730

0

Entering edit mode

I had the same idea that the GO enrichment could act as a filter for false positive. Thank for the tip of testing with random sets of genes, I will check that.

ADD REPLY • link 5.6 years ago by guillaume.rbt ★ 1.0k