Question

How to do RNA-Seq metadata analysis?

0

Entering edit mode

2.5 years ago

Francois Piumi ▴ 70

Hi,

I downloaded several RNA-Seq raw data (fastq files) from different studies. Then I submitted them to a RNA-Seq pipeline that I developped and that I already use for my own studies. I finally got several DESeq2 files with lists of differentially expressed genes (fold changes and p-values).

What are the best solutions to compare these results?

For instance, I'd like to compare pathways that are found enriched in the different studies.

I'd like to use KEGG or IPA, but the first step is to apply a fold change and a pvalue cutoff to select differentially expressed genes. Do I have to select the same cutoff for all the studies, or select an approximately identical number of differentially expressed genes in all the studies? Or maybe there's another solution?

RNA-Seq metadata • 656 views

ADD COMMENT • link updated 2.5 years ago by seidel 11k • written 2.5 years ago by Francois Piumi ▴ 70

score 1 · Answer 1 · 2022-05-21

I think you'll have to explore your data, and make decisions based on the questions you have, and what it is you're trying to determine. Many methods do not require selection of a cutoff to determine DE, but rather simply that your genes are ranked by some method. GSEA would be one method where a cutoff isn't required. You could also use the geneSetTest() function from edgeR to test whether any gene set of your choice is highly ranked relative toother genes in your data set.

On the other hand if you're trying to do some enrichment analysis to see if DE gene sets chosen by you are enriched for pathways it makes initial intuitive sense that you would apply the same cutoffs across your data sets. However, biological effects differ in size, and the experimental error and sensitivity of the experiments differs from lab to lab and system to system. "Several studies" implies a number you can reason over to take in what might be important characteristics of each one. For crude answers you could apply the same cutoff to each one, but comparing what gets enriched in 50 genes from one experiment to what gets enriched in 2000 genes from a second experiment doesn't seem that meaningful. If the sizes of DE gene sets differ wildly between experiments and you want to determining if a pathway is pushed to one side or another, it may make more sense to choose a consistent number of top ranked genes for comparison.

It really depends on your data and your questions.