This is a continuation from this post I made but I figured I would ask in a different and more explicit manner given that I have a deeper understanding of my dataset:
As far as my experimental design:
I received counts data for different treatment conditions after transfection with shRNAs. The counts data from each shRNA should serve as a proxy for whether or not the gene associated with said shRNA was associated with preferential survival or death in the given treatment conditions. Lower counts with the shRNA would indicate that the gene was important in the survival of the cell line in the experimental conditions.
Thus far, for each shRNA, I calculated its log2Fold change and p-value (adjusted) using both DESeq2
and EdgeR
. I then thresholded the results to select for only those shRNAs that fell below a certain p-value.
From this list, I then selected the shRNA for each gene (since there was >1 shRNA for each gene) with the highest log2Fold change (either positive or negative). I needed to select a single shRNA for each gene since the GSEA methods I planned to pursue would only allow one instance of a gene in a provided list. I chose to use the shRNA with the highest log2Fold change after p-value thresholding at the advice of my labmates.
I have other information about the shRNA data in the following dataframe as well, although this likely is not relevant to GSEA methods (all information is from EdgeR
. percent_signif_shrna
indicates the percentage of shRNAs that were intended to target a particular gene that fell below a specified p-value and above a certain log2Fold change).
I now want to use tools to understand the values associated with my output. Thus far, I used PantherDB's Statistical Enrichment Test
(using the Panther Pathways
section) to gather some preliminary information. I used this Statistical Enrichment Test
since I wanted to leverage both the gene names and the log2Fold change (since I have the quantitative data). The .txt file I put into PantherDB
looked like the following tab-delineated file:
I now want to use other types of GSEA to analyze my output and figure out the biological meanings of these values. I have identified the following potential tools: DAVID, GAGE, MSigDB, EnirchR, and RTopper.
Questions:
Are there any tools that anyone believes I should consider given my experimental design?
Should I know anything about any of these tools in particular before I provide my values (e.g. is one considered 'better' or 'worse' for the RNA-Seq data that I intend to work with?)
It seems like many of the tools only take in lists of genes and do not work with the log2Fold changes (unlike
PantherDB's
'Statistical Enrichment test'). If this is the case, should I ONLY include the genes that have positive log2Fold values (AKA only those that are enriched and not those that are not downregulated)? It seems that 'enrichment' is more the target of these tests rather than downregulation so that seems to be the most appropriate recourse.Do any of these tools leverage
log2Fold
in any capacity? I feel as if having that additional quantitative metric (in addition to thinking about the downregulation of certain pathways) may help with my analyses. I am, however, very new in this area and understand that in some cases the list of genes alone may be the only ones worth reporting out.
Thank you for the comment! The post that you linked is going to be helpful as I navigate the rest of my analyses.
For my edification: why should we avoid 'filtering out' genes a priori with a p-value? I would imagine anything with a high p-value might not be helpful for the analyses anyway and instead bog down the results.
Filtering out genes with low level of expression or those that have low level of variance from the input may reduce the statistical power of the enrichment analysis.
Here is what user guide from GSEA has stated about pre-processing: