Hi,
I am trying to do ssGSEA analysis on my dataset (RNA-seq data) using the GVSA package. I have a few questions:
For the input data, should I put in raw counts or normalised counts? At the moment, I am using deseq normalised counts - is that acceptable? The GenePattern ssGSEA module gave the options of "rank", "log" and "log.rank". How will the different normalisation method affect the result?
Second question is slightly long-winded. Please bear with me.
a) I've created my own GMT files for 29 gene sets based on a paper. However, after double checking, The gene sets contained 12 genes that is not available in my dataset. I am unsure how that happened and wanted to make sure that I have not filtered these genes out in my pre-processing steps. After checking, I realised that these genes were not present in my raw counts data (output from featureCounts). My FASTA files are aligned to GRCh37 and that is what I used for featureCounts too. Does anyone have this problem before? The missing genes contained pretty common genes so I am a bit perplexed why they are missing:
"MMP12" "MRC1" "CCL15" "IL4" "IFNA2" "CCL3" "C10orf54" "TRAC" "TRBC1" "TRBC2" "CCL4" "CCL5"
If anyone encountered this before and can tell me what went wrong that will be great.
b) I imagine missing a few genes will make a different to the ssGSEA result and some of my gene sets are pretty small (only have 8 genes). How does GSVA deal with a missing gene in the gene set? So I need to remove the gene from my GMT file before proceeding?
- My last question is on the minimum gene set size. As some of my gene set are very small (<10 genes) and the default is set as 10, will that affect the reliability of the ssGSEA analysis?
FYI I have input what I have at the moment (deseq normalised counts) and no errors were given from GSVA. But I would like to know how reliable is the results given I've had the issues above. Thank you!
gsva_results <- gsva(
deseq_norm_counts,
my_gmt,
method = "ssgsea",
# kcdf = "Gaussian" if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, kcdf = "Poisson" on integer counts.
kcdf = "Poisson",
# Minimum gene set size
min.sz = 1,
# Maximum gene set size
max.sz = 500,
# Compute Gaussian-distributed scores
mx.diff = TRUE,
# Don't print out the progress bar
verbose = T)
I read this https://bioconductor.org/packages/release/bioc/manuals/GSVA/man/GSVA.pdf but I couldn't find where its written how they treat missing genes.
My GMT files has 29 gene sets totaling around 250 genes. However, from my dataset, I only have 238/250 of the genes. Does this explanation makes more sense?
Thank you.