Question

Questions on ssGSEA

0

Entering edit mode

3.5 years ago

emi ▴ 80

Hi,

I am trying to do ssGSEA analysis on my dataset (RNA-seq data) using the GVSA package. I have a few questions:

For the input data, should I put in raw counts or normalised counts? At the moment, I am using deseq normalised counts - is that acceptable? The GenePattern ssGSEA module gave the options of "rank", "log" and "log.rank". How will the different normalisation method affect the result?
Second question is slightly long-winded. Please bear with me.

a) I've created my own GMT files for 29 gene sets based on a paper. However, after double checking, The gene sets contained 12 genes that is not available in my dataset. I am unsure how that happened and wanted to make sure that I have not filtered these genes out in my pre-processing steps. After checking, I realised that these genes were not present in my raw counts data (output from featureCounts). My FASTA files are aligned to GRCh37 and that is what I used for featureCounts too. Does anyone have this problem before? The missing genes contained pretty common genes so I am a bit perplexed why they are missing:

"MMP12" "MRC1" "CCL15" "IL4" "IFNA2" "CCL3" "C10orf54" "TRAC" "TRBC1" "TRBC2" "CCL4" "CCL5"

If anyone encountered this before and can tell me what went wrong that will be great.

b) I imagine missing a few genes will make a different to the ssGSEA result and some of my gene sets are pretty small (only have 8 genes). How does GSVA deal with a missing gene in the gene set? So I need to remove the gene from my GMT file before proceeding?

My last question is on the minimum gene set size. As some of my gene set are very small (<10 genes) and the default is set as 10, will that affect the reliability of the ssGSEA analysis?

FYI I have input what I have at the moment (deseq normalised counts) and no errors were given from GSVA. But I would like to know how reliable is the results given I've had the issues above. Thank you!

  gsva_results <- gsva(
deseq_norm_counts,
my_gmt,
method = "ssgsea",
# kcdf = "Gaussian" if we have expression values that are continuous such as log-CPMs, log-RPKMs or log-TPMs, kcdf = "Poisson" on integer counts. 
kcdf = "Poisson",
# Minimum gene set size
min.sz = 1,
# Maximum gene set size
max.sz = 500,
# Compute Gaussian-distributed scores
mx.diff = TRUE,
# Don't print out the progress bar
verbose = T)

RNA-seq gsva ssGSEA alignment • 3.3k views

ADD COMMENT • link updated 3.5 years ago by Brian Gudenas ▴ 90 • written 3.5 years ago by emi ▴ 80

score 0 · Answer 1 · 2021-06-06

0

Entering edit mode

3.5 years ago

Arsenal ▴ 160

1- Have you read GSVA manual?

2- Very confusing;

created my own GMT files for 29 gene sets based on a paper. However, after double checking, The gene sets contained 12 genes that is not available in my dataset

ADD COMMENT • link 3.5 years ago by Arsenal ▴ 160

0

Entering edit mode

I read this https://bioconductor.org/packages/release/bioc/manuals/GSVA/man/GSVA.pdf but I couldn't find where its written how they treat missing genes.
My GMT files has 29 gene sets totaling around 250 genes. However, from my dataset, I only have 238/250 of the genes. Does this explanation makes more sense?

Thank you.

ADD REPLY • link 3.5 years ago by emi ▴ 80

score 0 · Answer 2 · 2021-06-10

1.) ssGSEA is performed based on a ranked gene list (within sample comparisons) therefore it makes sense to use something that accounts for gene length bias in this case such as TPM or FPKM instead of normalized counts.

2.) Missing genes should not be a problem, they simply wont be counted in the test. But, I would not use ssGSEA for gene sets smaller than 10 genes, the results become more variable the fewer genes you have. With a small gene set (like 5 genes), changing the expression of a single gene can have a large impact on the enrichment score which defeats the purpose of doing GSEA which is meant to be robust to single gene deviations.