Question

Why GSVA returns a matrix with fewer gene-sets?

0

Entering edit mode

7.1 years ago

arronar ▴ 290

Hi.

I'm trying to run GSVA, using microarray data and using 186 KEGG gene-sets. While I was waiting the output matrix to be 186(pathways)xsamples , is 6(pathways)xsamples. It returns only enrichment values for only 6 of the provided gene-sets and not for all of them. Is that something expected or am I doing something wrong with the code?

Thanks.

microarray GSVA • 3.9k views

ADD COMMENT • link updated 3.3 years ago by Christopher Walker ▴ 80 • written 7.1 years ago by arronar ▴ 290

0

Entering edit mode

How many samples do you have on the original microarray data? If you had 6 samples then the expected output is 186(pathways provided)x6(samples)

ADD REPLY • link 7.1 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

I'm having 72 samples. and I'm getting back 6(pathways)x72(samples) instead of 1866(pathways)x72(samples)

Here is the command I'm using

gsva_res <- gsva( data, GSC.C2.msigdb$gsc[1:186], mx.diff=FALSE, verbose=TRUE)

and in the console it returns :

Estimating GSVA scores for 6 gene sets.

Computing observed enrichment scores

Estimating ECDFs in microarray data with Gaussian kernels

The gene set collection was built by

GSC.C2.msigdb <- loadGSC(file="geneSets/c2.all.v6.1.symbols.gmt", type="gmt")

ADD REPLY • link 7.1 years ago by arronar ▴ 290

1

Entering edit mode

Ok, are you sure you are using 186 Gene sets?? It seems like it is subseting by genes and not gene sets (subseting by number in a GeneSetCollection produces this, while subseting by name subsets by GeneSet name) or something alike. Where does the loadGSC function come from? Which class is the GSC.C2.msigdb$gsc?

ADD REPLY • link 7.1 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

loadGSC is from piano package. and class(GSC.C2.msigdb$gsc) returns "list"

Also I tried to provide it a plain list that created as KEGG <- GSC.C2.msigdb$gsc[1:186] which class(KEGG) is list and length(KEGG) is 186, but still the same results.

ADD REPLY • link 7.1 years ago by arronar ▴ 290

1

Entering edit mode

And which type of list is it: one with genes: pathways or pathways genes? Load it as a GeneSetCollection and post here the output of printing it. I think that there are some gene sets with empty genes, or could be that the genes are not mapped to your matrix of probes/genes per sample

ADD REPLY • link 7.1 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

> GSC.C2.msigdb returns :

First 50 (out of 4738) gene set names:
 [1] "KEGG_GLYCOLYSIS..." "KEGG_CITRATE_CY..." "KEGG_PENTOSE_PH..." "KEGG_PENTOSE_AN..." "KEGG_FRUCTOSE_A..."
 [6] "KEGG_GALACTOSE_..." "KEGG_ASCORBATE_..." "KEGG_FATTY_ACID..." "KEGG_STEROID_BI..." "KEGG_PRIMARY_BI..."
[11] "KEGG_STEROID_HO..." "KEGG_OXIDATIVE_..." "KEGG_PURINE_MET..." "KEGG_PYRIMIDINE..." "KEGG_ALANINE_AS..."
[16] "KEGG_GLYCINE_SE..." "KEGG_CYSTEINE_A..." "KEGG_VALINE_LEU..." "KEGG_VALINE_LEU..." "KEGG_LYSINE_DEG..."
[21] "KEGG_ARGININE_A..." "KEGG_HISTIDINE_..." "KEGG_TYROSINE_M..." "KEGG_PHENYLALAN..." "KEGG_TRYPTOPHAN..."
[26] "KEGG_BETA_ALANI..." "KEGG_TAURINE_AN..." "KEGG_SELENOAMIN..." "KEGG_GLUTATHION..." "KEGG_STARCH_AND..."
[31] "KEGG_N_GLYCAN_B..." "KEGG_OTHER_GLYC..." "KEGG_O_GLYCAN_B..." "KEGG_AMINO_SUGA..." "KEGG_GLYCOSAMIN..."
[36] "KEGG_GLYCOSAMIN..." "KEGG_GLYCOSAMIN..." "KEGG_GLYCOSAMIN..." "KEGG_GLYCEROLIP..." "KEGG_INOSITOL_P..."
[41] "KEGG_GLYCOSYLPH..." "KEGG_GLYCEROPHO..." "KEGG_ETHER_LIPI..." "KEGG_ARACHIDONI..." "KEGG_LINOLEIC_A..."
[46] "KEGG_ALPHA_LINO..." "KEGG_SPHINGOLIP..." "KEGG_GLYCOSPHIN..." "KEGG_GLYCOSPHIN..." "KEGG_GLYCOSPHIN..."

First 50 (out of 21095) gene names:
 [1] "ACSS2"   "GCK"     "PGK2"    "PGK1"    "PDHB"    "PDHA1"   "PDHA2"   "PGM2"    "TPI1"    "ACSS1"   "FBP1"   
[12] "ADH1B"   "HK2"     "ADH1C"   "HK1"     "HK3"     "ADH4"    "PGAM2"   "ADH5"    "PGAM1"   "ADH1A"   "ALDOC"  
[23] "ALDH7A1" "LDHAL6B" "PKLR"    "LDHAL6A" "ENO1"    "PKM2"    "PFKP"    "BPGM"    "PCK2"    "PCK1"    "ALDH1B1"
[34] "ALDH2"   "ALDH3A1" "AKR1A1"  "FBP2"    "PFKM"    "PFKL"    "LDHC"    "GAPDH"   "ENO3"    "ENO2"    "PGAM4"  
[45] "ADH7"    "ADH6"    "LDHB"    "ALDH1A3" "ALDH3B1" "ALDH3B2"

Gene set size summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   18.00   39.00   93.86   88.00 1972.00 

First 10 gene sets with additional info:
      Gene set                  Additional info                                                  
 [1,] "KEGG_GLYCOLYSIS_GLUC..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GLYCOLY..."
 [2,] "KEGG_CITRATE_CYCLE_T..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_CITRATE..."
 [3,] "KEGG_PENTOSE_PHOSPHA..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE..."
 [4,] "KEGG_PENTOSE_AND_GLU..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE..."
 [5,] "KEGG_FRUCTOSE_AND_MA..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_FRUCTOS..."
 [6,] "KEGG_GALACTOSE_METAB..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GALACTO..."
 [7,] "KEGG_ASCORBATE_AND_A..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_ASCORBA..."
 [8,] "KEGG_FATTY_ACID_META..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_FATTY_A..."
 [9,] "KEGG_STEROID_BIOSYNT..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_STEROID..."
[10,] "KEGG_PRIMARY_BILE_AC..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PRIMARY..."

while > GSC.C2.msigdb$gsc

$KEGG_GLYCOLYSIS_GLUCONEOGENESIS
 [1] "ACSS2"   "GCK"     "PGK2"    "PGK1"    "PDHB"    "PDHA1"   "PDHA2"   "PGM2"    "TPI1"    "ACSS1"   "FBP1"   
[12] "ADH1B"   "HK2"     "ADH1C"   "HK1"     "HK3"     "ADH4"    "PGAM2"   "ADH5"    "PGAM1"   "ADH1A"   "ALDOC"  
[23] "ALDH7A1" "LDHAL6B" "PKLR"    "LDHAL6A" "ENO1"    "PKM2"    "PFKP"    "BPGM"    "PCK2"    "PCK1"    "ALDH1B1"
[34] "ALDH2"   "ALDH3A1" "AKR1A1"  "FBP2"    "PFKM"    "PFKL"    "LDHC"    "GAPDH"   "ENO3"    "ENO2"    "PGAM4"  
[45] "ADH7"    "ADH6"    "LDHB"    "ALDH1A3" "ALDH3B1" "ALDH3B2" "ALDH9A1" "ALDH3A2" "GALM"    "ALDOA"   "DLD"    
[56] "DLAT"    "ALDOB"   "G6PC2"   "LDHA"    "G6PC"    "PGM1"    "GPI"    

$KEGG_CITRATE_CYCLE_TCA_CYCLE
 [1] "IDH3B"     "DLST"      "PCK2"      "CS"        "PDHB"      "PCK1"      "PDHA1"     "LOC642502" "PDHA2"    
[10] "LOC283398" "FH"        "SDHD"      "OGDH"      "SDHB"      "IDH3A"     "SDHC"      "IDH2"      "IDH1"     
[19] "ACO1"      "ACLY"      "MDH2"      "DLD"       "MDH1"      "DLAT"      "OGDHL"     "PC"        "SDHA"     
[28] "SUCLG1"    "SUCLA2"    "SUCLG2"    "IDH3G"     "ACO2"   
...
...
...

ADD REPLY • link 7.1 years ago by arronar ▴ 290

2

Entering edit mode

And the rownames of the count matrix you are using are the same symbol ids?

ADD REPLY • link 7.1 years ago by Lluís R. ★ 1.2k

1

Entering edit mode

Oh. my god. row.names are in lowercase while the gene-sets in uppercase. That is the cause of the problem.

ADD REPLY • link 7.1 years ago by arronar ▴ 290

1

Entering edit mode

I hope that while finding the solution to this problem you also learned how to track down your bugs and solve them. Good luck!

ADD REPLY • link 7.1 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

If you aren't applying any filter by pathway size it might be a bug. Post it in support.bioconductor.org or https://github.com/rcastelo/GSVA/issues

ADD REPLY • link 7.1 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

I'm not applying any filter

ADD REPLY • link 7.1 years ago by arronar ▴ 290

0

Entering edit mode

You may not be [applying filtering] but programs always perform filtering behind the scenes. You should review all possible parameters that can be used with gsva() and then modify those that could be producing the observed effect.

From what I gather, if no statistically significant enrichment can be performed for a sample, then it will not be included.

ADD REPLY • link 7.1 years ago by Kevin Blighe 89k

0

Entering edit mode

No, gsva doesn't have any filtering parameter, the problem is with the [1:186] part

ADD REPLY • link 7.1 years ago by Lluís R. ★ 1.2k

0

Entering edit mode

Did not read anywhere in the manual that it performs such kind of filtering

ADD REPLY • link 7.1 years ago by arronar ▴ 290

0

Entering edit mode

Old thread but you can use the gmtPathways command from the fgsea package to get the list of pathways as follows:

pathways.list <- gmtPathways("C:/Path/To/File/c2.all.v7.0.symbols.gmt")

then run for example:

gsva_results <- gsva(data, pathways.list, method = "ssgsea", kcdf = "Poisson", min.sz = 1, max.sz = 500, mx.diff = TRUE, verbose = T)

ADD REPLY • link 3.3 years ago by Christopher Walker ▴ 80