Why GSVA returns a matrix with fewer gene-sets?
0
0
Entering edit mode
6.7 years ago
arronar ▴ 290

Hi.

I'm trying to run GSVA, using microarray data and using 186 KEGG gene-sets. While I was waiting the output matrix to be 186(pathways)xsamples , is 6(pathways)xsamples. It returns only enrichment values for only 6 of the provided gene-sets and not for all of them. Is that something expected or am I doing something wrong with the code?

Thanks.

microarray GSVA • 3.4k views
ADD COMMENT
0
Entering edit mode

How many samples do you have on the original microarray data? If you had 6 samples then the expected output is 186(pathways provided)x6(samples)

ADD REPLY
0
Entering edit mode

I'm having 72 samples. and I'm getting back 6(pathways)x72(samples) instead of 1866(pathways)x72(samples)

Here is the command I'm using

gsva_res <- gsva( data, GSC.C2.msigdb$gsc[1:186], mx.diff=FALSE, verbose=TRUE)

and in the console it returns :

Estimating GSVA scores for 6 gene sets.

Computing observed enrichment scores

Estimating ECDFs in microarray data with Gaussian kernels

The gene set collection was built by

GSC.C2.msigdb <- loadGSC(file="geneSets/c2.all.v6.1.symbols.gmt", type="gmt")

ADD REPLY
1
Entering edit mode

Ok, are you sure you are using 186 Gene sets?? It seems like it is subseting by genes and not gene sets (subseting by number in a GeneSetCollection produces this, while subseting by name subsets by GeneSet name) or something alike. Where does the loadGSC function come from? Which class is the GSC.C2.msigdb$gsc?

ADD REPLY
0
Entering edit mode

loadGSC is from piano package. and class(GSC.C2.msigdb$gsc) returns "list"

Also I tried to provide it a plain list that created as KEGG <- GSC.C2.msigdb$gsc[1:186] which class(KEGG) is list and length(KEGG) is 186, but still the same results.

ADD REPLY
1
Entering edit mode

And which type of list is it: one with genes: pathways or pathways genes? Load it as a GeneSetCollection and post here the output of printing it. I think that there are some gene sets with empty genes, or could be that the genes are not mapped to your matrix of probes/genes per sample

ADD REPLY
0
Entering edit mode

> GSC.C2.msigdb returns :

First 50 (out of 4738) gene set names:
 [1] "KEGG_GLYCOLYSIS..." "KEGG_CITRATE_CY..." "KEGG_PENTOSE_PH..." "KEGG_PENTOSE_AN..." "KEGG_FRUCTOSE_A..."
 [6] "KEGG_GALACTOSE_..." "KEGG_ASCORBATE_..." "KEGG_FATTY_ACID..." "KEGG_STEROID_BI..." "KEGG_PRIMARY_BI..."
[11] "KEGG_STEROID_HO..." "KEGG_OXIDATIVE_..." "KEGG_PURINE_MET..." "KEGG_PYRIMIDINE..." "KEGG_ALANINE_AS..."
[16] "KEGG_GLYCINE_SE..." "KEGG_CYSTEINE_A..." "KEGG_VALINE_LEU..." "KEGG_VALINE_LEU..." "KEGG_LYSINE_DEG..."
[21] "KEGG_ARGININE_A..." "KEGG_HISTIDINE_..." "KEGG_TYROSINE_M..." "KEGG_PHENYLALAN..." "KEGG_TRYPTOPHAN..."
[26] "KEGG_BETA_ALANI..." "KEGG_TAURINE_AN..." "KEGG_SELENOAMIN..." "KEGG_GLUTATHION..." "KEGG_STARCH_AND..."
[31] "KEGG_N_GLYCAN_B..." "KEGG_OTHER_GLYC..." "KEGG_O_GLYCAN_B..." "KEGG_AMINO_SUGA..." "KEGG_GLYCOSAMIN..."
[36] "KEGG_GLYCOSAMIN..." "KEGG_GLYCOSAMIN..." "KEGG_GLYCOSAMIN..." "KEGG_GLYCEROLIP..." "KEGG_INOSITOL_P..."
[41] "KEGG_GLYCOSYLPH..." "KEGG_GLYCEROPHO..." "KEGG_ETHER_LIPI..." "KEGG_ARACHIDONI..." "KEGG_LINOLEIC_A..."
[46] "KEGG_ALPHA_LINO..." "KEGG_SPHINGOLIP..." "KEGG_GLYCOSPHIN..." "KEGG_GLYCOSPHIN..." "KEGG_GLYCOSPHIN..."

First 50 (out of 21095) gene names:
 [1] "ACSS2"   "GCK"     "PGK2"    "PGK1"    "PDHB"    "PDHA1"   "PDHA2"   "PGM2"    "TPI1"    "ACSS1"   "FBP1"   
[12] "ADH1B"   "HK2"     "ADH1C"   "HK1"     "HK3"     "ADH4"    "PGAM2"   "ADH5"    "PGAM1"   "ADH1A"   "ALDOC"  
[23] "ALDH7A1" "LDHAL6B" "PKLR"    "LDHAL6A" "ENO1"    "PKM2"    "PFKP"    "BPGM"    "PCK2"    "PCK1"    "ALDH1B1"
[34] "ALDH2"   "ALDH3A1" "AKR1A1"  "FBP2"    "PFKM"    "PFKL"    "LDHC"    "GAPDH"   "ENO3"    "ENO2"    "PGAM4"  
[45] "ADH7"    "ADH6"    "LDHB"    "ALDH1A3" "ALDH3B1" "ALDH3B2"

Gene set size summary:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.00   18.00   39.00   93.86   88.00 1972.00 

First 10 gene sets with additional info:
      Gene set                  Additional info                                                  
 [1,] "KEGG_GLYCOLYSIS_GLUC..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GLYCOLY..."
 [2,] "KEGG_CITRATE_CYCLE_T..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_CITRATE..."
 [3,] "KEGG_PENTOSE_PHOSPHA..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE..."
 [4,] "KEGG_PENTOSE_AND_GLU..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PENTOSE..."
 [5,] "KEGG_FRUCTOSE_AND_MA..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_FRUCTOS..."
 [6,] "KEGG_GALACTOSE_METAB..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_GALACTO..."
 [7,] "KEGG_ASCORBATE_AND_A..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_ASCORBA..."
 [8,] "KEGG_FATTY_ACID_META..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_FATTY_A..."
 [9,] "KEGG_STEROID_BIOSYNT..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_STEROID..."
[10,] "KEGG_PRIMARY_BILE_AC..." "http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_PRIMARY..."

while > GSC.C2.msigdb$gsc

$KEGG_GLYCOLYSIS_GLUCONEOGENESIS
 [1] "ACSS2"   "GCK"     "PGK2"    "PGK1"    "PDHB"    "PDHA1"   "PDHA2"   "PGM2"    "TPI1"    "ACSS1"   "FBP1"   
[12] "ADH1B"   "HK2"     "ADH1C"   "HK1"     "HK3"     "ADH4"    "PGAM2"   "ADH5"    "PGAM1"   "ADH1A"   "ALDOC"  
[23] "ALDH7A1" "LDHAL6B" "PKLR"    "LDHAL6A" "ENO1"    "PKM2"    "PFKP"    "BPGM"    "PCK2"    "PCK1"    "ALDH1B1"
[34] "ALDH2"   "ALDH3A1" "AKR1A1"  "FBP2"    "PFKM"    "PFKL"    "LDHC"    "GAPDH"   "ENO3"    "ENO2"    "PGAM4"  
[45] "ADH7"    "ADH6"    "LDHB"    "ALDH1A3" "ALDH3B1" "ALDH3B2" "ALDH9A1" "ALDH3A2" "GALM"    "ALDOA"   "DLD"    
[56] "DLAT"    "ALDOB"   "G6PC2"   "LDHA"    "G6PC"    "PGM1"    "GPI"    

$KEGG_CITRATE_CYCLE_TCA_CYCLE
 [1] "IDH3B"     "DLST"      "PCK2"      "CS"        "PDHB"      "PCK1"      "PDHA1"     "LOC642502" "PDHA2"    
[10] "LOC283398" "FH"        "SDHD"      "OGDH"      "SDHB"      "IDH3A"     "SDHC"      "IDH2"      "IDH1"     
[19] "ACO1"      "ACLY"      "MDH2"      "DLD"       "MDH1"      "DLAT"      "OGDHL"     "PC"        "SDHA"     
[28] "SUCLG1"    "SUCLA2"    "SUCLG2"    "IDH3G"     "ACO2"   
...
...
...
ADD REPLY
2
Entering edit mode

And the rownames of the count matrix you are using are the same symbol ids?

ADD REPLY
1
Entering edit mode

Oh. my god. row.names are in lowercase while the gene-sets in uppercase. That is the cause of the problem.

ADD REPLY
1
Entering edit mode

I hope that while finding the solution to this problem you also learned how to track down your bugs and solve them. Good luck!

ADD REPLY
0
Entering edit mode

If you aren't applying any filter by pathway size it might be a bug. Post it in support.bioconductor.org or https://github.com/rcastelo/GSVA/issues

ADD REPLY
0
Entering edit mode

I'm not applying any filter

ADD REPLY
0
Entering edit mode

You may not be [applying filtering] but programs always perform filtering behind the scenes. You should review all possible parameters that can be used with gsva() and then modify those that could be producing the observed effect.

From what I gather, if no statistically significant enrichment can be performed for a sample, then it will not be included.

ADD REPLY
0
Entering edit mode

No, gsva doesn't have any filtering parameter, the problem is with the [1:186] part

ADD REPLY
0
Entering edit mode

Did not read anywhere in the manual that it performs such kind of filtering

ADD REPLY
0
Entering edit mode

Old thread but you can use the gmtPathways command from the fgsea package to get the list of pathways as follows:

pathways.list <- gmtPathways("C:/Path/To/File/c2.all.v7.0.symbols.gmt") 

then run for example:

gsva_results <- gsva(data, pathways.list, method = "ssgsea", kcdf = "Poisson", min.sz = 1, max.sz = 500, mx.diff = TRUE, verbose = T)
ADD REPLY

Login before adding your answer.

Traffic: 2836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6