Question

GSVA R packages

2

Entering edit mode

4.9 years ago

Morris_Chair ▴ 370

Hello everyone, I'm trying to do a gene set varian analysis using R to detect a specific gene set signature of a specific pathway from 20 samples of RNA-seq. I have this files in BAM format but I don't know what to do in order to get a panel like this one in the pic! which R packages are required?

on the Y axis there is the score for a specific singling pathway while the x axis represents the different samples.

Thank you

GSVA • 7.2k views

ADD COMMENT • link updated 3.5 years ago by Kevin Blighe 89k • written 4.9 years ago by Morris_Chair ▴ 370

0

Entering edit mode

Hi Kevin,

I created the rlog matrix.txt and run GSVA, but I m doing something wrong, To run GSVA I m following this vignette:

https://bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.pdf

library(GSEABase)
library(GSVAdata)
data(c2BroadSets)
c2BroadSets

canonicalC2BroadSets <-c2BroadSets[c(grep("^KEGG",names(c2BroadSets)),
                                     grep("^REACTOME", names(c2BroadSets)),
                                     grep("^BIOCARTA", names(c2BroadSets)))]

canonicalC2BroadSets
data(genderGenesEntrez)

MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),collectionType=BroadCollection(category="c2"), setName="MSY")

XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),collectionType=BroadCollection(category="c2"), setName="XiE")

canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE))

mydata <- read.table(file= "mytable.csv",sep="",header=TRUE, row.names = 1)

esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500,
                 kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)

but at this point I have an error : Error in gsva(mydata, canonicalC2BroadSets, min.sz = 5, max.sz = 500, : could not find function "gsva"

I'm not familiar with this script, can you tell me how to use it?

Do I need of others libraries?

thank you very much

ADD REPLY • link 4.8 years ago by Morris_Chair ▴ 370

0

Entering edit mode

Hey Morris, I think that you also need to load the main package, like this:

require(GSVA)

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

score 4 · Answer 1 · 2020-04-09

4

Entering edit mode

4.9 years ago

Kevin Blighe 89k

Hello,

Your best approach would be this:

produce a raw counts matrix for your BAM files (HT-seq, featureCounts, or something else)
normalise the raw counts (EdgeR, DESeq2, etc)
transform the normalised counts via, e.g., regularised log, variance stabilisation, or log2 (CPM + 1)
Run GSVA with the 'C2' Broad Institute curated datasets

The output of GSVA will be what you can then use to generate bar plot for each enriched signature / pathway.

I will leave the specifics of each step to you.

Kevin

ADD COMMENT • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, I have this error

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘gsva’ for signature ‘"data.frame", "GeneSetCollection"’

probably it's not right the format of the matrix that I'm using or I'm missing a step?

Thank you

ADD REPLY • link 4.8 years ago by Morris_Chair ▴ 370

0

Entering edit mode

Hey, perhaps it has to be a data matrix?

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, I converted the file from data.frame to a "as.matrix" ... but another error occurs

esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500, + kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)

Error in .local(expr, gset.idx.list, ...) : 
  Less than two genes in the input expression data matrix
In addition: Warning messages:
1: In .local(expr, gset.idx.list, ...) :
  55046 genes with constant expression values throuhgout the samples.
2: In .local(expr, gset.idx.list, ...) :
  Since argument method!="ssgsea", genes with constant expression values are discarded.

thank you for your help

ADD REPLY • link 4.8 years ago by Morris_Chair ▴ 370

2

Entering edit mode

Oh, here is the worrying part:

  55046 genes with constant expression values throughout the samples.

What if you do this (before any conversion to data matrix):

table(apply(mydata, 2, function(x) var(x, na.rm = TRUE) == 0)
table(apply(mydata, 1, function(x) var(x, na.rm = TRUE) == 0)

That will tabulate samples and then genes of 0 variance, i.e., constant values

Also, make sure that you retry gsva() with data.matrix(), not as.matrix()

Did you pre-filter your data?

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

I filtered the matrix by removing empty rows and rows with values assigned with duplicated gene names because they were giving me error as well.

Here is the output seems that the error persists

> mydata <- read.table(file= "mytable_genename.csv",sep="",header=TRUE, row.names = 1)
> table(apply(mydata, 2, function(x) var(x, na.rm = TRUE) == 0))
< table of extent 0 >
> table(apply(mydata, 1, function(x) var(x, na.rm = TRUE) == 0))
< table of extent 0 >
> mydata<-data.matrix(mydata)


> esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500,
+                  kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
Error in .local(expr, gset.idx.list, ...) : 
  Less than two genes in the input expression data matrix
In addition: Warning messages:
1: In .local(expr, gset.idx.list, ...) :
  55046 genes with constant expression values throuhgout the samples.
2: In .local(expr, gset.idx.list, ...) :
  Since argument method!="ssgsea", genes with constant expression values are discarded.

Thank you

ADD REPLY • link 4.8 years ago by Morris_Chair ▴ 370

0

Entering edit mode

Hey Morris, all good? Why is your separater an empty field - sep="" ? What is the delimiter used in your file, mytable_genename.csv ?

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, I m ok for now thanks :)

I thought that maybe separator could be a problem so I removed it, now because you told me I converted the file from .csv to .txt without specifying the separator and I have a different result:

> mydata <- read.table(file= "mytable_genename.txt",sep="",header=TRUE, row.names = 1)
> table(apply(mydata, 2, function(x) var(x, na.rm = TRUE) == 0))

FALSE 
   18 
> table(apply(mydata, 1, function(x) var(x, na.rm = TRUE) == 0))

FALSE  TRUE 
50640  4406 
> mydata<-data.matrix(mydata)

> esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500,
+                  kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
Error in .local(expr, gset.idx.list, ...) : 
  No identifiers in the gene sets could be matched to the identifiers in the expression data.
In addition: Warning messages:
1: In .local(expr, gset.idx.list, ...) :
  4406 genes with constant expression values throuhgout the samples.
2: In .local(expr, gset.idx.list, ...) :
  Since argument method!="ssgsea", genes with constant expression values are discarded.

Maybe I should be using the ensemble for the gene name..

ADD REPLY • link 4.8 years ago by Morris_Chair ▴ 370

0

Entering edit mode

Cool, oh, but, can you show the actual contents of the file? Using sep="", irrespective of the file extension, will cause problems.

How are the fields / columns separated inside the file? You can open it with Notepad or Gedit (Ubuntu), or some other text editor.

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

I loaded the file into dropbox, if it's ok for you, you can download it from here

https://www.dropbox.com/s/yqa8yeialsxrlme/mytable_genename.txt?dl=0

thank you

ADD REPLY • link 4.8 years ago by Morris_Chair ▴ 370

0

Entering edit mode

My time is quite limited - sorry. You literally just need to look in the file to see how the columns are separated. It is most likely a comma or tab-space. So, sep = ',' or sep = '\t' should work.

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

0

Entering edit mode

it was separated by tab-space but the result didn't changed after editing the code :/

> mydata <- read.table(file= "mytable_genename.txt",sep="\t",header=TRUE, row.names = 1)
> table(apply(mydata, 2, function(x) var(x, na.rm = TRUE) == 0))

FALSE 
   18 
> table(apply(mydata, 1, function(x) var(x, na.rm = TRUE) == 0))

FALSE  TRUE 
50640  4406 

> mydata<-data.matrix(mydata)

> esrnaseq <- gsva(mydata, canonicalC2BroadSets, min.sz=5, max.sz=500,
+                  kcdf="Poisson", mx.diff=TRUE, verbose=FALSE, parallel.sz=1)
Error in .local(expr, gset.idx.list, ...) : 
  No identifiers in the gene sets could be matched to the identifiers in the expression data.
In addition: Warning messages:
1: In .local(expr, gset.idx.list, ...) :
  4406 genes with constant expression values throuhgout the samples.
2: In .local(expr, gset.idx.list, ...) :
  Since argument method!="ssgsea", genes with constant expression values are discarded.

thank you

ADD REPLY • link 4.8 years ago by Morris_Chair ▴ 370

0

Entering edit mode

Yes, the problem is that you have 4406 genes with 0 variance, so, there is nothing that can be done with these. You will have to remove them prior to performing GSVA. With the apply() function, you can basically create a vector of boolean values for TRUE and FALSE.

For example:

keep <- apply(mydata, 1, function(x) var(x, na.rm = TRUE) > 0

mydata<-data.matrix(mydata[keep,])

Now I am going out for a jog, while social distancing. Hope that it helps!

ADD REPLY • link 4.8 years ago by Kevin Blighe 89k

1

Entering edit mode

I will work on this, thank you, lucky you that you can jog, enjoy :)

ADD REPLY • link 4.8 years ago by Morris_Chair ▴ 370

0

Entering edit mode

Hello Kevin Blighe,

I have one more Question, In the GSVA help page it is suggested to use tokcdf="Poisson" for a suitable modelisation of RNA-seq (count Data)

By default,kcdf="Gaussian"which is suitable wheninput expression values are continuous,
such as microarray fluorescent units inlogarithmic scale,  RNA-seq  log-CPMs,  log-RPKMs or log-TPMs.
When in-put expression values are integer counts, such as those derived from RNA-seqexperiments, 
then this argument should be set tokcdf="Poisson"

but in your comment you suggest a 2 step normalisation of raw counts (DESeq2) and log transformation in the DESeq2 package we can do

counts(dds, normalized=TRUE) # normalised count

vsd <- vst(dds, blind=FALSE) # variance stabilizing transformation

rld <- rlog(dds, blind=FALSE) # regularized log' transformation and ntd <- normTransform(dds) # Normalized counts transformation

Can you explain which one to use ? Should I use a diagnostic plot ?

ADD REPLY • link 3.8 years ago by Z_Wael ▴ 10

1

Entering edit mode

Hello, if you wish to use "Poisson", then you should use counts(dds, normalized=TRUE)

ADD REPLY • link 3.8 years ago by Kevin Blighe 89k

1

Entering edit mode

Hi Kevin Blighe thank you for your feed back.

ADD REPLY • link 3.6 years ago by Z_Wael ▴ 10

0

Entering edit mode

Hello Kevin Blighe, thanks for your reply, it works great with the argument kcdf="Poisson" and the counts formalisation with DESeq2. But what is for you the best use (kcdf option and DESeq2 normalisation method) for ssgsea ?

ADD REPLY • link 3.5 years ago by Z_Wael ▴ 10

1

Entering edit mode

Hi, I have no huge opinion on that. I think that you can choose any kcdf with any method?

ADD REPLY • link 3.5 years ago by Kevin Blighe 89k