I'm using the bioconductor package GSVA (http://bioconductor.org/packages/GSVA) to perform a GSVA analysis using RNA-seq data from GTEX and I'm confused about which dataset I need to use and if I need to perform some pre-processing steps.
According to the GSVA package vignette the input should be RNA-seq counts, does this means that I need to use the file "GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz" instead of "GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz"? and should I do something before use it as input?
The parameter for gsva() about which you need to be aware is kcdf. kcdf can have the values "Gaussian" or "Poisson". If your input data is normalised RNA-seq counts, then choose "Poisson", i.e., Poisson distribution. If your data is regularised log, variance-stabilised, or other logged data, including microarray normalised expression values, then choose "Gaussian", i.e. Gaussin distribution.
You can check the distribution of your data with the hist() command.
If you are planning to use the RPKM counts, then choose "Poisson". First ensure that these are not logged by generating the histogram.
I previously used GSVA for GTEx data. I took the raw counts (_gene_reads.gct.gz) and normalised these in DESeq2. I then transformed them via regularised log transformation and used kcdf="Gaussian".
Hey, I have no major preference for filtering on samples. Usually a PCA bi-plot can ehlp to determine if a sample should be excluded or not. You could also look at similar metrics, like mean and max expression (per sample). Most people filter on the genes.
Just one last thing Kevin. I noticed that after the rlog step the samples are not exactly normally distributed (link below), so I scaled the values using gtex <- apply(gtex, 2, scale) but after doing that the results from the gsva are different. My question is: is it really meaningful to scale the values? or should I keep as they are after rlog
Hey, rlog never usually returns data that perfectly follows a normal distribution. In the past, I have often converted rlog data to Z-scale, like you have done. I think that this approach is fine.
apply(gtex, 2, scale) should produce the same output as t(scale(t(gtex))), I think, i.e., scaling by gene.
Edit: with apply(gtex, 2, scale), you are actually scaling each column. You may want to just try the t(scale(t(x))) approach.
Note that the scale function with default settings is essentially Z-transforming your data. It is very useful.
It was a difficult choice... I used Gaussian due to the fact that my input was the regularised log expression levels. The regularised log histogram profile can be difficult to interpret, though, but mostly follows a Gaussian.
Hey Kevin, This is very helpful. Many thanks.
Following your suggestions what I did was to filter genes with low count first from the raw counts file (.gene_reads.gct.gz) using:
Then I transformed the data to a DESeq object with:
And finally rlog transform:
Is this correct or there's something that I'm missing before use the data as input for gsva?
Hey, I believe you are missing one step just before the
rlog()
function, such as:Otherwise, you are performing rlog on the raw counts.
Also, your filter with
rowSums()
seems very high. You can try different filters, though. I usually go by mean <= 10 or 20.Ups! Thanks Kevin. May I ask you if I should apply some type of filtering for the samples (individuals)?
Hey, I have no major preference for filtering on samples. Usually a PCA bi-plot can ehlp to determine if a sample should be excluded or not. You could also look at similar metrics, like mean and max expression (per sample). Most people filter on the genes.
Just one last thing Kevin. I noticed that after the rlog step the samples are not exactly normally distributed (link below), so I scaled the values using
gtex <- apply(gtex, 2, scale)
but after doing that the results from the gsva are different. My question is: is it really meaningful to scale the values? or should I keep as they are after rloghttps://imgur.com/a/ltOdYSE
Hey, rlog never usually returns data that perfectly follows a normal distribution. In the past, I have often converted rlog data to Z-scale, like you have done. I think that this approach is fine.
apply(gtex, 2, scale)
should produce the same output ast(scale(t(gtex)))
, I think, i.e., scaling by gene.Edit: with
apply(gtex, 2, scale)
, you are actually scaling each column. You may want to just try thet(scale(t(x)))
approach.Note that the scale function with default settings is essentially Z-transforming your data. It is very useful.
Hi @Kevin. Is there a reason you used the rlog transformation (with kcdf="Gaussian") over the normalized count (with kcdf="Poisson") ?
It was a difficult choice... I used Gaussian due to the fact that my input was the regularised log expression levels. The regularised log histogram profile can be difficult to interpret, though, but mostly follows a Gaussian.