Question

GSVA on RNAseq data (GTEX)

1

Entering edit mode

6.7 years ago

kakukeshi ▴ 90

I'm using the bioconductor package GSVA (http://bioconductor.org/packages/GSVA) to perform a GSVA analysis using RNA-seq data from GTEX and I'm confused about which dataset I need to use and if I need to perform some pre-processing steps.

According to the GSVA package vignette the input should be RNA-seq counts, does this means that I need to use the file "GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz" instead of "GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz"? and should I do something before use it as input?

Thanks

RNA-Seq rna-seq r R • 6.1k views

ADD COMMENT • link updated 6.7 years ago by Kevin Blighe 89k • written 6.7 years ago by kakukeshi ▴ 90

score 3 · Answer 1 · 2019-03-13

3

Entering edit mode

6.7 years ago

Kevin Blighe 89k

Hey kakukeshi,

This is a good question.

The parameter for gsva() about which you need to be aware is kcdf. kcdf can have the values "Gaussian" or "Poisson". If your input data is normalised RNA-seq counts, then choose "Poisson", i.e., Poisson distribution. If your data is regularised log, variance-stabilised, or other logged data, including microarray normalised expression values, then choose "Gaussian", i.e. Gaussin distribution.

You can check the distribution of your data with the hist() command.

If you are planning to use the RPKM counts, then choose "Poisson". First ensure that these are not logged by generating the histogram.

I previously used GSVA for GTEx data. I took the raw counts (_gene_reads.gct.gz) and normalised these in DESeq2. I then transformed them via regularised log transformation and used kcdf="Gaussian".

Kevin

ADD COMMENT • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hey Kevin, This is very helpful. Many thanks.

Following your suggestions what I did was to filter genes with low count first from the raw counts file (.gene_reads.gct.gz) using:

i.rm <- which(rowSums(subsetdat)<5000)
subsetdat <- subsetdat[-i.rm,]

Then I transformed the data to a DESeq object with:

subsetDES <- DESeqDataSetFromMatrix(countData = subsetdat,
                                  colData = ctab,
                                  design = ~ 1)

And finally rlog transform:

rld <- rlog(subsetDES)

Is this correct or there's something that I'm missing before use the data as input for gsva?

ADD REPLY • link 6.7 years ago by kakukeshi ▴ 90

0

Entering edit mode

Hey, I believe you are missing one step just before the rlog() function, such as:

subsetDES <- DESeq(subsetDES)

Otherwise, you are performing rlog on the raw counts.

Also, your filter with rowSums() seems very high. You can try different filters, though. I usually go by mean <= 10 or 20.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Ups! Thanks Kevin. May I ask you if I should apply some type of filtering for the samples (individuals)?

ADD REPLY • link 6.7 years ago by kakukeshi ▴ 90

0

Entering edit mode

Hey, I have no major preference for filtering on samples. Usually a PCA bi-plot can ehlp to determine if a sample should be excluded or not. You could also look at similar metrics, like mean and max expression (per sample). Most people filter on the genes.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Just one last thing Kevin. I noticed that after the rlog step the samples are not exactly normally distributed (link below), so I scaled the values using gtex <- apply(gtex, 2, scale) but after doing that the results from the gsva are different. My question is: is it really meaningful to scale the values? or should I keep as they are after rlog

https://imgur.com/a/ltOdYSE

ADD REPLY • link 6.7 years ago by kakukeshi ▴ 90

2

Entering edit mode

Hey, rlog never usually returns data that perfectly follows a normal distribution. In the past, I have often converted rlog data to Z-scale, like you have done. I think that this approach is fine.

apply(gtex, 2, scale) should produce the same output as t(scale(t(gtex))), I think, i.e., scaling by gene.

Edit: with apply(gtex, 2, scale), you are actually scaling each column. You may want to just try the t(scale(t(x))) approach.

Note that the scale function with default settings is essentially Z-transforming your data. It is very useful.

ADD REPLY • link 6.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi @Kevin. Is there a reason you used the rlog transformation (with kcdf="Gaussian") over the normalized count (with kcdf="Poisson") ?

ADD REPLY • link 5.8 years ago by Nicolas Rosewick 11k

0

Entering edit mode

It was a difficult choice... I used Gaussian due to the fact that my input was the regularised log expression levels. The regularised log histogram profile can be difficult to interpret, though, but mostly follows a Gaussian.

ADD REPLY • link 5.8 years ago by Kevin Blighe 89k