Hi all, this question has indirectly come up several times. What is the best kcdf setting to use for GSVA analysis on non-log or non-variance normalized TPM data?
For GSVA analysis using RNAseq data, the GSVA manual states:
"We calculate now GSVA enrichment scores for these gene sets using first the microarray data and then the RNA-seq integer count data. Note that the only requirement to do the latter is to set the argument kcdf="Poisson" which is "Gaussian" by default.Note, however, that if our RNA-seq derived expression levels would be continous, such as log-CPMs, log-RPKMs or log-TPMs, the the default value of the kcdf argument should remain unchanged.
I assume that non-variance normalized TPM data should be treated by using the "Poisson" argument. However, following length normalization, most TPM data ends up as non-integer. I realize that this is the result of a linear transformation so the underlying structure of the data is unchanged, but according to the manual, it appears to be implied that the Gaussian setting may be appropriate for non-integer data, which includes non-variance normalized TPM.
I clearly don't understand the nuances of this setting, but wondering what other people's thoughts/suggestions/explanations are on this topic. For now, I'm just performing log1p on my TPM data and using the Gaussian argument, which runs much faster.
Thank you Kevin - appreciate the response.