Hello all,
I am probably being incredibly naive but I've only studied bioinformatics for a year so forgive me for this question.
My work so far: I've performed differential gene expression analysis between Standard Risk and Low Risk AML patients using DESeq2 in R and successfully identified differentially expressed genes. Whilst running DESeq2 I've obtained variance stabilising transformed (vst) expression values for all genes. 30 of these differentially expressed genes matched with those on a list of genes of interest so I've subsetted vst values for those genes. I am expected to perform survival analysis based on the gene expression of these genes of interest which are differentially expressed and had found a question answered by Michael Love suggesting to perform survival analysis on vst data (sorry, unable to locate the post) but it didn't specify how this could be completed. It was suggested to me by my supervisor to use "quartile expression of genes" so I discretised these vst values by assigning them to either Q1, Q2, Q3 or Q4 for each gene. I found that for some genes the Q1, Q2, and Q3 thresholds were the same number leading to problems in discretising expression values.
My question: I now think that I shouldn't be using vst data for this purpose or I shouldn't be using quartiles, perhaps both. I had a thought of computing Z-scores from vst data (probably wrong about that one) or Z-scores from the raw counts and then discretising from there into perhaps Low/Medium/High expression categories and performing survival analysis - similar to this tutorial by Kevin Blighe.
Has anybody got any further advice on this?
Thanks so much in advance for any help given,
Lindsay
Hello Kevin, thanks for your reply - much appreciated!
If I stick to using quartiles with VST counts I still have the issue of not knowing which quartile some of the VST values fall into, eg. for one gene across numerous (51 out of 62) patients the VST value is the exact same - leading to the Q1, Q2 and Q3 thresholds being the exact same. I don't really know what to do about this. Even if I obtained Low/Mid/High tertiles (using Z-scores of VST counts using scale() perhaps?), I think this would mean that there would be no patient samples falling into some of these categories also.
By the way, my sample n is 31 for each group (Standard and Low Risk) so fairly low.
Any further comments would be more than welcome - very stuck with this analysis, unfortunately. Apologies for my naivety.
Thanks, Lindsay
**EDIT: I tried rlog transformation instead of VST and I no longer have an issue with sorting values into quartiles as no values seem to be repeated across lots of patients. Is this acceptable to use?
Hey, sorry, I was away overnight. Just wondering why many (?) of your genes end up having the same value when transformed via the variance-stabilising transformation? Are these non-coding genes or of low values, by any chance? Did you perform any pre-filtering for low-count genes prior to normalisation?
Using regularised log counts is fine, too - just merely means stating something different in your materials and methods! The 'genuine' genes of interest that are associated with your outcome / endpoint should be identified with either rlog or vst counts. The 'fringe' genes that may not have much to do with your outcome of interest may appear in one or the other, or neither.
Kevin
Hi Kevin,
No need to apologise - very grateful for any time you've spent reading my question/comments!
I've focused on 30 coding genes and most VST values look ok. Maybe ~5-10 of these genes have same VST value across many patient samples - after examining the raw counts, some of these genes have low counts but others don't. Unsure what could be causing it. I've also attempted to filter low-count genes prior to DESeq():
I think I'll perhaps stick to rlog transformation for now if I can't get around that issue (despite it taking forever to run).
Thanks so much, Lindsay
You are using all genes for the purpose of normalisation and the subsequent vst, right? Are you aware of the
blind=TRUE
/FALSE
functionality, too? Withblind=TRUE
, the transformation is performed independent of your design formula. Withdesign=FALSE
, the transformation will actually adjust your transformed counts based on the factors supplied in your design formula. For downstream processes,blind=FALSE
is recommended. This may give you what you need. Otherwise, try rlog!Hi Kevin,
I promise I won't spam you any more with this question.
I was unaware that
blind = FALSE
was recommended for downstream analyses, however I did try it and although the numbers changed slightly, I did have the same issue with repeated VST values across patient samples for the same gene.I think I'll stick with rlog!
Thanks once again and all the best.
Lindsay