I am carrying out an RNA-seq analysis on mesenchymal stem cells retrieved from disease and control state. However, the exp is without replicates (bad idea I know). The analysis is exploratory, to identify potential factors that can be validated in the lab. So far, I performed transcript quantification using salmon and summarised output to gene level using tximport. So now I have a counts matrix. Just to clarify my only real approach at this point is to rank genes based on a transformed or regularized LogFC?
Thanks for your feedback guys, much appreciated. Yeah, I have absolutely no replicates and yes they have been processed in the same batch. I did try duplicating the data for DESeq2 analysis, but no joy. I think going of estimated p-values for experiment without replicates is prob not a great idea anyway. So I am looking at the fold change for indication of significant difference. I found that when I rank based on log2fold change I have quite large log2fold change values up to as far as +/- 8 and 9, with many highly ranked genes having lower counts compared to lower ranked genes. The ranking seems look a lot better when I apply DESeq's rlog for a regularised logfold change. From here I am applying a 0.6 cut off to obtain genes that have a normal log fold change of 1.5 or greater. From this I have ~200 up-regulated genes. I plan to use signalP to pull out secreted proteins that may be validated in the lab. Also ill be using IPA for pathway analysis, causal network analysis and building context around genes of interest! I am guessing this is the best I can do with what I have.
You choose 0.6 presumably because
log2(1.5) = 0.5849625
? Did you try to apply the fold change shrinkage method of DESeq2, too? In any case, it seems like you know what you are doing.That's exactly right Kevin, it's the Log2(1.5). After reading around I gathered some ideas for what I might be able to do in this situation. I guess being new to this kind of analysis there is a degree of uncertainty on my behalf. I just wanted to see what people thought. Thanks a lot for your input. I'll be sure to check out DESeq2s LFC shrinkage method.
In case the shrinkage does not give satisfying results and you have to use unshrunken FCs, I would suggest to use a decent read cutoff to avoid high fold changes due to small counts ("mean-variance relationship").
It seems I can't run the LFC shrinkage without first running the DESeq function, which in-turn requires replicates - "Treating samples as replicates was deprecated in v1.20" ~ output from DESeq2. So I will stick with the regularised log fold change. I was considering a high read count cutoff, but I absolutely have no idea how high or low I should draw the cutoff. I really hate asking for advice on choosing arbitrary cutoffs. However, I was thinking of summing up the rows across the two samples and applying a cutoff of 50. So genes that have a summed read count greater than 50 across the two samples would be retained for downstream analysis. I seen this approach in this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4934518/ . Would a summed read count cutoff of 50 be "decent"? I really am in the dark with this one.
There is no standard cut-off at all. All that you need to ensure is that you remain transparent about it in your methods. Could be 20, 30, 50, ..., 5, ..., 100, ...
Sounds like a good plan, especially with the log2FC shrinkage - just remember log2FC != pvalues
Older DESeq2 version were able to do it. Unfortunately they removed that ability from recent versions. Still as Kevin said those aren't recommended. You could try duplicating the data and then get some results, but still it will be bogus.
Don’t ever duplicate results to get pseudoreplicates. See miRNA-seq DE analysis with duplicate replicates. Am i correct?
Didn't mean to suggest that as the correct method, but rather a method to get top genes to validate using qPCR. That data is not publishable with pseudoreplicates.
Still, I do not think that creating pseudoreplicates gives you anything meaningful because you artificially create data without variance and therefore excessive false-positives and probably also false fold-changes. For qPCR candidates, I would use those that have reasonable fold log2FCs and proper mean expression as discussed above.
OK something that I am trying to get my head around. Given this scenario: I have a read count for GeneA for both control and case samples (no replicates). Gene A(control) has a read count =2, GeneA(case) has a read count=1000. That is approximately a 9 fold increase for GeneA(case) at the log2scale (which is massive). So when you say "I would suggest to use a decent read cutoff to avoid high fold changes due to small counts" you are not referring to the above scenario. However, your are referring to a scenario where Gene B(control) has a read count =2, GeneB(case) has a read count=30, which returns a relatively high Log2fold change ~4, based on low counts?
Exactly. The situation with 2 vs 1000 is probably/maybe a true biological effect, but 2 vs. 30 is rather the usual noise that comes from low abundances.