Hi everybody,
I've been lurking for a month learning about RNASeq analysis but now it's time to post my first question!
Very briefly I have the following situation that I would love to have some insight about:
Setup
RNAseq (HiSeq, single reads, 50bp) on 4 cell lines, untreated or treated.
High-quality 44bp reads after fastqc filtering and trimming, 20-27 million mapped reads on human, tophat2-mapped using hg38.gtf strict.
Problem
We originally assumed that the cell lines would be sufficiently similar to each other to be considered "biological replicates" and we would thus see the effects of the treatment on the transcriptome. However, this turned out to be a wrong assumption and they are quite different globally, so the effects of the treatment are far smaller than being different cell lines (I've seen this by clustering normalized reads (from HTSeq) or FPKM values from Cufflinks, and running a PCA or doing heatmaps).
Question
What's the best way to analyze RNASeq without biological replicates?
So far I've tried
Cuffdiff using -blind
mode, doesn't seem to yield anything different than default setting.
DESeq2, seems to work a bit better but severely underestimates changes (that we already saw by QPCR on individual genes).
So I was wondering if you think it's possible / "allowed" to do fold-difference values on normalized reads (HTSeq and using sizeFactors from DESeq2) pair-wise of untreated and treated of the 4 samples but what statistics can I use for that to determine significant changes? Or is it better to use FPKM values from Cufflinks, since they already come with a built-in confidence interval and simply calculate folds pairwise and cluster the results?
Grateful for any kind of insight or advice!
What you say about poisson distribution is interesting, because when I use Cuffdiff with
--poisson-dispersion
I do get a sizeable amount of differentially expressed genes (~5000) (compared with only 50 or so using default dispersion estimation).So, you don't think anything can be gained by using reads or FPKM values (using a cutoff threshold obviously to minimize low expressed genes / noise) and just going for genes showing consistent up / dn regulation in the four different cell lines?
The reason I'm asking is that I've done this, and gene set enrichment analysis (GSEA) on the ups and downs seem as what we would expect from the biological question, but I am still not sure if my method is "kosher"..
I will take a look at GFold and DEGseq, thanks for the advice!
I'd forgotten that cuffdiff had the
--poisson-dispersion
option :) That should produce similar results to DEGseq. What you're doing now (GSEA on consistent fold changes) is probably a decent route given the data you have at the moment. Having said that, if you tried to publish the results and I were one of the reviewers then the paper would get immediately rejected (unless treating the cell-lines was somehow extremely difficult/expensive...and the bar for that would be relatively high).Sequencing is pretty cheap these days and cell lines are generally not that hard to come by. Before spending too much time on a questionable analysis, you might as well just redo the experiment in a publishable form and use non-questionable methods to analyze the results. You might end up with the same results either way, but this way you won't have problems later on.
Thumbs up for this advice. I get cold chills down my spine every time I see someone try to claim something with <3 (no, that's not a heart) biological replicates.
So, I think obviously everybody realizes that 'more biological replicates' is better than 'less biological replicates'. However, could you or Devon or someone else ELI5 (explain like I'm 5 yrs old) why a method like GFOLD is not to be trusted? I've gone through the GFOLD paper where the authors do quite comprehensive comparisons between the then available methods (deseq, edger, cufflinks, fold change with offset etc) and their results on no-biological-replicates seem pretty solid to be discarded without a more qualified argument.
Also, I'd like to remind that in our setup we are analyzing the effects of treatment on 4 distinct, but different-enough cell lines, and we're interested in common/consistent effects among the 4.
You're interested in measuring differences between two biologically variable groups. It is simply impossible to accurately gauge biological variability. No one will ever conceive of a perfect method to do this. The examples used in the GFOLD paper are..."interesting". It should be noted that when a Poisson model (aka DEGseq) performs better than edgeR that the dataset is very atypical.
So trust GFOLD for as much as it's worth. If you're intent is only to find some candidates that will be undergoing thorough validation then GFOLD is probably OK.
I disagree. I would argue that William Stanley Gosset solved this problem about a hundred years ago.
It's a paired t-test.
And test what with it? That two samples are different from each other? They will be, but that's not interesting. A paired t-test solves a completely different problem.
I should add that I assume the paired T-test would be pairing treated and untreated within cell-line. If the cell lines are indeed actually quite different, then this won't work due to cell-line specific changes due to the treatment (you'd miss all of them). If there are a reasonable number of consistent changes due to treatment then that'd work, but one could also just use a regular GLM (~line+treatment). If the goal in using multiple cell-lines was to find more robust changes then that would work OK, assuming the FDR correction doesn't turn out to be too brutal. Of course if the goal is now to find within cell-line changes (as suggested in the title), then that won't help at all.
They are interested in what genes change expression in response to the treatment. To do that they used four biological replicates (different cell lines) but because the cell lines had so much variability their experiment was under powered to detect differences due to the treatment effect rather than the differences in cell line. That's a textbook example of when you use a paired t test.
Using different cell lines actually blocks against cell-line specific effects and gets at the what the treatment is doing. Those changes, if the treatment is actually doing something, should be robust. There isn't a problem with lack of replication. They are blocking on a wide range of biological variables by using different cell lines. That is how experiments should designed.
I agree with Devon. paired t-test would not be appropriate