Hello,
I am working with RNA seq data which contains a date batch effect because an error was done in size selection before sequencing.
Data with no normalization:
I have been looking to correct this using the RUVseq approach. The most important effect is seen between the wrongly size filtered (red) and the normal sequences (green) (ideally they should overlap just like the 2013.08.08 purple technical replicates).
I first removed unwanted variance with RUVs by specifying these "strange technical" replicates and then removed the unwanted variance with empirical 5000 non DE transcripts (RUVg). I did this second normalisation because the RLE plots weren't similar after the RUVs normalisation. This didn't give good RLE plots so I abandoned that idea.
When I did this double normalisation starting with RUVg and then normalising with RUVs, this gave very satisfactory PCA and RLE plots (maybe to good as this really is how I expect my biological model I am using should behave).
PCA and p-values when no technical replicates are combined
When I do a DE transcript detection between my species with EdgeR without combining the technical replicates, the p-value distribution is more or less ok. Although it looks ok, it is wrong because there is pseudo-replication because the replicate counts weren't combined.
When I combine those counts by adding them together, everything comes crumbling down. The p-value distribution looks very strange, as if it was depleted in significant values.
p-values after combining the technical replicates:
I also simply removed those strange replicates and normalised with RUVg. This looks ok but not as nice as the more complicated way I have tried to explain above. I also realise that my approach is probably a bit flawed because I sort of know what to expect and I am trying to get to that ...
I need enlightenment, am I doing multiple testing somewhere, is it right to do these two normalisations and how come it doesn't give the same results when done in one or the other way (i.e not commutative I think would be the term).
Thank you all for the time in considering this issue, I realise it it kind of a vast question.
Thomas
have you looked at batch effects before using RUVseq? Is it proper to assume in your experimental setup that the 5000 transcripts would be non-DE? You can upload images to something like [imgur](http://imgur.com/) to show them in your question.
Thanks for your reply, I have uploaded the plots on imgur as suggested.
I am not quite sure about your first question "have you looked at batch effects before using RUVseq". If you mean "is there a batch effect before my normalisation", I think the first image should speak for itself. If you mean "is it the first time I deal with batch effect in any data", then yes it is.
For the 5000 non-DE transcripts I give as argument to RUVg, I find them "in silico" as proposed in the RUVseq tutorial. I do a first DE study to find the 5000 with the highest p-values (that have the less chances of satisfying H1: differentially expressed). These 5000 transcripts are about 1/10 of my total transcriptome and I do not expect to have that many DE genes. My species of interest appeared recently after two independent polyploidisation events with the same parents.
I should add that this strange p-values distribution appears when I try finding DE genes between three species. (noted as e, t and m in the plots). When I want to find the DE genes between 2 species (considering e and t as the same) then the p-values distribution looks ok. Could the model in EdgeR be the cause of this strange distribution?
you probably want to recreate the right plot in your first image also colored so that replicates are the same color.
yea the p-value distribution looks weird in the last image. I would either not pool the other two when doing the comparison (so its always a 1 species vs another species comparison) or in the design matrix, instead of modeling it as '1 * e vs 1 * (t and m)' do '1 * e vs 0.5 * t + 0.5 * m'. If you paste your design matrix I might be able to do a better example.