Background: I have to perform a differential gene expression analysis using RNA-seq data. We have two genotypes. We have RNA-seq data for control and data after treatment for both genotypes. We also have two biological replicates for each case. In short, we have eight samples that include four samples before (2 genotypes x 2 replicates) and four samples after the treatment. Our goal is to find out genes that show differential expression between the genotypes after the treatment. Note: Actually we have have several treatments but I have tried to keep the question simple.
Problem: Now the problem is that each biological replicate was run on a different sequencing platform including Ion Proton and SOLiD Wildfire. Trust me it wasn't my idea. Don't kill the messenger (bioinformatician) :-)
Now we see a huge difference in expression counts between biological replicates that is purely due to batch (platform) effect. PCA clusters samples according to platforms and not the treatment or the strains. The samples from Ion proton always show high read counts. Same applies to RPKM values so the problem is not because of the difference in sequencing depth. The batch effect is not consistent between all the pair of biological replicates, and correlation between counts from two different platforms (or biological replicates) range between 0.3 to 0.6 for different case. I can use the batch as a covariate in my DEseq2 analysis, but is there A) any other better approach to remove the variation due to different sequencing platforms. Reason being is that there are samples after multiple treatments and we may need to merge reads from almost similar treatments into one later on. So scaling or correcting values will be better so that the new counts from almost similar treatments may be merged into one. B) Should I perform correction at the level of biological replicate or should I create two groups (Wildfire and IonProton) and perform batch correction using all the samples (4 Wildfire and 4 IonProton, actually I have lots of samples for wildfire and Ionproton as I have multiple treatments but I mentioned only two as I wanted to keep the question simple) ? C) I have never used Combat but I read that it doesn't work for small sample sizes, so I may need to carry out batch correction using all the samples although the batch effect is inconsistent. Also as Combat takes log transformed normalized data as input, I won't be able to use new output counts as input for DESeq2. I may have to use limma, right? Please excuse me if I haven't used the correct terminology. I am new to this.
Thanks.
Hey Ashutosh,
How about using Surrogate Variance Analysis for removal of batch effect. There is "Combat" of SVA package from Bioconductor to remove batch effect.
or I think quantile normalization of your log transformed counts per million would also help in your case
Thanks Manvendra.
RUVSeq worked very well on my dataset. May be you can give a try.