Hi folks,
I am processing RNA-Seq data (2 plant genotypes collected over 3 time points). I have 2 conditions -- control and treatment for each genotype and each condition has 3 biological replicates. For 2 out of 3 time points, I notice one biological replicate either in control or treatment has significantly low read counts compared to the other replicates of the same condition (e.g. 1-3 millions vs . 9-12 millions). I normalize the data with TMM before calling DEGs using edgeR, which I think it should handle the differences in read counts. However, no. of DEGs is almost double when I repeat the analysis without the sample with low-read count (110 vs 210 DEGs).
I am considering whether I should remove the samples with low read count from the analysis. The downside is I would have 2 biological replicates left for DEGs. Would you mind sharing your thoughts or suggestions?
Thanks a lot in advance!
Hi ATpoint,
Thanks a lot for sharing your suggestion and code. I agree that resequencing of the low-read samples is the best, but sadly they are old samples and we no longer have them. I just have to try to get the best out of the existing data. :(
I tried your code with data from the two time points in doubt. The samples in question is sample 5 for the first time point (PCA1=21.92%) and sample 3 for the second time point (PCA1=22.65%). For both time points, sample 1-3 are control and 4-6 are treated samples.
In my inexperience eyes, they look OK for DEG callings. Would you mind sharing your opinion? Thanks again and have a great day!