I am doing the comparison of spermatid and sperm RNA-seq libraries. Sperm is a really specialized cell where few RNA is retained, as most of it is expelled together with the rest of the cytoplasm content at the final stage of the spermatogenesis and during epididymal maturation (before ejaculation). Some RNA species with important regulatory functions may be retained and that is what we are interested to study. The differences between these conditions are considerable and we know that the majority of RNA molecules are going to be reduced in sperm. In other words, due to its nature, we expect way more down-regulated genes in sperm compared to spermatid. However, this is not the case. As the sequencing depth is way lower in sperm (8M-17M) than in spermatids (20M-36M), we observe that normalization inflates the counts in sperm considerably and a lot of up-regulated genes are obtained in sperm. Thus we believe that we are getting a lot of false positives that are significantly up-regulated in sperm when read counts are very low.
DESeq2 assumes that a minority of genes are largely affected by the condition, i.e. few genes have considerable differences. This is why the standard median ratio method for normalizing in cases where this assumption is not met will not provide correct inference. This is our case, as we cannot make this assumption.
Therefore, we would like to know how to proceed to analyze the differential expression of samples that are this different, which are the adequate normalization procedures that can be used to compare samples with large differences in their RNA profile.
Thanks!
In my experience, for cases with very strong library size differences which can influence gene detection (beyond simply differences in quantification), sometimes subsampling the FASTQs to have similar library sizes can help.