Question

Normalization of samples with very different transcriptomic profiles

0

Entering edit mode

6 months ago

paulanavarrete116 • 0

I am doing the comparison of spermatid and sperm RNA-seq libraries. Sperm is a really specialized cell where few RNA is retained, as most of it is expelled together with the rest of the cytoplasm content at the final stage of the spermatogenesis and during epididymal maturation (before ejaculation). Some RNA species with important regulatory functions may be retained and that is what we are interested to study. The differences between these conditions are considerable and we know that the majority of RNA molecules are going to be reduced in sperm. In other words, due to its nature, we expect way more down-regulated genes in sperm compared to spermatid. However, this is not the case. As the sequencing depth is way lower in sperm (8M-17M) than in spermatids (20M-36M), we observe that normalization inflates the counts in sperm considerably and a lot of up-regulated genes are obtained in sperm. Thus we believe that we are getting a lot of false positives that are significantly up-regulated in sperm when read counts are very low.

DESeq2 assumes that a minority of genes are largely affected by the condition, i.e. few genes have considerable differences. This is why the standard median ratio method for normalizing in cases where this assumption is not met will not provide correct inference. This is our case, as we cannot make this assumption.

Therefore, we would like to know how to proceed to analyze the differential expression of samples that are this different, which are the adequate normalization procedures that can be used to compare samples with large differences in their RNA profile.

Thanks!

DESeq2 normalization RNA-seq edgeR DGE • 592 views

ADD COMMENT • link updated 6 months ago by ATpoint 86k • written 6 months ago by paulanavarrete116 • 0

2

Entering edit mode

In my experience, for cases with very strong library size differences which can influence gene detection (beyond simply differences in quantification), sometimes subsampling the FASTQs to have similar library sizes can help.

ADD REPLY • link 6 months ago by Papyrus ★ 3.0k

score 5 · Answer 1 · 2024-06-06

5

Entering edit mode

6 months ago

ATpoint 86k

You need to come up with genes that you think (or have evidence for, maybe from literature) that are not differential, and provide these to the controlGenes argument of estimateSizeFactors.

ADD COMMENT • link 6 months ago by ATpoint 86k

0

Entering edit mode

I think this is the best option, but I think there aren't any housekeeping genes whose expression is stably maintained from spermatid to sperm...

ADD REPLY • link 6 months ago by paulanavarrete116 • 0

1

Entering edit mode

If I have to go "blind" like this I usually make an MAplot and see if there is any set of genes that could serve as reference. In typical MA-plots you have an arrowhead-like shape with the rightmost part (the "tip" of the arrow" usually being genes with high expression across all samples, and these are typically stable. Need to see if that holds true for you or whether it is really a global change in everything.

ADD REPLY • link 6 months ago by ATpoint 86k