I understand that length normalization is not necessary for voom because the same gene is compared between groups, rather than comparing different genes of different lengths. I found a nice summary here: https://stat.ethz.ch/pipermail/bioconductor/2012-June/045919.html
However, I’m in a different situation where one condition of my RNA samples (treatment B) were more degraded than treatment A, so I had to fragment the treatment A RNA for it to be of appropriate size for sequencing, unlike treatment B. Other than this, the library preparation protocols were the same (e.g. same strandedness, rRNA depletion, etc). Final PCR cycles did vary slightly based on amount of input material in a given library but reads were deduplicated prior to quantification.
There is a rational basis for the differential expression results I got without controlling for the difference in gene length; the GO on the DEGs are congruent with biological expectation. But I wanted to run to see if gene length impacted things at all. I’m wondering if this is a necessary or if I can leave things as-is. Even if I plot log-CPM expression for the genes I’m interested in (genes are not from DEG, but from the biology of my system), there’s higher expression in the degraded treatment B relative to fragmented treatment A, which is consistent with my hypothesis for the experiment. Also I wanted to avoid normalizing the degraded data on the basis of gene length since it will inflate the counts for shorter genes.
I’m confused on if this is something I can do because there isn’t a discussion on this (e.g. feed in raw counts only vs. normalization) in the voom guide but there is in the voom manuscript that logged RPKM can be used to replace logged CPM. I have variance in the library sizes in my samples (max lib size / min lib size ~= 10) so I need to use voomWithQualityWeights, but my understanding was that this fn takes in raw counts. I could do limma with RPKM, but my limited understanding was that this is inappropriate on the basis of variance in library size (page 71 of the latest online manual).
Thoughts on if/how I can compare for gene length here? Thanks!
It's a little unclear from your description, but do you only have one sample for treatment B, and it was that one sample that was degraded? Or was it just one treatment B out of n treatment Bs that was degraded?
it's a paired design; 5 samples of treatment A (that required fragmentation) and 5 samples of treatment B (no fragmentation).