Question

comparing fragmented RNA-seq to non-fragmented RNA-seq

0

Entering edit mode

2.8 years ago

wiscoyogi ▴ 40

I understand that length normalization is not necessary for voom because the same gene is compared between groups, rather than comparing different genes of different lengths. I found a nice summary here: https://stat.ethz.ch/pipermail/bioconductor/2012-June/045919.html

However, I’m in a different situation where one condition of my RNA samples (treatment B) were more degraded than treatment A, so I had to fragment the treatment A RNA for it to be of appropriate size for sequencing, unlike treatment B. Other than this, the library preparation protocols were the same (e.g. same strandedness, rRNA depletion, etc). Final PCR cycles did vary slightly based on amount of input material in a given library but reads were deduplicated prior to quantification.

There is a rational basis for the differential expression results I got without controlling for the difference in gene length; the GO on the DEGs are congruent with biological expectation. But I wanted to run to see if gene length impacted things at all. I’m wondering if this is a necessary or if I can leave things as-is. Even if I plot log-CPM expression for the genes I’m interested in (genes are not from DEG, but from the biology of my system), there’s higher expression in the degraded treatment B relative to fragmented treatment A, which is consistent with my hypothesis for the experiment. Also I wanted to avoid normalizing the degraded data on the basis of gene length since it will inflate the counts for shorter genes.

I’m confused on if this is something I can do because there isn’t a discussion on this (e.g. feed in raw counts only vs. normalization) in the voom guide but there is in the voom manuscript that logged RPKM can be used to replace logged CPM. I have variance in the library sizes in my samples (max lib size / min lib size ~= 10) so I need to use voomWithQualityWeights, but my understanding was that this fn takes in raw counts. I could do limma with RPKM, but my limited understanding was that this is inappropriate on the basis of variance in library size (page 71 of the latest online manual).

Thoughts on if/how I can compare for gene length here? Thanks!

voom differential fragmentation normalization limma • 1.3k views

ADD COMMENT • link updated 2.8 years ago by Gordon Smyth ★ 7.8k • written 2.8 years ago by wiscoyogi ▴ 40

0

Entering edit mode

It's a little unclear from your description, but do you only have one sample for treatment B, and it was that one sample that was degraded? Or was it just one treatment B out of n treatment Bs that was degraded?

ADD REPLY • link 2.8 years ago by rpolicastro 13k

0

Entering edit mode

it's a paired design; 5 samples of treatment A (that required fragmentation) and 5 samples of treatment B (no fragmentation).

ADD REPLY • link 2.8 years ago by wiscoyogi ▴ 40

score 1 · Answer 1 · 2022-06-09

1

Entering edit mode

2.8 years ago

Gordon Smyth ★ 7.8k

I think you may have some misunderstandings about how RNA fragment size feeds into the analysis.

Whether or not you fragmented the RNA samples has nothing to do with gene length, and it does not cause genes to get systematically larger or smaller counts. The length of sequenced fragments, on one hand, and genomic gene length, on the other, are completely different things and are not even correlated except for a small minority of very short genes. Fragment length is just a technical issue for sequencing efficiency.
Fragmentation has nothing to do with the choice between log-CPM or log-RPKM. Nor does variation in library size. Indeed the log-CPM and logRPKM limma analyses are identical, it is just a presentation issue.
The variation in library sizes makes voom sensible instead of limma-trend, but it does not imply you need to use voomWithQualityWeights. Using voomWithQualityWeights might well be beneficial, but not for that reason. Indeed voomWithQualityWeights may attach lower weigths to the degraded samples, which may help with the analysis.

The degraded nature of the treatment B samples causes a batch effect into the study, but there is nothing that you can do about it, because it is intrinsic to your treatments. You have to proceed with a normal limma analysis and try to rule out unwanted DE effects on biological rather than statistical grounds. Library size normalization is already preventing any global increase in expression between samples, so that is not an issue. There are no other bioinformatics tricks that can be added to the analysis that will help.

On another topic, I am somewhat concerned that you describe the design as "paired". A few hours ago you asked a question about an unpaired analysis for a dataset that seems very similar ( limma voom -- all genes after filtering are differentially expressed ). Is this the same study or a different dataset?

ADD COMMENT • link 2.8 years ago by Gordon Smyth ★ 7.8k

0

Entering edit mode

Thank you for your thorough answers and for all this helpful background.

Regarding fragmentation and gene length: my understanding was that if longer RNA molecules are broken into multiple pieces, and that after sequencing when looking at the count data, systemic bias could be introduced for having more counts for longer genes. I understand that this has nothing to do with the experimental fragmentation -- I suppose that one could make the claim that the fragmentation in group B already happened in the system rather than the experimental protocol and that it'd still be more probable to get elevated expression from longer genes so it'd still be appropriate to normalize on the basis of gene length. Is this reasoning correct? RE: this DE post, I agree that there is a systematic batch effect introduced with treatment B and I did do TMM library normalization, so I addressed this to be the best of my ability and can just state the limitations.

Regarding the design: Experimentally, not all the samples were matched due to sample availability (e.g. I couldn't get ALL the sibling pairs). So what I did was:

run one DEG on just the paired samples, and did a design matrix as described in 9.4 of the guide (I had '5' sibling pairs, and I looked at the DEG results on the coefficient equivalent to 'TreatT'.

I could also do:

run another DEG on all the samples, where the unpaired samples just get their own new sample ID so that I could use all the samples

In my prior post, I ignored the paired part of the design since I originally couldn't get a topTable result that made any sense. Now that I get a volcano plot and a topTable that makes sense, I'm making more complex design matrices.

other question (can make separate post if necessary): I want to build more intuition behind the interaction terms in linear models besides what's in the limma voom guide, do you have any recommendations for where to go?

ADD REPLY • link 2.8 years ago by wiscoyogi ▴ 40

1

Entering edit mode

Regarding fragmentation and gene length, I don't know the details of your data, but I find it hard to imagine any biological treatment that you could apply that would stop longer genes getting more reads, assuming you're using a standard Illumina RNA-seq protocol. It seems reasonable to me to imagine that some sort of fragmentation must have occured in the B samples.

Regarding the design, if your samples are only partially paired, then use duplicate correlation to handle the pairing so you can use analyse all the data together and still respect the pairing. See Section 9.7.

Personally I do not use or recommend volcano plots. I find mean-difference plots much more informative (plotMD).

Regarding interaction terms, I don't know what you mean, but you might try https://f1000research.com/articles/9-1444 . I generally recommend a direct approach to comparing groups that avoids classical interaction terms.

ADD REPLY • link 2.8 years ago by Gordon Smyth ★ 7.8k