Question

How to normalise the RNAseq data without spike in?

0

Entering edit mode

7 weeks ago

Mo ▴ 50

Hi all,

I have done an mRNA stability experiment in which I inhibited the global transcription in Human cells at time point zero and collected the cells at different time points (0,1,2,4,6 and 8 hours). Then, I isolated the total RNA from cells for all time points and reduced the rRNA, with the remaining mRNA, I made the double-stranded cDNA without PCR amplification. Then, I barcoded the samples, pooled them together and sequenced this cDNA on Nanopore.

For analysis, I quantified the total reads of each gene at a time point and normalised this by the total library sequencing reads at the given time point to get the reads per million.

As you can see in the first heatmap, there are ~200 genes (in raws) from different time points (columns), and the data looks noisy. If there was 100% mRNA at time point zero, there should have been less mRNA at later time points (i.e., 1,2,4,6 & 8 hours), but the mRNA goes up and down. In terms of drugs, the experiment has worked, as you see in controls, GAPDH, one of the most stable mRNA, has a good amount of mRNA left at the later time point, and the least stable mRNA c-myc has rapidly degraded. An ideal heatmap will look like the second figure (yellow and black heatmap, from a reference paper).

The only issue I see is that in this paper, they have used spike-in RNA to normalise the counts at different time points, where I forgot to include the spike-in RNA. I have tried to normalise my data with GAPDH and using other normalization methods, such as FPKM, but it still looks the same.

Due to some reasons, I can't repeat this experiment, and I am looking for a way to normalise my data to get rid of this noise. Any suggestions will be really helpful.

Thanks a lot.

enter image description here

Nanopore spike-in RNAseq • 318 views

ADD COMMENT • link 7 weeks ago by Mo ▴ 50

score 1 · Answer 1 · 2024-11-03

The issue you're running into is that, even after halting mRNA production, there is still residual mRNA and you only select some fraction of it during the library preparation process. Differences in fragment lengths during library preparation mean that you're potentially sampling different fractions of the initial pool. Normalizing to housekeeping genes won't really work for the same reason they wouldn't work for a qPCR readout - this would be a relative rather than global change of expression. This is the reason for using ERCC spike-ins, though experiences using them are mixed.

If there was 100% mRNA at time point zero, there should have been less mRNA at later time points (i.e., 1,2,4,6 & 8 hours), but the mRNA goes up and down

What you're seeing here is that some genes decay more rapidly ("down") than other genes ("up"). When you go to sequence these things and then normalize to the library size (or to a housekeeping gene), you're basically observing that fast-decaying genes are a smaller proportion of the library, or that they decay faster than your housekeeping genes.

One thing you should find is that your libraries are increasingly less complex - that is, if you down-sample and compute % duplication, these curves should be shallow at time 0 and grow increasingly steep with the time points. This indicates that your pool of RNA contained fewer molecules to start with [or: you lost a large fraction during library prep].

Here's something you can do. I will have to think if it is statistically justified or not (others are welcome to chime in on this), but it will give you the result you desire, and if you had a negative control it would also give you basically a null result.

For each timepoint and replicate, compute the duplication curves by downsampling your reads. Hopefully you have sufficient depth to build this from 1M to 20M reads.
For your baseline timepoint, set the duplication rate as the reference point (I will assume 80% here but it could be different)
For each timepoint and replicate, fit a (non-linear) model duplication.rate ~ read.depth, and compute the library size at 80% duplication as 0.8 * <read depth you need for 80% duplication>. Spend some time on this to make sure it's right - it may be easier to fit a logistic to (1-duplication.rate) ~ log(read.depth).

You can then use a normalization factor as NFA = timepoint_library_size/baseline_library_size, so your adjusted cpm/tpm would be CPM * NFA. For something like DESeq2 you would basically use the baseline library size as your size factor for everything, and you will have to perform the following up-sampling:

[Count up-sampling] For each timepoint and replicate, compute the adjusted count as library_size_at_80 * (count_gene / total_counts)

This basically assumes that each library profiles the same proportion of overall RNA, and so you can use estimated number of molecules in each library as a proxy for the estimated total number of RNA molecules. As such this is only valid if the number of input cells was identical for all replicates, otherwise you need to further normalize by number of input cells.

Obviously this relies on many assumptions about the equivalence cDNA profiling between separate library preparations. ERCC would largely free you from these assumptions.

In the future, where overall complexity is a concern, I would recommend using a barcoded RT, or indexing and pooling prior to any cDNA amplification, and prepare a single library from the pooled cDNA. You would have to sequence pretty deeply to capture fragments from the highly-degraded RNA library, but it would free you from this nasty normalization question; you could normalize everything to the full library size.