How does edgeR do ERCC spike-in normalization?
3
1
Entering edit mode
8.2 years ago
moxu ▴ 510

It seems calcNormFactors(..) only normalizes by library size. It would be a big loss of information if ERCC spike-in information is not used.

Thanks

RNA-Seq R • 8.9k views
ADD COMMENT
1
Entering edit mode
8.2 years ago
GZ1995 ▴ 410

You can simply calculate norm factors on ERCC spike-in, and pass them to downstream analysis.

 x$samples$norm.factors = calcNormFactors(x[spikes,])$samples$norm.factors

BTW, in RUV paper the authors suggest that ERCC spike-in does not behave like endogenous genes. Global normalisation based on ERCC spike-in can lead to poor normalised counts.

ADD COMMENT
0
Entering edit mode

Thanks for the reply. I read the abstract of the RUV paper as well. One researcher in our lab told me that he had some experience with ERCC normalization -- without it he had some ridiculous results.

BTW, how to use RUV to do ERCC normalization?

ADD REPLY
0
Entering edit mode
browseVignettes("RUVSeq)
ADD REPLY
0
Entering edit mode

The RUV method assumes the RLE distribution to be centered around 0, but I am not sure this should always be the case. In our treatment, the cells are treated with toxic chemicals and gene expressions are in large reduced due to the harmful effect of the treatment. IMHO, ERCC is the most logically sound among all kinds of normalization (library size, RUV, etc.)?

As you said above, it seems reasonable to me that I should normalize by ERCC spike-in and then feed the data to edgeR. Not sure why edgeR authors do not like any sort of normalization, though.

ADD REPLY
0
Entering edit mode

Normalisation method like TMM (default by calling calcNormFactors) or other global scaling method also assumes there is no global shift in gene expression. I agree with you that normalization on spike-in is a good idea in your case.

RUV method does not assume RLE distribution to be centered around 0. RLE plot is a diagnostic plot to check whether your sample have similar distriubtion if you believe most of genes are not differentially expressed (not true as you have mentioned). There are other standards to check (PCA, p-val distribution, positive controls, etc.). The assumption of RUVg is that the factors of unwanted variation estimated from spike-ins span the same linear space as the factors of unwanted variation for all of genes [1].

You may also try other methods like supervised svaseq [2] or cyclic loess regression on spike-ins [3]. From my experience supervised svaseq behaves similarly with RUVg. I don't have much experience using cyclic loess on ERCC spike-in, but from RUV paper it seems that it does not perform very well in RNA-seq.

Refs:

  1. Normalization of RNA-seq data using factor analysis of control genes or samples

  2. svaseq: removing batch effects and other unwanted noise from sequencing data

  3. Revisiting global gene expression analysis

ADD REPLY
1
Entering edit mode
5.8 years ago

While I think it is kind of hard to precisely prove, I don't think the ERCC spike-in information is really as crucial as assumed in the question.

However, I really like the comments from first-hand experience such as "From my experience supervised svaseq behaves similarly with RUVg," and I thought it was really important for eldronzhou to point out "BTW, in RUV paper the authors suggest that ERCC spike-in does not behave like endogenous genes. Global [normalization] based on ERCC spike-in can lead to poor [normalized] counts."

So, for what it is worth, here is my input:

1) My personal preference is to test multivariate models for differenital expression for multiple methods (such as edgeR, DESeq2, limma-voom, etc.) rather than having a corrected normalization upstream of differential expression (although I do test visualization with simple adjustments after differential expression, such as centering expression among groups that you want to adjust)

2) You still need to critically assess the supervised normalization strategies. For example, if you use ComBat to adjust expression in a way that essentially makes your samples show the clustering that you want, you should be weary about over-correction (that may make results less robust and harder to re-produce in other studies). I would recommend checking expression before and after any sort of adjustment. For example, maybe it is helpful to center expression by batch, but can you check both expression types and see that your conclusions would be similar (So, do you see similar trends in each batch? Or, did your normalization do something like change the direction of the gene expression change within the batch, which would need to be examined more carefully).

3) While I'm sure you can find a variety of opinions, here are some other references that I believe indicate normalization with ERCC spike-ins can be problematic:

Paper #1 (Qing et al. 2013): "[ERCC] fluctuation may prevent the ERCC controls from being used for cross-sample normalization in RNA-Seq"

Paper #2 (SEQC 2014 Nature paper): "We observed, however, that the fraction of reads aligning to ERCC spike-ins for a given sample varied widely between libraries and platforms, with measured ERCC ranges of 1–2.5% for HiSeq 2000 and 2.5–4.7% for SOLiD, with a clear ‘library effect’ observed for all sites and platforms, affecting reproducibility"

-->My opinion on this is that you should probably try to define the most direct adjustment possible. In other words, if you really want to adjust for is the total number of detected genes, then maybe something like TMM normalization (from the overall distribution for all quantified genes) is better than trying to normalize based upon manually added ERCC spike-ins (although, even if it helps overall, I think it should be understood that the TMM normalization may not be perfect, and may still also have some amount of over-correction to an extent that you should evaluate for each project).

(for #2, I should thank members of the Bioinformatics Forum at City of Hope for having a discussion that caused me to, at least briefly, read these citations in the context of the ERCC spike-ins recently, even though these represent my individual opinions and agreement shouldn't be assumed for all members of the discussion group; I also think the Qing et al. 2013 paper had a relatively small number of citations for good paper that provided some important/interesting points that I found were fairly easy to understand).

ADD COMMENT
0
Entering edit mode
24 months ago
Bogdan ★ 1.4k

I would like to add a similar question, similar to a question outlined in :

https://support.bioconductor.org/p/9135179/ (it is an old message, sorry for cross-posting)

In contrast to the question on BioC mailing list, my question is : does anyone know if it is a well validated formula and peer - reviewed ? Thank you !

 **

y <- DGEList(counts = counts_without_spike_in, samples = mapping_file)

    norm.factors <- spike_in_factor / y$samples$lib.size

    norm.factors <- norm.factors / prod(norm.factors)^(1/length(norm.factors))

    y$samples$norm.factors <- norm.factors

**

ADD COMMENT

Login before adding your answer.

Traffic: 1911 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6