Normalizing Rna-Seq Data Using Ercc Spike-In
2
7
Entering edit mode
11.2 years ago
Eric Fournier ★ 1.4k

Hi,

I have sequencing data which has been spiked-in with exogenous ERCC controls to allow normalization. I have aligned the reads using tophat 2, and evaluated their abundance using cufflinks. To normalize transcript abundances and get a measure which is directly comparable between replicates, my strategy up to now has been to take a transcripts FPKM and divide it by the sum of FPKMs of all exogenous ERCC controls.

This seems to work, but now I am left with a question: by dividing FPKMs by FPKMs, am I not cancelling out the part of the FPKM calculation which accounts for the number of reads? IE, FPKM is Fragment per Kilobase of Exons per Million of reads. For any given calculation, the "Kilobase of exons" is a characteristic of the transcript, and is identical for all ERCC transcripts across all replicates, so all of my calculated values will be scaled by an identical constant, so there's no issue there. However, the "per million of reads" is a per-replicate variable, and will be identical for both the biological transcript and the exogenous ones, so I assume they will cancel each others out. Is that right? And if it is, is that something which is desirable (since I am, after all, seeking to normalize on the amount of ERCC transcripts), or should I switch over to normalizing using the total number of aligned ERCC transcripts, for example?

normalization rna-seq fpkm • 19k views
ADD COMMENT
0
Entering edit mode

How will you use cuffdiff to evaluate significant changes in expression with your normalized FPKM files? I have so far only been successful using cuffdiff with .bam or .sam file input. Is there a way to give cuffdiff an input of spike-in normalized FPKM files?

ADD REPLY
0
Entering edit mode

Does anyone have an opinion or answer to this question?

ADD REPLY
3
Entering edit mode
11.2 years ago

Remember that the "per million reads" part of fpkm is actually a library size normalization step. Undoing that with spike-ins is fine, then, since you're then producing a proper transcript length normalized count subsequently normalized by your spike-ins. If you're using cuffdiff next, be sure to change the default library size normalization so it doesn't undo all your hard work!

I'm curious how well this works with cuffdiff (I assume that's what you're using) and how the results compare to the same data in DESeq, where I find dealing with spike-ins more straight forward.

ADD COMMENT
0
Entering edit mode

dpryan79 -- Can you provide more details for how you normalize for spike-ins with DESeq?

ADD REPLY
1
Entering edit mode

Read in the count data, subset the resulting matrix such that it includes only the spike-ins, create a DESeqDataSet from that and then just estimateSizeFactors() on the results. The size factors can then be placed in the appropriate slot on the DESeqDataSet for the full count matrix (make sure to remove the spike-ins, since you no longer need them).

Edit: The same procedure would work for edgeR or limma as well. This is also part of the modification that SAMstrt makes to SAMseq, if you're interested in just using that.

ADD REPLY
0
Entering edit mode

Thank you. I just now found this response and am trying it today. I have not found a way to use cuffdiff for anything other than .bam or .sam file input, which precludes me from trying to compare how well cuffdiff works with normalized FPKM input compared to DESeq and EdgeR. I agree with your above comment that this would be an important comparison. Best,

ADD REPLY
0
Entering edit mode

There's likely some hacking of the source needed since cuffdiff tries to re-estimate fpkms given a merged annotation. You'd just need to get around that step and then the remainder should work. Personally, I simply wouldn't use cuffdiff for this sort of task.

ADD REPLY
0
Entering edit mode

Hi Devon

(1) Could you reply some R command here for all the steps you mentioned (how you normalize for spike-ins with DESeq/DESeq2)? For example you said that "subset the resulting matrix such that it includes only the spike-ins", but I don't know how to find the one only the spike-ins.

(2) If you do similar in edgeR, do we still need to use CPM to trim counts table?

Thank you so much!

ADD REPLY
0
Entering edit mode

(1) You'd have to either know ahead of time which rows have the spike-ins or know the names that they go by.

(2) "Need" is a bit strong, but you'll probably benefit from doing so (simply for the sake of statistical power).

ADD REPLY
0
Entering edit mode

I haven't been using cuffdiff for differential expression testing, as it has been crashing when I try to feed it 30 multi-gb bam files at a time.

ADD REPLY
0
Entering edit mode

generally we've stopped using the tuxedo suite altogether for similar reasons.

ADD REPLY
2
Entering edit mode
11.2 years ago

If you are going to use the ERCC data, then (IMHO) you should use loess normalization on your raw data using the ERCC data factored in.

See here for a related question: question in normalizing with ERCC spike-in control

ADD COMMENT
0
Entering edit mode

I've tried using loess for normalization, but the results have been pretty awful. I cannot say with certitude why this issue arises, but I think it might be because in my data, the ERCC concentration - FPKM correlation goes down the drain at lower concentrations, with FPKMs that sometime vary by a hundredfold for the same concentration.

ADD REPLY

Login before adding your answer.

Traffic: 2465 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6