Hi,
I have sequencing data which has been spiked-in with exogenous ERCC controls to allow normalization. I have aligned the reads using tophat 2, and evaluated their abundance using cufflinks. To normalize transcript abundances and get a measure which is directly comparable between replicates, my strategy up to now has been to take a transcripts FPKM and divide it by the sum of FPKMs of all exogenous ERCC controls.
This seems to work, but now I am left with a question: by dividing FPKMs by FPKMs, am I not cancelling out the part of the FPKM calculation which accounts for the number of reads? IE, FPKM is Fragment per Kilobase of Exons per Million of reads. For any given calculation, the "Kilobase of exons" is a characteristic of the transcript, and is identical for all ERCC transcripts across all replicates, so all of my calculated values will be scaled by an identical constant, so there's no issue there. However, the "per million of reads" is a per-replicate variable, and will be identical for both the biological transcript and the exogenous ones, so I assume they will cancel each others out. Is that right? And if it is, is that something which is desirable (since I am, after all, seeking to normalize on the amount of ERCC transcripts), or should I switch over to normalizing using the total number of aligned ERCC transcripts, for example?
How will you use cuffdiff to evaluate significant changes in expression with your normalized FPKM files? I have so far only been successful using cuffdiff with .bam or .sam file input. Is there a way to give cuffdiff an input of spike-in normalized FPKM files?
Does anyone have an opinion or answer to this question?