Hello folks,
I use TopHat and Cufflinks to process some NGS data (human genome). When using cuffdiff to obtain values for the DE I got FPKM values for each group. Since I didn't provide any gtf-file with known exons I wondered how the FPKM values were calculated.
The formula for FPKM is 10^9 * C / (N * L), with C is the number of mappable reads that fell onto the gene's exons (how did the program know this?), N the total number of mappable reads in the experiment and L the number of base pairs in the exon (again, where is this number coming from?).
I used the pipeline provided in the Cufflinks tutorial: http://cufflinks.cbcb.umd.edu/tutorial.html (see below).
Then I read in this forum an answer which is quite related to my current question but I thinks this is opposed to the cufflinks tutorial. So I'm confused now :-)
In another try I used some single end data and also obtained FPKM values. Shouldn't I get a RPKM value for single end reads?
Thanks in advance, Oliver
Here my pipeline (I hope I pooled the samples the right way?):
Group A:
SRR027863_1.fastq SRR027863_2.fastq
SRR027864_1.fastq SRR027864_2.fastq
SRR027865_1.fastq SRR027865_2.fastq
Group B:
SRR027866_1.fastq SRR027866_2.fastq
SRR027867_1.fastq SRR027867_2.fastq
Pipeline:
tophat -r 160 -o top_SRR027863-65 ../../../reference/hg19 SRR027863_1.fastq,SRR027864_1.fastq,SRR027865_1.fastq SRR027863_2.fastq,SRR027864_2.fastq,SRR027865_2.fastq
tophat -r 160 -o top_SRR027866-67 ../../../reference/hg19 SRR027866_1.fastq,SRR027867_1.fastq SRR027866_2.fastq,SRR027867_2.fastq
cufflinks -o cuff_SRR027863-65 top_SRR027863-65/accepted_hits.bam
cufflinks -o cuff_SRR027866-67 top_SRR027866-67/accepted_hits.bam
cuffmerge -s ../../../reference/hg19.fa assemblies.txt
cuffdiff merged_asm/merged.gtf top_SRR027863-65/accepted_hits.bam top_SRR027866-67/accepted_hits.bam
assemblies.txt:
cuff_SRR027863-65/transcripts.gtf
cuff_SRR027866-67/transcripts.gtf
If you have a reasonable mathematics background, I highly recommend reading the supplementary methods of the Cufflinks paper. It is very well-written and fully explains all the calculations in going from read alignments to FPKM values.