Meaning Of Fpkm Value Used By Cufflinks
1
8
Entering edit mode
13.3 years ago
Oliver ▴ 240

Hello folks,

I use TopHat and Cufflinks to process some NGS data (human genome). When using cuffdiff to obtain values for the DE I got FPKM values for each group. Since I didn't provide any gtf-file with known exons I wondered how the FPKM values were calculated.

The formula for FPKM is 10^9 * C / (N * L), with C is the number of mappable reads that fell onto the gene's exons (how did the program know this?), N the total number of mappable reads in the experiment and L the number of base pairs in the exon (again, where is this number coming from?).

I used the pipeline provided in the Cufflinks tutorial: http://cufflinks.cbcb.umd.edu/tutorial.html (see below).

Then I read in this forum an answer which is quite related to my current question but I thinks this is opposed to the cufflinks tutorial. So I'm confused now :-)

http://biostar.stackexchange.com/questions/10996/determining-fpkm-of-de-novo-assembly-of-transcriptome

In another try I used some single end data and also obtained FPKM values. Shouldn't I get a RPKM value for single end reads?

Thanks in advance, Oliver

Here my pipeline (I hope I pooled the samples the right way?):

Group A:

SRR027863_1.fastq SRR027863_2.fastq
SRR027864_1.fastq SRR027864_2.fastq
SRR027865_1.fastq SRR027865_2.fastq

Group B:

SRR027866_1.fastq SRR027866_2.fastq
SRR027867_1.fastq SRR027867_2.fastq

Pipeline:

tophat -r 160 -o top_SRR027863-65 ../../../reference/hg19 SRR027863_1.fastq,SRR027864_1.fastq,SRR027865_1.fastq SRR027863_2.fastq,SRR027864_2.fastq,SRR027865_2.fastq
tophat -r 160 -o top_SRR027866-67 ../../../reference/hg19 SRR027866_1.fastq,SRR027867_1.fastq SRR027866_2.fastq,SRR027867_2.fastq

cufflinks -o cuff_SRR027863-65 top_SRR027863-65/accepted_hits.bam
cufflinks -o cuff_SRR027866-67 top_SRR027866-67/accepted_hits.bam

cuffmerge -s ../../../reference/hg19.fa assemblies.txt

cuffdiff merged_asm/merged.gtf top_SRR027863-65/accepted_hits.bam top_SRR027866-67/accepted_hits.bam

assemblies.txt:

cuff_SRR027863-65/transcripts.gtf
cuff_SRR027866-67/transcripts.gtf
next-gen sequencing gene cufflinks cuffdiff tophat • 38k views
ADD COMMENT
2
Entering edit mode

If you have a reasonable mathematics background, I highly recommend reading the supplementary methods of the Cufflinks paper. It is very well-written and fully explains all the calculations in going from read alignments to FPKM values.

ADD REPLY
18
Entering edit mode
13.3 years ago
Karl ▴ 350

Tophat has aligned your reads to the reference genome, and marked those reads which align with and without splice junctions. It's intended for transcriptome analysis, or RNA-Seq. That way, anything that maps unspliced must be an exon and anything that jumps regions must span an intron.

Cufflinks then looks at the distribution of reads and estimates a transcript structure (or more than one at the same region).

It's got a fancy probabilistic assignment method and you should read their paper for details.

So each region(gene) has multiple transcripts and each transcript has multiple exons, but the transcripts in a region share exons and that's why the reads don't map precisely, but probabilistically. That's why instead of giving you read count numbers, it gives this fancy FPKM number.

Tophat might be the wrong tool if you arent doing RNA-Seq, but instead looking purely at genomic DNA. Try to get your data into a viewer like IGV to see if it looks right. IGV can load a SAM or BAM and show you where the reads are, it's nice for reviewing the aligner's work.

If you're after a simpler to interpret system or not interested in alternative splicing/transcripts. Try something like DESeq, a simple read count based system.

As to the question about FPKM vs RPKM, it's just a name, they switched from Reads per... to Fragments per... to clean up confusion regarding paired end reads, now its all about the number of cDNA fragments regardless of which way you do the measurement (SE or PE).

ADD COMMENT
1
Entering edit mode

Thanks for your answer, Karl. I forgot to mention that I use RNA-Seq data, paired end. The data I use is from the SRA http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP000698

In the IGV the data looks well. Using coverageBed with the gene.gtf from UCSC for hg19 I get 62% of exonic regions that are hit by reads with a coverage > 0.

I read now the supplementary material for the cufflinks paper and understand I think I know now how the calculate the FPKM basically.

ADD REPLY

Login before adding your answer.

Traffic: 2804 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6