Question

TCGA/GDC FPKM vs FPKM-UQ

1

Entering edit mode

8.4 years ago

igor 13k

GDC provides RNA-seq quantification in multiple forms:

For mRNA-Seq data, the GDC generates gene level and exon level quantification in Fragments Per Kilobase of transcript per Million mapped reads (FPKM). To facilitate cross-sample comparison and differential expression analysis, the GDC also provides Upper Quartile normalized FPKM (UQ-FPKM) values and raw mapping count.

Source: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification

I tried downloading both FPKM and FPKM-UQ data for TCGA-GBM dataset. The distributions of FPKM-UQ values look more comparable across samples than for FPKM values, which makes sense.

The sums for each sample of FPKM values range from 200k to 318k, so the highest sample has about 60% more. For FPKM-UQ, the sums range from 4x10^9 to 9x10^9, so the highest sample is more than double the lowest. UQ normalization actually increases that difference. Does that imply that the total number of transcripts is 2x more in some samples compared to others?

gdc rna-seq • 10k views

ADD COMMENT • link updated 6.7 years ago by solo7773 ▴ 90 • written 8.4 years ago by igor 13k

0

Entering edit mode

Hi Igor!

Did you provide a plausible answer about this issue of yours? I know it is longer but I am facing the same problem and I can not find any good source about this.

Cheers

ADD REPLY • link 7.7 years ago by marviedemit • 0

score 2 · Answer 1 · 2018-11-07

First of all, let's find out how RPKM and RPKM-UQ are calculated (https://docs.gdc.cancer.gov/Encyclopedia/pages/HTSeq-FPKM-UQ/)

FPKM = [RMg * 10^9 ] / [RMt * L]

RMg: The number of reads mapped to the gene
RMt: The total number of read mapped to protein-coding sequences in the alignment
L: The length of the gene in base pairs


FPKM-UQ = [RMg * 10^9 ] / [RM75 * L]

RMg: The number of reads mapped to the gene
RM75: The number of read mapped to the 75th percentile gene in the alignment.
L: The length of the gene in base pairs

Here we can see the only difference is the divisor part, which is RMt for FPKM while RM75 for FPKM-UQ. To gabriel.rosser, the factor is 10^9, not changed. Both in the FPKM matrix and FPKM-UQ matrix, every column (all genes of a sample) is divided by a constant factor (either RMt or RM75). Therefore, in the quotient matrix, the column values are the same, which is consistent with gabriel.rosser's explanation as well.

To the igor's question, RM75 can be much smaller than RMt because RM75 is the reads mapped to the 75th percentile gene within a sample. Imaging a numerical vector of length 100, the first 75 elements are value 1, and elements 76 to 100 are value 1000000. When apply this setting to our case, that means the RM75 is 1 while RMt is 1000000. As a result, the FPKM and FPKM-UQ can be dramatically different. So when the genes of one sample is divided by a small RM75 but the genes of another sample is divided by a big RM75, after summation within each sample and then compare sums between samples, you will see what you've seen.

score 1 · Answer 2 · 2018-10-17

This this blog post and this discussion are helpful when considering the difference. Summarising the former:

To compute FPKM (or RPKM) from raw counts, first divide by the total read count, then by a constant factor, then by gene size. Typically, the total read count is just the sum of all the reads. However, in the FPKM-UQ data, the total read count is estimated as the 75th percentile read count. This will be a lot smaller than the sum of the reads and more robust to outliers(?)

Given the factor of 10^6 difference, I also suspect they've changed the constant factor.

Having said that, I can't reproduce the results, because the FPKM values come from a different pipeline to the HT-Seq raw counts - so performing the aforementioned steps on the counts data does not reproduce the FPKM values.

However, dividing the FPKM matrix by the FPKM-UQ matrix returns values that are constant down the columns (i.e. a single value per sample for all genes). This is consistent with my explanation.