Can I use TCGA FPKM-UQ values directly to compare across samples without any preprocessing?
1
2
Entering edit mode
7.1 years ago

Dear all,

This is a newbie question :)

I'm building a linear model to identify significant predictors of mutation count/types in tumours from TCGA. I want to include expression levels of a couple of genes, but I am quite new to RNA-Seq analyses and best practices. TCGA provides RNA-Seq data at the gene level in three formats: HTSeq-counts, FPKM and FPKM-UQ. I have been reading (tutorials and the questions here) and asking around and I have reached the conclusion that I can use FPKM-UQ values to compare across samples without any further pre-processing - Is this true? Or would you recommend doing pre-processing to these values before comparing?

Thanks so much, Daniela

RNA-Seq FPKM-UQ TCGA • 10k views
ADD COMMENT
2
Entering edit mode

Thanks so much both! I had seen that chart, Kevin, that is why I thought I could use FPKM-UQ directly. But given your advice and the paper Cindy sent over I will use HTSeq-counts and process through DESeq2 before doing any analyses. I will then compare with the results from using FPKM-UQ directly and post the results here when I have them.

Thanks both again! :)

Daniela

ADD REPLY
1
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

If @Kevin's answer is acceptable you can mark it so (green check mark) to provide closure to this thread.

ADD REPLY
7
Entering edit mode
7.1 years ago

Hi Daniela,

I would highly recommend the HTSeq counts, actually, because these will be raw counts. The normalisation method that produces FPKM expression levels has come under criticism in recent years and is now not even recommended by some sources. The main issue with this [FPKM] method is that cross-sample normalisation is non-existent, as such, it's akin to comparing multiple batches without even doing any correction for batch.

Use HTSeq counts and load these into DESeq2 or EdgeR for downstream analyses.

I have recently analysed an entire TCGA RNAseq dataset (>500 samples) and I used HTSeq counts. They work very well.

Kevin

------------------------------

Update May 2, 2018:

The TCGA states that "To facilitate cross-sample comparison and differential expression analysis, the GDC also provides Upper Quartile normalized FPKM (UQ-FPKM) values and raw mapping count." - https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification

My original advice still stands, i.e., better to obtain the raw HT-seq counts (where available), and re-process those using an updated normalisation method, like TMM (EdgeR) or geometric (DESeq2). Some TCGA datasets are only available in RSEM counts, which are also possible to use and input to DESeq2 using tximport

ADD COMMENT
2
Entering edit mode

I do agree with Kevin. There was a paper in 2012 comparing the state of the art normalization techniques, and it stated that RPKM/FPKM should not be used and DESeq2 and TMM worked best. Have a look at the paper: https://academic.oup.com/bib/article/14/6/671/189645/A-comprehensive-evaluation-of-normalization

Best,

Cindy

ADD REPLY
0
Entering edit mode

Dear Kevin, thanks so much for your answer! This is really helpful. I have one remaining question, I would be grateful if you could help me: I thought the FPKM-UQ was a modification of the FPKM normalisation to precisely allow cross-sample comparison, is this not the case then?

ADD REPLY
1
Entering edit mode

Hi Daniela,

Yes, that is correct, and there are some other posts on Biostars about this topic, like: Differences between FPKM and FPKM-UQ files in gene expression analysis

My suggestion to use HTSeq raw counts is based on a few things:

  • by using raw counts, you have more control over the analysis (FPKM and FPKM-UQ should not be used with common differential expression analysis tools like DESeq2, EdgeR, and Limma, which expect raw counts). If you used FPKM, you would limit the amount of tools/programs that you could use for downstream analyses
  • by not using anything derived from FPKM, you save yourself criticism that would undoubtedly come whenever you tried to publish your work (which could go so far as you having to re-analyse all data depending on the reviewers' comments and the journal involved)
  • by using raw counts, you will have a better opportunity to pick up new skills.

One golden rule in data analysis and bioinformatics is to always aim to get the data in its rawest form possible such that you have most control over how to analyse it. :)

All of this being said, if, at your institute, there are already defined pipelines for analysing FPKM-UQ data, then this may prove the best 'political' option for you.

I also find this very simple flow-chart quite useful in relation to your question:

gene_expression_quantification_pipeline

[source: https://gdc.cancer.gov/about-data/data-harmonization-and-generation/genomic-data-harmonization/high-level-data-generation/rna-seq-quantification]

ADD REPLY
0
Entering edit mode

by not using anything derived from FPKM, you save yourself criticism that would undoubtedly come whenever you tried to publish your work

That's a little extreme. There are a lot of FPKM-based papers in prestigious journals. It's not ideal, but criticism is unlikely.

Of course, it really depends on exactly what you are doing with these FPKMs.

ADD REPLY
0
Entering edit mode

I agree with you, but only if the reviewers and journal editors are not up to speed with data analysis normalisation methods, which is probably going to be true for clinically-focused journals where the bioinformatics methods may not even be mentioned or may only appear in the supplementary.

It has been stated in published literature and from various sources that FPKM/RPKM is not ideal. It produces unreliable statistics from differential expression analysis.

ADD REPLY
0
Entering edit mode

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLY

Login before adding your answer.

Traffic: 1185 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6