I have a data set of n=3 treated vs control RNA-seq. I used hisat2 for alignment followed by stringtie to compute fpkms, and then prepDE.py to convert the fpkms to count data, for analysis for D.E. using DESEQ2.
When I plot dendograms and heatmaps using hclust(ward), the results are different depending upon whether the input is rlog counts or log2fpkm.
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
The Total Count and RPKM [FPKM] normalization methods, both of which are
still widely in use, are ineffective and should be definitively
abandoned in the context of differential analysis.
The first thing one should remember is that without between sample
normalization (a topic for a later post), NONE of these units arecomparable across experiments. This is a result of RNA-Seq being arelative measurement, not an absolute one.
It makes sense that the dendrograms would be different for the following 2 main reasons:
FPKM counts and those produced from DESeq2's normalisation method are different: FPKM counts are not
sufficiently adjusted for differences in library sizes across your
samples; DESeq2 does [account for this difference]
log (base 2) is not the same transformation as the regularisd log
transformation of DESeq2
In addition, looking at your dendrograms, I do not see good separation between control and treated. For your FPKM data, I see a 'flat' and 'structureless' dendrogram. On the other hand, I see a well-structured dendrogram for your rlog data.
Something to keep in mind: when clustering, Euclidean distance should only be used for normal (Gaussian) / binomial - distributed data. If you want to cluster FPKM data, you should probably at least use 1 minus Pearson/Spearman correlation distance.
Thank you very much for your helpful response. Firstly I need to apologize, turns out the previous hclust dendograms were not generated using the exact same set of data.
When repeated with the correct files, they still look different.
now I understand from your answer that the inherent normalizations for fpkm and counts may contribute to it. Indeed, the samples dont separate well to start with so that is perhaps the biggest reason that the separations always look noisy, when comparing data for >15000 data points.
I always thought (going by the name fpkm), it was well normalized for # of reads etc.
I was unsure whether when we use prepDE to convert fpkm to counts, the same normalization as running default DESeq2 applies
May I ask, since rlog transformed count data from DESeq2 should reflect a neg binomial distribution, ward is OK to use as correlation distance?
Thanks for the updates. So, it's the correlation distance that contributes to the 'flat' dendrograms. As I think about it, it makes sense because using correlation distance on a small number of samples will not give much reliable values.
In saying that, it makes sense that correlation distance produces the same dendrogram between log2(FPKM) and the rlog counts because correlation, the statistical measure, is independent of the library size difference that I mentioned earlier (if you think about it, correlation just looks for what increases and decreases together). This is why FPKM is 'okay' for things like co-expression analysis, as in network analysis (think of WGCNA), but not differential expression analysis.
I don't know much about the prepDE function of StringTie, so, I don't know how it re-produces counts suitable for DESeq2. My preference is always to get raw counts and use those for DESeq2. If I need to use HISAT2 / StringTie, it's only to generate a custom GTF file, which I then re-use on my FASTQs or BAMs for the purpose of deriving raw counts suitable for DESeq2.
The rlog data will actually mostly reflect a binomial distribution. However, it is produced from the normalised counts, which are measured on a negative binomial distribution. In this sense, Euclidean distance and Ward's linkage are fine for rlog data. For log2(FPKM) in clustering, perhaps correlation distance would be more reliable (as Spearman).
PS - make sure that you use ward.D2 and not ward or ward.D in the hclust() function. There's an issue with ward:
Two different algorithms are found in the literature for Ward
clustering. The one used by option "ward.D" (equivalent to the only
Ward option "ward" in R versions <= 3.0.3) does not implement Ward's
(1963) clustering criterion, whereas option "ward.D2" implements that
criterion (Murtagh and Legendre 2014).
PS - why do you have 15,000 data-points? Are you plotting all transcripts that passed QC? Did you do any filtering for differentially expressed transcripts? If you plot the entire dataset in clustering, you will rarely see 'natural' segregation between your groups.
"In saying that, it makes sense that correlation distance produces the same dendrogram between log2FPKM and the rlog counts because correlation, the statistical measure, is 'mostly' independent of the library size difference that I mentioned earlier (if you think about it, correlation just looks for what increases and decreases together). This is why FPKM is fine for things like co-expression analysis, as in network analysis (think of WGCNA), but not differential expression analysis."
1.I guess the dendograms are not 100% identical between fpkm and counts when Ward is used, but identical when spearman is used. I am still surpised why "fpkms" are not considered well normalized for library size. I thought that was the whole point of fragments per kilo million. I'd be grateful if you could share more on that.
Thanks for letting me know about the issues with Ward. We will make the switch to WardD2.
Yes, 15,000 data points because plotting all genes. The unbiased clustering is just to see how the data plays out to begin with. Indeed, there's more similarity across samples than the smaller changes between groups, so we don't see natural segregation. When I filter for DE genes, the heatmaps and clustering look very pretty as one would expect.
Hi nancy, FPKM counts are produced from an earlier form of normalisation for RNA-seq data when sequencing was only being performed on 1 or a few samples. It is now considered inappropriate for differential expression comparisons across multiple samples. Here is the primary reason why:
In NGS, samples are always sequenced to different depths of coverage, even if they are sequenced to a specific target depth of coverage. Factors within the sequencer itself and differences in sample prep. mean that one sample will always be sequenced to a higher or lower depth than another. What this translates to is more or less reads aligning to one sample over another when we conduct our analyses. This is critical in RNA-seq because we gauge expression of a particular gene based on the number of reads aligning.
With FPKM, there is no adequate adjustment for this fact that I've mentioned above. It means that an expression value of, for example, 1000 in one sample is not equivalent to 1000 in another sample, due to the fact that they would have been sequenced to different depths of coverage. Thus, when conducting differential expression analysis, the samples remain incomparable (because statistical tests look at the mean or median, and other factors influence these tests, too).
Another issue in RNA-seq is that larger genes naturally accumulate more reads, which makes sense; however, in these situations, it does not imply that the gene is more expressed than, say, a smaller ncRNA. FPKM does [yes] adjust for this fact.
Methods that do take into account both of the phenomena that I mention above include TMM (implemented in EdgeR) and DESeq2's 'media ratio' method.
Hope that this helps. FPKM, despite receiving its criticism, is still widely used, but it should not be when the goal is to conduct differential expression. If you search online, you will find lots of discussions about it.
Here's another interesting few lines from a very well cited published work back in 2013:
the presence of these high-count genes clearly results in an inflated
false-positive rate for five of the normalization methods (TC, UQ,
Med, Q and RPKM). Only DESeq and TMM are able to control the
false-positive rate while also maintaining the power to detect
differentially expressed genes.
.
Based on three real mRNA and one miRNA-seq datasets, we confirm
previous observations that RPKM and TC, both of which are still widely
in use [40, 41], are ineffective and should be definitively abandoned
in the context of differential analysis. The RPKM approach was
initially proposed to account for differences in gene length [19];
however, the relationship between gene length and DE actually varies
among the datasets considered here (Supplementary Figures S6–S8). Even
in cases where a strong positive association between gene counts and
length is observed, scaling counts by gene length with RPKM is not
sufficient for removing this bias [16, 20].
Thank you for your detailed reply. It will take me a while to assimilate and understand in as great detail. Having been on the lab side for several years, managing an NGS core lab - the only thing that strikes to me as odd is this part "NGS, samples are always sequenced to different depths of coverage... Factors within the sequencer itself and differences in sample prep. " - I must say that nowadays, if one is preparing standard bulk RNA-seq libraries, and running standard Illumina platforms, it is quite easy to accurately measure library concentrations and ensure equal output, down to +/- 2-5 million reads for libraries being sequenced in a batch. I agree that if you are comparing samples from different runs/experiments etc. it starts getting a little noisy, and in those cases I will now appreciate the accuracy from counts vs fpkm even if I don't fully understand the math/stat just yet. :-)
On a related note, since you earlier mentioned you were familiar with DESeq2 may I request your comment on another post I made, where I observed that (in a data set of n=2), the p-values for data points showing huge variation in replicates are quite low and comparable to other events where the replicates behave well.
This seemed rather counter-intuitive to me, the D.E estimated in the former case is likely to be random and not statistically significant? I would appreciate your thoughts on this. Thank you.
An update (6th October 2018):
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units