How to do FPKM differential analysis?
2
5
Entering edit mode
5.4 years ago

Hello, I want to do diffrential analysis on rnaseq FPKM data (can't get count). DEseq and edgR need raw count as input, so how should I do? Is it suitable for limma package?

RNA-Seq FPKM DEseq limma • 17k views
ADD COMMENT
3
Entering edit mode

Yes you can use limma although It is not as good as if you had counts - so double check you cannot get the counts and try writing the people behind the data to see if they will provide the counts (most people are quite friendly in my experince). If you cannot get the counts you can log2 transform the FPKM values (use a pseudocount of 1) and use the limma trend approach described on page 71 of the limma vignette. For a comparison of limma-trend vs limma-voom take a look at this article.

Hope this helps

ADD REPLY
2
Entering edit mode

Page 71, limma, where do they state you can use rpm/fpkm/tpm for differential expression analysis.

15.4 Differential expression: limma-trend

If the sequencing depth is reasonably consistent across the RNA samples, then the simplest and most robust approach to differential exis to use limma-trend. This approach will usually work well if the ratio of the largest library size to the smallest is not more than about 3-fold. In the limma-trend approach, the counts are converted to logCPM values using edgeR’s cpm function:

> logCPM <- cpm(dge, log=TRUE, prior.count=3)

The prior count is used here to damp down the variances of logarithms of low counts. The logCPM values can then be used in any standard limma pipeline, using the trend=TRUE argument when running eBayes or treat. For example:

> fit <- lmFit(logCPM, design)
> fit <- eBayes(fit, trend=TRUE)
> topTable(fit, coef=ncol(design))

Or, to give more weight to fold-changes in the gene ranking, one might use:

> fit <- lmFit(logCPM, design)
> fit <- treat(fit, lfc=log2(1.2), trend=TRUE)
> topTreat(fit, coef=ncol(design))

15.5 Differential expression: voom

When the library sizes are quite variable between samples, then the voom approach is theoretically more powerful than limma-trend. In this approach, the voom transformation is applied to the normalized and filtered DGEList object:

v <- voom(dge, design, plot=TRUE)

The voom transformation uses the experiment design matrix, and produces an EList object. It is also possible to give a matrix of counts directly to voom without TMM normalization, by

> v <- voom(counts, design, plot=TRUE)

If the data are very noisy, one can apply the same between-array normalization methods as would be used for microarrays, for example:

> v <- voom(counts, design, plot=TRUE, normalize="quantile")

After this, the usual limma pipelines for differential expression can be applied, for example:

> fit <- lmFit(v, design)
> fit <- eBayes(fit)
> topTable(fit, coef=ncol(design))

Or, to give more weight to fold-changes in the ranking, one could use say:

> fit <- treat(fit, lfc=log2(1.2))
> topTreat(fit, coef=ncol(design))
ADD REPLY
1
Entering edit mode

That is wrong. If you read that closely you see they use cpm. That is a critical difference because CPM is the properly-normalized counts without length correction is fine for limma-trend while length-corrected data such as FPKM completely distort the mean-variance trend which is the whole point of methods such as limma which gain power by sharing information across genes. At no point they recommend FPKM for anything.

ADD REPLY
0
Entering edit mode

Thanks a lot. It really helps☺

ADD REPLY
0
Entering edit mode

Hi, You can get read count or No. of mapped reads by using samtools idxstats command, where the 3rd column in the output file is the read count.

The command is :

$ samtools idxstats input.bam > output.txt

(Note: The input bam is the alignment bam file generated by any alignment tool (Hisat2) or assembly tool (Spades) which is coordinate sorted and indexed)

Hope this will help you!!

ADD REPLY
3
Entering edit mode

I think that you may have mis-understood the question. The user does not want total and aligned read counts per sample. They need raw read counts per gene per sample.

ADD REPLY
4
Entering edit mode
5.4 years ago

I agree with Lila - you should not perform differential expression analysis on FPKM expression levels. If you don't believe me, then take it from the developer of limma, where some suggestions are also made: https://support.bioconductor.org/p/56275/#56299

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

Remarkably, I still see publications coming out where people are comparing groups of samples based on FPKM counts, even though this makes no sense. As an example, FPKM of 10 in one sample may be the equivalent of 50 in another, due to the way that FPKM counts are produced, i.e., with no cross-library / sample normalisation.

ADD COMMENT
0
Entering edit mode

If you cannot get counts-limma trend it is a viable approach (see the voom article by the limma authors). Please also note that the Bioconductor post you refrence only states you should not use limma-voom on FPKM values. Actually in that exact post Gordon highlights (as option 3) if you only have FPKM you can use limma-trend via a log2 transformation with a pseudocount.

If you have different library sizes you are right they are not comparable (also what the voom article shows) but then you can just do a inter-library normalization of the FPKM counts first (exactly like you are doing for count data).

ADD REPLY
1
Entering edit mode

Yes, but, that is the third option in a list that is ordered "in decreasing order of desirability"

ADD REPLY
2
Entering edit mode
5.4 years ago
Lila M ★ 1.3k

Hi, I do not recommend to use FPKM for DE analysis. Is better to use the normalized counts (vst/vsd) generated by DESeq2. (For more info have a look in here). However, there is a function in DESeq2 called fpkm() that allow you to do that

ADD COMMENT
1
Entering edit mode

Is better to use the normalized counts (vst/vsd) generated by DESeq2.

Better for what? DE?

However, there is a function in DESeq2 called fpkm() that allow you to do that

This is only for calculation of FPKM not for DE.

ADD REPLY
0
Entering edit mode

In most published articles and studies is hardly recommended to use normalized counts rather than fpkm for DE analysis. You do not agree with that?

ADD REPLY
0
Entering edit mode

Absolutely not. Please list the respective articles. FPKM regularily fails in benchmarking studies towards differential analysis. Use the established tools to perform differential analysis. The only true advantage of per-million scaling is that you can later add new samples without that norm. counts of other samples change which is not the case with e.g. TMM from edgeR. This is why databases usually use TPM or RPKM/FPKM to illustrate counts.+

Still, it is not that per-million scaling is completely useless. If library compositions between samples do not dramatically change, a simple FPKM might be sufficient. If I am not wrong cuffdiff used FPKM as the standard normalization method and I've seen datasets where agreement between FPKM values and TMM was pretty good. Still, as there are advanced methods, why not using them. I always start from raw counts as this ensures that I can stand up for the lowlevel pipeline. if you are provided with FPKM you have no idea how they were created.

ADD REPLY
0
Entering edit mode

Hi, If I understand right, you mean I can merge two FPKM data from two batches without considering the batch effect? I have two batch rnaseq FPKM data now (counts not avaliable) and I want to merge them into a whole dataset. I know Deseq can deal with it by treat batch as variable for counts data. So what about FPKM? Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2318 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6