Very Discordant Results Produced By Cuffdiff Vs Edger
1
3
Entering edit mode
11.0 years ago
sethugunja ▴ 60

Hi,

I have analysed the differential gene expression in Patient versus normal conditions using Cuffdiff and EdgeR. I want to know why there is big difference in the number of genes that are differentially expressed between Cuffdiff and EdgeR. Here the details of the analysis: Our aim is to

  1. to know the differential expression in the total RNA (RZ) abundace of the genes
  2. to know the differential expression in the poly(A) RNA (PA) abundace of the genes

Samples: Patient (3 replicates), Normal1 (2 replicates), Normal2 (2 replicates) sequenced using Illumina Hiseq 2000 platform. Sequencing was done in two ways (total 7 samples per each way):

  1. using totalRNA (to know the differential expression in the total RNA abundace of the genes)
  2. using poly(A) selected RNA (to know the differential expression in the poly(A) RNA abundace of the genes)

Analysis: I got approx. 26 million reads( paired end).

  • I started the analysis by testing the QC of the reads and then mapped the reads to the human reference genome (GRCh.p11, ensembl) using TopHat(2.0.8b).
  • I used the bam file from each replicate to analyse the differential expression using cuffdiff (cufflink 2.1.1) by taking the Normal1(2 replicates), Normal2 (2 replicates) as 4

replicates of normal vs 3 replicates of patient.

  • The resulting cuffdiff output file has 433 genes in RZ and 485 genes in PA that are significantly differentially expressed in normal vs patient P-value < 0.05(q- value) Then I wanted to evaluate this result by using HTseq-EdgeR tools. For this,
  • I used the same bam files for HTseq and tested to know the differentially expressed genes in normal vs patient.
  • The EdgeR results has 1169 genes in RZ and 938 gene in PA that are significantly differentially expressed in normal vs patient P-value < 0.05(FDR) Comparing these two results, 329 genes in RZ and 374 genes in PA were common.

Could any one clarify me why these two tools behaving differently. Which results I have to consider for my further studies.

Thanks Sethu

cuffdiff edger • 7.0k views
ADD COMMENT
1
Entering edit mode

From what you wrote, it's a bit ambiguous whether the 3 patient replicates are biological or technical (mostly since you have "Normal1" and "Normal2", which could denote either two normal samples or 2 different comparison groups). Also, what does RZ mean? From context, I can only assume that this represents the total RNA library comparisons.

Finally, why would you ever expect these two different tools to give you very similar results? Cuffdiff has gotten better over time, but the whole idea of it is a bit different than edgeR (or DESeq, though cuffdiff has been using a lot of DESeq methods in the more recent versions).

ADD REPLY
0
Entering edit mode

Thanks for your prompt reply. * 2 replicates from Normal1 and 2 replicates from Normal 2 were treated as 4 biological replicates of one comparison group versus 3 replicates of patient were treated as biological replicates of another group. * RZ here represents total RNA and PA represents poly(A) selected RNA. * "why would you ever expect these two different tools to give you very similar results? "... If I understand correctly, the main difference between these tools is : HTseq gives read counts whereas Cuffdiff gives FPKM values (which are proportional to read count) , so I think that these tools will give similar results. So, you think that the results from these two tools cannot be comparable? If so, can you please suggest me the tools to compare the cuffdiff/EdgeR results just to confirm the results from one tool.

ADD REPLY
0
Entering edit mode

There's a bigger difference between the two tools. htseq-count uses only uniquely aligned reads. The purpose behind cufflinks is to utilize ambiguously aligned reads as well. Early on, cufflinks gave completely non-sensical results, but it seems to be better now. Also, edgeR will give slightly different results even to DESeq do to differences in the various normalization algorithms and other implementational differences.

ADD REPLY
0
Entering edit mode

Ok then, can you suggest me to which tool I have to stick with for downstream analysis?

ADD REPLY
1
Entering edit mode

Knowing nothing else, I would suggest that edgeR will generally give better results. Presumably you're going to do some validations of these, so just pick a few candidates such that you can test which tool better fit your data.

ADD REPLY
0
Entering edit mode

Thats a good one ! Thankyou.

ADD REPLY
2
Entering edit mode
11.0 years ago

I actually would say edgeR is one of my least favorite algorithms because it can sometimes give funky results that don't make sense. In fact, I compared popular tools for RNA-Seq analysis in a recent blog post:

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

That blog / paper also cites these other earlier comparisons:

http://genomebiology.com/2013/14/9/R95

http://www.biomedcentral.com/1471-2164/13/484

It is probably not what you want to hear, but I would probably choose DESeq over edgeR or cuffdiff. However, I agree with dpryan79 that you just need to select a few candidates for validation: for example, you could look at the overlapping genes between these two programs.

ADD COMMENT
1
Entering edit mode

For the record, I generally prefer DESeq2 and limma, but neither of them were mentioned by the OP :o)

ADD REPLY

Login before adding your answer.

Traffic: 2939 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6