Cufflinks/Cuffdiff giving incorrect statistics between ON/OFF genes
0
0
Entering edit mode
9.4 years ago

Hello all,

I've been using the Tuxedo suite to calculate differential gene expression for Drosophila RNAseq samples I have been analyzing. I've noticed that Cufflinks/Cuffdiff seems to give incorrect statistics between ON and OFF genes.

My pipeline for analyzing differential gene expression was: 1) Align reads with TopHat, 2) Cufflinks to estimate gene/isoform abundance, 3) Merge cufflinks files with Cuffmerge, and 4) Perform differential gene expression test with Cuffdiff (earlier version of cufflinks package).

For both cufflinks and cuffdiff I used the options -frag-bias-correct and -multi-read-correct to improve the accuracy of the abundance of the expressed genes.

Here is an example of how this pipeline seems to miscalculating the expression between ON and OFF genes:

test_id       gene_id       gene      locus                     sample_1   sample_2   status   value_1    value_2   log2(fold_change)   test_stat   p_value    q_value      significant
XLOC_000812   XLOC_000812   CR44143   chr2L:11721285-11722098   WT         mut        OK       0.721719   0         #NAME?              nan         5.00E-05   0.00148681   yes
XLOC_000933   XLOC_000933   CG31813   chr2L:13736139-13736792   WT         mut        OK       1.83633    0         #NAME?              nan         5.00E-05   0.00148681   yes
XLOC_001662   XLOC_001662   CG43165   chr2L:3694462-3694938     WT         mut        OK       1.07611    0         #NAME?              nan         5.00E-05   0.00148681   yes
XLOC_010147   XLOC_010147   CG14244   chr3R:22551862-22552246   WT         mut        OK       0.843113   0         #NAME?              nan         0.0055     0.0576938    no
XLOC_007888   XLOC_007888   CG43391   chr3L:11573928-11574119   WT         mut        OK       2.8848     0         #NAME?              nan         0.01215    0.101753     no
XLOC_001158   XLOC_001158   Arr1      chr2L:18078268-18081187   WT         mut        HIDATA   0          7600.17   0                   0           1          1            no
XLOC_011439   XLOC_011439   Rh3       chr3R:15906781-15908172   WT         mut        HIDATA   2034.74    0         0                   0           1          1            no

Why am I seeing this discrepancy in the differential gene expression between ON and OFF genes. Especially for the genes that get a status called HIDATA.

How can I resolve this discrepancy and improve the calculating of differential gene expression for these ON and OFF genes.

Thank you

Rodrigo

Cufflinks RNA-Seq Cuffdiff • 2.8k views
ADD COMMENT
0
Entering edit mode

Not really sure what you mean by ON/OFF genes, or, in your example, which gene(s) are "ON" and which gene(s) are "OFF"? Do you have p-values for the fold changes you found?

ADD REPLY
0
Entering edit mode

Hi Joe, If you scroll to the right in the table you will see that the p_value and q_value for the listed example genes.

ADD REPLY
0
Entering edit mode

Sorry, didn't see the scroll bar at first.

What does the raw file say? #NAME? is an excel problem. I'm assuming you have NaN values there, when there's no expression of a gene in one condition, the fold change is +/-infinity.

A few other people seem to have had this problem:

You could play around with the settings a bit more or check your reads to make sure you're not underestimating even a small amount of expression in the "OFF" cases. You could also just add a small psuedocount to all of the FPKM values. X/0 = infinity, so X/0.000...1 -> infinity, you'll just have massive fold changes instead of infinity.

A non-bioinformatics answer would be to double check these genes with qPCR.

ADD REPLY

Login before adding your answer.

Traffic: 1771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6