Hello all,
I've been using the Tuxedo suite to calculate differential gene expression for Drosophila RNAseq samples I have been analyzing. I've noticed that Cufflinks/Cuffdiff seems to give incorrect statistics between ON and OFF genes.
My pipeline for analyzing differential gene expression was: 1) Align reads with TopHat, 2) Cufflinks to estimate gene/isoform abundance, 3) Merge cufflinks files with Cuffmerge, and 4) Perform differential gene expression test with Cuffdiff (earlier version of cufflinks package).
For both cufflinks and cuffdiff I used the options -frag-bias-correct and -multi-read-correct to improve the accuracy of the abundance of the expressed genes.
Here is an example of how this pipeline seems to miscalculating the expression between ON and OFF genes:
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
XLOC_000812 XLOC_000812 CR44143 chr2L:11721285-11722098 WT mut OK 0.721719 0 #NAME? nan 5.00E-05 0.00148681 yes
XLOC_000933 XLOC_000933 CG31813 chr2L:13736139-13736792 WT mut OK 1.83633 0 #NAME? nan 5.00E-05 0.00148681 yes
XLOC_001662 XLOC_001662 CG43165 chr2L:3694462-3694938 WT mut OK 1.07611 0 #NAME? nan 5.00E-05 0.00148681 yes
XLOC_010147 XLOC_010147 CG14244 chr3R:22551862-22552246 WT mut OK 0.843113 0 #NAME? nan 0.0055 0.0576938 no
XLOC_007888 XLOC_007888 CG43391 chr3L:11573928-11574119 WT mut OK 2.8848 0 #NAME? nan 0.01215 0.101753 no
XLOC_001158 XLOC_001158 Arr1 chr2L:18078268-18081187 WT mut HIDATA 0 7600.17 0 0 1 1 no
XLOC_011439 XLOC_011439 Rh3 chr3R:15906781-15908172 WT mut HIDATA 2034.74 0 0 0 1 1 no
Why am I seeing this discrepancy in the differential gene expression between ON and OFF genes. Especially for the genes that get a status called HIDATA
.
How can I resolve this discrepancy and improve the calculating of differential gene expression for these ON and OFF genes.
Thank you
Rodrigo
Not really sure what you mean by ON/OFF genes, or, in your example, which gene(s) are "ON" and which gene(s) are "OFF"? Do you have p-values for the fold changes you found?
Hi Joe, If you scroll to the right in the table you will see that the p_value and q_value for the listed example genes.
Sorry, didn't see the scroll bar at first.
What does the raw file say?
#NAME?
is an excel problem. I'm assuming you have NaN values there, when there's no expression of a gene in one condition, the fold change is +/-infinity.A few other people seem to have had this problem:
You could play around with the settings a bit more or check your reads to make sure you're not underestimating even a small amount of expression in the "OFF" cases. You could also just add a small psuedocount to all of the FPKM values. X/0 = infinity, so X/0.000...1 -> infinity, you'll just have massive fold changes instead of infinity.
A non-bioinformatics answer would be to double check these genes with qPCR.