I am running Cuffdiff on our server, as follows:
./cuffdiff -o diff_out -b Bowtie2Index/genome.fa -p 8 --library-type fr-firststrand -L control,LS -u genes.gtf \
./result/CTR/accepted_hits.bam ./result/LS/accepted_hits.bam
where the two BAM files are the outputs from two Tophat2 runs on hg19 (using GENCODE annotation) with paired-end reads.
In “Calculating preliminary abundance estimates”, Cuffdiff processed 34053 loci.
At the step “Testing for differential expression and regulation in locus”, Cuffdiff became extremely slow: after 24 hours, it just progressed from:
Processing Locus chr1:33772366-33896653 [ ] 1%
to:
Processing Locus chr1:108614103-108617141 [* ] 4%
Is this normal?
Is there a way to speed up this process?
I’d greatly appreciate any ideas and suggestions.
Thank you very much!
Thank you so much for your advice!
I looked into the new Cufflinks 2.2.0 workflow (http://cole-trapnell-lab.github.io/cufflinks/manual/), it says: “Cuffquant allows you to compute the gene and transcript expression profiles and save these profiles to files that you can analyze later with Cuffdiff or Cuffnorm. This can help you distribute your computational load over a cluster.”
And, in the Cufflinks 2.2.0 Release Notes, it says: “Cuffquant quantifies gene and transcript expression levels for a single BAM file. These levels are stored in a new binary file type, the CXB file… Because expression levels for each sample are quantified by Cuffquant, Cuffdiff doesn't have to perform this step, which speeds up Cuffdiff runs substantially and lowers their memory footprints.”
I am just slightly confused about in which way running 'Cuffquant + Cuffdiff" speeds up the process. Is the total process time of “Cuffquant + Cuffdiff” significantly shorter than running Cuffdiff (with BAM inputs) alone? Or does the new workflow mean distributing the computational load over a cluster?
Cuffquant provides pre-calculation of gene expression levels for each sample. I can see this saves time for multiple Cuffdiff runs, since multiple Cuffdiff runs don’t have to re-calculate gene expression levels for the same sample.
The problem I am having right now is that a single Cuffdiff run is extremely slow at the step “Testing for differential expression” (4% progress per day). So does running "Cuffquant + Cuffdiff" also speed up the process for a single Cuffdiff run?
Thank you very much for your help!
Hi - What they mean by "distribute your computational load over a cluster" is that for individual files you can run cuffquant and then use the abundances.cxb file downstream rather than trying to estimate the abundances for all the files ( and then do differential analysis )in a single run of cuffdiff. Not only does this lessen the computational load but also significantly saves time in my experience. Running cuffquant+cuffdiff is splitting the cuffdiff with bam files step in two more manageable smaller steps.
Thank you very much for your further explanation!
So you mean, for a single Cuffdiff run with 2 BAM files, the total process time of “Cuffquant with BAM #1 + Cuffquant with BAM #2 + Cuffdiff” is significantly shorter than “Cuffdiff with 2 BAM files”, right?
When you say “trying to estimate the abundances for all the files”, do you mean the step of “Calculating preliminary abundance estimates” or something else?
In my case, the step of “Calculating preliminary abundance estimates” took just 19 min; but the step of “Testing for differential expression and regulation in locus” progressed only 4% after 24 hours. So I’m wondering if “trying to estimate the abundances for all the files” is actually a part of the step of “Testing for differential expression and regulation in locus”?
Thank you very much!
It is only shorter if you run Cuffquant for Bam1 and Bam2 in parallel ( separate jobs) -- Also totally depends on the size of your bam file ( and if your bam files are comparable to each other etc). But if the abundance estimation hardly took any time then maybe skshare's answer is right that learning the bias parameters is your choke point. Cuffquant will also take time at that step but since you can run the files in parallel it should still save you time than direct cuffdiff. I don't have the log of a successfully completed cuffdiff run to check if there is any additional step than quantifying abundances, learning the bias parameters and then testing for differential expression.
Thank you very much for your reply!
In fact, the step of “Learning bias parameters” only took 6 min in my run.
It’s the step of “Testing for differential expression and regulation in locus” that is extremely slow. Now 3 days have passed; this step only completed 10%. It’s doing something like: Processing Locus ……………………
From my log, it looks like that there are only “Calculating preliminary abundance estimates” and “Learning bias parameters” before “Testing for differential expression and regulation in locus”.
Thanks a lot!
I am curious now -- Why dont you run cuffquant + cuffdiff and tell us how much time did it take.
O.K., I'll give it a try. I need to upgrade cufflinks package since our current version is old and does not have cuffquant.
Thanks a lot for the help!