Question

Cuffdiff takes extremely long time at "Testing for differential expression and regulation in locus"

1

Entering edit mode

8.4 years ago

tunl ▴ 90

I am running Cuffdiff on our server, as follows:

./cuffdiff -o diff_out -b Bowtie2Index/genome.fa -p 8 --library-type fr-firststrand -L control,LS -u genes.gtf \
./result/CTR/accepted_hits.bam ./result/LS/accepted_hits.bam

where the two BAM files are the outputs from two Tophat2 runs on hg19 (using GENCODE annotation) with paired-end reads.

In “Calculating preliminary abundance estimates”, Cuffdiff processed 34053 loci.

At the step “Testing for differential expression and regulation in locus”, Cuffdiff became extremely slow: after 24 hours, it just progressed from:

Processing Locus chr1:33772366-33896653 [ ] 1%

to:

Processing Locus chr1:108614103-108617141 [* ] 4%

Is this normal?

Is there a way to speed up this process?

I’d greatly appreciate any ideas and suggestions.

Thank you very much!

RNA-Seq Cuffdiff • 5.6k views

ADD COMMENT • link updated 8.4 years ago by Satyajeet Khare ★ 1.6k • written 8.4 years ago by tunl ▴ 90

score 1 · Answer 1 · 2016-07-19

1

Entering edit mode

8.4 years ago

aditi.qamra ▴ 270

Try running cuffquant to generate abundances.cxb file for each sample and then use .cxb files in cuffdiff to speed up the process.

ADD COMMENT • link 8.4 years ago by aditi.qamra ▴ 270

0

Entering edit mode

Thank you so much for your advice!

I looked into the new Cufflinks 2.2.0 workflow (http://cole-trapnell-lab.github.io/cufflinks/manual/), it says: “Cuffquant allows you to compute the gene and transcript expression profiles and save these profiles to files that you can analyze later with Cuffdiff or Cuffnorm. This can help you distribute your computational load over a cluster.”

And, in the Cufflinks 2.2.0 Release Notes, it says: “Cuffquant quantifies gene and transcript expression levels for a single BAM file. These levels are stored in a new binary file type, the CXB file… Because expression levels for each sample are quantified by Cuffquant, Cuffdiff doesn't have to perform this step, which speeds up Cuffdiff runs substantially and lowers their memory footprints.”

I am just slightly confused about in which way running 'Cuffquant + Cuffdiff" speeds up the process. Is the total process time of “Cuffquant + Cuffdiff” significantly shorter than running Cuffdiff (with BAM inputs) alone? Or does the new workflow mean distributing the computational load over a cluster?

Cuffquant provides pre-calculation of gene expression levels for each sample. I can see this saves time for multiple Cuffdiff runs, since multiple Cuffdiff runs don’t have to re-calculate gene expression levels for the same sample.

The problem I am having right now is that a single Cuffdiff run is extremely slow at the step “Testing for differential expression” (4% progress per day). So does running "Cuffquant + Cuffdiff" also speed up the process for a single Cuffdiff run?

Thank you very much for your help!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

1

Entering edit mode

Hi - What they mean by "distribute your computational load over a cluster" is that for individual files you can run cuffquant and then use the abundances.cxb file downstream rather than trying to estimate the abundances for all the files ( and then do differential analysis )in a single run of cuffdiff. Not only does this lessen the computational load but also significantly saves time in my experience. Running cuffquant+cuffdiff is splitting the cuffdiff with bam files step in two more manageable smaller steps.

ADD REPLY • link 8.4 years ago by aditi.qamra ▴ 270

0

Entering edit mode

Thank you very much for your further explanation!

So you mean, for a single Cuffdiff run with 2 BAM files, the total process time of “Cuffquant with BAM #1 + Cuffquant with BAM #2 + Cuffdiff” is significantly shorter than “Cuffdiff with 2 BAM files”, right?

When you say “trying to estimate the abundances for all the files”, do you mean the step of “Calculating preliminary abundance estimates” or something else?

In my case, the step of “Calculating preliminary abundance estimates” took just 19 min; but the step of “Testing for differential expression and regulation in locus” progressed only 4% after 24 hours. So I’m wondering if “trying to estimate the abundances for all the files” is actually a part of the step of “Testing for differential expression and regulation in locus”?

Thank you very much!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

1

Entering edit mode

It is only shorter if you run Cuffquant for Bam1 and Bam2 in parallel ( separate jobs) -- Also totally depends on the size of your bam file ( and if your bam files are comparable to each other etc). But if the abundance estimation hardly took any time then maybe skshare's answer is right that learning the bias parameters is your choke point. Cuffquant will also take time at that step but since you can run the files in parallel it should still save you time than direct cuffdiff. I don't have the log of a successfully completed cuffdiff run to check if there is any additional step than quantifying abundances, learning the bias parameters and then testing for differential expression.

ADD REPLY • link 8.4 years ago by aditi.qamra ▴ 270

0

Entering edit mode

Thank you very much for your reply!

In fact, the step of “Learning bias parameters” only took 6 min in my run.

It’s the step of “Testing for differential expression and regulation in locus” that is extremely slow. Now 3 days have passed; this step only completed 10%. It’s doing something like: Processing Locus ……………………

From my log, it looks like that there are only “Calculating preliminary abundance estimates” and “Learning bias parameters” before “Testing for differential expression and regulation in locus”.

Thanks a lot!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

0

Entering edit mode

I am curious now -- Why dont you run cuffquant + cuffdiff and tell us how much time did it take.

ADD REPLY • link 8.4 years ago by aditi.qamra ▴ 270

0

Entering edit mode

O.K., I'll give it a try. I need to upgrade cufflinks package since our current version is old and does not have cuffquant.

Thanks a lot for the help!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

score 1 · Answer 2 · 2016-07-19

1

Entering edit mode

8.4 years ago

Satyajeet Khare ★ 1.6k

If you do not use option -b it will be faster. I think -b is not required for cuffdiff. Its for cufflinks and does take time.

ADD COMMENT • link 8.4 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Thank you very much for your advice!

Both Cuffdiff and Cuffquant have the -b option, and the manual says “it can significantly improve accuracy of transcript abundance estimates.”

So I’m just wondering what may be the impact on the results if not using the –b option?

Some people said online that when they use the –b option, Cuffquant also runs forever for their case; but when they remove it, they get results fast.

Thank you very much for your help!

ADD REPLY • link 8.4 years ago by tunl ▴ 90

1

Entering edit mode

Hi,

This reference has tried cufflinks with and without -b option and found that use of -b makes the analysis much slower without detectable improvement in results. It also mentions that -b is for cufflinks. Even the online manual says following: "Providing Cufflinks with a multifasta file ... ". So I guessed that its for cufflinks than cuffdiff.

ADD REPLY • link 8.4 years ago by Satyajeet Khare ★ 1.6k