Abyss genome assembly input
1
1
Entering edit mode
9.5 years ago

In the past when I've ran Abyss, I've given it multiple sequencing libraries from multiple input fastq.gz files (~36 separate files).

In this current run, I've concatenated all my sequencing libraries together into two huge fastq.gz files (one for each of the pair). The main reason was that I performed digital normalization and the output was two huge concatenated files.

The issue now is that I am finding Abyss to be taking a lot longer to finish assembling these normalized reads (about 1/3 of the original size).

From reading the abyss logs, I noticed in the past there was parallel reading of multiple input fastq.gz files. Is it possible that since I just have two huge files now, I am not taking advantage of this parallel reading? And a significant amount of time is being used just to read in files?

In my past runs with multiple inputs, Abyss would finish reading in the reads in 6-8 hours (I determine this by check when the pruning started to get logged). It has been almost a full day now with the current run and it still haven't finished reading in the reads.

I am running Abyss on an ad hoc AWS cluster (starcluster) with 20 nodes. In past runs with multiple input files, I've used abyss 1.52. In this current run with two huge files, I am using abyss 1.9. Could it be a difference between versions also?

genome-assembly abyss • 4.0k views
ADD COMMENT
2
Entering edit mode
9.5 years ago
Shaun Jackman ▴ 420

From reading the abyss logs, I noticed in the past there was parallel reading of multiple input fastq.gz files. Is it possible that since I just have two huge files now, I am not taking advantage of this parallel reading? And a significant amount of time is being used just to read in files?

Yes. This issue is on our radar. Feel free to open a feature request issue on GitHub.

In past runs with multiple input files, I've used abyss 1.52. In this current run with two huge files, I am using abyss 1.9. Could it be a difference between versions also?

No.

Cheers,
Shaun

ADD COMMENT
0
Entering edit mode

Just some extra info. I performed another assembly today at a different k-length. But this time, I splited the single big input into 30 parts (60 fastqs in total). I also did not gzip the fastqs. This gave me a significant speed increase.

As a comparison, the first run at k=70 with a single file (2x40gb fastq.gz) took ~30 hours for the coverage.hist to be generated. The second run at k=120 with splited 30 files (60 fastqs, un-gzipped) took 1.5 hours for the coverage.hist to be generated. This was done on a 20 node AWS cluster.

ADD REPLY

Login before adding your answer.

Traffic: 2068 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6