I am trying to assemble a 30-35Mbp diploid genome using Abyss from HighSeq Illumina runs. It takes a very long time (days) to compute assembly for just one k-mer using default settings on a 12 CPU/large RAM machine.
Hence my questions:
- what is your experience with abyss-bwa and abyss-bowtie, both performance- and quality of assembly/scaffoldling-wise?
- I use NFS-mounted partitions for both data and temp directories, which I guess slows down Abyss. How do I estimate how much of local disc space I will need for local temp directory?
- Compression settings. I have found this Biostar post about pbzip2 and speed. Has anyone done comparisons with gzip/pigz?
- scaffolding off. It seems that majority of my runs, Abyss spends on mapping reads to assembly/scaffolding. Since I am exploring k-mer space, I want to get contigs, check N50s, compare the assembly with related species genomes, then pick few good looking k-mers and rerun assembly with i.e. differently filtered/base-error-corrected data sets. Can I switch off the whole scaffolding part?
- openmpi & Abyss: are there any i.e. minimum RAM requirements for cluster nodes to run Abyss without crashing?
Yes, I know there is a Abyss mailing list, but it takes a long time to get an answer from the overworked developer. Trawling through the archives did not gave me clear answers so far.
Thanks a lot for your help.
EDIT (partial answers)
ad 1: according to ABySS author, the default mapper/scaffolder performs better quality-wise than abyss-bwa and abyss-bowtie
ad 4: the answer is: "abyss-pe pe-contigs other_switches_go_here"
I think you may figure it out by yourself. To speed things up, the key is to identify the bottleneck. Just run abyss normally and check "top" every half an hour to see which steps takes most of time. My guess is graph construction and simplification take most of time. As to scaffolding, if you assemble reads as single-end, I guess scaffolding will be skipped.
I would also give SOAPdenovo and SGA a try.
Days seems excessive - I'd expect hours on my server (24 CPU, 100 GB RAM). Assembly can often be "held up" by a very small number of "rogue reads" which mess up the graph, so you may want to look at some quality filtering to reduce the number of input reads.