I have been using ABySS for many years and it seems like it is taking longer to read in the sequencing files and produce the first coverage.hist
file. Or perhaps I am just getting bigger projects. My current project is 160 Gbase (250bp reads) of a mammalian genome (thus ~3 Gbase or ~50x coverage). I am using a fairly new Intel-based 20-cpu, 256 GB machine but also have access to older and slower machines. I am running ABySS v.1.9 with the v=-vv
option (for logging purposes). Other parameters include s=202 n=10 l=40
plus k=XXX
.
Extrapolating from the log file it looks like it will take about 10 days to read in the pair-end files and then comes all of the processing time it will take to produce scaffolds. A different mammalian project I did a couple of months ago with 100 Gbase of 100bp reads took, as I recall, about 5 to 6 days to read in the files thus the 10 day estimate from the log file appears to be valid.
I can run either using a single R1 and R2 file as input in which case two CPUs are active while the others sit or I can split up the R1/R2 files into multiple parts and get all CPUs to read in the files. Splitting up the files doesn't seem to help out the speed issue very much. The files are on a fast NFS "scratch" space. Tests with a sub-set of the data on a RAMdisk also doesn't improve the speed too much. Different compile options don't see to make much different either.
I usually do parameter sweeps with different k-values however it is hard to do so when running ABySS takes weeks. I am getting irritated by this. Anyway, my questions:
- Is a 10-day read time reasonable for 160 Gbase?
- Are there any parameters I can try tweaking in order to speed up the loading process?
From limited timing tests (done via hacking the code) it appears that the time is mostly being used up in adding kmers to the initial graph (seqCollection->add
) and not in reading the file itself. Occasionally I will also get spikes in the seqCollection->pumpNetwork()
call which I find strange. However given my limited c++ hacking skills those tests are subject to question.
Hi Rick,
No, 10 days to read in 160 Gbp of sequencing data is not at all normal in my experience. I can understand why you would be annoyed. Roughly speaking, I might expect it to take ~ 12 hours to load the data and to be much faster than that if you split up the files.
Notwithstanding your experiment with the RAMFS, I would highly suspicious of your filesystem. If I were in your position, I would try to find some way to benchmark your file I/O and compare those results to what is expected for a typical high performance computing cluster.
Besides file splitting, I'm not aware of anything else you could do to speed things up.