We're working on sequencing some big insect genomes (>2GBp) and as the first data comes out, I'm trying to find a way to tackle with them.
MaSuRCA crashed on just one lane of unfiltered sequences (~(140 x 2) mil reads) and I started looking for alternatives, since for some of our insects we have 10 lanes (paired-end and mate pairs).
So one of the suggestions was to split our reads into smaller subsets and assemble the subsets, separately. Then, move with assembling the assemblies and so on, until we get to the final assembly. One of the problems I see with this approach, however, is that you may end up assembling contigs coming from different assemblies that have very different sequencing coverage (hence different copy number).
What are your thoughts about this approach and also what are your thoughts about the assembly strategy, in general, that I should follow? I know that some plant genomes are a lot bigger than our insects so maybe there is already a solution!
Also, the machine I ran MaSuRCA on had 256GB of RAM, which I think is not small; maybe I can find a machine with 512GB, but definitely not more than this. So please have that in mind when suggesting solutions!
Last, I saw that there's a very similar question, but it was more than a year ago so some things may have changed since then...
Thanks!