I'm currently involved in the assembly (illumina data) of a few species with genome size 10-26Gb. I'm using the ABySS assembly software, mainly due to it's excellent ability to scale on large compute clusters and of course because it gave good results in the past. To determine the Kmer to use for the assembly I'm running the pipeline up to the unitig stage with different Kmer and then evaluate which Kmer will work best because running the whole pipeline on all data is rather unfeasible. I now started wondering whether this a valid approach. More specifically is the performance at the unitig level a good representation of the performance/result for the whole process (== up to the contig or even scaffold level)?
Would I be better of with running the whole pipeline but for example only using 1 pair of input sequences (I think not because then the coverage, or better lack of, will become an issue).
Anybody has an idea or experience with this (or perhaps has a comparison of unitig vs contig (scaffold?) performance)?
You might try Preqc (https://github.com/jts/sga/wiki/preqc) for optimizing K, but I don't know if it will be too slow or not given the size of your data! It can generate reports that might be useful for choosing K.
Which organism genome are you trying to assemble?
I'd use an assembler e.g. SPAdes that uses multiple kmers and then consolidates the assemblies generated using those kmers into longer contigs and scaffolds.
a number of conifer species as well as a few invertebrates.
From experience I don't see SPAdes handling these kind of datasets within reasonable time/resource requirements. Or am I mistaken?
But would be interested to know though, as I also have PacBio data to through in, so that would then be a plus for SPAdes (compared to ABySS)
I have never assembled a plant genome using SPAdes but it is possible to subsample the reads and a perform hybrid assembly using SPAdes or IDBA-UD and then extend assemblies using tools like SSPACE that works with long as well as short reads.