Question

Assembly with ABySS, Kmer optimisation

0

Entering edit mode

6.8 years ago

lieven.sterck 15k

I'm currently involved in the assembly (illumina data) of a few species with genome size 10-26Gb. I'm using the ABySS assembly software, mainly due to it's excellent ability to scale on large compute clusters and of course because it gave good results in the past. To determine the Kmer to use for the assembly I'm running the pipeline up to the unitig stage with different Kmer and then evaluate which Kmer will work best because running the whole pipeline on all data is rather unfeasible. I now started wondering whether this a valid approach. More specifically is the performance at the unitig level a good representation of the performance/result for the whole process (== up to the contig or even scaffold level)?

Would I be better of with running the whole pipeline but for example only using 1 pair of input sequences (I think not because then the coverage, or better lack of, will become an issue).

Anybody has an idea or experience with this (or perhaps has a comparison of unitig vs contig (scaffold?) performance)?

ABySS assembly Kmer • 4.4k views

ADD COMMENT • link 6.7 years ago by lieven.sterck 15k

0

Entering edit mode

You might try Preqc (https://github.com/jts/sga/wiki/preqc) for optimizing K, but I don't know if it will be too slow or not given the size of your data! It can generate reports that might be useful for choosing K.

ADD REPLY • link 6.8 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Which organism genome are you trying to assemble?

I'd use an assembler e.g. SPAdes that uses multiple kmers and then consolidates the assemblies generated using those kmers into longer contigs and scaffolds.

ADD REPLY • link 6.8 years ago by Sej Modha 5.3k

0

Entering edit mode

a number of conifer species as well as a few invertebrates.

From experience I don't see SPAdes handling these kind of datasets within reasonable time/resource requirements. Or am I mistaken?

But would be interested to know though, as I also have PacBio data to through in, so that would then be a plus for SPAdes (compared to ABySS)

ADD REPLY • link 6.8 years ago by lieven.sterck 15k

0

Entering edit mode

I have never assembled a plant genome using SPAdes but it is possible to subsample the reads and a perform hybrid assembly using SPAdes or IDBA-UD and then extend assemblies using tools like SSPACE that works with long as well as short reads.

ADD REPLY • link 6.8 years ago by Sej Modha 5.3k

score 0 · Answer 1 · 2018-10-24

ok, got a break with the crediting on our shared cluster system so I was able to push through some more runs and I just empirically determined the answer to my own question

 n       n:300   L50     min     N75     N50     N25     E-size  max     sum     name
11.13e6 3969248 418222  300     1861    4881    9446    6651    90892   7.566e9 t_k128-unitigs.fa
10.97e6 3906291 404532  300     1912    5038    9764    6867    90892   7.562e9 t_k128-contigs.fa
12.05e6 3656766 389015  300     2112    5241    9743    6906    90884   7.337e9 t_k120-unitigs.fa
11.85e6 3590754 376337  300     2182    5414    10067   7128    90884   7.333e9 t_k120-contigs.fa
12.79e6 3591925 388910  300     2120    5185    9592    6811    104906  7.23e9  t_k115-unitigs.fa
12.57e6 3522090 375817  300     2195    5366    9920    7038    104906  7.225e9 t_k115-contigs.fa

This is just an extract of the whole test set but the trend was consistent all over.

For future reference: it thus seems possible to evaluate on the unitig level what the result will be on the contig level. Some more testing on different datasets/organims would be good though.