choice of k-mer size for metagenomic assembly
1
1
Entering edit mode
9.6 years ago
qiyunzhu ▴ 130

Dear community,

I am doing de novo assembly of some metagenomic datasets. They are Illumina NextSeq reads (paired-end, 150 bp per read). I have tried IDBA-UD and SPAdes so far. Both of them gave me a final N50 values of some thousands, which is not too bad but still below my expectation.

I noticed that I can manually set the k-mer sizes used in each iteration. In SPAdes, the recommended k-mer sizes are 21, 33, 55, 77, and in IDBA-UD the default is 20 to 100 with an increment of 20. I changed IDBA-UD's maximum k-mer size from 100 to 240, and the final N50 value is significantly higher. Below is a plot of the metrics per iteration (x-axis):

My questions are:

1) I feel that larger maximum k-mer size does perform better than smaller ones, since the N50 values grows almost linearly, without notably compromising total length. Am I right?

2) Based on the figure, is there any improvement I can possibly make? (e.g., further increase max k-mer size, decrease increment, etc).

3) What other tips do you suggest me to play with?

Thanks and you all have a great day.

== update ==================

Here is another plot of the distribution of resulting contig sizes at different maximum k-mer sizes by IDBA-UD. It looks to me that the performance is indeed 240 > 180 > 120, because the whole curve moves right without changing the shape much. Am I right?

Assembly genome k-mer • 8.5k views
ADD COMMENT
2
Entering edit mode
9.6 years ago
5heikki 11k

Your N50 is growing with the increasing max k-mer setting mostly because shorter reads than max k-mer are not included in the scaffolds file produced by IDBA-UD (those that could not be assembled into any contig). IMO the best max k-mer size for IDBA-UD is max read length after trimming (or 2x if they're overlapping reads and you merge them prior to assembly). In my opinion, in metagenomics N50 larger than 1k bp is good enough.

ADD COMMENT
0
Entering edit mode

Thanks! They are paired end reads and are supposed to overlap (though not too many of them actually overlap well) I am trying to get something larger. What do you think I can do after assembly? E.g., I bin them by taxonomy and re-assemble within bins, using some more aggressive algorithms. Do you think that is reasonable? Thanks.

ADD REPLY
0
Entering edit mode

I recently recovered multiple (about 30) near complete prokaryote and virus genomes from a complex metagenomic assembly (~1.3kbp N50) with MaxBin (paper), which bins contigs on the basis of coverage and tetramer usage. Before this I have used the ESOM approach but MaxBin appears to do the job much better (no wonder since there's the additional coverage information) and is completely automated (unlike ESOM which needs awful many parameters as user input and forces you to manually select bins on the basis of umatrix visualization). I haven't gotten around to completing these bins yet, but I will likely attempt it with either PRICE or maybe IDBA-hybrid. I would be interested in hearing what others think of optimal k-mer settings though. I think the general consensus is that you just need to try many different ones to find the best one for you data..

ADD REPLY

Login before adding your answer.

Traffic: 2884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6