Hi Biostar,
This is not a question, just a small blog post. While developing a new version of Minia, I was wondering how assemblers filter out short contigs from their results. I thought maybe someone else could benefit from these observations.
SPAdes 3.5:
- Return all contigs of length at least k+1, except:
- short (<= max(read length, 150 bp)) isolated contigs (those not connected to any other contig)
isolated ones with coverage at most 2actually that filter is disabled by default
(source: simplification.info and graph_simplification.hpp)
Velvet:
- Return all contigs of length at least 2*k
Minia v1 and v2: same as Velvet.
BTW not sure if this should be a Blog or Forum post. Or just not posted here :)
Interesting. Was your sequencing coverage slightly above 80x? It's curious that your coverages spread out that much between 20-100 with only a small peak at around 80.
2k is a special value: the length of each path in a one-nucleotide mutation bubble (SNP or sequencing error) is k nodes (if start and end nodes aren't included). The sequence length of each path is 2k-1. So for each bubble not removed by the assembler (for any reason), there is going to be a contig of length 2k-1 (or 2k, or 2k+1, depending on how the assembler includes branching nodes in contigs).
(source: https://raw.githubusercontent.com/redayounsi/wiki/master/bub2.png)