Hello,
I am assembling bacterial genomes (~6Mb) using 250 bp paired-end MiSeq data. I have tried a bunch of assemblers (idba_ud, mira, ray, SOAPdenovo, ABySS to name a few...), but am getting reasonably good results using good old velvet (~360 contigs, n50 = 40kb). But I have a question about how to set the velvetg parameter -max_coverage
? It's value has a large effect on the resulting number of contigs and total number of bases in the assembly (ie assembled genome size). Am I correct in thinking that many of these high-coverage nodes errors (or at least error-prone, like repeat elements etc) and should be excluded for a better assembly?
I estimate the coverage distribution (in R using plotrix) from the stats.txt file after running a preliminary: velvetg velvet_big_127 -cov_cutoff auto -exp_cov auto
. It is then easy to calculate the weighted mean coverage -exp_cutoff
and to set a reasonable value for -cov_cutoff
, but there is often a long tail in the distribution meaning that there are small number of nodes with very high coverage.
Generally, what is a good way to determine a sensible value for -max_coverage
?
Many thanks! Reuben