Question

How To Set -Max_Coverage In Velvetg

0

Entering edit mode

11.6 years ago

rwn ▴ 610

Hello,

I am assembling bacterial genomes (~6Mb) using 250 bp paired-end MiSeq data. I have tried a bunch of assemblers (idba_ud, mira, ray, SOAPdenovo, ABySS to name a few...), but am getting reasonably good results using good old velvet (~360 contigs, n50 = 40kb). But I have a question about how to set the velvetg parameter -max_coverage? It's value has a large effect on the resulting number of contigs and total number of bases in the assembly (ie assembled genome size). Am I correct in thinking that many of these high-coverage nodes errors (or at least error-prone, like repeat elements etc) and should be excluded for a better assembly?

I estimate the coverage distribution (in R using plotrix) from the stats.txt file after running a preliminary: velvetg velvet_big_127 -cov_cutoff auto -exp_cov auto. It is then easy to calculate the weighted mean coverage -exp_cutoff and to set a reasonable value for -cov_cutoff, but there is often a long tail in the distribution meaning that there are small number of nodes with very high coverage.

Generally, what is a good way to determine a sensible value for -max_coverage?

Many thanks! Reuben

cov_k131

velvet genomics • 3.0k views

ADD COMMENT • link updated 11.2 years ago by Torst ▴ 980 • written 11.6 years ago by rwn ▴ 610

score 2 · Answer 1 · 2013-09-28

You shouldn't usually set the -max_coverage to anything, unless you know there is "contamination" in your sample in a higher ratio to what your actual true sample is you are trying to recover. Then you would use it as a "low pass filter". Another scenario is if you have a plasmid with high copy number relative to your chromosome. You could use -max_coverage to filter out the plasmid reads. Then you could use -cov_cutoff etc to do the opposite to recover the plasmid. But in general, setting -max_coverage will remove repeat elements from your assembly only. Although this may increase your metrics like N and N50, it is artificial, as what is left over is the same as what was there before, but without the repeated contigs.