Question

What Happens When The K-Mer Size Is Larger Than The Trimmed Reads Size In Velvet Assembly?

0

Entering edit mode

11.7 years ago

Rahul Sharma ▴ 660

Hi all,

I am assembling a genome of size 120Mb from 5 different libraries of different inserts. Insert sizes are 300bp, 1Kb, 8Kbs, 20kbs and singletons. first two libraries are from Illumina genome analyzer(Read length: 76bp) and the last two are from HiSeq (Read length: 100bp). After reads trimming mean lengths are 55 and 87bp from GA and HiSeq runs. I want to do assemblies with velvet, would the k-mer size of 35, 45, 55, 65, 75 will crate any issue? Since my trimmed read length is quite varying? Will it be fine to assemble both GA and HiSeq reads together or should I assemble separately and merge assemblies later? I would appreciate the decent comments.

Regards

velvet • 6.6k views

ADD COMMENT • link updated 11.1 years ago by SES 8.6k • written 11.7 years ago by Rahul Sharma ▴ 660

score 1 · Answer 1 · 2013-03-31

I don't know first hand but I recall people stating that it can't work as the method won't be able to build the kmers that are long enough.

Stated for example in a blog post from Homologous: http://www.homolog.us/blogs/2012/10/10/multi-kmer-de-bruijn-graphs/

More relevant overall information on k and other parameters can be found in Titus Brown's blog:

http://ivory.idyll.org/blog/the-k-parameter.html

In fact all pages tagged as assembly are worth consulting:

http://ivory.idyll.org/blog/tag/assembly.html

score 1 · Answer 2 · 2013-10-14

I was curious about this because I use velvet a lot, so I tested it. There is no explicit warning from velveth, but you can tell there were no overlaps found by a couple of ways. First, look at the Roadmaps file. If you choose a k-mer size larger than your read lengths, then the Roadmaps found will be equal to the input sequence number. Another way would be to just run velvetg and take a look at the graph produced. If it runs rather quickly and ends with something like:

...
[155.488308] EMPTY GRAPH
Final graph has 0 nodes and n50 of 0, max 0, total 0, using 0/20198538 reads

Then you have a clear indication no overlaps were found for that hash length. Because read lengths vary, I think that all the sequences would have to be processed in order to warn about these conditions. Though, it would probably be helpful to warn about this after the pre-processing stage or fall back to a hash length shorter than the reads before working on the Roadmaps.