I have been trying to assemble a nematode genome denovo from past 6 months. I have got read libraries of different insert lengths 375bp, 500bp and 5kbp. Read length for all libraries are 150bp average. I have used velvet, SOAPdenovo, ALLPATHS-LG, MaSuRCA assemblers so far using kmer length as 75 (after multiple k-mer size estimation using velvet). Except Velvet all other assembler gave me very low N50 and low total genome coverage. Further I have used the assembly I got from velvet to improve and fill the gaps. i have used SSPACE for scaffolding again on this velvet contigs. Then used FinisherSC using 375bp, 500bp and 5kbp insert library reads (three distinct runs) for filling gaps and following are my statistics using gnx tool.
Results using 500bp insert library reads:
Total number of sequences: 17587
Total length of sequences: 120498292 bp
Shortest sequence length : 200 bp
Longest sequence length : 301717 bp
Total number of Ns in sequences: 858844
N50: 30511 (1051 sequences) (60270141 bp combined)
Results using 375bp insert library reads:
Total number of sequences: 17776
Total length of sequences: 121255487 bp
Shortest sequence length : 200 bp
Longest sequence length : 301717 bp
Total number of Ns in sequences: 861716
N50: 30683 (1044 sequences) (60637225 bp combined)
5kbp insert library program is still running. I am in a hurry to proceed further for this genome annotation and further improvement. My genome size estimate to around 150Mbp. As I am pretty new to this kind of work. I am here writing to know if my genome assembly so far is good enough to freeze this assembly and proceed further for genome annotation and other analysis? Or is there any way to improve this assembly further and then proceed for annotation? I also want to know the denovo gene prediction and annotation methods carried out these days.
I'm not sure what the bigger objective is of assembling this genome, but I would expect if you could add some long reads (Nanopore or PacBio) you could definitely improve the assembly and reduce the number of contigs.
As of now we can not go for long reads or sending the sample to sequence again. We have to proceed with these data and make a draft output. Our bigger goal is to annotate and look for specific genes which are responsible for disease progression in the cattle but when infected the same worm doesn't cause any disease in other mammals
As @WouterDeCoster said using long reads is a good solution, but In your case as you said you can't; did you try to use GapFiller? It may help you to have better result
Yes I have used GapClosure tool for that. After Using GapClosure only I went ahead and ran FinisherSC tool. The results are above.
I would suggest to assemble a list of highly conserved genes from closely related organisms, blastx against the genome and see how many of them you're able to find in your draft genome. Nanopre is a good suggestion as well, you should be able to have enough reads will less than $5k