hi folks. I am trying to do a marine fish genome assembly, and i got 120 G reads for two short pairend lib and a large matepair lib. as the expect the genome size should be about 600 MB, but i got 1200MB after the gapclose with SOAPdenovo. then i read some paper (The oyster genome reveals stressadaptation and complexity ofshell formation) and search online get a phase "Remove Redundancy From Assembly", so is there any idea how to deal with reducing the error? or any other advice?
thanks for your suggestions, i tried Platanus for assembly and i got information, the contig.fa is 1013Mb and there is 370Mb contigBubble.fa
So if Platanus and SOAPdenovo roughly agree on genome size then I think your original genome size estimation is way off.
I'd run Chris' suggested kmer abundance analysis, here's an online tool which I found easiest to use: http://qb.cshl.edu/genomescope/
What do the basic overall stats look like when comparing the two assemblies, NG50 for example? I suggest using an arbitrarily high est. genome size when calculating these (maybe 1-1.2Gb) just for comparison purposes, the N50 will not be directly comparable. Also, I recommend looking at MEGAHIT over SOAPdenovo2 (note the github docs on SOAPdenovo2 also state this). Don't include the bubble file with the Platanus data, those are generally the redundant sequences (possible allelic variations).
Also, like most assemblers Platanus and SOAPdenovo2/MEGAHIT have options at the contig and scaffold steps, these can be used to reduce redundancy and to play with linkage parameters.
I am still working on platanus scaffolding here is the stat info. Total sequences 569886 Total bases 755283165 Min sequence length 100 Max sequence length 412374 Average sequence length 1325.32 Median sequence length 200.00 N25 length 47550 N50 length 12865 N75 length 2006 N90 length 561 N95 length 210 As 29.04 % Ts 28.69 % Gs 20.22 % Cs 20.19 % (A + T)s 57.73 % (G + C)s 40.41 % Ns 1.86 % still not good enough.