I have done genome assembly on an interleaved fastq file using many different assemblers (Velvet, ABySS, Minia, SPAdes, etc.) and have the "contigs.fasta" file from all of them. I have run over 50 assemblies with different parameters and options in each of those assemblers, now I have processed each "contigs.fasta" file using QUAST. I know that the length of the genome I am trying to assemble is originally 200,000. However using QUAST the "Total Length" and "Total length (>= 0 bp)" I am getting for 95% of my assemblies (i.e. contigs.fasta files from different assemblers) is near 390,000 all the time. What is the problem? Does "Total Length" in QUAST refer to something different? Why can't I get any length value near the expected 200,000? I have experimented with tons of k-mer, coverage-cutoff, expected coverage value combinations!
Yes! It is simulated data, (probably generated by Matlab, but I am not very sure about that) and all I know about it is that the original length of the genome is 200,000 and the coverage is 50.
Here are the values I have received running QUAST:
https://docs.google.com/document/d/1ElQsrC4qx8X-a6j-pZs6OPVsPKK_nRAOd0IMW3Xq-iw/edit?usp=sharing
Okay, I can see, why you are in doubt about the 200kbp :). If the data was simulated, some form of errors/heterogeneity (or maybe repeats) had to be introduced to the data - otherwise the assemblers would not have such a hard time with such a small data set. If you cannot find out how exactly the set was generated, you could run a kmer analysis to a) estimate the expected genome size and b) determine the level of noise - something along those lines: http://koke.asrc.kanazawa-u.ac.jp/HOWTO/kmer-genomesize.html