Generally speaking, longer contigs mean better assembly. It is much better to have a small number of long contigs than a large number of short contigs, even if the number of bases is identical between the two.
Longer contigs can be binned more reliably because they have stronger signal. Consider a 1000 bp contig, which has a total of 997 tetranucleotides (4Ns). As the total 4Ns number is 256, that means a 1 Kb contig has on average ~4 of each 4Ns, which is not necessarily strong enough signal to distinguish between different contigs. A 5 Kb contig has on average ~20 of each 4Ns, which is a much stronger signal. This is why there is not much point trying to bin contigs smaller than 1 Kb, and I often toss out all the contigs < 2 Kb.
Binning is done by two parameters mainly which are tetranucleotide frequencies and 2nd is contig abundance or differential coverage.
Not sure if you don't understand this or just using imprecise language. 4N frequencies is not a single parameter, but rather a vector of 256 numbers (or 136, depending on how 4Ns are counted). Average contig depth of coverage, on the other hand, is a single parameter.
How does coverage help? Well, if you have two bins with average coverage of 10x and 100x, respectively, and you have another single contig with 100x coverage, which bin is more likely to be the correct one for that contig? In simple terms, it is expected that contigs that came from the same organism have the same abundance, and will therefore be sequenced at a similar coverage. There are several reasons why this doesn't always hold, but it is a good starting assumption.
Now, this is where the exact number of parameters comes into play, which is why I felt necessary to correct you above. 4N frequencies already comprise a 256-long vector, so adding a single value to it (the coverage) doesn't change the whole signal in a fundamental way. At least this has been the case in my experiments, which is why I don't use coverage for binning.
If you want to use coverage, it is calculated by mapping all raw reads to your contigs and then counting the number of mapped reads per contig.
Hi Mensur Dlakic
Thanks for correcting me sir! Thanks for showing patience and making me understand the basics.
Now I have another question, I am doing binning based on the assemblies of individual samples so does that affect the binning results? I am considering individual binning because each metagenome datasets that I have are from different sites so, I think it would be wise not to use the co-assembly approach. Please correct me if I am wrong here, and I would also like to know what binning tools you use for your metagenome analysis.
Thanks