Question

Metagenome Assembly

0

Entering edit mode

3.1 years ago

serene.s • 0

Hi Everyone!

I have two questions while going with the metagenome assembly and binning approach:

How does contig length affect the quality of assembly and later on the quality of the bins?
Binning is done by two parameters mainly which are tetranucleotide frequencies and 2nd is contig abundance or differential coverage. I am not able to understand the meaning of the second parameter? How to estimate the contig coverage and how does that affect binning?

Thanks

taxonomy Assembly Binning Metagenomics • 1.0k views

ADD COMMENT • link 3.0 years ago by serene.s • 0

score 0 · Answer 1 · 2022-07-08

Generally speaking, longer contigs mean better assembly. It is much better to have a small number of long contigs than a large number of short contigs, even if the number of bases is identical between the two.

Longer contigs can be binned more reliably because they have stronger signal. Consider a 1000 bp contig, which has a total of 997 tetranucleotides (4Ns). As the total 4Ns number is 256, that means a 1 Kb contig has on average ~4 of each 4Ns, which is not necessarily strong enough signal to distinguish between different contigs. A 5 Kb contig has on average ~20 of each 4Ns, which is a much stronger signal. This is why there is not much point trying to bin contigs smaller than 1 Kb, and I often toss out all the contigs < 2 Kb.

Binning is done by two parameters mainly which are tetranucleotide frequencies and 2nd is contig abundance or differential coverage.

Not sure if you don't understand this or just using imprecise language. 4N frequencies is not a single parameter, but rather a vector of 256 numbers (or 136, depending on how 4Ns are counted). Average contig depth of coverage, on the other hand, is a single parameter.

How does coverage help? Well, if you have two bins with average coverage of 10x and 100x, respectively, and you have another single contig with 100x coverage, which bin is more likely to be the correct one for that contig? In simple terms, it is expected that contigs that came from the same organism have the same abundance, and will therefore be sequenced at a similar coverage. There are several reasons why this doesn't always hold, but it is a good starting assumption.

Now, this is where the exact number of parameters comes into play, which is why I felt necessary to correct you above. 4N frequencies already comprise a 256-long vector, so adding a single value to it (the coverage) doesn't change the whole signal in a fundamental way. At least this has been the case in my experiments, which is why I don't use coverage for binning.

If you want to use coverage, it is calculated by mapping all raw reads to your contigs and then counting the number of mapped reads per contig.