Metagenome Assembly
1
0
Entering edit mode
2.4 years ago
serene.s • 0

Hi Everyone!

I have two questions while going with the metagenome assembly and binning approach:

  1. How does contig length affect the quality of assembly and later on the quality of the bins?

  2. Binning is done by two parameters mainly which are tetranucleotide frequencies and 2nd is contig abundance or differential coverage. I am not able to understand the meaning of the second parameter? How to estimate the contig coverage and how does that affect binning?

Thanks

taxonomy Assembly Binning Metagenomics • 812 views
ADD COMMENT
0
Entering edit mode
2.4 years ago
Mensur Dlakic ★ 28k

Generally speaking, longer contigs mean better assembly. It is much better to have a small number of long contigs than a large number of short contigs, even if the number of bases is identical between the two.

Longer contigs can be binned more reliably because they have stronger signal. Consider a 1000 bp contig, which has a total of 997 tetranucleotides (4Ns). As the total 4Ns number is 256, that means a 1 Kb contig has on average ~4 of each 4Ns, which is not necessarily strong enough signal to distinguish between different contigs. A 5 Kb contig has on average ~20 of each 4Ns, which is a much stronger signal. This is why there is not much point trying to bin contigs smaller than 1 Kb, and I often toss out all the contigs < 2 Kb.

Binning is done by two parameters mainly which are tetranucleotide frequencies and 2nd is contig abundance or differential coverage.

Not sure if you don't understand this or just using imprecise language. 4N frequencies is not a single parameter, but rather a vector of 256 numbers (or 136, depending on how 4Ns are counted). Average contig depth of coverage, on the other hand, is a single parameter.

How does coverage help? Well, if you have two bins with average coverage of 10x and 100x, respectively, and you have another single contig with 100x coverage, which bin is more likely to be the correct one for that contig? In simple terms, it is expected that contigs that came from the same organism have the same abundance, and will therefore be sequenced at a similar coverage. There are several reasons why this doesn't always hold, but it is a good starting assumption.

Now, this is where the exact number of parameters comes into play, which is why I felt necessary to correct you above. 4N frequencies already comprise a 256-long vector, so adding a single value to it (the coverage) doesn't change the whole signal in a fundamental way. At least this has been the case in my experiments, which is why I don't use coverage for binning.

If you want to use coverage, it is calculated by mapping all raw reads to your contigs and then counting the number of mapped reads per contig.

ADD COMMENT
0
Entering edit mode

Hi Mensur Dlakic

Thanks for correcting me sir! Thanks for showing patience and making me understand the basics.

Now I have another question, I am doing binning based on the assemblies of individual samples so does that affect the binning results? I am considering individual binning because each metagenome datasets that I have are from different sites so, I think it would be wise not to use the co-assembly approach. Please correct me if I am wrong here, and I would also like to know what binning tools you use for your metagenome analysis.

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 2259 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6