Question

How to calculate overall coverage in de novo metagenomics assembly?

1

Entering edit mode

9.6 years ago

scchess ▴ 640

Let's say I have a reference genome and I sequence it into short-reads. Then, I will fed the reads to velvet to create a de novo assembly.

Let's say I have two or more contigs assembled (but not the entire genome). velvet also reports k-mer coverage for each of the contig.

For example, if AGCGGCC is my reference genome, my two assembled contigs are AG (the first two bases) and CC (the last three bases). I'm also given k-mer coverage for AG and GCC, 10.0 and 20.0 respectively.

How to find the overall coverage for the genome? In RNA, we can calculate something like RPKM abundance for a transcript but is there anything like that in metagenomics? Does my question even make sense? I know everything about my reference genome, can I report anything like coverage (or abundance) for the reference genome?

EDITED

The Ray assembler gives biological abundances statistic. Is this the coverage that I'm trying to find?

https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt

metagenomics velvet • 7.7k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by scchess ▴ 640

0

Entering edit mode

9.6 years ago

Josh Herr 5.8k

You question is confusing to me and is not very well communicated -- do you want to calculate coverage for a genome sequencing project or a metagenomic sequencing project?

Calculating coverage for genome sequencing project is very straightforward -- there is plenty out there to help you figure it out.

Calculating coverage for a metagenome assembly is not straightforward. First of all, you have no idea of the genome complexity and qualities of your "template" DNA. You'll have many different strains which represent distinct OTUs which provide overlapping coverage. Because of these qualities, coverage from k-mers is not an accurate measure of metagenome coverage. Even with mock communities barely approaching the diversity in "real" metagenomic samples, you'll only be sequencing a small portion of your overall template -- best case scenario is about 5 to 10 % of metagenome sequencing reads will actually assemble. You therefore have to understand all the caveats of metagenome assembly and coverage when communicating any numbers relating to your research.

What I do is simple and perhaps not the best solution (but I am not aware of any others -- and I've looked -- most people do this): map reads with bwa or bowtie to your assembly (you won't get many, but you can see assembly "hot-spots") and communicate the caveats.

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by Josh Herr 5.8k

0

Entering edit mode

Thanks for the link! I'm a bit confused, that's why I'm asking. I've checked Ray assembler, it has something like biological abundances, do you think this means coverage of a reference genome?

Please check: https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.6 years ago by scchess ▴ 640

0

Entering edit mode

Still not sure if you're talking genome or metagenome assembly -- this matters on the issue as they are not the same.

If you look at the link I posted from a previous questions about Ray, you'll see it uses k-mers to measure coverage. Don't confuse k-mer coverage with actual read coverage as strain diversity and similar OTUs will affect this.

Furthermore, you mention "reference genome" -- in a metagenomic sample how do you know what your reference genome is?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Josh Herr 5.8k

0

Entering edit mode

Sorry I made my questions unclear because I'm struggling with the subject (it's quite technical). I actually have a known microbial community that I can use it to simulate reads. The goal is to evaluate how each de novo assembler such as velvet perform, relatively to the community from where the reads come from. I know I can get k-mer coverage for a contig easily, but I'm struggling to understand if I can also calculate k-mer coverage or actual read coverage for an organism. I asked because I'm not even sure my question makes sense. Everywhere, I see people talk about k-mer coverage for a contig, but what about the reference genome? Would that be possible or make sense to calculate coverage for the genome?

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 9.6 years ago by scchess ▴ 640

0

Entering edit mode

One more time: Is this a metagenome (unknown reference) or a synthetic microbial community (known reference genomes)? This matters here if you can use k-mers or not to estimate coverage.

After your last comment here, I'm just confused what exactly you are looking to do. What is your research question?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.6 years ago by Josh Herr 5.8k

Ram · Accepted Answer · 2015-06-03

1

Entering edit mode

9.6 years ago

Brian Bushnell 20k

Since you have known references, coverage for the reference has nothing to do with an assembly, or assemblers, or kmers for that matter. Concatenate the references together and map all the reads to them to calculate coverage. For example, with BBMap:

bbmap.sh ref=concatenated.fasta in=reads.fq covstats=covstats.txt scafstats=scafstats.txt

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k