Let's say I have a reference genome and I sequence it into short-reads. Then, I will fed the reads to velvet to create a de novo assembly.
Let's say I have two or more contigs assembled (but not the entire genome). velvet also reports k-mer coverage for each of the contig.
For example, if AGCGGCC is my reference genome, my two assembled contigs are AG (the first two bases) and CC (the last three bases). I'm also given k-mer coverage for AG and GCC, 10.0 and 20.0 respectively.
How to find the overall coverage for the genome? In RNA, we can calculate something like RPKM abundance for a transcript but is there anything like that in metagenomics? Does my question even make sense? I know everything about my reference genome, can I report anything like coverage (or abundance) for the reference genome?
EDITED
The Ray assembler gives biological abundances statistic. Is this the coverage that I'm trying to find?
https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt
Thanks for the link! I'm a bit confused, that's why I'm asking. I've checked Ray assembler, it has something like biological abundances, do you think this means coverage of a reference genome?
Please check: https://github.com/sebhtml/ray/blob/master/Documentation/BiologicalAbundances.txt
Still not sure if you're talking genome or metagenome assembly -- this matters on the issue as they are not the same.
If you look at the link I posted from a previous questions about Ray, you'll see it uses k-mers to measure coverage. Don't confuse k-mer coverage with actual read coverage as strain diversity and similar OTUs will affect this.
Furthermore, you mention "reference genome" -- in a metagenomic sample how do you know what your reference genome is?
Sorry I made my questions unclear because I'm struggling with the subject (it's quite technical). I actually have a known microbial community that I can use it to simulate reads. The goal is to evaluate how each de novo assembler such as velvet perform, relatively to the community from where the reads come from. I know I can get k-mer coverage for a contig easily, but I'm struggling to understand if I can also calculate k-mer coverage or actual read coverage for an organism. I asked because I'm not even sure my question makes sense. Everywhere, I see people talk about k-mer coverage for a contig, but what about the reference genome? Would that be possible or make sense to calculate coverage for the genome?
One more time: Is this a metagenome (unknown reference) or a synthetic microbial community (known reference genomes)? This matters here if you can use k-mers or not to estimate coverage.
After your last comment here, I'm just confused what exactly you are looking to do. What is your research question?