Hi,
So I've ran SPAdes with a kmer = 77 and my read length is 100bp.
This is one of my contigs:
>NODE_86_length_345_cov_0.615672
TCCTTTGACTCCTTTGACACTGACAAATTGGCTTCCATATTTTATACCTTAATCATCTAATTGGCTTCCATATTTTATACCTTAATCATCT...
So this contig has a length of 345bp and the last value is, from what I understand, the kmer coverage which is 0.61.
Based on that definition:
https://groups.google.com/forum/#!topic/abyss-users/47xOSCy9uQc
the k-mer coverage of a contig is the number of k-mer that map (with perfect identity) to that contig.
Now I am confused about that definition. In my case, how can it be possible that I have a kmer coverage of 0.61 since my contig should have been built by more than 1 kmer (because the total length is 345 bp) ?
Thanks for the link.
So a kmer coverage is not really interpretable ? better convert to base coverage ?
Anyway, the definition that I gave from the google link is false then ?
Of course kmer coverage is interpretable.
The assembler will be happy if you provide it with a aproximate value of genome size, because it will use that value to check the assembling procedure
And you can give that value of genome size by giving either the nucleotide coverage or the kmer coverage. Since this assembler uses Bruijn graphs with 77mer nodes, you can provide that nucleotide coverage in kmer coverage.
The actual formula is Ck = C (L-k+1)/L, where Ck = Kmer coverage C = nucleotide coverage L= The length of your reads k= the hash value you select for the Bruijn graph, in this case, 77
Thanks for your explanation, that's very useful.
However sorry but I still don't have my answer. As I said in my post, the definition I found for k-mer coverage is
This is not consistent with what I see, because I have a kmer coverage of 0.61 but it should be not possible because my contig size is 375 bp and my k-mer is 77.
It seems that there are two k-mer coverage definitions involved in this threat
A k-mer coverage taking into account the whole genome that is depending on the total of bases you got sequenced
A contig coverage. It is likely that this value tries to inform you about what is the proportion of the total coverage (of the k-mers) of this contig in particular over the total of contigs that the assembler has gotten. To validate this, you should process the first lane of each of the fasta files, extract the coverages values, sum them, and check if they are close to 100
The contig coverage (or kmer coverage) total sum in not equal to 100. For instance, some contigs have a coverage of 200 etc.