I'm submitting a bunch of genomes to NCBI and they want genome coverage:
The estimated base coverage across the genome, eg 12x. This can be calculated by dividing the number of bases sequenced by the expected genome size and multiplying that by the percentage of bases that were placed in the final assembly. More simply it is the number of bases sequenced divided by the expected genome size.
Is there a tool that does this or should I just write a method myself?
Are you suggesting something like this?
Wouldn't this over estimate the coverage because it would imply that a read covers the entire contig? I just ran it for my genomes and in some cases I'm getting
5623X
which seems like a lot.In simplest terms, the coverage is all the bases sequenced (that belong to that genome) divided by genome size.
Your formula is correct. What you calculated is high coverage, but not unheard of if you are sequencing pure cultures. The easiest way to see whether that makes sense is to look at coverage of individual contigs. If they are in such high numbers, then the coverage of the whole genome will be high as well.
The coverage is not a simple sum of all the reads that cover any part of the contig, because you are right that would overestimate everything. The script I mentioned takes care of calculating the average coverage per base properly, which will be in file
coverage.tab
and look like this:By the way, you may have to get the script from old Autometa release, as it may have been discontinued in most recent versions:
https://github.com/KwanLab/Autometa/releases/tag/1.0
Oh I think that is probably the culprit, I was using counts from featurecounts. How is contig coverage calculated in the example above?
It is all in the script I referenced above, and I am fairly confident you know your way around the python code.
Briefly, the raw reads are mapped to your assembly using
bowtie2
, followed bysamtools
and a couple of custom perl scripts that come bundled in that version of Autometa I referenced above. If you already have a sorted.bam
file by mapping the raw reads using some other program, you can bypass several steps. If you look through the code it should be clear.Thanks, will do. I appreciate your help.