Hi all. I'm in the process of uploading a draft bacterial genome assembly to NCBi. NCBI asks that you give a coverage estimate based on #bps sequenced/ expected genome size x % of bps placed in final assembly. I have calculated this using the kmergenie estimate for expected genome size as this is a de novo project, the numbers are as follows: Forward read fastq file: Num reads:5261180 Num Bases: 1575030702 Reverse read fastq file: Num reads:5049184 Num Bases: 1511690223 (1575030702+1511690223) = 3086720925 (i.e total bps sequenced) kmergenie genome size estimate: 4727586 Actual assembly size: 4706279
This gave a coverage calculation of: (3086720925/ 4727586) x ((4706279/3086720925)x100)= 99.549304867
I am inexperienced but this seems a high coverage- does this calculation seem sensible?
For bacterial genomes high coverage sequencing is easily possible. Your genome size estimate (4.7 Mb) also is inline with what a bacterial genome would be sized at. Is that number similar to a reference genome already in NCBI (or a closely related species)?
Hi, Yes this genome size is similar to that of other closely related species on NCBI. It was more the method for calculation of coverage I was concerned with.
I don't understand your math. The coverage is (bases sequenced)/(genome size) which is 3086720925/4706279=656. If you want the coverage of reads placed in the final assembly, you'll have to map them and then use the total number of mapped bases as the numerator instead (BBMap will print the coverage after mapping if you include the flag "covstats=covstats.txt").
Hi Brian, Thank you. I misunderstood what was required as the % bps placed in the final assembly (hence calculating it as number of bps in the assembly as a % of total number of bps sequenced). I have now mapped the reads as below: $ bowtie2 -p 6 -f -x Str113_genomeidx -1 Strain113_S52_R1_001.fasta -2 Strain113_S52_R2_001.fasta --very-sensitive -X 1000 -I 200 | samtools view -bS - > Str113.bam So I can use bedtools genomecov to estimate the coverage.
Hi again Brain. This is what NCBI ask for: "The estimated base coverage across the genome, eg 12x. This can be calculated by dividing the number of bases sequenced by the expected genome size and multiplying that by the percentage of bases that were placed in the final assembly. More simply it is the number of bases sequenced divided by the expected genome size." I think this is depth, as opposed to coverage?? 98.55% of the reads mapped back to the assembly by Bowtie2. For this NCBI calculation would it be correct to do: (3086720925/4706279=656)* 0.99= 649.44 This seems an extraordinarily large figure for coverage?!
Not really. Coverage can be anything. 600x for a bacteria is not unusual.