So I am trying to estimate the size of a genome, and getting confusing and inconsistent results. So far I have:
- assembly size. All of them tend towards a cumulative size 600-700 Mbp assemblies, depending on the assembler and what sequence data is used, with the higher numbers from assemblies considered better quality. Mapping back reads, DNA, fosmids, etc, show that the assembly is pretty much complete, N50 is good, and there are no obvious problems.
- k-mer estimates on the inbred strain used in the assembly. Using the jellyfish recipe here: https://bioinformatics.uconn.edu/genome-size-estimation-tutorial/, I get an estimate of 1 Gbp
- Edit: k-mer on pooled wild strains. k-mer coverage of 55x, estimated size is slightly above 1 Gbp, consistent with inbred
- mapping estimates. Calculating coverages with samtools depth, integrating (summing coverages times counts) and dividing by modal coverage gives between 800 and 900 Mbp, with the larger libraries tending towards the lower end.
- lab methods (not sure which¹, but not based on sequencing) consistently report a genome size of 1.6 Gbp.
Questions: What am I doing wrong here? Is this kind of discrepancies to be expected (i.e., all these methods are close to worthless)? Are there other methods I can (easily) use to get more estimates?
¹ Edit: I checked. We have used staining densiometry and flow cytometry, both apparently give a size of 1.5-1.6 Gbp, using human and chicken as controls.
do you have any other prior knowledge, such as ploidy state, heterozygosity level, ... ?
and yes, discrepancies are to be expected here (they are all estimates), but ideally they should all indicate kinda the same ballpark number .
As far as I can tell, it is a regular diploid animal (copepod). One hypothesis is that the inbred specimens used for sequencing have lost parts of the genome, while the genome size experiments were done on wild specimens. I have sequences from wild individuals too, but currently they are unavailable, thanks to our IT department. Will keep you posted :-)
count the kmers. Kmer count may throw the light on genome size.
As mentioned, I used jellyfish to do this. With kmers of 21, 25, 29, and 31, I get k-mer coverages of 20, 19, 18 and 17, and genome size esimates of 975, 981, 986 and 1017 Mbp respectively.
Sounds like it's a real biological difference between your inbred and wild type strains. 50 % of expected sounds a bit low.
I have almost always seen assemblies making up about 70-90% of the expected genome size (eg 0.550 / 0.75 Gbp, 2.2 /2.9 Gbp) etc. Remember manually assembled "finished" assemblies contain a lot of Ns of unknown sequence (telomere, centromeres, large repeats etc). That said, I have only ever worked with diploid, inbred or double haploid samples (no highly heterozygous samples).
I found some wild sequences, and ran the k-mer analysis. This gives slightly over 1Gbp. Will do mapping coverage when I get the BAM files.