Hi folks,
I tried Genome Size Estimation using 2 resources:
Instructions at Kanazawa's K-mer and genome size estimation page
A Perl script from Joseph Ryan
For both of these, I used the same Jellyfish k-mer histogram file as input (input file), generated using the following syntax:
jellyfish count -t 24 -C -m 25 -s 5G -o EthFoc-2_jellyfish --min-qual-char=5 EthFoc-2.*.txt_val*.fq # paired end file pair
and
jellyfish histo -o EthFoc-2_jellyfish.histo EthFoc-2_jellyfish
But the genome size and coverage estimates from each of these calculations are quite different.
Kanazawa's method results:
Genome size estimate - 56.3MB ; Coverage estimate - 50X? (also peak of histogram)
Ryan's method results:
estimate_genome_size.pl --kmer=25 --peak=50 --fastq=EthFoc-2.S282_L007.1.txt_val_1.fq EthFoc-2.S282_L007.2.txt_val_2.fq
Genome size estimate - 36.8MB ; Coverage estimate - ~60X
My guess is that Kanazawa's is a good 1st approximation, but the Perl script is more accurate? - Yes / No / Maybe?
I like how Kanazawa's demo helped me deduce repeat content is ~ 10% of genome, but seems like that also would just be a 1st approximation?
Has anyone encountered such discrepancy before, or know whether one is more reliable that the other? I am interested in deducing repeat content, but due to discrepancy, I'm not sure if my ~ 10% inference is even remotely accurate. Thoughts, anyone? Thank you!
Thanks for following up and posting this informative reply.