I have filtered paired-end genomic DNA sequence of non-model plant, sequenced through Hiseq 2000.Now I want to do k-mer analysis and want to estimate genome size. I am have tried to use jellyfish for my analysis. Please let me know the steps and commands to do k-mer and genome size estimation. I just tried with basic command from jellyfish manual
jellyfish count -m 21 -s 100M -t 10 -C reads.fasta
Wait, why are we dividing by the multiplicity of the camel hump peak?
Multiplicity of the hump = depth of coverage, so divide to determine 1X depth.
Wait what? I'm now even more confused.
This whole time, I've been taking the point at which we reach the first minima (dip), and just summing up everything after that. I've been getting numbers that are spot on to the correct genome size of my organism. If I start dividing by like 8 or 13, which is the multiplicity of most of my local maxima, then I will radically throw off my numbers. I've never needed to divide by anything, and the method seems largely the same as that of kmergenie's estimations.
Something else must be going on. My reads are of high coverage (30x). yet, my peak is about 13.
Can you give my a citation? I've literally never heard of this business with the multiplicity at the hump before. It wouldn't make any sense that the local minima's multiplicity is equal to the coverage because everything after the noise peak is just actual kmers. Adding more reads can't result in any higher numbers because the summation caps out at some asymptote
You must be describing a different value, perhaps the number of unique true kmers. That number will approach a limit with increasing data, as you saturate your sampling. And, if most of the genome is unique, then it's a reasonable approximation of genome size.
On the other hand, counting the frequency of each kmer is ENTIRELY dependent upon the read depth. Think about it: if you double the number of reads, then (on average) you double the number of times you count each true kmer in the genome.
A step-by-step explanation can be found at this link.
Strange, this never occurred to me before.I decided to test this experiment by taking a raw read fastq file, and concatenating it with itself in order to simulate twice the number of reads that I have.
I proceeded to do a jellyfish analysis
jellyfish count -m 21 -s 100M -t 10 -C file.fastq
. And when I proceeded to model the k-mer multiplicity histogram, I got the same exact results. Can you explain what's going on? I was expecting all my bars to move up by double