Question

Jellyfish-Kmer, Genome Size Estimation

3

Entering edit mode

11.4 years ago

jackuser1979 ▴ 890

I have filtered paired-end genomic DNA sequence of non-model plant, sequenced through Hiseq 2000.Now I want to do k-mer analysis and want to estimate genome size. I am have tried to use jellyfish for my analysis. Please let me know the steps and commands to do k-mer and genome size estimation. I just tried with basic command from jellyfish manual

jellyfish count -m 21 -s 100M -t 10 -C reads.fasta

illumina • 22k views

ADD COMMENT • link 11.3 years ago by jackuser1979 ▴ 890

Ram · Answer 1 · 2014-02-14

8

Entering edit mode

11.4 years ago

Joseph Hughes ★ 3.0k

After counting the kmers, the next step is to use the histo command to get the counts for each kmer, use the -f option to also get the counts of 0s: jellyfish count – m 21 –o output –s 100M –t 10 -C reads.fasta

jellyfish histo –f output > histogram.txt

Plot this histogram. Get the number of distinct kmers and the number of distinct error kmers. The error kmers are those between zero and the dip before the hump.

Determine the multiplicity of the second kmer peak (the hump). Divide the total number of good kmers (distinct kmers minus error kmers) by the multiplicity to get an estimate of the total genome size.

ADD COMMENT • link 11.4 years ago by Joseph Hughes ★ 3.0k

0

Entering edit mode

Wait, why are we dividing by the multiplicity of the camel hump peak?

ADD REPLY • link 9.6 years ago by Tom ▴ 20

0

Entering edit mode

Multiplicity of the hump = depth of coverage, so divide to determine 1X depth.

ADD REPLY • link 9.6 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Wait what? I'm now even more confused.

This whole time, I've been taking the point at which we reach the first minima (dip), and just summing up everything after that. I've been getting numbers that are spot on to the correct genome size of my organism. If I start dividing by like 8 or 13, which is the multiplicity of most of my local maxima, then I will radically throw off my numbers. I've never needed to divide by anything, and the method seems largely the same as that of kmergenie's estimations.

Something else must be going on. My reads are of high coverage (30x). yet, my peak is about 13.

Can you give my a citation? I've literally never heard of this business with the multiplicity at the hump before. It wouldn't make any sense that the local minima's multiplicity is equal to the coverage because everything after the noise peak is just actual kmers. Adding more reads can't result in any higher numbers because the summation caps out at some asymptote

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Tom ▴ 20

0

Entering edit mode

You must be describing a different value, perhaps the number of unique true kmers. That number will approach a limit with increasing data, as you saturate your sampling. And, if most of the genome is unique, then it's a reasonable approximation of genome size.

On the other hand, counting the frequency of each kmer is ENTIRELY dependent upon the read depth. Think about it: if you double the number of reads, then (on average) you double the number of times you count each true kmer in the genome.

A step-by-step explanation can be found at this link.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Strange, this never occurred to me before.I decided to test this experiment by taking a raw read fastq file, and concatenating it with itself in order to simulate twice the number of reads that I have.

I proceeded to do a jellyfish analysis jellyfish count -m 21 -s 100M -t 10 -C file.fastq. And when I proceeded to model the k-mer multiplicity histogram, I got the same exact results. Can you explain what's going on? I was expecting all my bars to move up by double

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Tom ▴ 20

score 5 · Answer 2 · 2014-02-14

+1 on both answers from Joseph Hughes and Zev.Kronenberg. Jellyfish is a really robust tool for counting k-mers.

If you're just looking for the estimated genome size, I would use Rayan Chikhi's KmerGenie which -- with a single command -- will output lots of graphs and give you the estimated genome size in the output file.

I was a little disappointed with your question as I found it to be a bit lazy -- with any tool you should always read the documentation (and the relevant literature) to figure out how to use it. We're here to help you, but people will be much more eager to help you when they see you have put some time into a question before asking how to do something.

score 4 · Answer 3 · 2014-02-14

4

Entering edit mode

11.4 years ago

Zev.Kronenberg 12k

I think this page has a great explanation :

https://banana-slug.soe.ucsc.edu/bioinformatic_tools:jellyfish

ADD COMMENT • link 11.4 years ago by Zev.Kronenberg 12k

score 0 · Answer 4 · 2014-04-01

0

Entering edit mode

11.3 years ago

jackuser1979 ▴ 890

To answer my own question, there is available perl script to get k-mer distribution from jellyfish with nice plot developed by Joesphryan.

ADD COMMENT • link 11.3 years ago by jackuser1979 ▴ 890