Question

Jellyfish -C option explanation

4

Entering edit mode

9.7 years ago

pbigbig ▴ 250

Hi,

When working with Jellyfish, I used sample command in their guide:

$ jellyfish count -m 22 -s 100M -t 16 -C merged.fq.gz

and I am not really understand the meaning of -C, which is described as "canonical", or (as in Jellyfish 1) "Count both strand, canonical representation". Doesn't it should be a default option (no need to put in)? What is the consequence if I do not include this -C in my jellyfish command?

I also wonder that is this option also pre-implemented in kmergenie?

Thank you in advance for your clarification!

jellyfish • 9.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by pbigbig ▴ 250

Ram · Answer 1 · 2015-08-04

8

Entering edit mode

9.7 years ago

Rob 7.1k

When counting k-mers in sequencing reads, there is really no way to differentiate between k-mers and their reverse complement. What I mean by this is that seeing e.g. ACGGT is equivalent to seeing ACCGT, since the latter is the reverse complement of the former and the sequenced reads don't originate from a prescribed strand of the DNA. The -C command in jellyfish considers both a k-mer and its reverse complement as equivalent, and associates the count for both (the sum of the count of a kmer and its reverse complement) with the k-mer among the two that is lexicographically smaller. So, for example, above only ACCGT would be stored and its count would be equal to the number of occurrences of both ACCGT and ACGGT. If you don't include -C in your jellyfish options, these k-mers will be treated separately. There's nothing "wrong" with this, per-se, but it may not be what you want.

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Rob 7.1k

1

Entering edit mode

EDIT: I misread Robs explanation. As he stated -C collapses k-mers with its reverse complement.

>test1
AAAAATTTTT

jellyfish count -s 1M -k 5 -o withoutC.jf test.fa && jellyfish dump test.fa
>1
AAAAA
>1
AAAAT
>1
TTTTT
>1
AATTT
>1
AAATT
>1
ATTTT

jellyfish count -C -s 1M -k 5 -o withC.jf test.fa && jellyfish dump

>2
AAAAA
>2
AAAAT
>2
AAATT

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by thackl ★ 3.0k

0

Entering edit mode

That's what I said. -C considers both a kmer and its reverse "complement as equivalent". This means it collapses them.

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by Rob 7.1k

1

Entering edit mode

Yeah, proper reading definitely helps ;) - Sorry about the that!

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by thackl ★ 3.0k

0

Entering edit mode

No problem ;P

ADD REPLY • link 9.7 years ago by Rob 7.1k

0

Entering edit mode

Thanks a lot Rob!

ADD REPLY • link 9.7 years ago by pbigbig ▴ 250

0

Entering edit mode

Hi,

I just have experimented with ~30Gbyte of Miseq data, the final merged (by $ cat command) fastq file contains 3 sets of pair-end data (that means totally 6 fastq files). When run for 22mer with Jellyfish 2, with counting command included -C, I obtained ~700Mb genome size and coverage was about 27X. However, when using same command minus the -C, I obtained ~1.4Gb genome size and coverage was about 13X.

So which result is true? my data was haploid pair-end reads

Additionally, I 'm not really sure but kmergenie seem to be pre-determined to treat kmer and its reverse complement counterpart as a single block, therefore, is kmergenie always process in the same way as jellyfish with -C does?

Any suggestions and ideas are greatly welcome! Thanks

ADD REPLY • link updated 2.5 years ago by Ram 45k • written 9.7 years ago by pbigbig ▴ 250

0

Entering edit mode

The result with -C (~700Mb) is true. Treating reverse complement kmers independently would not make sense in a shotgun data set because you expect to have 50/50 fragments from both strands.

ADD REPLY • link 9.7 years ago by thackl ★ 3.0k

0

Entering edit mode

Thanks a lot

ADD REPLY • link 9.7 years ago by pbigbig ▴ 250

0

Entering edit mode

When counting k-mers in sequencing reads, there is really no way to differentiate between k-mers and their reverse complement. What I mean by this is that seeing e.g. ACGGT is equivalent to seeing ACCGT, since the latter is the reverse complement of the former and the sequenced reads don't originate from a prescribed strand of the DNA.

Hi Rob could you clarify this for me please? Is this just a consequence of the fact that a k-mer and its reverse complement would have arisen from the same coordinates on the genome? But wouldn't this lead to undercounting when a k-mer that isn't another's reverse complement incorrectly gets classified as such? Or is this resolved by keeping track of the sequence of origin and checking whether those sequences themselves are reverse complements of one another?

ADD REPLY • link 3.3 years ago by Dunois ★ 2.9k

1

Entering edit mode

Hi Dunois ,

The most common use of k-mer counters is to count k-mers from raw sequencing reads. The issues is that you often don’t know what strand of the DNA the sequencing read itself was draw from. In this case, there are really only three options. First, you could count k-mers in the orientation in which they are observed. This doesn’t distort the observed strand information, but then e.g. when you have reads drawn from the same region but different strands, they won’t overlap under “naive” string comparison. Second, you could keep track of only canonical k-mer counts. This is essentially equivalent to erasing the information about the strand on which the k-mer was observed. Both a k-mer and its reverse complement are associated with the same token in this case. This means that if e.g. I have a count of 3 for a k-mer k, I don’t know if I saw it 3 times in the fw orientation, or 2 times fw, 1 time rc, or 1 time fw, 2 times rc, or 3 times rc. The third option is, when you see a k-mer k, you increment the count of both k and rc(k) — I am not aware of any tool that actually does this, as it’s just an inefficient way to get equivalent information to the second approach.

The above discussion is with respect to counting k-mers from sequencing reads. On the other hand, if you are counting or indexing k-mers from known reference sequences, one may make a different decision. In that case, one may count k-mers in the observed orientation, since the ambiguity doesn’t exist, or they may, as you suggest, count k-mers canonically, but then track the orientation of the sequence from which this k-mer originated. The latter approach is what is done in many k-mer indices, where canonical k-mers are indexed, but the orientation of the k-mer with respect to the underlying reference sequence is stored as part of the index.

ADD REPLY • link 3.3 years ago by Rob 7.1k

1

Entering edit mode

Thanks a lot for detailed explanation Rob . This has really clarified things for me!!

ADD REPLY • link 3.3 years ago by Dunois ★ 2.9k