Hi,
When working with Jellyfish, I used sample command in their guide:
$ jellyfish count -m 22 -s 100M -t 16 -C merged.fq.gz
and I am not really understand the meaning of -C
, which is described as "canonical", or (as in Jellyfish 1) "Count both strand, canonical representation". Doesn't it should be a default option (no need to put in)? What is the consequence if I do not include this -C
in my jellyfish command?
I also wonder that is this option also pre-implemented in kmergenie?
Thank you in advance for your clarification!
EDIT: I misread Robs explanation. As he stated
-C
collapses k-mers with its reverse complement.That's what I said.
-C
considers both a kmer and its reverse "complement as equivalent". This means it collapses them.Yeah, proper reading definitely helps ;) - Sorry about the that!
No problem ;P
Thanks a lot Rob!
Hi,
I just have experimented with ~30Gbyte of Miseq data, the final merged (by $ cat command) fastq file contains 3 sets of pair-end data (that means totally 6 fastq files). When run for 22mer with Jellyfish 2, with counting command included -C, I obtained ~700Mb genome size and coverage was about 27X. However, when using same command minus the -C, I obtained ~1.4Gb genome size and coverage was about 13X.
So which result is true? my data was haploid pair-end reads
Additionally, I 'm not really sure but kmergenie seem to be pre-determined to treat kmer and its reverse complement counterpart as a single block, therefore, is kmergenie always process in the same way as jellyfish with
-C
does?Any suggestions and ideas are greatly welcome! Thanks
The result with -C (~700Mb) is true. Treating reverse complement kmers independently would not make sense in a shotgun data set because you expect to have 50/50 fragments from both strands.
Thanks a lot
Hi Rob could you clarify this for me please? Is this just a consequence of the fact that a
k-mer
and its reverse complement would have arisen from the same coordinates on the genome? But wouldn't this lead to undercounting when ak-mer
that isn't another's reverse complement incorrectly gets classified as such? Or is this resolved by keeping track of the sequence of origin and checking whether those sequences themselves are reverse complements of one another?Hi Dunois ,
The most common use of k-mer counters is to count k-mers from raw sequencing reads. The issues is that you often don’t know what strand of the DNA the sequencing read itself was draw from. In this case, there are really only three options. First, you could count k-mers in the orientation in which they are observed. This doesn’t distort the observed strand information, but then e.g. when you have reads drawn from the same region but different strands, they won’t overlap under “naive” string comparison. Second, you could keep track of only canonical k-mer counts. This is essentially equivalent to erasing the information about the strand on which the k-mer was observed. Both a k-mer and its reverse complement are associated with the same token in this case. This means that if e.g. I have a count of 3 for a k-mer k, I don’t know if I saw it 3 times in the fw orientation, or 2 times fw, 1 time rc, or 1 time fw, 2 times rc, or 3 times rc. The third option is, when you see a k-mer k, you increment the count of both k and rc(k) — I am not aware of any tool that actually does this, as it’s just an inefficient way to get equivalent information to the second approach.
The above discussion is with respect to counting k-mers from sequencing reads. On the other hand, if you are counting or indexing k-mers from known reference sequences, one may make a different decision. In that case, one may count k-mers in the observed orientation, since the ambiguity doesn’t exist, or they may, as you suggest, count k-mers canonically, but then track the orientation of the sequence from which this k-mer originated. The latter approach is what is done in many k-mer indices, where canonical k-mers are indexed, but the orientation of the k-mer with respect to the underlying reference sequence is stored as part of the index.
Thanks a lot for detailed explanation Rob . This has really clarified things for me!!