I know in Jellyfish -C
stands for canonical kmers, however I'm a little iffy on how this is implemented. Does Jellyfish take into account whether the reads are paired-end or not? I'm working on my own kmer software to use internally and want the the results to be equivalent to what jellyfish would spit out.
So far, my understanding is that that -C
does not take into account which strand a read came from, but rather creates the reverse complement of any kmer it sees automatically and then classifies both a kmer and its reverse complement as the same kmer.
Ok I think I understand most of this, but let me get a specific example, suppose we have
ATG
occurring3
times andCAT
occurring2
times, what does this output as in jellyfish, is itCAT> 5
orATG> 5
According to the jellyfish manual (https://github.com/gmarcais/Jellyfish/blob/master/doc/jellyfish.pdf) "whichever comes first lexicographically". So, in your case, it would be
ATG> 5
.Ok so it's actually simple, they just count both and then group together the kmer and rc(kmer) and select lexicographically first as the "name" for that set. Thanks for all the help!