Hi all,
I am using jellyfish to count kmers of length 31 that appear in a viral genome sequence. These were the commands I ran:
jellyfish count -m 31 -s 154675 -t 10 NC_001798.fa # the -s parameter is based on the size of the NC_001798 genome
jellyfish dump mer_counts.jf > mer_counts_dumps.fa
I then take a random 31mer from mer_counts_dumps.fa (e.g. GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG, which is output to have a count of 13) and grep for that same 31mer in the original input NC_001798.fa file. This is the command I run (to account for 31mers that might go across a line break):
However, this only returns 5, which tells me that the 31mer does not appear 13 times in the fasta file (only 5 times). Does anyone know what may be causing the discrepancy? I also tried using kmercountexact.sh from the BBMap suite and it also outputs a count of 13 for this specific 31mer, so I'm wondering if my method of grepping for the 31mer in the fasta file is erroneous. I have this problem for multiple 31mers with a count greater than 1 in mer_counts_dumps.fa.
In sequencing reads, it is unknown which strands of the DNA is
sequenced. As a consequence, a k-mer or its reverse complement are
essentially equivalent.
did you count up the rev-comp of your test sequence too?
Also, I'm not totally sure if grep will find every instance of the k-mer if they overlap, I'm guessing the jellyfish authors did think of that.
Since I am counting kmers in a genome, the reverse-complement of a kmer shouldn't be considered equivalent to it, so when I ran kmercountexact.sh, I used the rcomp=f parameter. Not considering a kmer and its reverse-complement equivalent is the default behavior for jellyfish (you would need to use the -C parameter to set that behavior). This information is from section 1.1.2 of the manual (about counting kmers in a genome).
Even when I grep for the reverse complement, I get a count of 5. So adding the count for the kmer and the reverse complement would give me 10, which is still different from the count of 13 output by jellyfish.
I stand corrected that an overlap is unlikely. That statement was based only on the k-mer length (31), while a simple visual inspection shows that you have a perfect duplication of GGGCGGGGGTCGGGC in your 31-mer (plus an extra G). It seems like your random 31-mer is not so random. I suggest you test a different 31-mer with your grep command, or try counting this repetitive 15-mer.
If you do your grep command on this sequence (14 repeats of the 15-mer plus an extra G at the end):
Oh, this is great! Exactly the tool I need. Thank you very much.