jellyfish kmer counting discrepancy
2
0
Entering edit mode
5.3 years ago
7ehuang • 0

Hi all, I am using jellyfish to count kmers of length 31 that appear in a viral genome sequence. These were the commands I ran:

jellyfish count -m 31 -s 154675 -t 10 NC_001798.fa  # the -s parameter is based on the size of the NC_001798 genome 
jellyfish dump mer_counts.jf > mer_counts_dumps.fa

I then take a random 31mer from mer_counts_dumps.fa (e.g. GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG, which is output to have a count of 13) and grep for that same 31mer in the original input NC_001798.fa file. This is the command I run (to account for 31mers that might go across a line break):

cat NC_001798.fa | tr -d " \t\n\r" | grep -o GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG | wc -l

However, this only returns 5, which tells me that the 31mer does not appear 13 times in the fasta file (only 5 times). Does anyone know what may be causing the discrepancy? I also tried using kmercountexact.sh from the BBMap suite and it also outputs a count of 13 for this specific 31mer, so I'm wondering if my method of grepping for the 31mer in the fasta file is erroneous. I have this problem for multiple 31mers with a count greater than 1 in mer_counts_dumps.fa.

Thanks!

Best, Elaine

kmer count genome RNA-Seq jellyfish • 3.1k views
ADD COMMENT
2
Entering edit mode
5.3 years ago

The better tool to verify patterns that will include overlapping matches is the dreg tool from emboss

cat NC_001798.fa | dreg -filter -pattern GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG

produces 13 matches

#=======================================
#
# Sequence: NC_001798.2     from: 1   to: 154675
# HitCount: 13
#
# Pattern: GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
#
#=======================================

  Start     End  Strand Pattern                               Sequence
 153769  153799       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153784  153814       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153799  153829       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153814  153844       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153829  153859       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153844  153874       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153859  153889       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153874  153904       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153889  153919       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153904  153934       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153919  153949       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153934  153964       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG
 153949  153979       + regex:GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG

#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 154675
# Reported_sequences: 1
# Reported_hitcount: 13
#---------------------------------------
ADD COMMENT
0
Entering edit mode

Oh, this is great! Exactly the tool I need. Thank you very much.

ADD REPLY
0
Entering edit mode
5.3 years ago

From the manual:

In sequencing reads, it is unknown which strands of the DNA is sequenced. As a consequence, a k-mer or its reverse complement are essentially equivalent.

did you count up the rev-comp of your test sequence too?

Also, I'm not totally sure if grep will find every instance of the k-mer if they overlap, I'm guessing the jellyfish authors did think of that.

ADD COMMENT
0
Entering edit mode

Does anyone know what may be causing the discrepancy?

Reverse complement is the likely explanation. Instead of reverse-complementing your genome file, you can do it to the 31-mer:

cat NC_001798.fa | tr -d " \t\n\r" | grep -o CGCCCGACCCCCGCCCGCCCGACCCCCGCCC | wc -l

grep counts overlapping patterns as one, but it is unlikely that k-mers of this length overlap.

ADD REPLY
0
Entering edit mode

Thank you both for your responses!

Since I am counting kmers in a genome, the reverse-complement of a kmer shouldn't be considered equivalent to it, so when I ran kmercountexact.sh, I used the rcomp=f parameter. Not considering a kmer and its reverse-complement equivalent is the default behavior for jellyfish (you would need to use the -C parameter to set that behavior). This information is from section 1.1.2 of the manual (about counting kmers in a genome).

Even when I grep for the reverse complement, I get a count of 5. So adding the count for the kmer and the reverse complement would give me 10, which is still different from the count of 13 output by jellyfish.

ADD REPLY
2
Entering edit mode

I stand corrected that an overlap is unlikely. That statement was based only on the k-mer length (31), while a simple visual inspection shows that you have a perfect duplication of GGGCGGGGGTCGGGC in your 31-mer (plus an extra G). It seems like your random 31-mer is not so random. I suggest you test a different 31-mer with your grep command, or try counting this repetitive 15-mer.

If you do your grep command on this sequence (14 repeats of the 15-mer plus an extra G at the end):

GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG

the count of your 31-mer will be 5 even though the actual count is 7.

ADD REPLY
0
Entering edit mode

Now that you mention it, I see that the 31mers I'm running into this problem for all have duplications of some kmer within them. Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 2073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6