jellyfish count discrepancy between results obtained with and without -C option
1
0
Entering edit mode
3.0 years ago
ashenflower ▴ 30

I ran jellyfish on my Illumina reads, first using the -C options, and then without, on the same data. Those are the commands I used:

With -C option:

time jellyfish count -m 21 -s 100M -t 15 -C <(zcat file_1.fastq.gz file_2.fastq.gz)

Without -C option:

time jellyfish count -m 21 -s 100M -t 15 <(zcat file_1.fastq.gz file_2.fastq.gz)

In the first case I obtained 1.684.382.436 distinct k-mers, while in the second case I obtained 2.205.041.740, so only 520.659.304.

How is it possible? I was expecting the number of distinct k-mers in the second case (no -C) to be around twice the number in the first case (coverage is, approximately, x53)

counting kmers discrepancy jellyfish • 1.3k views
ADD COMMENT
4
Entering edit mode
3.0 years ago
Rob 6.9k

This is not at all unexpected. The number without -C will count the kmers as they appear in the input. To have this number be twice what you get with -C, then every kmer would have to appear in both its forward and reverse complement orientation. With 50X coverage, you might expect this with most (not all) kmers in the true underlying genome. However, many erroneous kmers (from sequencing error), are likely to occur only once, so you won't see them in both orientations, even at high coverage.

ADD COMMENT
1
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode

Well said!

ADD REPLY

Login before adding your answer.

Traffic: 1671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6