I have a comparison that's been puzzling me. Suppose I have a fastq file for a bacteria: rawread.fastq
I do:
cat rawread.fastq | awk 'NR%4==2{print}' | tr -d '\n' | wc -c
to count the exact number of nucleotides in the rawread file.
I get 121916338
I proceed to do a jellyfish analysis:
jellyfish count -m 21 -s 5000 -t 10 -C rawread.fastq
jellyfish stats mer_count.jf
Output is:
Unique: 553597
Distinct: 5700265
Total: 110466529
Max_count: 1832
Shouldnt the number of distinct kmers multiplied by 21 equal the exact nucleotide count? If I've counted every kmer in the file, I should be able to multiply that by the kmer-size to get how much nucleotide information that was fed into the system. There seems to be a deficit. Whats wrong?
If you have 110466529 total k-mers reported for size k=21, then the length of the genome should be n - 21 + 1 = 110466529, and that is n = 110466549. The length of your genome reported by your bash script is 121916338, which differs too much. Does your file contain masked regions? Does jellyfish count these too as part of a k-mer or does it break the mer? Is your bash script counting break lines? (I suppose it is NOT since you have that tr -d \n...) Any of these could be the reason behind the difference. Hope it helps!
His fastQ is of sequenced reads, not a genomic fasta.
(These are raw reads not assemblies.)
Think of the problem like this: Suppose you were ONLY given the .jf file. Derive how many nucleotides the original rawread file had.