Question

Estimate total number of KMERs from a FASTQ file

0

Entering edit mode

8.2 years ago

sumerc • 0

Hi,

I would like to be able to estimate the total number(it is not necessary to be unique) of KMERs that can be generated from a given FASTQ file with only knowing its file size beforehand. For example BFCounter currently accepts that as an command line input that is the estimated number of KMERs before starting reading the file. I am wondering if that can be done automatically.

Is there something you can suggest? For example, can we say DNA Sequence line length will be between 80-120 chars or the average byte count per-line in a fastq file is like XXX bytes.

Do you have any kind of statistical knowledge like above?

kmer dna counting • 3.6k views

ADD COMMENT • link 8.2 years ago by sumerc • 0

score 0 · Answer 1 · 2016-08-25

If you have the number of reads and the length of reads, it is trivial to work out the number of kmers of size k. As the identifiers for reads differ and the quality string sometimes has and other times doesn't have an identifier, getting the number of kmers from the file size will only be a very rough estimate.

score 0 · Answer 2 · 2016-08-25

kmercountexact.sh from BBMap suite will do this.

Description: Counts the number of unique kmers in a file. Generates a kmer frequency histogram and genome size estimate (in peaks output), and prints a file containing all kmers and their counts. Supports K=1 to infinity, though not all values are allowed. SEE ALSO: bbnorm.sh/khist.sh which have similar functionality.

Usage: kmercountexact.sh in=<file> khist=<file> peaks=<file> out=<file>

score 0 · Answer 3 · 2016-08-25

0

Entering edit mode

8.2 years ago

sumerc • 0

Thanks for the replies.

Actually what I am trying to do is to implement something like kmercountexact.sh with using different data structures and play with algorithms. So, if I read all the file then I lose some chances to preallocate some buffers that is why I want to get an estimate on the possible KMER count. (I want to initialize a Bloom Filter to be exact) So even a very rough estimate can be useful in my situation.

ADD COMMENT • link 8.2 years ago by sumerc • 0

2

Entering edit mode

The total number of kmers is R*(max(0, L-K+1)), where R is the number of reads, L is the read length, and K is the kmer length. If L is variable you could use the average. You can't estimate the number of unique kmers without processing at least part of the file, though.

KmerCountExact stores the kmers to get an exact count of the number of unique kmers, which (can) take a lot of memory. BBNorm is capable of an approximation, using Bloom filters, in an arbitrarily small amount of memory (though less memory means lower accuracy). But BBTools also has an implementation of another technique called LogLog, which can estimate the number of unique kmers using only a tiny, fixed amount of memory (a few kb). As opposed to KmerCountExact and BBNorm, LogLog also runs at about the same speed you can read a file. For example:

loglog.sh in=mruber_60x.fq.gz k=31
Cardinality:   51725986
Time:   6.885 seconds.

That's a 500MB file, so it's limited by gzip decompression speed. Anyway, once you know the approximate number of unique kmers (which is accurate to within a few percent) you can allocate the bloom filters appropriately.

P.S. LogLog by default uses a different random seed on each run, giving slightly different results. On 4 consecutive runs, I got:

51725986, 54242725, 54205118, 56469179

The correct number, from KmerCountExact, is 54913615. BBNorm with 1GB RAM gave 54912534, and with 100MB it gave 54902886.

ADD REPLY • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Good answer, I will inspect the loglog.sh to see how they have achieved the estimation exactly. I am thinking to use that estimation to init a couting Bloom filter to count KMERs that occur more than X and use that information to finally allocate my final hash table that consists the KMER counts. That way, app. will hopefully use less memory both because of the bloom filter and more importantly preallocation will decrease heap fragmentation. Anyway, it seems it is worth trying to push the limits.

Thanks for the information:)

ADD REPLY • link 8.2 years ago by sumerc • 0

0

Entering edit mode

@Brian is the author of BBMap suite hence the insider information on kmercountexact.sh and loglog.sh :-)

ADD REPLY • link 8.2 years ago by GenoMax 147k