Hi,
I would like to be able to estimate the total number(it is not necessary to be unique) of KMERs that can be generated from a given FASTQ file with only knowing its file size beforehand. For example BFCounter currently accepts that as an command line input that is the estimated number of KMERs before starting reading the file. I am wondering if that can be done automatically.
Is there something you can suggest? For example, can we say DNA Sequence line length will be between 80-120 chars or the average byte count per-line in a fastq file is like XXX bytes.
Do you have any kind of statistical knowledge like above?
The total number of kmers is
R*(max(0, L-K+1))
, where R is the number of reads, L is the read length, and K is the kmer length. If L is variable you could use the average. You can't estimate the number of unique kmers without processing at least part of the file, though.KmerCountExact stores the kmers to get an exact count of the number of unique kmers, which (can) take a lot of memory. BBNorm is capable of an approximation, using Bloom filters, in an arbitrarily small amount of memory (though less memory means lower accuracy). But BBTools also has an implementation of another technique called LogLog, which can estimate the number of unique kmers using only a tiny, fixed amount of memory (a few kb). As opposed to KmerCountExact and BBNorm, LogLog also runs at about the same speed you can read a file. For example:
That's a 500MB file, so it's limited by gzip decompression speed. Anyway, once you know the approximate number of unique kmers (which is accurate to within a few percent) you can allocate the bloom filters appropriately.
P.S. LogLog by default uses a different random seed on each run, giving slightly different results. On 4 consecutive runs, I got:
The correct number, from KmerCountExact, is 54913615. BBNorm with 1GB RAM gave 54912534, and with 100MB it gave 54902886.
Good answer, I will inspect the loglog.sh to see how they have achieved the estimation exactly. I am thinking to use that estimation to init a couting Bloom filter to count KMERs that occur more than X and use that information to finally allocate my final hash table that consists the KMER counts. That way, app. will hopefully use less memory both because of the bloom filter and more importantly preallocation will decrease heap fragmentation. Anyway, it seems it is worth trying to push the limits.
Thanks for the information:)
@Brian is the author of BBMap suite hence the insider information on
kmercountexact.sh
andloglog.sh
:-)