I am trying to find all k-mers in my fasta files. For eg if the sequence is AAATTCCGGGGGAAAA , if I want all k-mers with k=3 , it should return 64 values from what I expect. I can write a simple python script to do this but I am dealing with lot of data and wanted to use Jellyfish. It works well but doesnt include and evaluate all possible combinations. I am interested in k-mers of size 1-5. Would anyone know about this?
Jellyfish (and most k-mer counters I know of for that matter), only return results for k-mers that are actually present in the data. If there are only M distinct k-mers in your data, the counts of all other 4^k - M k-mers are implicitly 0. This helps keep the size of intermediate files manageable, since, for larger k-mer sizes most data is very sparse and reporting all absent k-mers with a count of 0 would waste an enormous amount of space. While Jellyfish and other k-mer counters are designed for speed, and will scale well to large files and large k-mer sizes, if you're only interested in 1-5-mers a simple solution with a direct lookup table (array) mapping the k-mer id to an atomic integer of counts should be fairly fast (if implemented in C/C++ with multiple threads, it may even be faster than some existing counters since it's more specific in scope). Of course, you could always just run Jellyfish for these values and create a simple Python script to expand it's resulting file format (which lists only k-mers that are present) into a format listing the results for all 4^k k-mers. That should also be sufficiently fast, and should be somewhat simpler to set up.
--Rob
ADD COMMENT
• link
updated 4.9 years ago by
Ram
44k
•
written 10.5 years ago by
Rob
6.9k
Another suggestion is to use DSK (a k-mer counter with low memory footprint). It now uses HDF5 as output format, so you can use HDF5 tools to extract information of the kmers counts. You can find several examples in the README file. It also provides a dsk2ascii binary that dumps couples [kmer,count].
Thanks Rob. Eventually I just used a simple python script:)
Just for clarification. I think it should not return 64 kmers. It should return the following.