Hi :)
A few weeks ago I was looking for a tool that would help me get "DNA composition" statistics for my sequencing. Something that would give me a dataset which which I could ask questions about GC bias, or over-represented sequences, motifs, etc. There are tools to answer each question specifically, but I was looking for something more general from which many analyses could be built on. This led me to k-mer counting, and all the down-stream tools which leverage k-mer result files.
k-mer counting tools are pretty cool, but all the ones I tried had some draw backs; like high RAM requirements, very long run-times (although the latests bloom-filter based tools seem to mitigate this somewhat), but most importantly requiring a specific k-mer size to be chosen. I really wanted all mers in the dataset so I could look at ''GC' and 'GCG' and 'GCCGACGGACGAC' without having to re-run any analyses. I couldn't find a tool like this after a brief search, so gave up and wrote my own in NumPy based off suffix-arrays.
Two weeks later I have a functional program in the sense that I get results, but before I invest any time making it usable for others, I thought I should investigate further if there are tools which already do this. Making a suffix array was a nice learning experience for me so I haven't lost anything if such tools already exist - and if they do i'd love to compare performance characteristics - but if not I might consider tidying up the code and making proper documentation. Does anyone know of such tools?
Thank you so much, and happy Diwali :)
It seems like no, there are no tools with the same functionality as yours. What is the range of kmers your tool covers?
Well theoretically all ranges; but practically a row of data (node in a trie) can only store so much DNA before it gets split into two or more rows/nodes. So with 64bits for storing DNA, the sweet-spot for DNA length is around 50-60bp. For longer DNA fragments you'd have to increase that, but i mean you could store a 1000mer in a 32bit output, it would just take 1000/16 rows to store it :)
did you finally share your code or not ? :D how and where to use that ? thanks
Yes and no. Yes it's out there (https://github.com/JohnLonginotto/ACGTrie) But it isn't really ready for anyone else to use unless they know some python. It will make the trie, and give you the tools to query it, but you would have to write your own scripts right now to do anything more than that (like compare two samples, etc).
thanks, seems i need to learn python to survive this days :))
This seems like a very specific application. Could you just write a bash script that launches Jellyfish several times with different k?
This would probably work. My only concern is that you cant merge files of different k-mer sizes in Jellyfish, so working with 100+ files per sample would be a pain.
I also worry that it would be kind of slow. I tried to see how long it would take to process 12800000 fragments (all < 20bp) which is roughly 0.25% of one seq file, and by the time it had done the 4mers the (much slower) python script had finished. It's not really a fair comparison for jelly because it had to initialize and read the whole file over and over again, but when it gave a memory leak at 6mer and 7mer sizes i gave up with the approach.
Also, I dont think the n-mer question is a specific application - rather I think it is the more general problem, of which k-mers for graph building/etc is the more specific application. Problem with the specific question I asked here on Biostars is that if no one posts a tool than can do it, that doesn't mean a tool doesn't exist - just no one posted :/