I'm currently extracting a list of k-mers from a FastQ file, using jellyfish. In addition to the k-mers, I would also like a list of all the sequence ids (which are actually the id of the MiSeq reads) for each k-mer.
Is this something jellyfish is capable of doing? Unfortunately, couldn't find any description for that in the docs.
If not, is there a tool that is able to perform this task?
Thanks Rob. I think the best approach is to iterate these k-mers and keep a list of reads per k-mer off - memory (SQLite is probably the easiest method)
If you know what k-mers you're interested in ahead of time, and it's a reasonably-sized set, then an approach like this would work well. You have your set of k-mers in a hash, you do a linear scan of the file, and for each k-mer of interest you encounter, you maintain a list of the reads where it occurred. If you want to do this for all k-mers, then building e.g. an SQL-lite database should "work", it just may end up being slow / huge. The benefit of the indices I mentioned above is that they are relatively compact w.r.t the amount of information they contain (and the queries they can answer), so the should work well even for very large read sets. However, if your FASTQ files aren't too huge, a simpler approach should work just fine.