Question

Getting sequence id from k-mers using jellyfish

0

Entering edit mode

8.7 years ago

Protostome ▴ 50

I'm currently extracting a list of k-mers from a FastQ file, using jellyfish. In addition to the k-mers, I would also like a list of all the sequence ids (which are actually the id of the MiSeq reads) for each k-mer.

Is this something jellyfish is capable of doing? Unfortunately, couldn't find any description for that in the docs.

If not, is there a tool that is able to perform this task?

next-gen jellyfish alignment • 3.1k views

ADD COMMENT • link updated 8.7 years ago by Rob 6.9k • written 8.7 years ago by Protostome ▴ 50

score 3 · Accepted Answer · 2016-03-26

3

Entering edit mode

8.7 years ago

Rob 6.9k

No, neither Jellyfish nor any other standard k-mer counter of which I am aware will provide this type of information. Remembering the record where each k-mer occurred would require a huge amount of extra resources (specifically, memory) during k-mer counting. The tools that do this are those that actually build an index on the read set (which, you should be forewarned, is typically a time and memory-consuming task). You might want to look at Gk-Arrays and BEETL. These tools will build an index on a set of reads that allows you to query for a specific k-mer and get a list of all of the reads in which it occurs.

ADD COMMENT • link 8.7 years ago by Rob 6.9k

0

Entering edit mode

Thanks Rob. I think the best approach is to iterate these k-mers and keep a list of reads per k-mer off - memory (SQLite is probably the easiest method)

ADD REPLY • link 8.7 years ago by Protostome ▴ 50

0

Entering edit mode

If you know what k-mers you're interested in ahead of time, and it's a reasonably-sized set, then an approach like this would work well. You have your set of k-mers in a hash, you do a linear scan of the file, and for each k-mer of interest you encounter, you maintain a list of the reads where it occurred. If you want to do this for all k-mers, then building e.g. an SQL-lite database should "work", it just may end up being slow / huge. The benefit of the indices I mentioned above is that they are relatively compact w.r.t the amount of information they contain (and the queries they can answer), so the should work well even for very large read sets. However, if your FASTQ files aren't too huge, a simpler approach should work just fine.

ADD REPLY • link 8.7 years ago by Rob 6.9k