Getting sequence id from k-mers using jellyfish
1
0
Entering edit mode
8.7 years ago
Protostome ▴ 50

I'm currently extracting a list of k-mers from a FastQ file, using jellyfish. In addition to the k-mers, I would also like a list of all the sequence ids (which are actually the id of the MiSeq reads) for each k-mer.

Is this something jellyfish is capable of doing? Unfortunately, couldn't find any description for that in the docs.

If not, is there a tool that is able to perform this task?

next-gen jellyfish alignment • 3.1k views
ADD COMMENT
3
Entering edit mode
8.7 years ago
Rob 6.9k

No, neither Jellyfish nor any other standard k-mer counter of which I am aware will provide this type of information. Remembering the record where each k-mer occurred would require a huge amount of extra resources (specifically, memory) during k-mer counting. The tools that do this are those that actually build an index on the read set (which, you should be forewarned, is typically a time and memory-consuming task). You might want to look at Gk-Arrays and BEETL. These tools will build an index on a set of reads that allows you to query for a specific k-mer and get a list of all of the reads in which it occurs.

ADD COMMENT
0
Entering edit mode

Thanks Rob. I think the best approach is to iterate these k-mers and keep a list of reads per k-mer off - memory (SQLite is probably the easiest method)

ADD REPLY
0
Entering edit mode

If you know what k-mers you're interested in ahead of time, and it's a reasonably-sized set, then an approach like this would work well. You have your set of k-mers in a hash, you do a linear scan of the file, and for each k-mer of interest you encounter, you maintain a list of the reads where it occurred. If you want to do this for all k-mers, then building e.g. an SQL-lite database should "work", it just may end up being slow / huge. The benefit of the indices I mentioned above is that they are relatively compact w.r.t the amount of information they contain (and the queries they can answer), so the should work well even for very large read sets. However, if your FASTQ files aren't too huge, a simpler approach should work just fine.

ADD REPLY

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6