Question

k-mer counters - presence/absence matrix

0

Entering edit mode

3.4 years ago

lizabe ▴ 10

Hi all, I need to compute a presence/absence matrix (binary) of k-mers present in a set of genomes (fasta files).

Could you please suggest me a tool? I have tried Jellyfish but the --matrix option which is described in the tutorial (https://raw.githubusercontent.com/gmarcais/Jellyfish/master/doc/jellyfish.pdf) didn´t work.

Thanks.

k-mers matrix • 2.8k views

ADD COMMENT • link 3.4 years ago by lizabe ▴ 10

0

Entering edit mode

Can you elaborate on "didn't work"? Jellyfish was the first thing that sprang to mind reading your title, so I would say its probably worth persisting with since its one of if not the best tools for kmer stuff.

ADD REPLY • link 3.4 years ago by Joe 21k

0

Entering edit mode

Hi Joe, thanks for your answer. Sorry I didn´t explain my problem in the first message.

I installed jellyfish 2.3.0 and ran the command:

jellyfish count -m 256 -o jellyoutput -c 1 -s 100000000 -t 32 --matrix file.fasta

This was the error: count: unrecognized option '--matrix' Use --usage or --help for some help

The tutorial probably corresponds to an old version of the program. Do you know what is the correct command to generate a matrix like the one I need?

ADD REPLY • link 3.4 years ago by lizabe ▴ 10

0

Entering edit mode

I am not entirely sure if Jellyfish can be readily used to carry out such comparative analysis. You may have to generate k-mer profiles for each sample/genome and then carry out comparisons separately.

ADD REPLY • link 3.4 years ago by Sej Modha 5.3k

0

Entering edit mode

Thanks for the answers!

ADD REPLY • link 3.4 years ago by lizabe ▴ 10

score 4 · Answer 1 · 2021-08-06

Hi lizabe,

You're right that this tutorial is out of date. The --matrix option is no longer valid as an option to jellyfish count. However, I don't think it's original intent was to do what you wanted anyway. It doesn't write out a binary presence/absence matrix. Rather, it specifies the binary matrix that is used to generate the universal hash function for hashing the k-mers. Jellyfish relies on a universal hash function, which can be generated using a random binary matrix. If you want to use the exact same hash function for other purposes, you need to know what that matrix is.

Anyway, to achieve what you want, I'm afraid you'll need to take a different approach. Essentially, what you want to do is to count k-mers in a collection of different fasta files / genomes, and then determine which k-mers are present in each. With jellyfish, you could do this by running jellyfish separately on each input genome, then using the dump command to get the k-mer list for each in plain text, and then merging across the files to get the matrix. Alternatively you could use a tool like mantis (disclosure; I'm a senior author of this method) or metagraph that are designed explicitly to be able to answer k-mer presence/absence queries over a large collection of k-mers coming from different sources (among other things).

score 2 · Answer 2 · 2021-08-09

2

Entering edit mode

3.4 years ago

Alex Reynolds 36k

Perhaps kmer-counter or kmer-boolean would be of use for kmers shorter than 31 characters:

The kmer-counter repo contains a script to demonstrate Python integration for quick filtering/querying. You could easily write out a presence/absence matrix from this result.

For kmers that are 32 characters and longer, a tool like Jellyfish would be appropriate.