Question

k-mer finding and counting

2

Entering edit mode

4.3 years ago

sami ▴ 40

I have a fastq file. I want to have k-mers and their abundance numbers. can you give me some tools to do this! thank you

k-mer • 6.6k views

ADD COMMENT • link updated 3.4 years ago by Alex Reynolds 35k • written 4.3 years ago by sami ▴ 40

0

Entering edit mode

What were some tools you came across when you googled how to do this?

ADD REPLY • link 4.3 years ago by Ram 44k

1

Entering edit mode

i have not google it , i ll try

ADD REPLY • link 4.3 years ago by sami ▴ 40

3

Entering edit mode

4.3 years ago

kristina.mahan ▴ 170

You can use Jellyfish, KMC3, kmerfreq, BFCounter, ntCard . . .

ADD COMMENT • link 4.3 years ago by kristina.mahan ▴ 170

3

Entering edit mode

4.3 years ago

wulj2 ▴ 50

i would prefer this one: https://github.com/lh3/yak by Heng Li

ADD COMMENT • link 4.3 years ago by wulj2 ▴ 50

0

Entering edit mode

why they use always 31 as maximum size of kmer ?

ADD REPLY • link 4.3 years ago by sami ▴ 40

1

Entering edit mode

A canonical 31-mer can be represented with a 64-bit unsigned integer. Larger word lengths could be used at the expense of more memory and time, so that's a common cutoff when using that data structure to store a kmer key for presence/absence or counts.

If you use a larger k, tools like Jellyfish use a data structure called a Bloom filter (https://en.wikipedia.org/wiki/Bloom_filter) to be able to manage performance and storage, with the trade-off of giving an inexact count.

ADD REPLY • link 4.3 years ago by Alex Reynolds 35k

0

Entering edit mode

Hey Alex do you have a example in how to use your kmer count in a banch of files in different directories? Thanks

ADD REPLY • link 3.4 years ago by schlogl ▴ 160

1

Entering edit mode

Just point the binary at the file. See https://github.com/alexpreynolds/kmer-counter/blob/master/test/makefile#L19 for a basic example. If you have a lot of files, iterate through a bash array or similar scripted approach.

ADD REPLY • link 3.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Where is that specified?

ADD REPLY • link 4.3 years ago by Ram 44k

1

Entering edit mode

https://github.com/lh3/yak/blob/5c97fb5b12c1d4b4a088e4919f3628c6649a559e/yak.h#L8

ADD REPLY • link 4.3 years ago by Alex Reynolds 35k

score 3 · Accepted Answer · 2020-08-09

If you're working with small values of k, I wrote a command-line k-mer counter called kmer-counter that will output results in a form that a Python script can consume:

https://github.com/alexpreynolds/kmer-counter

You can grab, build and install it like so:

$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
$ cp kmer-counter /usr/local/bin

Once the binary is in your path, you might use it in Python like so:

k = 6
fastaFile = '/path/to/some/seqs.fa'
kmerCmd = 'kmer-counter --fasta --k=%d %s' % (k, fastaFile)
try:
    output = subprocess.check_output(kmerCmd, shell=True)
    result = {}
    for line in output.splitlines():
        (header, counts) = line.strip().split('\t')
        header = header[1:]
        kmers = dict((k,int(v)) for (k,v) in [d.split(':') for d in counts.split(' ')])
        result[header] = kmers
    sys.stdout.write("%s" % (str(result)))
except subprocess.CalledProcessError as error:
    sys.stderr.write("%s" % (str(error)))

Given example FASTA like this:

>foo
TTAACG
>bar
GTGGAAGTTCTTAGGGCATGGCAAAGAGTCAGAATTTGAC

For k=6, you would get an iterable Python dictionary like this:

{'foo': {'TTAACG': 1, 'CGTTAA': 1}, 'bar': {'GTTCTT': 1, 'AGAACT': 1, 'GAGTCA': 1, 'ATGGCA': 1, 'GAACTT': 1, 'ATTCTG': 1, 'CTAAGA': 1, 'CTTCCA': 1, 'ATTTGA': 1, 'GGAAGT': 1, 'AGGGCA': 1, 'CCTAAG': 1, 'CTCTTT': 1, 'AATTTG': 1, 'TCTGAC': 1, 'TTTGCC': 1, 'CTTAGG': 1, 'TTTGAC': 1, 'GAAGTT': 1, 'CCCTAA': 1, 'AGAATT': 1, 'AGTCAG': 1, 'CTGACT': 1, 'TCTTAG': 1, 'CGTTAA': 1, 'GTGGAA': 1, 'TGCCAT': 1, 'ACTCTT': 1, 'GGGCAT': 1, 'TTAGGG': 1, 'CTTTGC': 1, 'TGGAAG': 1, 'GACTCT': 1, 'CATGCC': 1, 'GCAAAG': 1, 'AAATTC': 1, 'GTCAAA': 1, 'TGACTC': 1, 'TAGGGC': 1, 'AAGTTC': 1, 'ATGCCC': 1, 'TCAAAT': 1, 'CAAAGA': 1, 'AACTTC': 1, 'GTCAGA': 1, 'CAAATT': 1, 'TAAGAA': 1, 'CATGGC': 1, 'AAGAAC': 1, 'AAGAGT': 1, 'TCTTTG': 1, 'TTCCAC': 1, 'TGGCAA': 1, 'GGCAAA': 1, 'AGTTCT': 1, 'AGAGTC': 1, 'TCAGAA': 1, 'GAATTT': 1, 'AAAGAG': 1, 'TGCCCT': 1, 'CCATGC': 1, 'GGCATG': 1, 'TTGCCA': 1, 'CAGAAT': 1, 'AATTCT': 1, 'GCATGG': 1, 'ACTTCC': 1, 'TTCTTA': 1, 'GCCATG': 1, 'GCCCTA': 1, 'TTCTGA': 1}}

You can use standard Python calls to manipulate this dictionary object and get sums of counts per record, for sequence, etc. which seems to answer your question. Please feel free to clarify what you're looking for if this object representation is not clear.

For a fully-fleshed out demonstration, see: https://github.com/alexpreynolds/kmer-counter/blob/master/test/kmer-test.py

If you are working with large values of k, then you might look at other tools, like Jellyfish. They use data structures to give an approximate count for large values of k that are not tenable for strict word counts.