Question

How to count frequency of kmers for different values of k from fasta file

0

Entering edit mode

4.6 years ago

Ashi ▴ 20

Hi is there any way to count the frequency of kmers from fasta files for different values of k? My objective is to produce a plot of the kmer distribution. Thank you so much.

fasta kmers count frequency genome • 4.0k views

ADD COMMENT • link updated 4.6 years ago by GenoMax 153k • written 4.6 years ago by Ashi ▴ 20

0

Entering edit mode

Finding Kmers From Fasta File

Counting k-mers in a multifasta file

specific k-mer count in multi fasta file

It is three times the same tools being suggested though :)

ADD REPLY • link 4.6 years ago by ATpoint 89k

0

Entering edit mode

@ATpoint Thank you for these. I had seen these. For some reason I am failing to install Jellyfish on my linux server. it is saying command not found whenever I am giving jellyfish command. kmer-counter is working but it is pretty slow. For ecoli short reads it took more than 6 hours and the server where I run the program can run for 6 hours at a stretch so the job got killed before producing any result. That's why I was looking for some other tool or script maybe awk or anythig.

ADD REPLY • link 4.6 years ago by Ashi ▴ 20

score 0 · Answer 1 · 2021-01-21

0

Entering edit mode

4.6 years ago

Mensur Dlakic ★ 29k

This one is fast and has lots of other functions in addition to k-mer counting:

https://github.com/dib-lab/khmer

ADD COMMENT • link 4.6 years ago by Mensur Dlakic ★ 29k

GenoMax · Answer 2 · 2021-01-21

0

Entering edit mode

4.6 years ago

GenoMax 153k

From BBMap suite.

$ kmercountexact.sh

Written by Brian Bushnell
Last modified November 29, 2018

Description:  Counts the number of unique kmers in a file.
Generates a kmer frequency histogram and genome size estimate (in peaks output),
and prints a file containing all kmers and their counts.
Supports K=1 to infinity, though not all values are allowed.
SEE ALSO: bbnorm.sh/khist.sh loglog.sh, and kmercountmulti.sh.

Usage:   kmercountexact.sh in=<file> khist=<file> peaks=<file>

Input may be fasta or fastq, compressed or uncompressed.
Output may be stdout or a file.  out, khist, and peaks are optional.

or

$ kmercountmulti.sh

Written by Brian Bushnell
Last modified December 30, 2016

Description:  Estimates cardinality of unique kmers in sequence data.
Processes multiple kmer lengths simultaneously to produce a histogram.

Usage:  kmercountmulti.sh in=<file> sweep=<20,100,8> out=<histogram output>

ADD COMMENT • link 4.6 years ago by GenoMax 153k

0

Entering edit mode

@GenoMax Is this from BBmaptool? Do I need to install BBmap for this?

ADD REPLY • link 4.6 years ago by Ashi ▴ 20

0

Entering edit mode

Yes. There is no installation needed. Just download the software and unzip the file. You are ready to go as long as you have a java run time installed.

ADD REPLY • link 4.6 years ago by GenoMax 153k

0

Entering edit mode

@Genomax I know this is not the proper post to ask this so I can delete it if needed but I am having this error when I tried to run kmercountermulti.sh on my linux server

/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx500m -cp /home/bbmap/current/ jgi.KmerCountMulti in=/home/eColi_LR_SL.fasta sweep=16,32,64 out=hist1_eColi_LR
Executing jgi.KmerCountMulti [in=/home/eColi_LR_SL.fasta, sweep=16,32,64, out=hist1_eColi_LR]

Exception in thread "main" java.lang.ArrayStoreException: cardinality.LogLog2
        at cardinality.MultiLogLog.<init>(MultiLogLog.java:30)
        at cardinality.MultiLogLog.<init>(MultiLogLog.java:11)
        at jgi.KmerCountMulti.<init>(KmerCountMulti.java:149)
        at jgi.KmerCountMulti.main(KmerCountMulti.java:49)

My java version is :

openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
OpenJDK 64-Bit Server VM (build 25.272-b10, mixed mode)

ADD REPLY • link updated 4.6 years ago by GenoMax 153k • written 4.6 years ago by Ashi ▴ 20

0

Entering edit mode

Just tried with E. coli genome.

$ kmercountmulti.sh in=GCF_000005845.2_ASM584v2_genomic.fna sweep=20,100,8

Input is being processed as unpaired

#K  Count
20  4645838
28  4483466
36  4645838
44  4577093
52  4541564
60  4365048
66  4597298
75  4669509
84  4475877
90  4626988
100 4387290

Time:                           1.603 seconds.
Reads Processed:           1    0.00k reads/sec
Bases Processed:       4641k    2.90m bases/sec

ADD REPLY • link 4.6 years ago by GenoMax 153k

0

Entering edit mode

Unfortunately I am having the same error. Did you do any modifications to the calcmem.sh as it is showing some error at the 75th line?

ADD REPLY • link 4.6 years ago by Ashi ▴ 20

0

Entering edit mode

No I did not make any changes. In the sweep= option use the order start,stop,step (in my example go from 20 to 100 in steps of 8). Wonder if you have an extra comma at the end of sweep=16,32,64,.

ADD REPLY • link 4.6 years ago by GenoMax 153k

0

Entering edit mode

No I do not have any extra comma there:

./kmercountmulti.sh in=/home/eColi_LR_SL.fasta sweep=16,64,16 out=hist1_eColi_LR

This is the command I am using and getting the error for. This is a fasta file for long reads of e Coli that I have simulated. I am wondering if the error if because of the input then.