How to count frequency of kmers for different values of k from fasta file
2
0
Entering edit mode
3.8 years ago
Ashi ▴ 20

Hi is there any way to count the frequency of kmers from fasta files for different values of k? My objective is to produce a plot of the kmer distribution. Thank you so much.

fasta kmers count frequency genome • 3.1k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

@ATpoint Thank you for these. I had seen these. For some reason I am failing to install Jellyfish on my linux server. it is saying command not found whenever I am giving jellyfish command. kmer-counter is working but it is pretty slow. For ecoli short reads it took more than 6 hours and the server where I run the program can run for 6 hours at a stretch so the job got killed before producing any result. That's why I was looking for some other tool or script maybe awk or anythig.

ADD REPLY
0
Entering edit mode
3.8 years ago
Mensur Dlakic ★ 28k

This one is fast and has lots of other functions in addition to k-mer counting:

https://github.com/dib-lab/khmer

ADD COMMENT
0
Entering edit mode
3.8 years ago
GenoMax 147k

From BBMap suite.

$ kmercountexact.sh

Written by Brian Bushnell
Last modified November 29, 2018

Description:  Counts the number of unique kmers in a file.
Generates a kmer frequency histogram and genome size estimate (in peaks output),
and prints a file containing all kmers and their counts.
Supports K=1 to infinity, though not all values are allowed.
SEE ALSO: bbnorm.sh/khist.sh loglog.sh, and kmercountmulti.sh.

Usage:   kmercountexact.sh in=<file> khist=<file> peaks=<file>

Input may be fasta or fastq, compressed or uncompressed.
Output may be stdout or a file.  out, khist, and peaks are optional.

or

$ kmercountmulti.sh

Written by Brian Bushnell
Last modified December 30, 2016

Description:  Estimates cardinality of unique kmers in sequence data.
Processes multiple kmer lengths simultaneously to produce a histogram.

Usage:  kmercountmulti.sh in=<file> sweep=<20,100,8> out=<histogram output>
ADD COMMENT
0
Entering edit mode

@GenoMax Is this from BBmaptool? Do I need to install BBmap for this?

ADD REPLY
0
Entering edit mode

Yes. There is no installation needed. Just download the software and unzip the file. You are ready to go as long as you have a java run time installed.

ADD REPLY
0
Entering edit mode

@Genomax I know this is not the proper post to ask this so I can delete it if needed but I am having this error when I tried to run kmercountermulti.sh on my linux server

/bbmap//calcmem.sh: line 75: [: -v: unary operator expected
java -ea -Xmx500m -cp /home/bbmap/current/ jgi.KmerCountMulti in=/home/eColi_LR_SL.fasta sweep=16,32,64 out=hist1_eColi_LR
Executing jgi.KmerCountMulti [in=/home/eColi_LR_SL.fasta, sweep=16,32,64, out=hist1_eColi_LR]

Exception in thread "main" java.lang.ArrayStoreException: cardinality.LogLog2
        at cardinality.MultiLogLog.<init>(MultiLogLog.java:30)
        at cardinality.MultiLogLog.<init>(MultiLogLog.java:11)
        at jgi.KmerCountMulti.<init>(KmerCountMulti.java:149)
        at jgi.KmerCountMulti.main(KmerCountMulti.java:49)

My java version is :

openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
OpenJDK 64-Bit Server VM (build 25.272-b10, mixed mode)
ADD REPLY
0
Entering edit mode

Just tried with E. coli genome.

$ kmercountmulti.sh in=GCF_000005845.2_ASM584v2_genomic.fna sweep=20,100,8

Input is being processed as unpaired

#K  Count
20  4645838
28  4483466
36  4645838
44  4577093
52  4541564
60  4365048
66  4597298
75  4669509
84  4475877
90  4626988
100 4387290

Time:                           1.603 seconds.
Reads Processed:           1    0.00k reads/sec
Bases Processed:       4641k    2.90m bases/sec
ADD REPLY
0
Entering edit mode

Unfortunately I am having the same error. Did you do any modifications to the calcmem.sh as it is showing some error at the 75th line?

ADD REPLY
0
Entering edit mode

No I did not make any changes. In the sweep= option use the order start,stop,step (in my example go from 20 to 100 in steps of 8). Wonder if you have an extra comma at the end of sweep=16,32,64,.

ADD REPLY
0
Entering edit mode

No I do not have any extra comma there:

./kmercountmulti.sh in=/home/eColi_LR_SL.fasta sweep=16,64,16 out=hist1_eColi_LR

This is the command I am using and getting the error for. This is a fasta file for long reads of e Coli that I have simulated. I am wondering if the error if because of the input then.

ADD REPLY
0
Entering edit mode

Possibly. How did you simulate the reads?

ADD REPLY

Login before adding your answer.

Traffic: 2712 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6