I wrote a C++14 program that's up on Github called kmer-counter. It does k-mer counting on FASTA input and other types of files.
It will do what you want to do — count k-mers per sequence — without the need to split your multifasta file into single sequence files. Which is a horrible idea, btw, as the I/O overhead would be ridiculous.
I think this should run decently fast for your needs, but of course you'd need to try it and decide that for yourself.
Assuming you have reasonably modern development tools installed, you could build it locally via something like:
$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
Say you have a FASTA file like this, where headers and sequences are on alternating lines:
$ more sequences.fa
>foo
TTAACG
>bar
GTGGAAGTTCTTAGGGCATGGCAAAGAGTCAGAATTTGAC
Then you can count 6-mers with kmer-counter
by specifying --fasta
and --k=6
as options to your sequences.fa
FASTA file:
$ ./kmer-counter --fasta --k=6 sequences.fa
>foo CGTTAA:1 TTAACG:1
>bar TTCTTA:1 TAGGGC:1 AAATTC:1 GTGGAA:1 AACTTC:1 AGTTCT:1 GCAAAG:1 AAAGAG:1 AAGAGT:1 TCAAAT:1 TGGAAG:1 GTTCTT:1 GTCAGA:1 TCTGAC:1 CATGGC:1 CGTTAA:1 GCATGG:1 TTTGCC:1 CTTTGC:1 TCAGAA:1 CTTAGG:1 TTAGGG:1 TGCCAT:1 TGACTC:1 ACTTCC:1 CAAATT:1 TTCTGA:1 GTCAAA:1 AGAACT:1 TCTTAG:1 CCTAAG:1 GCCATG:1 AGAATT:1 GGAAGT:1 AGTCAG:1 AATTTG:1 CCCTAA:1 ATTCTG:1 GAACTT:1 GAGTCA:1 CTCTTT:1 ATTTGA:1 CAGAAT:1 CCATGC:1 GGCAAA:1 ATGGCA:1 TTCCAC:1 ATGCCC:1 TGGCAA:1 CAAAGA:1 AAGAAC:1 AATTCT:1 TGCCCT:1 TAAGAA:1 GCCCTA:1 CTGACT:1 GAATTT:1 TTTGAC:1 CTAAGA:1 AGAGTC:1 GAAGTT:1 AAGTTC:1 GGGCAT:1 CATGCC:1 TTGCCA:1 GGCATG:1 AGGGCA:1 GACTCT:1 CTTCCA:1 TCTTTG:1 ACTCTT:1
This is a two-column text file. The header is in the first column and the mer-counts are in the second column.
If you want this redirected to a file, just use the redirection operator:
$ ./kmer-counter --fasta --k=6 sequences.fa > counts.txt
If you have a FASTA file where headers and sequences do not alternate lines (there are two or more lines of sequences between headers), there a lots of scripts on BioStars that show how to use awk
or similar to preprocess your FASTA in single-line form, which run pretty quickly.
Can you share a glimpse of your sequence file? Is it a multi-fasta ? I think I already a ready script which I could share. Looking forward to hear from you soon.