Question

Counting k-mers in a multifasta file

2

Entering edit mode

7.6 years ago

ldyer2006 ▴ 50

Hi Biostars!

I'm currently trying to create a feature space of some ChIP-Seq sequence data which I will use to practice some machine learning. The first step I'm taking is to define the feature space by counting the number of k-mers which appear in each of the results from the ChIP-Seq run of a specific TF (RING1B).

There are around 3 million results, each of 36bp, and I'm looking to get a separate k-mer counts (Probably 6bp or 7bp) for each of those results. I've taken a look at using KMC2 or JellyFish, but these programs both seem to count the total number of 6-mers in the whole file, rather than counting individually for each separate sequence. My current attempt at a solution is to use faidx to split the file and then run a script that will use KMC2 on each of the 3 million or so outputs.

While faidx is running, I would like to ask, is there a better way to do this? Splitting the files is very intensive, and I'd much rather run a program initially that would give me separate results for each sequence - Is this possible?

Thanks in advance for any help.

k-mer counting k-mers • 5.9k views

ADD COMMENT • link updated 7.6 years ago by Alex Reynolds 36k • written 7.6 years ago by ldyer2006 ▴ 50

0

Entering edit mode

Can you share a glimpse of your sequence file? Is it a multi-fasta ? I think I already a ready script which I could share. Looking forward to hear from you soon.

ADD REPLY • link 7.6 years ago by lakhujanivijay 5.9k

score 6 · Answer 1 · 2017-04-22

I wrote a C++14 program that's up on Github called kmer-counter. It does k-mer counting on FASTA input and other types of files.

It will do what you want to do — count k-mers per sequence — without the need to split your multifasta file into single sequence files. Which is a horrible idea, btw, as the I/O overhead would be ridiculous.

I think this should run decently fast for your needs, but of course you'd need to try it and decide that for yourself.

Assuming you have reasonably modern development tools installed, you could build it locally via something like:

$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make

Say you have a FASTA file like this, where headers and sequences are on alternating lines:

$ more sequences.fa
>foo
TTAACG
>bar
GTGGAAGTTCTTAGGGCATGGCAAAGAGTCAGAATTTGAC

Then you can count 6-mers with kmer-counter by specifying --fasta and --k=6 as options to your sequences.fa FASTA file:

$ ./kmer-counter --fasta --k=6 sequences.fa 
>foo    CGTTAA:1 TTAACG:1
>bar    TTCTTA:1 TAGGGC:1 AAATTC:1 GTGGAA:1 AACTTC:1 AGTTCT:1 GCAAAG:1 AAAGAG:1 AAGAGT:1 TCAAAT:1 TGGAAG:1 GTTCTT:1 GTCAGA:1 TCTGAC:1 CATGGC:1 CGTTAA:1 GCATGG:1 TTTGCC:1 CTTTGC:1 TCAGAA:1 CTTAGG:1 TTAGGG:1 TGCCAT:1 TGACTC:1 ACTTCC:1 CAAATT:1 TTCTGA:1 GTCAAA:1 AGAACT:1 TCTTAG:1 CCTAAG:1 GCCATG:1 AGAATT:1 GGAAGT:1 AGTCAG:1 AATTTG:1 CCCTAA:1 ATTCTG:1 GAACTT:1 GAGTCA:1 CTCTTT:1 ATTTGA:1 CAGAAT:1 CCATGC:1 GGCAAA:1 ATGGCA:1 TTCCAC:1 ATGCCC:1 TGGCAA:1 CAAAGA:1 AAGAAC:1 AATTCT:1 TGCCCT:1 TAAGAA:1 GCCCTA:1 CTGACT:1 GAATTT:1 TTTGAC:1 CTAAGA:1 AGAGTC:1 GAAGTT:1 AAGTTC:1 GGGCAT:1 CATGCC:1 TTGCCA:1 GGCATG:1 AGGGCA:1 GACTCT:1 CTTCCA:1 TCTTTG:1 ACTCTT:1

This is a two-column text file. The header is in the first column and the mer-counts are in the second column.

If you want this redirected to a file, just use the redirection operator:

$ ./kmer-counter --fasta --k=6 sequences.fa > counts.txt

If you have a FASTA file where headers and sequences do not alternate lines (there are two or more lines of sequences between headers), there a lots of scripts on BioStars that show how to use awk or similar to preprocess your FASTA in single-line form, which run pretty quickly.

score 1 · Answer 2 · 2017-04-13

1

Entering edit mode

7.6 years ago

GenoMax 147k

Instead of pre-splitting the data can you not write a script that reads the sequences in and does the k-mer counts on the fly (kmercountexact.sh from BBMap suite would be another option).

ADD COMMENT • link 7.6 years ago by GenoMax 147k

score 0 · Answer 3 · 2017-04-14

Hi! I actually found a very useful script that does what I want, although it's going to take a while:

https://noble.gs.washington.edu/proj/nucsvm/fasta2matrix.py

My sequence file is just the result of a bedtools getfastafrombed command on my chosen ChIP-Seq dataset using hg18 as the reference genome, as I believe this is what they used for their original alignment.

score 0 · Answer 4 · 2017-04-14

0

Entering edit mode

7.6 years ago

ldyer2006 ▴ 50

Hi! I actually found a very useful script that does what I want, although it's going to take a while:

https://noble.gs.washington.edu/proj/nucsvm/fasta2matrix.py

Pre-splitting the data was definitely a poor decision..!!

ADD COMMENT • link 7.6 years ago by ldyer2006 ▴ 50

score 0 · Answer 5 · 2018-11-01

My fastatuples program will dump tuple information from fasta files, optionally also emitting sequence number and positions, into either text or binary formats:

http://saf.bio.caltech.edu/pub/software/molbio/fastatuples.c

If the number or tuples is small just use text mode and then "sort" and "uniq -d -c" to obtain the counts. For larger sets dump as binary then use "binorder" from the drm_tools package on sourceforge

https://sourceforge.net/projects/drmtools/

to do the same thing. Binorder isn't multithreaded but it isn't hard to break the data up into chunks (perhaps with

http://saf.bio.caltech.edu/pub/software/molbio/fastasplitn.c

). The data can be sorted and merged into the final form using binorder on the tip nodes doing -sorts feeding into a FIFO tree with a binorder doing -merge at each node. It needs a lot of memory to work that way though. If you want it back into text format after that there is "binformat" in the same package but the output will likely be gigantic. However, if you are going that long binary route with binorder then jellyfish is probably what you really want. Even though the counts are not exact (as they would be with binorder) they are quite close to exact, and will suffice 99% of the time.