Question

specific k-mer count in multi fasta file

0

Entering edit mode

7.0 years ago

jr5731173 • 0

Hello, I'm trying to count a specific kmer in multifasta files contained in a directory with approximately 65 thousand multifasta files, I have tried with Bioconductor but not it's possible to save all data in memory, and I would like to do this task with a perl or python script. I would like to modify the next script to count the kmer occurrences in the fasta file.

#!/usr/bin/perl -w 
# Searching for motifs 
# Ask the user for the filename of the file containing 
# the protein sequence data, and collect it from the 
keyboard 
print "Please type the filename of the DNA sequence 
data: "; 
$dnafilename = <STDIN>; 
# Remove the newline from the DNA filename 
chomp $dnafilename; 
# open the file, or exit 
unless ( open(DNAFILE, $dnafilename) ) { 
    print "Cannot open file \"$dnafilename\"\n\n"; 
    exit; 
} 
# Read the dna sequence data from the file, and store 
it 
# into the array variable @protein 
@dna = <DNAFILE>; 
# Close the file - we've read all the data into @dna 
now. 
close DNAFILE; 
# Put the DNA sequence data into a single string, as 
it's easier 
# to search for a motif in a string than in an array of 
# lines (what if the motif occurs over a line break?) 
$dna = join( '', @dna); 
# Remove whitespace 
$dna =~ s/\s//g; 
# In a loop, ask the user for a motif, search for the motif, 
# and report if it was found. 
# Exit if no motif is entered. 
do { 
    print "Enter a motif to search for: "; 
    $motif = <STDIN>; 
    # Remove the newline at the end of $motif 
    chomp $motif; 
    # Look for the motif 
    if ( $dna =~ /$motif/ ) { 
        print "I found it!\n\n"; 
    } else { 
        print "I couldn\'t find it.\n\n"; 
    } 
# exit on an empty user input 
} until ( $motif =~ /^\s*$/ ); 
# exit the program 
exit;

sequence • 4.3k views

ADD COMMENT • link updated 4.2 years ago by schlogl ▴ 180 • written 7.0 years ago by jr5731173 • 0

0

Entering edit mode

Isn't it better to use Jellyfish, KMC, KmerCountExact, or another ready-made solution?

ADD REPLY • link 7.0 years ago by h.mon 35k

score 2 · Answer 1 · 2018-08-15

2

Entering edit mode

7.0 years ago

Alex Reynolds 36k

You could use kmer-counter and do something like:

$ ./kmer-counter --fasta --k=8 sequences.fa | grep -wF 'ACCCTGGC' > answer.txt

Adjust for your k-length and mer of interest.

If you want to use Python, there's a test script that demonstrates how to render mer-counts from kmer-counter into a Python dictionary, over which you can iterate and access counts directly for your kmer of interest.

ADD COMMENT • link 7.0 years ago by Alex Reynolds 36k

0

Entering edit mode

Do you have a conda installer? Thanks

ADD REPLY • link 4.2 years ago by schlogl ▴ 180

1

Entering edit mode

No, but installation is easy:

$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
$ cp kmer-counter /usr/bin/local
$ kmer-counter --help
...

Hope this is still useful. If you need something via conda, there is jellyfish: https://anaconda.org/conda-forge/jellyfish