Hello, I'm trying to count a specific kmer in multifasta files contained in a directory with approximately 65 thousand multifasta files, I have tried with Bioconductor but not it's possible to save all data in memory, and I would like to do this task with a perl or python script. I would like to modify the next script to count the kmer occurrences in the fasta file.
#!/usr/bin/perl -w
# Searching for motifs
# Ask the user for the filename of the file containing
# the protein sequence data, and collect it from the
keyboard
print "Please type the filename of the DNA sequence
data: ";
$dnafilename = <STDIN>;
# Remove the newline from the DNA filename
chomp $dnafilename;
# open the file, or exit
unless ( open(DNAFILE, $dnafilename) ) {
print "Cannot open file \"$dnafilename\"\n\n";
exit;
}
# Read the dna sequence data from the file, and store
it
# into the array variable @protein
@dna = <DNAFILE>;
# Close the file - we've read all the data into @dna
now.
close DNAFILE;
# Put the DNA sequence data into a single string, as
it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
$dna = join( '', @dna);
# Remove whitespace
$dna =~ s/\s//g;
# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.
# Exit if no motif is entered.
do {
print "Enter a motif to search for: ";
$motif = <STDIN>;
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
if ( $dna =~ /$motif/ ) {
print "I found it!\n\n";
} else {
print "I couldn\'t find it.\n\n";
}
# exit on an empty user input
} until ( $motif =~ /^\s*$/ );
# exit the program
exit;
Isn't it better to use Jellyfish, KMC, KmerCountExact, or another ready-made solution?