You can use R/Bioconductor to solve this. For example, with the first 10 sequences of one of my fasta files:
library(seqinr)
library(Biostrings)
fasta_file <- read.fasta("mm10_refGene.fa", as.string = T) # read fasta file; every sequence will be one string
pattern <- "attag" # the pattern to look for
dict <- PDict(pattern, max.mismatch = 0) # make a dictionary from the patterns that you want to look for
seq <- DNAStringSet( unlist(fasta_file)[1:10] ) # make a DNAStringSet from the DNA sequences (only the first ten for this examples)
result <- vcountPDict(dict, seq) # count pattern in each of the sequences
result
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 1 0 2 3 1 0 1 4
The result is a matrix with one column per sequence. If you have several patterns, you provide them as a vector. In the result, every pattern will have one row.
You can add the annotation of your fasta sequences as column names:
colnames(result) <- attr(fasta_file, "names")[1:10]
And save the matrix as a .csv:
write.csv2(result, "result.csv")
There is just one problem: If you have any characters in your sequences that are not part of the DNA alphabet, you will get an error.
DNA_ALPHABET
[1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"
Farhat's perl script A: 7n motif search over the genome should solve your problem.