I'm looking for an automated way to align multiple fasta files ~(50,000) to a custom made reference (fasta) and return the number of mismatched for each sequence. Each sequence and the reference are about 295 bp long. I have used bwa and bowtie to align sequences in the past but not sure if they can return the number of mismatches. I'm looking for a simplest way to accomplish this task.
It may just be a case of using a heavy duty program to do the aligning, and a small script to count the mismatches after the fact (but I'm sure theres a program that will have what you need.
A while back I had a similar problem and I ended up writing SequenceMatcher, a wrapper around the alignment tools in BioJava:
Basically:
java -jar SequenceMatcher.jar match -a reference.fa -b queries.fa
The output gives you the number of mismatches (edit distance in fact) in column NM plus a number of similarity distances (Levenshtein, Hamming, Jaro-Winkler) and other metrics:
The advantage over other aligners is that every sequence in query.fa is optimally aligned to every sequence in reference.fa (many to one, in your case) regardless of the number of mismatches. I.e. there is no filtering for alignment score or similar and no drop in alignment quality to favour speed. (This comes at the expense of computational efficiency, of course, but for ~50000 sequence vs 1 of length ~300bp it should be fast enough)
PS: bwa and bowtie should give the number of mismatches/edit distance in the NM tag
You are looking for an 'all-vs-one' pairwise alignment?
Off the top of my head I can't think of one that I know for sure gives the number of mismatches, but there are plenty of aligners to choose from: https://en.wikipedia.org/wiki/List_of_sequence_alignment_software#Pairwise_alignment
It may just be a case of using a heavy duty program to do the aligning, and a small script to count the mismatches after the fact (but I'm sure theres a program that will have what you need.