We have a 23 nucleotide CRISPR target sequence of which I would like to find out if it also present in other locations in the genome.
The sequences directs a CRISPR RNA construct to introduce a indel mutation in the genome and we would like to make sure that there is only one target loci. There is also one N in the nucleotide sequence.
Let's say the 23 nucleotide sequence is :
GGAGCGAGCGGAGCGGTACANGG
How do I find all the loci in a genome were this sequence matches, exactly (well 1 mismatch one the N), or with say an edit distance of 2 or 3?
I tried BWA aln with a short sequence of 23 bp from the human genome with parameters -l 23 -k 2 but it didn't find back the location of the 23 bp. Does bwa work with sequences of this lenght?
I tried blast but I get back a lot of results and I can't control the max edit distance.
PatMatch allows controlling the number of mismatches and whether that includes insertions, deletions, and/or substitutions. There is a stand-alone version of the software available as posted about here in response to a related question. (In fact, at the referenced resource you can run it right in your browser right now via Jupyter environment served by MyBinder.org.) As far as I can tell, it cannot fine-tune specifying how to break down that number further to say 2 substitutions and 1 deletion max.
but it looks like PatMatch only works for Arabidopsis
@chahat_u PatMatch definitely isn't limited to Arabidopsis. Look at the other post I pointed at here. There are several web sites offering PatMatch working as a web tool for quite a few organisms beyond Arabidopsis. I list the ones I could find here. Additionally, as long as you have the sequence and go to https://github.com/fomightez/patmatch-binder and launch a binder session there, you can follow along with the example I set up and use another genome.