How To Find The Locations Of A Short Specific Sequence In A Genome With 1 Or 2 Mismatches Allowed?
3
1
Entering edit mode
11.4 years ago
William ★ 5.3k

We have a 23 nucleotide CRISPR target sequence of which I would like to find out if it also present in other locations in the genome.

The sequences directs a CRISPR RNA construct to introduce a indel mutation in the genome and we would like to make sure that there is only one target loci. There is also one N in the nucleotide sequence.

Let's say the 23 nucleotide sequence is :

GGAGCGAGCGGAGCGGTACANGG

How do I find all the loci in a genome were this sequence matches, exactly (well 1 mismatch one the N), or with say an edit distance of 2 or 3?

I tried BWA aln with a short sequence of 23 bp from the human genome with parameters -l 23 -k 2 but it didn't find back the location of the 23 bp. Does bwa work with sequences of this lenght?

I tried blast but I get back a lot of results and I can't control the max edit distance.

bwa blast sequence • 6.7k views
ADD COMMENT
0
Entering edit mode

PatMatch allows controlling the number of mismatches and whether that includes insertions, deletions, and/or substitutions. There is a stand-alone version of the software available as posted about here in response to a related question. (In fact, at the referenced resource you can run it right in your browser right now via Jupyter environment served by MyBinder.org.) As far as I can tell, it cannot fine-tune specifying how to break down that number further to say 2 substitutions and 1 deletion max.

ADD REPLY
0
Entering edit mode

but it looks like PatMatch only works for Arabidopsis

ADD REPLY
0
Entering edit mode

@chahat_u PatMatch definitely isn't limited to Arabidopsis. Look at the other post I pointed at here. There are several web sites offering PatMatch working as a web tool for quite a few organisms beyond Arabidopsis. I list the ones I could find here. Additionally, as long as you have the sequence and go to https://github.com/fomightez/patmatch-binder and launch a binder session there, you can follow along with the example I set up and use another genome.

ADD REPLY
2
Entering edit mode
9.9 years ago

Yes, bwa will find it, but you need to change the parameters. Do not use the seeded mode, use the slower -N mode:

bwa aln -n 4 -o 0 -k 4 -N

The sanger CRISPR site uses more or less these parameters.

ADD COMMENT
0
Entering edit mode

Hi, I tried your method to find the genomic location of a DNA sequence in the hg19 genome, and I ran the following command -

bwa aln -n 4 -o 0 -k 4 -N hg19.fasta testmotif.fq > out.sai

But the out.sai file seemed to only have illegible stuff in it -

SAI  ÄÑÄø ˇˇˇ

Do you have some idea as to what could be going wrong?

ADD REPLY
1
Entering edit mode
5.9 years ago
Johan Zicola ▴ 70

Using Bowtie (for example v1.2.2 here) to find off-targets for defined CRISPR-Cas9 target sequences:

Make the Bowtie index for your genome (fasta file format)

bowtie-build -f genome.fa  genome_prefix

Search for your target sequence by allowing 1 mismatch (for your N) with the flag -n 1

 bowtie genome_prefix -n 1 -c GGAGCGAGCGGAGCGGTACANGG

It should find back your origin sequence even with 1 mismatch (your N in this case). To allow 2 mismatches, use -n 2. Even though up to 3 mismatches are allowed with the -n argument, only 2 mismatches will be tolerated (I wrote an issue in their GitHub repository). The seed length is 28 by default so you don't need to change that as you work with CRISPR-Cas9 target sequences (typically 20 bp). Check more in Bowtie documentation.

Note: I use Bowtie since Bowtie2 allows maximum 1 mismatch, which is a drawback in this case. Note also that while you can search for a sequence containing Ns, Bowtie does not allow alignment to Ns contained in your reference genome (but bowtie2 does). I think it would be nice to have the flexibility of Bowtie regarding the number of mismatches allowed and the ability of Bowtie2 to align to sequences containing Ns. Despite this, Bowtie is used to identify off-targets in the most common webtools for sgRNAs design such as CHOPCHOP or CCTop.

ADD COMMENT
0
Entering edit mode
11.4 years ago

vmatch is an excellent general aligner

The Vmatch large scale sequence analysis software

ADD COMMENT

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6