string matching with mismatches
3
0
Entering edit mode
9.6 years ago

I want to do string matching of say "ACCTGGATTTC" but allowing for mismatches such as substitutions/insertions/deletions. Can I use grep for this?

alignment sequence • 3.2k views
ADD COMMENT
1
Entering edit mode
9.6 years ago
h.mon 35k

I believe you can't use grep, but you could use agrep. See this and this threads for several other suggestions.

ADD COMMENT
0
Entering edit mode

agrep sounds promising but I am unable to install it from github. Any alternate sources? thanks

ADD REPLY
0
Entering edit mode

on Debian / Ubuntu (need to be root or use sudo):

apt-get install agrep
ADD REPLY
1
Entering edit mode
9.6 years ago
grep -E 'A.*C.*C.*T.*G.*G.*A.*T.*T.*T.*C'

Good luck. You'd better have a look at http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

ADD COMMENT
1
Entering edit mode
9.6 years ago

You can use BBDuk to match that pattern allowing substitutions or indels:

bbduk.sh in=sequences.fasta literal=ACCTGGATTTC edist=1 k=11 outm=matched.fasta out=unmatched.fasta

That will allow an edit distance of 1, and will be extremely fast.

ADD COMMENT
0
Entering edit mode

How can install bbduk.sh? Thanks

ADD REPLY
0
Entering edit mode

Download the BBMap package here, then unzip and untar it (tar xzf BBMap_34.94.tar.gz). Then it will work as long as you have Java installed. You will need to add the path to bbduk.sh to your environment or else type out the full path, e.g.

/user/bin/bbmap/bbduk.sh ... (other parameters).

...where the exact path is just the location you unzipped it.

ADD REPLY

Login before adding your answer.

Traffic: 2132 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6