Question

How To Find Sequences With A Given Number Of Mismatches

5

Entering edit mode

14.6 years ago

Emmecola ▴ 180

Hi! Is there any script or tool that allows to find similar sequences given the maximum acceptable number of mismatches? What I want is something similar to a Blastn search, but instead of filtering with the e-value I want to filter with the number of mismatches. I've tried with Perl regular expressions but I couldn't find a solution. Thanks!

blast similarity sequence perl • 10k views

ADD COMMENT • link updated 14.6 years ago by Hanif Khalak ★ 1.3k • written 14.6 years ago by Emmecola ▴ 180

Ram · Answer 1 · 2010-09-29

9

Entering edit mode

14.6 years ago

Larry_Parnell 16k

You could parse the BLASTN results and examine those results for the information you desire. Specifically, the "identities" and "Gaps" values can be used to tell you the size of the HSP (high-scoring pair) and the number of unaligned or mismatching positions. To do this correctly, turn OFF the low-complexity filter because low-complexity seq in a BLASTN HSP will be counted as "not matching.".

Then, filter these tabularized results for those that are within your threshold of acceptance - say no more than X mismatches per 100 or 1000 residues aligned.

This can get problematic if you are aligning unlike sequences - say genes (i.e, exons and introns) from two species or across a gene family. Introns lengths are more variable than exon sizes and there is lower similarity between introns over exons, but it may be exactly what you wish to legitimately compare.

ADD COMMENT • link 14.6 years ago by Larry_Parnell 16k

1

Entering edit mode

Bio::SearchIO from bioperl will allow you to do this. In fact there are examples in the How-to that can easily be adapted for this functionality.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 14.6 years ago by Alastair Kerr 5.3k

0

Entering edit mode

Thanks, Alastair!

ADD REPLY • link 14.6 years ago by Larry_Parnell 16k

Ram · Answer 2 · 2010-09-29

A simple-and-stupid solution in 'C': a simple loop of the query over the object sequence.

#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <ctype.h>
static const int MAX_MISMATCH=5;
#define SEQMAX 10000
static void align(
    const char* query,int query_len,
    const char* object,int object_len)
    {
    int i;
    const int shift=object_len-query_len;
    if(shift<0) return;
    for(i=0;i<=shift;++i)
        {
        int n_mismatch=0;
        int j;
        for(j=0;j< query_len;++j)
            {
            if(toupper(query[j])!=toupper(object[j+i]))
                {
                n_mismatch++;
                if(n_mismatch>=MAX_MISMATCH) break;
                }
            }
        if(n_mismatch<MAX_MISMATCH)
            {
            printf("%s vs %s: %d\n",query,object,n_mismatch);
            return;
            }
        }
    }

int main(int argc,char** argv)
    {
    char object[SEQMAX];
    char* query=argv[1];
    int query_len=strlen(argv[1]);
    while(fgets(object,SEQMAX,stdin)!=NULL)
        {
        int object_len=strlen(object);
        if(object[object_len-1]=='\n')
            {
            object[object_len-1]=0;
            object_len--;
            }
        if(object[0]==0) continue;
        align(query,query_len,object,object_len);
        }
    return 0;
    }

test:

gcc file.c
echo "AAAAATAAAAATCAAAAA" | ./a.out AAAAAAAAAAA
AAAAAAAAAAA vs AAAAATAAAAATCAAAAA: 1
echo "GGGGAGGAAAGGGAAGGA" | ./a.out AAAAAAAAAAA
(nothing)

Ram · Answer 3 · 2010-10-19

3

Entering edit mode

14.5 years ago

Hanif Khalak ★ 1.3k

A simple and useful command-line tool is agrep, an "approximate grep" which is available as the original or as part of the TRE package. To find hits within 5 mismatches, you would use something like:

agrep -5 patternstring target.file

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 14.5 years ago by Hanif Khalak ★ 1.3k

0

Entering edit mode

Always nice to discover good Unix tools, thanks!

ADD REPLY • link 14.5 years ago by Eric Normandeau 11k

score 3 · Answer 4 · 2010-10-19

3

Entering edit mode

14.5 years ago

Dave Lunt ★ 2.0k

I think the motu_define.pl script from Mark Blaxter's lab does exactly this, it first uses BLAST then filters according to a user determined maximum permissible number of mismatches. http://www.nematodes.org/bioinformatics/MOTU/

There is a newer (java) version here: http://www.nematodes.org/bioinformatics/jMOTU/index.shtml

Also some very impressive solutions coming from Robert Edgar, I think UCLUST will do what you want (and VERY quickly). http://www.drive5.com/usearch/

Both of these you can set number of permitted mismatches per unit length.

ADD COMMENT • link 14.5 years ago by Dave Lunt ★ 2.0k

0

Entering edit mode

Thanks for both links! I still have to try UBLAST. Has anyone used it? Is it very fast and sensitive? Cheers.

ADD REPLY • link 14.5 years ago by Eric Normandeau 11k