How To Find Sequences With A Given Number Of Mismatches
4
5
Entering edit mode
14.2 years ago
Emmecola ▴ 180

Hi! Is there any script or tool that allows to find similar sequences given the maximum acceptable number of mismatches? What I want is something similar to a Blastn search, but instead of filtering with the e-value I want to filter with the number of mismatches. I've tried with Perl regular expressions but I couldn't find a solution. Thanks!

blast similarity sequence perl • 10.0k views
ADD COMMENT
9
Entering edit mode
14.2 years ago

You could parse the BLASTN results and examine those results for the information you desire. Specifically, the "identities" and "Gaps" values can be used to tell you the size of the HSP (high-scoring pair) and the number of unaligned or mismatching positions. To do this correctly, turn OFF the low-complexity filter because low-complexity seq in a BLASTN HSP will be counted as "not matching.".

Then, filter these tabularized results for those that are within your threshold of acceptance - say no more than X mismatches per 100 or 1000 residues aligned.

This can get problematic if you are aligning unlike sequences - say genes (i.e, exons and introns) from two species or across a gene family. Introns lengths are more variable than exon sizes and there is lower similarity between introns over exons, but it may be exactly what you wish to legitimately compare.

ADD COMMENT
1
Entering edit mode

Bio::SearchIO from bioperl will allow you to do this. In fact there are examples in the How-to that can easily be adapted for this functionality.

ADD REPLY
0
Entering edit mode

Thanks, Alastair!

ADD REPLY
4
Entering edit mode
14.2 years ago

A simple-and-stupid solution in 'C': a simple loop of the query over the object sequence.

#include <stdio.h>
#include <string.h>
#include <limits.h>
#include <ctype.h>
static const int MAX_MISMATCH=5;
#define SEQMAX 10000
static void align(
    const char* query,int query_len,
    const char* object,int object_len)
    {
    int i;
    const int shift=object_len-query_len;
    if(shift<0) return;
    for(i=0;i<=shift;++i)
        {
        int n_mismatch=0;
        int j;
        for(j=0;j< query_len;++j)
            {
            if(toupper(query[j])!=toupper(object[j+i]))
                {
                n_mismatch++;
                if(n_mismatch>=MAX_MISMATCH) break;
                }
            }
        if(n_mismatch<MAX_MISMATCH)
            {
            printf("%s vs %s: %d\n",query,object,n_mismatch);
            return;
            }
        }
    }

int main(int argc,char** argv)
    {
    char object[SEQMAX];
    char* query=argv[1];
    int query_len=strlen(argv[1]);
    while(fgets(object,SEQMAX,stdin)!=NULL)
        {
        int object_len=strlen(object);
        if(object[object_len-1]=='\n')
            {
            object[object_len-1]=0;
            object_len--;
            }
        if(object[0]==0) continue;
        align(query,query_len,object,object_len);
        }
    return 0;
    }

test:

gcc file.c
echo "AAAAATAAAAATCAAAAA" | ./a.out AAAAAAAAAAA
AAAAAAAAAAA vs AAAAATAAAAATCAAAAA: 1
echo "GGGGAGGAAAGGGAAGGA" | ./a.out AAAAAAAAAAA
(nothing)
ADD COMMENT
3
Entering edit mode
14.1 years ago
Hanif Khalak ★ 1.3k

A simple and useful command-line tool is agrep, an "approximate grep" which is available as the original or as part of the TRE package. To find hits within 5 mismatches, you would use something like:

agrep -5 patternstring target.file
ADD COMMENT
0
Entering edit mode

Always nice to discover good Unix tools, thanks!

ADD REPLY
3
Entering edit mode
14.1 years ago
Dave Lunt ★ 2.0k

I think the motu_define.pl script from Mark Blaxter's lab does exactly this, it first uses BLAST then filters according to a user determined maximum permissible number of mismatches. http://www.nematodes.org/bioinformatics/MOTU/

There is a newer (java) version here: http://www.nematodes.org/bioinformatics/jMOTU/index.shtml

Also some very impressive solutions coming from Robert Edgar, I think UCLUST will do what you want (and VERY quickly). http://www.drive5.com/usearch/

Both of these you can set number of permitted mismatches per unit length.

ADD COMMENT
0
Entering edit mode

Thanks for both links! I still have to try UBLAST. Has anyone used it? Is it very fast and sensitive? Cheers.

ADD REPLY

Login before adding your answer.

Traffic: 2337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6