Question

Degenerate Nucleotide Sequences

2

Entering edit mode

14.4 years ago

Max ▴ 20

Are there any programs available (Python, Perl, whatever), that can take a degenerate nucleotide sequence and translate it into its multiple possible oligos?

Any help would be appreciated.

sequence perl • 12k views

ADD COMMENT • link updated 14.0 years ago by Pierre Lindenbaum 166k • written 14.4 years ago by Max ▴ 20

0

Entering edit mode

can you give use a typical example ? because the simple sequence ATGCNNNNATGC would generate 256 oligos....

ADD REPLY • link 14.4 years ago by Pierre Lindenbaum 166k

Ram · Accepted Answer · 2011-03-08

This program should do the job:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define RUN(base) copy[index]=base; recurs(seq,copy,index+1,len)

static void recurs(const char *seq,char* copy,int index,const int len)
   {
   if(index==len)
      {
      fwrite(copy,sizeof(char),len,stdout);
      fputc('\n',stdout);
      }
   else
      {
      switch(toupper(seq[index]))
           {
           case 'A':case 'T': case 'G': case 'C':  RUN(seq[index]); break;
           case 'N':RUN('A');RUN('T');RUN('G');RUN('C');break;
           case 'W':RUN('A');RUN('T');break;
           case 'S':RUN('G');RUN('C');break;
           case 'B':RUN('T');RUN('G');RUN('C');break;
           case 'D':RUN('A');RUN('T');RUN('G');break;
           case 'H':RUN('A');RUN('T');RUN('C');break;
           case 'V':RUN('A');RUN('G');RUN('C');break;
           case 'K':RUN('G');RUN('T');break;
           case 'M':RUN('A');RUN('C');break;
           case 'R':RUN('A');RUN('G');break;
           case 'Y':RUN('C');RUN('T');break;
           default: fprintf(stderr,"Bad base in %s (%c)\n",seq,seq[index]); exit(EXIT_FAILURE);break;
           }
      }
   }

int main(int argc,char** argv)
   {
   char* seq;
   int len,i;
   if(argc!=2)
      {
      fprintf(stderr,"Usage : %s <dna>",argv[0]);
      return EXIT_FAILURE;
      }
   seq=argv[1];
   len=strlen(seq);
   char* copy=malloc((len+1)*sizeof(char));
   if(copy==NULL)
      {
      fprintf(stderr,"Out of memory\n");
      exit(EXIT_FAILURE);
      }
   copy[len]='\0';
   recurs(seq,copy,0,len);
   free(copy);
   return 0;
   }

Compilation:

gcc -O3 -o prg prg.c

test

> ./prg ATGCTGATCGAGCTANATCGATCGGACTACY
ATGCTGATCGAGCTAAATCGATCGGACTACC
ATGCTGATCGAGCTAAATCGATCGGACTACT
ATGCTGATCGAGCTATATCGATCGGACTACC
ATGCTGATCGAGCTATATCGATCGGACTACT
ATGCTGATCGAGCTAGATCGATCGGACTACC
ATGCTGATCGAGCTAGATCGATCGGACTACT
ATGCTGATCGAGCTACATCGATCGGACTACC
ATGCTGATCGAGCTACATCGATCGGACTACT

Ram · Accepted Answer · 2011-03-08

Like Pierre said, this is a combinatorial problem that can quickly blow up in your face. However, that aside, it easy to generate all the combinations using a recursion. Here is a compact implementation in Perl:

#!/usr/bin/perl -w

# Lookup table of degenerate IUPAC nucleotide codes.
my %deg2nuc = (
    "R" => ["A", "G"],
    "Y" => ["C", "T"],
    "S" => ["G", "C"],
    "W" => ["A", "T"],
    "K" => ["G", "T"],
    "M" => ["A", "C"],
    "B" => ["C", "G", "T"],
    "D" => ["A", "G", "T"],
    "H" => ["A", "C", "T"],
    "V" => ["A", "C", "G"],
    "N" => ["A", "C", "G", "T"]
);

# Recursive function that replaces degenerate nucleotides with all combinations.
sub generate
{
    if ($_[0] =~ /(.*)([RYSWKBDHVN])(.*)/) {
        my $head = $1;
        my $tail = $3;
        my @seqs;
        foreach my $nuc (@{$deg2nuc{$2}}) {
            push @seqs, generate($head.$nuc.$tail);
        }
        return @seqs;
    }
    else {
        return $_[0];
    }
}

# Demo: print all sequences generated from ANCRG.
print join("\n", generate("ANCRG")), "\n";

EDIT:

On second thought, that was not at all compact by Perl standards. Here is the really compact version that ventures into code golf territory:

#!/usr/bin/perl -w

my %A = ("R"=>1, "W"=>1, "M"=>1, "D"=>1, "H"=>1, "V"=>1, "N"=>1);
my %C = ("Y"=>1, "S"=>1, "M"=>1, "B"=>1, "H"=>1, "V"=>1, "N"=>1);
my %G = ("R"=>1, "S"=>1, "K"=>1, "B"=>1, "D"=>1, "V"=>1, "N"=>1);
my %T = ("Y"=>1, "W"=>1, "K"=>1, "B"=>1, "D"=>1, "H"=>1, "N"=>1);

$_ = "ANCRG\n";
while (s/(.*)([RYSWKBDHVN])(.*)\n/(exists $A{$2} ? "$1A$3\n" : "").(exists $C{$2} ? "$1C$3\n" : "").(exists $G{$2} ? "$1G$3\n" : "").(exists $T{$2} ? "$1T$3\n" : "")/e) {}
print;