Question

Seeking tool for identifying hybridization probes in gDNA -- more like Southern blotting than PCR/microarray

0

Entering edit mode

10.9 years ago

cwhooker • 0

I need an app, something like a primer design app, that will scan a sequence (genome) and identify sections that are likely to work as hybridization probes in an application we are developing. We think the probes should be:

around 200bp long
not less than 55% GC
devoid of runs of more than 3 T's
and a few more similar "rules"

but wherever you see a number there, I need to be able to input that as a variable, because I need to use the software together with my physical method to refine the algorithm. That is, vary the parameters to generate multiple probes, test the probes in my application, and based on their performance go back and re-tweak the parameters; lather, rinse, repeat.

I also need the probes to have minimal repetitive/low-complexity sequence and to be, in practical terms, unique to the target organism in question, but I can deal with the last two requirements using other software, e.g. BLAST, if necessary.

So I have two specific questions:

Does this thing already exist, and if so how do I get my hands on it?
If it doesn't already exist, does that sound like something you would be interested in building? It would be a paying contract, so depending on answers perhaps I should post separately under "jobs".

probe-design sequence-scanning-algorithm • 3.5k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by cwhooker • 0

Ram · Answer 1 · 2014-06-05

1

Entering edit mode

10.9 years ago

Danielk ▴ 640

Primer3 can design hyb probes.

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Danielk ▴ 640

0

Entering edit mode

Thanks, Daniel. I looked at Primer3, but unless I'm too dumb to figure out the relevant settings (possible!) it won't do the sequence-based rules like "no runs of >3 T's" or "no more than 20 bp between adjacent G's" that I need.

ADD REPLY • link 10.9 years ago by cwhooker • 0

0

Entering edit mode

I ended up writing a little script that applies such filters after Primer3 designed six candidate pairs over each target. Your 20bp rule will probably need manual processing no matter the tool you use. As a command line program, Primer3 is the easiest to script batch runs for.

ADD REPLY • link 10.9 years ago by karl.stamm 4.1k

Ram · Answer 2 · 2014-06-05

Here's something to get you (or anyone else) started. It requires BioPerl. You supply the sequence file (fasta format), minimum GC %, k-mer length and maximum consecutive "T" as command-line arguments, e.g.

perl getkmers.pl --seqfile myfile.fa --max_t 3 --min_gc 55 --length 50 > outfile

The script prints out the k-mer, GC % and length of the maximum consecutive "T". You could rewrite the code so as to use the values supplied to filter before printing the output. Or you could print everything to a file and filter later, in a spreadsheet or whatever. Of course for large files (genomes), you probably do not want to write out every k-mer.

head outfile
GTAGGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGT    60    3
TAGGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTT    58    3
AGGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTT    58    3
GGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTA    58    3
GGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTAT    56    3
GTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATG    56    3
TCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGT    54    3
CGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGTT    54    3
GAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGTTT    52    3
AGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGTTTT    50    4

Here's the code (rough and ready without validation, checks, tests etc.).

#!/usr/bin/perl -w

use strict;
use Getopt::Long;
use Bio::SeqIO;
use Bio::Tools::SeqStats;

GetOptions("seqfile=s" => \my $seqfile, # fasta file
           "length=i"  => \my $length,  # kmer length
           "min_gc=i"  => \my $min_gc,  # >= this GC%
           "max_t=i"   => \my $max_t);  # max consecutive T

my $seqio = Bio::SeqIO->new(-file => $seqfile, -format => "fasta");
while(my $seq = $seqio->next_seq) {
    my $seqlen = $seq->length;
    for(my $i = 1; $i <= $seqlen - ($length-1); $i++) {
        my $kmer  = $seq->subseq($i, $i+($length-1));
        my $stats = Bio::Tools::SeqStats->new(-seq => Bio::Seq->new(-seq => $kmer));
        my $mers  = $stats->count_monomers(); 
        my $gc    = 100 * ($mers->{'G'} + $mers->{'C'}) / $length;
        my $t     = 0;
        while($kmer =~ /T{$max_t,}/g) {
            $t = length($&) if length($&) > $t; # longest consecutive run of T
        }
        print $kmer, "\t", $gc, "\t", $t, "\n";
    }
}

score 0 · Answer 3 · 2014-06-05

0

Entering edit mode

10.9 years ago

pld 5.1k

http://www.ncbi.nlm.nih.gov/tools/primer-blast/

You can always check the GC content after you get your primers back.

ADD COMMENT • link 10.9 years ago by pld 5.1k

0

Entering edit mode

Thanks, Joe. As with Primer3 alone, adding the "BLAST for unique sequences" doesn't get me the parameterizable rules. I could try to get fancy with degenerate (mixed) bases but I can't see a way to include rules like "no more than X bp between A's" by doing that.

ADD REPLY • link 10.9 years ago by cwhooker • 0