Question

Split A Multiple Sequence Alignment (Possibly A Programming Challenge)

2

Entering edit mode

13.9 years ago

Lee Katz ★ 3.2k

I have a very large multiple sequence alignment from MAUVE, which represents a large homologous region between several genomes. I want to fine-tune this alignment using MUSCLE or ClustalW. However, both of these programs cannot handle the multiple sequence alignment because it is so large. Is there a way to split this multiple sequence alignment into smaller pieces?

The split should be safe. In other words, split the multiple sequence alignment along homologous regions. I'm not sure if there is a standard algorithm or program to do this already, or if someone needs to write an innovative subroutine/function for it. Thank you for your help.

* Edit * Is it possible that the MAUVE command line option --max-backbone-gap or the option --max-gapped-aligner-length is my answer? --max-gapped-aligner-length? That MAUVE has a way of limiting the size of each LCB in the first place? The documentation doesn't explain it well though...

MAUVE CLI documentation: http://asap.ahabs.wisc.edu/mauve-aligner/mauve-user-guide/using-mauvealigner-from-the-command-line.html

multiple clustalw programming • 6.0k views

ADD COMMENT • link updated 13.9 years ago by Jarretinha 3.4k • written 13.9 years ago by Lee Katz ★ 3.2k

2

Entering edit mode

How large is your sequence? It will be nice to know the round figure so people here can know your challenges.

ADD REPLY • link 13.9 years ago by Thaman ★ 3.3k

0

Entering edit mode

How large is your sequence?

ADD REPLY • link 13.9 years ago by Thaman ★ 3.3k

0

Entering edit mode

I guess you should split on syntenic regions. How large is very large?

ADD REPLY • link 13.9 years ago by Dave Gerrard ▴ 190

0

Entering edit mode

Have you tried to use Exonerate instead of MUSCLE or Clustal?

ADD REPLY • link 13.9 years ago by Jarretinha 3.4k

0

Entering edit mode

One of my large alignments is 86636 nucleotides, 1.4 megabytes. Another one is 861 kilobytes.

I have not tried Exonerate yet but guessing from "exon" it is not exactly what I want.

ADD REPLY • link 13.9 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

Exonerate is a general purpose aligner aimed at genome projects. You can even use it for gene models. By the way, the sizes of your aligments are small. MUSCLE should work . . .

ADD REPLY • link 13.9 years ago by Jarretinha 3.4k

0

Entering edit mode

Well, not so small . . .

ADD REPLY • link 13.9 years ago by Jarretinha 3.4k

0

Entering edit mode

I guess you should split on syntenic regions. How large is very large?

ADD REPLY • link 13.9 years ago by Dave Gerrard ▴ 190

Ram · Answer 1 · 2011-01-13

My solution:

# split a large MSA into smaller chunks

use Bio::Perl;
use Bio::AlignIO;
use Data::Dumper;
use List::Util qw[min max];
use strict;
use warnings;

exit(main(@ARGV));

sub main{
  my($lcbFile,$format)=@_;
  my @aln=splitLcb($lcbFile,$format);

  # write the split alignment
  for(my $i=0;$i<@aln;$i++){
    my $outFile="$lcbFile.part$i";
    print "Writing $outFile\n";
    my $out=Bio::AlignIO->new(-format=>"fasta",-file=>">$outFile");
    $out->write_aln($aln[$i]);
  }
  return 0;
}

# only looks at a single MSA in the input file (does not accept more than one aln in an aln file)
sub splitLcb{
  my($lcb,$format)=@_;
  # look for a set of xx nt that are homologous and split in the middle.
  my $minHomologousNucleotides=499;

  my @slicedAln; # result
  my $in=Bio::AlignIO->new(-file=>$lcb,-format=>$format);
  my $aln=$in->next_aln;

  # walk along the LCB to find a region of 100% identity
  my $alnLength=$aln->length;
  my $matchLine=$aln->match_line;

  # use the match line to find regions of homology
  SEARCH_FOR_HOMOLOGY: for(my $col=$minHomologousNucleotides*2;$col<$alnLength;$col++){
    if(is_col_identical($matchLine,$col)){
      # now see if the next xx nts are identical too
      my $length=min($alnLength,$col+$minHomologousNucleotides-1);
      for($col=$col;$col<$length;$col++){
        if(!is_col_identical($matchLine,$col)){
          next SEARCH_FOR_HOMOLOGY;
        }
      }
      my $homologyStart=($col-$minHomologousNucleotides+1);
      my $homologyStop=$col-1;
      print "Region of homology is $homologyStart - $homologyStop, and the minHomologousNucleotides=$minHomologousNucleotides, alignmentLength=$alnLength\n";

      # slice the alignment into two chunks
      $homologyStart++;$homologyStop++; # base-1 coordinates
      my $sliceCoordinate=int ($minHomologousNucleotides/2+$homologyStart);
      $slicedAln[0]=$aln->slice(1,$sliceCoordinate);
      $slicedAln[1]=$aln->slice($sliceCoordinate+1,$alnLength);
      last SEARCH_FOR_HOMOLOGY;
    }
  }
  return (@slicedAln);
}

sub is_col_identical{
  my($matchLine,$col)=@_;
  my $t=substr($matchLine,$col,1);
  if($t eq '*'){
    return 1;
  }
  return 0;
}

score 0 · Answer 2 · 2011-01-13

0

Entering edit mode

13.9 years ago

Jarretinha 3.4k

Hi Lee,

I don't see the point in splitting your MSA. MUSCLE and CLUSTAL can be compiled to run in 64 bit machines. So, your limit will be your memory (RAM + swap). Right now I'm running MUSCLE on a dozen of 150 kbp Herpesviridae genomes. It consumed all my RAM and a large piece of my swap. But it's running!

Nevertheless, I think you are in a much more delicate situation. Mauve is quite good in detecting indels/inversions/translocations and similar genome variations. You shouldn't use a progressive MSA program to refine your MSA at the cost of introducing much more gaps than necessary.

ADD COMMENT • link 13.9 years ago by Jarretinha 3.4k

0

Entering edit mode

I want to use MUSCLE on a MAUVE LCB because MAUVE gives a coarse alignment, and MUSCLE (or clustal) can give a more accurate alignment on a local level.

I need to split the LCB (mauve output) because it is extremely slow right now. It's been running since last week on two out of 70 of the LCBs. However, if I split the larger LCBs, then MUSCLE should run faster.

I don't believe I am introducing any gaps by splitting the LCB based on a region of homology, and so I don't think that I am introducing any inaccuracies.

ADD REPLY • link 13.9 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

You should have said that you needed a speed optimization. In my experience with mauve (and progressiveMauve) if you are certain that you don't have large scale events in your genome the --collinear option with the right matrix will give results as good as MUSCLE. On the other hand, if do have inversions or similar MUSCLE will mess up your alignment. Anyway, if you genome segment is all collinear the solution is trivial: just split equally or randomly. If not, you'll to find collinear boundaries(your anchors). You're using homology. A region could be homologous without being collinear.

ADD REPLY • link 13.9 years ago by Jarretinha 3.4k