I have a very large multiple sequence alignment from MAUVE, which represents a large homologous region between several genomes. I want to fine-tune this alignment using MUSCLE or ClustalW. However, both of these programs cannot handle the multiple sequence alignment because it is so large. Is there a way to split this multiple sequence alignment into smaller pieces?
The split should be safe. In other words, split the multiple sequence alignment along homologous regions. I'm not sure if there is a standard algorithm or program to do this already, or if someone needs to write an innovative subroutine/function for it. Thank you for your help.
* Edit * Is it possible that the MAUVE command line option --max-backbone-gap or the option --max-gapped-aligner-length is my answer? --max-gapped-aligner-length? That MAUVE has a way of limiting the size of each LCB in the first place? The documentation doesn't explain it well though...
Exonerate is a general purpose aligner aimed at genome projects. You can even use it for gene models. By the way, the sizes of your aligments are small. MUSCLE should work . . .
# split a large MSA into smaller chunks
use Bio::Perl;
use Bio::AlignIO;
use Data::Dumper;
use List::Util qw[min max];
use strict;
use warnings;
exit(main(@ARGV));
sub main{
my($lcbFile,$format)=@_;
my @aln=splitLcb($lcbFile,$format);
# write the split alignment
for(my $i=0;$i<@aln;$i++){
my $outFile="$lcbFile.part$i";
print "Writing $outFile\n";
my $out=Bio::AlignIO->new(-format=>"fasta",-file=>">$outFile");
$out->write_aln($aln[$i]);
}
return 0;
}
# only looks at a single MSA in the input file (does not accept more than one aln in an aln file)
sub splitLcb{
my($lcb,$format)=@_;
# look for a set of xx nt that are homologous and split in the middle.
my $minHomologousNucleotides=499;
my @slicedAln; # result
my $in=Bio::AlignIO->new(-file=>$lcb,-format=>$format);
my $aln=$in->next_aln;
# walk along the LCB to find a region of 100% identity
my $alnLength=$aln->length;
my $matchLine=$aln->match_line;
# use the match line to find regions of homology
SEARCH_FOR_HOMOLOGY: for(my $col=$minHomologousNucleotides*2;$col<$alnLength;$col++){
if(is_col_identical($matchLine,$col)){
# now see if the next xx nts are identical too
my $length=min($alnLength,$col+$minHomologousNucleotides-1);
for($col=$col;$col<$length;$col++){
if(!is_col_identical($matchLine,$col)){
next SEARCH_FOR_HOMOLOGY;
}
}
my $homologyStart=($col-$minHomologousNucleotides+1);
my $homologyStop=$col-1;
print "Region of homology is $homologyStart - $homologyStop, and the minHomologousNucleotides=$minHomologousNucleotides, alignmentLength=$alnLength\n";
# slice the alignment into two chunks
$homologyStart++;$homologyStop++; # base-1 coordinates
my $sliceCoordinate=int ($minHomologousNucleotides/2+$homologyStart);
$slicedAln[0]=$aln->slice(1,$sliceCoordinate);
$slicedAln[1]=$aln->slice($sliceCoordinate+1,$alnLength);
last SEARCH_FOR_HOMOLOGY;
}
}
return (@slicedAln);
}
sub is_col_identical{
my($matchLine,$col)=@_;
my $t=substr($matchLine,$col,1);
if($t eq '*'){
return 1;
}
return 0;
}
I don't see the point in splitting your MSA. MUSCLE and CLUSTAL can be compiled to run in 64 bit machines. So, your limit will be your memory (RAM + swap). Right now I'm running MUSCLE on a dozen of 150 kbp Herpesviridae genomes. It consumed all my RAM and a large piece of my swap. But it's running!
Nevertheless, I think you are in a much more delicate situation. Mauve is quite good in detecting indels/inversions/translocations and similar genome variations. You shouldn't use a progressive MSA program to refine your MSA at the cost of introducing much more gaps than necessary.
I want to use MUSCLE on a MAUVE LCB because MAUVE gives a coarse alignment, and MUSCLE (or clustal) can give a more accurate alignment on a local level.
I need to split the LCB (mauve output) because it is extremely slow right now. It's been running since last week on two out of 70 of the LCBs. However, if I split the larger LCBs, then MUSCLE should run faster.
I don't believe I am introducing any gaps by splitting the LCB based on a region of homology, and so I don't think that I am introducing any inaccuracies.
You should have said that you needed a speed optimization. In my experience with mauve (and progressiveMauve) if you are certain that you don't have large scale events in your genome the --collinear option with the right matrix will give results as good as MUSCLE. On the other hand, if do have inversions or similar MUSCLE will mess up your alignment. Anyway, if you genome segment is all collinear the solution is trivial: just split equally or randomly. If not, you'll to find collinear boundaries(your anchors). You're using homology. A region could be homologous without being collinear.
How large is your sequence? It will be nice to know the round figure so people here can know your challenges.
How large is your sequence?
I guess you should split on syntenic regions. How large is very large?
Have you tried to use Exonerate instead of MUSCLE or Clustal?
One of my large alignments is 86636 nucleotides, 1.4 megabytes. Another one is 861 kilobytes.
I have not tried Exonerate yet but guessing from "exon" it is not exactly what I want.
Exonerate is a general purpose aligner aimed at genome projects. You can even use it for gene models. By the way, the sizes of your aligments are small. MUSCLE should work . . .
Well, not so small . . .
I guess you should split on syntenic regions. How large is very large?