I need to implement a quality trimming algorithm for capillary sequence reads (yes, they still exist!) in a Perl package of my own, so I am trying to implement the algorithm used by the PHRED software as well as other tools. There is a description of the Mott algorithm here: https://www.codoncode.com/support/faqs/trimming.htm (under "More about trimming") and in the PHRED manual (see --trim_alt) here: http://www.phrap.org/phredphrap/phred.html
There is an implementation of the algorithm in BioPyhton here: http://biopython.org/DIST/docs/api/Bio.SeqIO.AbiIO-pysrc.html#_abi_trim
But that confuses me: while the original algorithm tries to identify the highest scoring sub-sequence, the BioPython algorithm starts the trimmed sequence at the first base that reaches the quality threshold. That seems strange to me because that could well be followed by a run of bases with very low quality, which should be trimmed.
I have interpreted the Mott algorithm as follows in Perl, which seems to produce good results when I manually check the trimming of some sequence reads I have got. But I am slightly worried that I am missing something because the BioPhyton interpretation is so different. Another thing that I don't get is that the description of the original algorithm says "the score can have non-negative values only" but if I set all negative scores to zero, there is no penalty for low quality scores anymore or am I missing something?
sub quality_trimmed_seq {
my $seq = shift or croak("need a Bio::Seq object");
if ( !blessed( $seq ) or !$seq->isa('Bio::Seq::SequenceTrace') ){
croak("input to quality_trimmed_seq must be a 'Bio::Seq::SequenceTrace object, not a".
(ref $seq||'scalar')
)
}
my $cutoff = shift || 0.05;
my $segment_len = shift || 20; # minimum length of a segment;
die "segment_len must be > 1" if $segment_len < 1;
# if the sequence is too short, return undef
return undef if $seq->length < $segment_len;
# transform the Q values into error probabilities
my @scores;
foreach my $qual ( @{ $seq->qual } ){
my $score = $cutoff - ( 10 ** ( $qual / -10.0 ) ) ;
push @scores, $score;
}
# traverse all possible sub-segments of the sequence
# and calculate a sum score
my $max_score = 0;
my $trim_start = 1;
my $trim_end = $seq->length;
foreach my $start ( 1 .. $seq->length - 1 ){
my $first_end = $start + $segment_len - 1;
next if $first_end > $seq->length;
foreach my $end ( $first_end .. $seq->length){
my $score_sum = 0;
foreach my $base_i ( $start - 1 .. $end - 1 ){
$score_sum += $scores[$base_i];
if ( $score_sum > $max_score ){
$max_score = $score_sum;
$trim_start = $start;
$trim_end = $end;
}
}
}
}
if ( $max_score > 0 ){
return $seq->sub_trace_object( $trim_start, $trim_end);
} else {
return undef;
}
}