Question

Capillary sequence trimming algorithm

1

Entering edit mode

6.8 years ago

tospo ▴ 50

I need to implement a quality trimming algorithm for capillary sequence reads (yes, they still exist!) in a Perl package of my own, so I am trying to implement the algorithm used by the PHRED software as well as other tools. There is a description of the Mott algorithm here: https://www.codoncode.com/support/faqs/trimming.htm (under "More about trimming") and in the PHRED manual (see --trim_alt) here: http://www.phrap.org/phredphrap/phred.html

There is an implementation of the algorithm in BioPyhton here: http://biopython.org/DIST/docs/api/Bio.SeqIO.AbiIO-pysrc.html#_abi_trim

But that confuses me: while the original algorithm tries to identify the highest scoring sub-sequence, the BioPython algorithm starts the trimmed sequence at the first base that reaches the quality threshold. That seems strange to me because that could well be followed by a run of bases with very low quality, which should be trimmed.

I have interpreted the Mott algorithm as follows in Perl, which seems to produce good results when I manually check the trimming of some sequence reads I have got. But I am slightly worried that I am missing something because the BioPhyton interpretation is so different. Another thing that I don't get is that the description of the original algorithm says "the score can have non-negative values only" but if I set all negative scores to zero, there is no penalty for low quality scores anymore or am I missing something?

sub quality_trimmed_seq {
  my $seq = shift or croak("need a Bio::Seq object");
  if ( !blessed( $seq ) or !$seq->isa('Bio::Seq::SequenceTrace') ){
    croak("input to quality_trimmed_seq must be a 'Bio::Seq::SequenceTrace object, not a".
      (ref $seq||'scalar')
    )
  }
  my $cutoff = shift || 0.05;
  my $segment_len = shift || 20;  # minimum length of a segment;
  die "segment_len must be > 1" if $segment_len < 1;

  # if the sequence is too short, return undef
  return undef if $seq->length < $segment_len;

  # transform the Q values into error probabilities
  my @scores;
  foreach my $qual ( @{ $seq->qual } ){
    my $score = $cutoff - ( 10 ** ( $qual / -10.0 ) ) ;
    push @scores, $score;
  }

  # traverse all possible sub-segments of the sequence
  # and calculate a sum score
  my $max_score = 0;
  my $trim_start = 1;
  my $trim_end = $seq->length;
  foreach my $start ( 1 .. $seq->length - 1 ){
    my $first_end = $start + $segment_len - 1;
    next if $first_end > $seq->length;
    foreach my $end ( $first_end .. $seq->length){
      my $score_sum = 0;
      foreach my $base_i ( $start - 1 .. $end - 1  ){
        $score_sum += $scores[$base_i];
        if ( $score_sum > $max_score ){
          $max_score = $score_sum;
          $trim_start = $start;
          $trim_end = $end;
        }
      }
    }
  }
  if ( $max_score > 0 ){
    return $seq->sub_trace_object( $trim_start, $trim_end);
  } else {
    return undef;
  }
}

sequencing capillary • 1.2k views

ADD COMMENT • link 6.8 years ago by tospo ▴ 50