Hi
I am currently doing research on speeding up of sequence alignment. Could you let me know which algorithm(or the base algorithm) is used for sequence ALIGNMENT (not assembly) in phred phrap.?
Could you also let me know what are the common sequence alignment algorithms in use now-a-days?
Phred doesn't do alignment, and phrap was the cutting edge back in...the late nineties? Current algorithms typically focus on short reads using compressed suffix arrays (bfindex et al), but for many purposes, old workhorses like BLAST or BLAT are still commonly used.
My understanding is that Phrap is based on Cross_match/SWAT, which are implementations of the Smith-Waterman local alignment algorithm. Cross_match is based on SWAT with banding constraints.
Swat and cross_match are programs for rapid protein and nucleic acid sequence comparison and database search, based on an efficient implementation of the Smith-Waterman-Gotoh algorithm (Smith & Waterman 1981, Gotoh 1982). By revising the recursion relations slightly and making efficient use of word-packing, we have been able to significantly reduce the number of machine instructions executed per Smith-Waterman matrix cell, resulting in a speed enhancement of 2-3 fold over the previous best implementation (Pearson and Miller, 1992). Swat is about 1 / 10 as fast as BLAST, making it quite adequate for performing database searches on a workstation. Statistical significance of the hits is evaluated based on the empirical score distribution for the search, taking into account dependence of the score variance and mean on the length of the database sequence.
Cross_match uses the same algorithm as swat, but in addition allows the comparison of a pair of sequences to be constrained to bands of the Smith-Waterman matrix that surround one or more matching words in the sequences. This substantially increases speed for large-scale nucleotide sequence comparisons without significantly compromising sensitivity. We have found cross_match useful for the following tasks: comparing a set of reads to a set of vector sequences to produce vector-masked versions of the reads; comparing contig sequences found by two alternative assembly procedures (e.g. phrap and xbap) to each other; and comparing assembled contigs to the final edited cosmid sequence, in retrospective studies of the accuracy of fully automated assembly. We also now routinely use cross_match to find repeats in finished cosmid sequences and mask them out prior to database searches.
Clustal and Muscle are common sequence alignment algorithms implemented in many other programs. These are particularly good for multiple sequence alignments containing many sequences.
http://www.clustal.org/http://www.drive5.com/muscle/
Have you already searched on the Web and in the literature for relevant information? That's an important part of "doing research".
Phred doesn't do alignment, and phrap was the cutting edge back in...the late nineties? Current algorithms typically focus on short reads using compressed suffix arrays (bfindex et al), but for many purposes, old workhorses like BLAST or BLAT are still commonly used.
thanks Ketil. Could you put this as an answer?