Question

Normalized Smith-Waterman similarity score based on pairwise amino acids sequence alignments

0

Entering edit mode

9.3 years ago

' ▴ 330

I have 100 protein sequences and I wish to compute similarities between them. What's the most efficient way to get the normalized Smith-Waterman similarity scores?

alignment • 4.9k views

ADD COMMENT • link 7.1 years ago by ' ▴ 330

1

Entering edit mode

Given the number of pre-existing alignment tools, the most efficient method would be to not write anything and just use someone elses (likely online and easily findable via google) tool.

ADD REPLY • link 9.3 years ago by Devon Ryan 105k

0

Entering edit mode

I ended up using R which was reasonably fast for my limited computational resources. I used Biostrings::pairwiseAlignment.

ADD REPLY • link 7.6 years ago by ' ▴ 330

0

Entering edit mode

is there any answer?

ADD REPLY • link 7.1 years ago by y2607570419 • 0

score 0 · Answer 1 · 2018-07-26

Here's a naive approach:

library("Biostrings")
library("seqinr")    


## Load protein sequences
#Protein_Seq <- read.fasta("Protein_sequneces.fa", seqtype="AA", as.string="TRUE")
#Protein_Sequences_Only <- getSequence(Protein_Seq, as.string=TRUE)
data("BLOSUM62")
protein_dat <- read.table("Protein_sequneces.dat")

Smith_Waterman_Scores <- data.frame(Seq1=as.numeric(),
                                    Seq2=as.numeric(), 
                                    Score=as.numeric(), 
                                    stringsAsFactors=FALSE) 

## Perform alignment
for (i in 1:length(protein_dat)) {
  for (j in 1:length(protein_dat))
  {

    t <- pairwiseAlignment(protein_dat[i,], protein_dat[j,],
                                         substitutionMatrix=BLOSUM62,type="local")
    Smith_Waterman_Scores <- rbind(Smith_Waterman_Scores, c(i,j,t@score))

  }
}

names(Smith_Waterman_Scores)[1] <- "First.Protein"
names(Smith_Waterman_Scores)[2] <- "Second.Protein"
names(Smith_Waterman_Scores)[3] <- "Score"



### Normalize Smith-Waterman similarity scores
dt <- data.table(Smith_Waterman_Scores)
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt <- dt[dt.lookup][
  , Normalized :=  Score / (sqrt(Score1) * sqrt(Score2))][
    , .(First.Protein, Second.Protein, Normalized)]
dt <- dt[order(dt$First.Protein),]

Smith_Waterman_Scores <- as.data.frame(dt)