Normalized Smith-Waterman similarity score based on pairwise amino acids sequence alignments
1
0
Entering edit mode
8.7 years ago
' ▴ 330

I have 100 protein sequences and I wish to compute similarities between them. What's the most efficient way to get the normalized Smith-Waterman similarity scores?

alignment • 4.5k views
ADD COMMENT
1
Entering edit mode

Given the number of pre-existing alignment tools, the most efficient method would be to not write anything and just use someone elses (likely online and easily findable via google) tool.

ADD REPLY
0
Entering edit mode

I ended up using R which was reasonably fast for my limited computational resources. I used Biostrings::pairwiseAlignment.

ADD REPLY
0
Entering edit mode

is there any answer?

ADD REPLY
0
Entering edit mode
6.5 years ago
' ▴ 330

Here's a naive approach:

library("Biostrings")
library("seqinr")    


## Load protein sequences
#Protein_Seq <- read.fasta("Protein_sequneces.fa", seqtype="AA", as.string="TRUE")
#Protein_Sequences_Only <- getSequence(Protein_Seq, as.string=TRUE)
data("BLOSUM62")
protein_dat <- read.table("Protein_sequneces.dat")

Smith_Waterman_Scores <- data.frame(Seq1=as.numeric(),
                                    Seq2=as.numeric(), 
                                    Score=as.numeric(), 
                                    stringsAsFactors=FALSE) 

## Perform alignment
for (i in 1:length(protein_dat)) {
  for (j in 1:length(protein_dat))
  {

    t <- pairwiseAlignment(protein_dat[i,], protein_dat[j,],
                                         substitutionMatrix=BLOSUM62,type="local")
    Smith_Waterman_Scores <- rbind(Smith_Waterman_Scores, c(i,j,t@score))

  }
}

names(Smith_Waterman_Scores)[1] <- "First.Protein"
names(Smith_Waterman_Scores)[2] <- "Second.Protein"
names(Smith_Waterman_Scores)[3] <- "Score"



### Normalize Smith-Waterman similarity scores
dt <- data.table(Smith_Waterman_Scores)
dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt <- dt[dt.lookup][
  , Normalized :=  Score / (sqrt(Score1) * sqrt(Score2))][
    , .(First.Protein, Second.Protein, Normalized)]
dt <- dt[order(dt$First.Protein),]

Smith_Waterman_Scores <- as.data.frame(dt)
ADD COMMENT
0
Entering edit mode

hi, I am confused about the dt.lookup object. It seems that dt.lookup <- dt[First.Protein == Second.Protein] is the wrong code. Would you please fix it. Thank you

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6