Finding Protein Sequence Coverage With R Or Python
1
Hello,
I am trying to find protein sequence coverage after pairwise sequence alignment by using the following formula
sequence coverage = (length of aligned residues/length of longest sequence)*100
How can I use this formula in R or in bio python? Is there any package available to calculate proein sequence coverage ?
protein
r
alignment
biopython
• 5.0k views
This function will give the sequence coverage
seq.cov <- function(x){
if(!class(x)=="PairwiseAlignedFixedSubject"){
message('Only supports objects from class PairwiseAlignedFixedSubject')
stop()
}
length.string1 <- x@subject@unaligned@ranges@width
length.string2 <- x@pattern@unaligned@ranges@width
align.length1 <- x@pattern@range@width
align.length2 <- x@subject@range@width
if(length.string1 >= length.string2){
longest.length <- length.string1
} else {
longest.length <- length.string2
}
if(align.length1 >= align.length2){
align.length <- align.length2
} else {
align.length <- align.length1
}
x.seq.cov <- ( align.length / longest.length ) * 100
return(x.seq.cov)
}
Test function:
library(Biostrings )
s1 <- DNAString("AGTATAGATGATAGAT")
s2 <- DNAString("AGTAGATAGATGGATGATAGATA")
palign1 <- pairwiseAlignment(s1, s2)
palign2 <-
pairwiseAlignment(s1, s2,
substitutionMatrix =
nucleotideSubstitutionMatrix(match = 2, mismatch = 10, baseOnly = TRUE))
s3 <- DNAString("AGTATAGATGATAGATGGTCC")
s4 <- DNAString("AGTAGATAGATGGATGATAGATA")
palign3 <- pairwiseAlignment(s3, s4)
palign4 <-
pairwiseAlignment(s3, s4,
substitutionMatrix =
nucleotideSubstitutionMatrix(match = 2, mismatch = 10, baseOnly = TRUE))
seq.cov(palign1)
seq.cov(palign2)
seq.cov(palign3)
seq.cov(palign4)
This should do it.
Login before adding your answer.
Traffic: 1879 users visited in the last hour
How did you generate those alignments? For some programs (e.g. BLAST but also others) this kind of information is provided in the output.
@philippe Thanks for your comment. I used R to do pairwise sequence alignment using Pairwisealignment function(seqinr package) and calculated sequence identity by using pid function(Biostrings package)