You could calculate pairwise identity with Biostrings::pid()
in R
. Comparing hashes will be way faster, so you should go with 5heikki's suggestion.
Just because I was a bit curious about how much faster hashing would be compared to calculating pairwise identities (PID), I benchmarked pid
vs. comparing hashes (using hash()
from insect
).
tl;dr: hashing is ~100x faster in comparison to PID.
Code and benchmarks are below.
#Packages.
library(microbenchmark)
library(insect)
library(Biostrings)
#Custom functions.
#Wrapper function around Biostrings::pid which
#checks if the sequences are identical and returns
#TRUE/FALSE.
#Biostrings::pid itself needs a Biostrings::pairwiseAlignment
#object as input.
mypid <- function(x1, x2){
Biostrings::pid(Biostrings::pairwiseAlignment(x1, x2)) == 100
}
#Wrapper function around insect::hash which
#checks if the hashes of the two input sequences
#are identical.
myhash <- function(x1, x2){
insect::hash(x1) == insect::hash(x2)
}
#Creating two random character strings of length 200
#to simulate two random protein sequences.
seqs <- replicate(n = 2, paste0(sample(LETTERS, size = 200, replace = TRUE), collapse = ""))
#BENCHMARKS.
#PID vs. comparing hashes of identical sequences.
bm1 <- microbenchmark(mypid(seqs[1], seqs[1]),
myhash(seqs[1], seqs[1]))
#PID vs. comparing hashes of non-identical sequences.
bm2 <- microbenchmark(mypid(seqs[1], seqs[2]),
myhash(seqs[1], seqs[2]))
bm1
#Unit: microseconds
# expr min lq mean median uq max neval
# mypid(seqs[1], seqs[1]) 496883.628 513732.0455 530269.2377 525232.7875 540688.9640 623581.62 100
# myhash(seqs[1], seqs[1]) 137.264 147.8555 248.8959 233.0835 256.7435 3942.56 100
bm2
#Unit: microseconds
# expr min lq mean median uq max neval
# mypid(seqs[1], seqs[2]) 311092.567 529674.820 556387.5045 553213.35 586489.2800 788800.004 100
# myhash(seqs[1], seqs[2]) 138.495 156.273 239.4025 240.71 278.9925 457.665 100
If this is about exact matching then I would recommend that instead of sequence alignment you just hash the sequences. Identical hash == identical sequence. Obviously you would exclude headers and make sure that the formatting is otherwise the same..
Thank you so much for your idea!