You could calculate pairwise identity with Biostrings::pid()
in R
. Comparing hashes will be way faster, so you should go with 5heikki's suggestion.
Just because I was a bit curious about how much faster hashing would be compared to calculating pairwise identities (PID), I benchmarked pid
vs. comparing hashes (using hash()
from insect
).
tl;dr: hashing is ~100x faster in comparison to PID.
Code and benchmarks are below.
library(microbenchmark)
library(insect)
library(Biostrings)
mypid <- function(x1, x2){
Biostrings::pid(Biostrings::pairwiseAlignment(x1, x2)) == 100
}
myhash <- function(x1, x2){
insect::hash(x1) == insect::hash(x2)
}
seqs <- replicate(n = 2, paste0(sample(LETTERS, size = 200, replace = TRUE), collapse = ""))
bm1 <- microbenchmark(mypid(seqs[1], seqs[1]),
myhash(seqs[1], seqs[1]))
bm2 <- microbenchmark(mypid(seqs[1], seqs[2]),
myhash(seqs[1], seqs[2]))
bm1
bm2
If this is about exact matching then I would recommend that instead of sequence alignment you just hash the sequences. Identical hash == identical sequence. Obviously you would exclude headers and make sure that the formatting is otherwise the same..
Thank you so much for your idea!