Question

Sequence Alignment

1

Entering edit mode

4.0 years ago

science03 ▴ 50

Hello everyone,

I have the fasta files of a few isoforms that under different platforms are annotated differently. In order for me to see which isoform (in platform X) is the equivalent of one isoform (in platform Y) I was thinking of doing some sort of sequence alignment in R. Can anyone please suggest any package or a suitable script that I could use?

Sequence Alignment R Similarity • 1.9k views

ADD COMMENT • link updated 4.0 years ago by Dunois ★ 2.9k • written 4.0 years ago by science03 ▴ 50

2

Entering edit mode

If this is about exact matching then I would recommend that instead of sequence alignment you just hash the sequences. Identical hash == identical sequence. Obviously you would exclude headers and make sure that the formatting is otherwise the same..

ADD REPLY • link 4.0 years ago by 5heikki 11k

0

Entering edit mode

Thank you so much for your idea!

ADD REPLY • link 4.0 years ago by science03 ▴ 50

score 2 · Answer 1 · 2021-11-07

You could calculate pairwise identity with Biostrings::pid() in R. Comparing hashes will be way faster, so you should go with 5heikki's suggestion.

Just because I was a bit curious about how much faster hashing would be compared to calculating pairwise identities (PID), I benchmarked pid vs. comparing hashes (using hash() from insect).

tl;dr: hashing is ~100x faster in comparison to PID.

Code and benchmarks are below.

#Packages.
library(microbenchmark)
library(insect)
library(Biostrings)


#Custom functions.

#Wrapper function around Biostrings::pid which 
#checks if the sequences are identical and returns
#TRUE/FALSE.
#Biostrings::pid itself needs a Biostrings::pairwiseAlignment
#object as input.
mypid <- function(x1, x2){
  Biostrings::pid(Biostrings::pairwiseAlignment(x1, x2)) == 100
}

#Wrapper function around insect::hash which
#checks if the hashes of the two input sequences
#are identical.
myhash <- function(x1, x2){
  insect::hash(x1) == insect::hash(x2)
}



#Creating two random character strings of length 200
#to simulate two random protein sequences.
seqs <- replicate(n = 2, paste0(sample(LETTERS, size = 200, replace = TRUE), collapse = ""))


#BENCHMARKS.

#PID vs. comparing hashes of identical sequences.
bm1 <- microbenchmark(mypid(seqs[1], seqs[1]), 
                      myhash(seqs[1], seqs[1]))

#PID vs. comparing hashes of non-identical sequences.
bm2 <- microbenchmark(mypid(seqs[1], seqs[2]), 
                      myhash(seqs[1], seqs[2]))


bm1
#Unit: microseconds
#                     expr        min          lq        mean      median          uq       max neval
#  mypid(seqs[1], seqs[1]) 496883.628 513732.0455 530269.2377 525232.7875 540688.9640 623581.62   100
# myhash(seqs[1], seqs[1])    137.264    147.8555    248.8959    233.0835    256.7435   3942.56   100

bm2
#Unit: microseconds
#                     expr        min         lq        mean    median          uq        max neval
#  mypid(seqs[1], seqs[2]) 311092.567 529674.820 556387.5045 553213.35 586489.2800 788800.004   100
# myhash(seqs[1], seqs[2])    138.495    156.273    239.4025    240.71    278.9925    457.665   100