Sequence Alignment
1
1
Entering edit mode
3.1 years ago
science03 ▴ 50

Hello everyone,

I have the fasta files of a few isoforms that under different platforms are annotated differently. In order for me to see which isoform (in platform X) is the equivalent of one isoform (in platform Y) I was thinking of doing some sort of sequence alignment in R. Can anyone please suggest any package or a suitable script that I could use?

Sequence Alignment R Similarity • 1.3k views
ADD COMMENT
2
Entering edit mode

If this is about exact matching then I would recommend that instead of sequence alignment you just hash the sequences. Identical hash == identical sequence. Obviously you would exclude headers and make sure that the formatting is otherwise the same..

ADD REPLY
0
Entering edit mode

Thank you so much for your idea!

ADD REPLY
2
Entering edit mode
3.1 years ago
Dunois ★ 2.8k

You could calculate pairwise identity with Biostrings::pid() in R. Comparing hashes will be way faster, so you should go with 5heikki's suggestion.

Just because I was a bit curious about how much faster hashing would be compared to calculating pairwise identities (PID), I benchmarked pid vs. comparing hashes (using hash() from insect).

tl;dr: hashing is ~100x faster in comparison to PID.

Code and benchmarks are below.

#Packages.
library(microbenchmark)
library(insect)
library(Biostrings)


#Custom functions.

#Wrapper function around Biostrings::pid which 
#checks if the sequences are identical and returns
#TRUE/FALSE.
#Biostrings::pid itself needs a Biostrings::pairwiseAlignment
#object as input.
mypid <- function(x1, x2){
  Biostrings::pid(Biostrings::pairwiseAlignment(x1, x2)) == 100
}

#Wrapper function around insect::hash which
#checks if the hashes of the two input sequences
#are identical.
myhash <- function(x1, x2){
  insect::hash(x1) == insect::hash(x2)
}



#Creating two random character strings of length 200
#to simulate two random protein sequences.
seqs <- replicate(n = 2, paste0(sample(LETTERS, size = 200, replace = TRUE), collapse = ""))


#BENCHMARKS.

#PID vs. comparing hashes of identical sequences.
bm1 <- microbenchmark(mypid(seqs[1], seqs[1]), 
                      myhash(seqs[1], seqs[1]))

#PID vs. comparing hashes of non-identical sequences.
bm2 <- microbenchmark(mypid(seqs[1], seqs[2]), 
                      myhash(seqs[1], seqs[2]))


bm1
#Unit: microseconds
#                     expr        min          lq        mean      median          uq       max neval
#  mypid(seqs[1], seqs[1]) 496883.628 513732.0455 530269.2377 525232.7875 540688.9640 623581.62   100
# myhash(seqs[1], seqs[1])    137.264    147.8555    248.8959    233.0835    256.7435   3942.56   100

bm2
#Unit: microseconds
#                     expr        min         lq        mean    median          uq        max neval
#  mypid(seqs[1], seqs[2]) 311092.567 529674.820 556387.5045 553213.35 586489.2800 788800.004   100
# myhash(seqs[1], seqs[2])    138.495    156.273    239.4025    240.71    278.9925    457.665   100
ADD COMMENT
0
Entering edit mode

Indeed hashing is the most suitable option considering this sort of specific analysis. Thank you for your feedback!

ADD REPLY
0
Entering edit mode

I wonder whether straight-up string comparison would be faster still. After all, for a hash, it needs to traverse the entire string and compute the hash. Whereas exact string comparison can bail out at the first nonmatching character.

In Python straight string comparison is five times faster than hashing (the gain is probably proportional with the length of the string):

$ python -m timeit '"ATAAGAGAAGATAGAGAGA" == "ATATAGAGAGATTATA"'
10000000 loops, best of 5: 22.2 nsec per loop

$ python -m timeit 'hash("ATAAGAGAAGATAGAGAGA") == hash("ATATAGAGAGATTATA")'
2000000 loops, best of 5: 116 nsec per loop
ADD REPLY

Login before adding your answer.

Traffic: 2016 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6