I am looking at a simple, fast and vectorized way to compare two Bioconductor Biostring objects (DNAstrings to be exact) and check for each position whether the nucleotides are the same. Currently, I use following code, but I doubt (I hope...) this is the most convenient way:
compare.DNA <- function(x,y){
sapply(seq(length(x)),
function(i){
x[i] == y[i]
}
)
}
This gives:
require(Biostring)
DNAseq1 <- DNAString('ACCTCAGCT')
DNAseq2 <- DNAString('ATCTCATCT')
compare.DNA(DNAseq1,DNAseq2)
[1] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
I've been trying to access the values directly, but there's no slot in Biostrings that contain the actual information, it just contains a link to a cached object. The problem is I have to do this on +3 million sequences, and this code runs slow:
> system.time(replicate(10000,compare.DNA(DNAseq1,DNAseq2)))
user system elapsed
51.05 0.16 51.21
Any ideas on how to do this a whole lot faster?
This is true in the artificial case that the sequences are in single DNAString objects, but you cannot convert a DNAStringSet which you will likely have to integer using as.integer.
@Michael Dondrup. I know. That's why I specifically asked for DNAstring methods, as I don't have a DNAStringSet.
ok, was just wondering how you read your sequences into R, or do you generate them in R? Because in most cases I would expect read.DNAStringSet be used to read a fasta file.
@Michael Dondrup I get them from another lab, so I have a (pretty odd) list of DNAStrings. Converting that to a DNAStringSet isn't that obvious.
as.raw
should be faster still; the folks on the Bioconductor mailing list http://bit.ly/x1U1ux might have other suggestions (especially if the 'real' data has some particular characteristics, e.g., a few long or many short sequences).@MartinMorgan I tested it, and that's actually not the case. as.raw first calls
as.integer
to get the data out...@MartinMorgan I tested it, and that's actually not the case. as.raw first calls as.integer to get the data out... Thanks for the tip about the mailing list though.