Conditionally apply Biostrings reverseComplement on a DNAStringSet
1
0
Entering edit mode
5.1 years ago
hersh ▴ 60

Hi,

I am using the Biostrings package for its really fast reverseComplement() function. Here's a basic tutorial with the functions and objects I'm using: link

I have a DNAStringSet object, and reverseComplement() can operate really quickly in a vectorized way across the entire set. However, I only want to conditionally take the RC of sequences in the set. I have a logical vector of the same length as my string set which indicates which sequences I want to flip.

For speed, is there any way to avoid using a for loop to loop over the vector and set, and just flip the sequences I want to?

An example of the data:

A DNAStringSet instance of length 10
 width seq
 [1]    30 ATGCACGCCAGCTGGAGTCAGAGTGAGGGG
 [2]    30 ATGCACGCCAGCTGGAGTCAGAGTGAGGGG
 [3]    30 GAGCTGAGTCGTGAGGATGCGAGGCAGACC
 [4]    30 GAGCTGAGTCGTGAGGATGCGAGGCAGACC
 [5]    30 AAAGGGAGCCCAGTGGGGATGAATGAGGGG
 [6]    30 AAAGGGAGCCCAGTGGGGATGAATGAGGGG
 [7]    31 AAAGCGCGGAGAGCCCCAGAGCATTTGAGGG
 [8]    31 AAAGCGCGGAGAGCCCCAGAGCATTTGAGGG
 [9]    30 ACCTAAGGGTCCTCGGAGCCCGGACTCAGG
[10]    30 ACCTAAGGGTCCTCGGAGCCCGGACTCAGG

With a corresponding vector:

[1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE

Thank you!

R sequence • 966 views
ADD COMMENT
0
Entering edit mode
5.1 years ago
Mark ★ 1.6k

Yes. If you have a vector:

tmp <- LETTERS[1:15]

and a corresponding logical vector of equal length* [see footnote]:

l <- sample(rep(c(TRUE, FALSE), 15), 15)

You can subset your tmp vector by the logical vector: tmp[l].

reverseComplement() can handle vectors so you can simply do: reverseComplement(tmp[l]).


Putting it all together:

library("Biostrings")
myFasta <- DNAStringSet(c("CCGAAAACCATGATGGTTGCCAG", "CCGAAAACCATGATGGTTGCCAG", "CCGAAAACCATGATGGTTGCCAG"))
myFasta_want <- c(TRUE, FALSE, TRUE)

reverseComplement(myFasta[myFasta_want])

Hope that helps! In R, if you ever think for loop stop and consider vectorisation. there's probably a functional programming method that utilises vectorisation that's more applicable. Most packages have been designed with this setup in mind. Especially these "core" bioconductor packages. The people behind them are amazing programmers.

footnote: You can also have a vector not of equal length. but don't worry about this for now.

ADD COMMENT

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6