Hello,
I am wondering how to subset a fasta file in R, using biostrings. Specifically, I would like to only get sequences that match names in a list. For example, if my data was like the following:
afastafile <- DNAStringSet(c("GCAAATGGG", "CCCGGGTT", "AAAGGGTT", "TTTGGGCC"))
names(afastafile) <- c("ABC1_1", "ABC2_1", "ABC3_1", "ABC4_1")
and my list was something like this:
list <- as.data.frame(c("ABC1_1", "ABC4_1"))
note: I made my list as a data frame because my actual list is an external file and I use "read.table" to load it in R
I tried using the following loop expression, but it doesn't work
final.table <- NULL
for (i in 1:nrow(list)) {
a <- afastafile[grep(list[i,], afastafile, perl=TRUE)]
final.table <- rbind(final.table,a)
}
I believe that subsetting an XStringSet by name should be an easy task to do in R, but I have been struggling for very long. Any help would be really appreciated. Thank you!
Thank you so much, James Ashmore! I realized later that my list of sequence names (from blast) only partially matched the names in my DNAStringSet object. But I wrote a little loop to get complete sequence names and, once that was corrected, your solution worked greatly!
Here is the loop I wrote if anyone is interested:
Cheers!
No worries, happy you solved your problem!