Question

Subset XStringSet by Sequence Name

0

Entering edit mode

7.2 years ago

pedrorodrigues ▴ 10

Hello,

I am wondering how to subset a fasta file in R, using biostrings. Specifically, I would like to only get sequences that match names in a list. For example, if my data was like the following:

afastafile <- DNAStringSet(c("GCAAATGGG", "CCCGGGTT", "AAAGGGTT", "TTTGGGCC")) 
names(afastafile) <- c("ABC1_1", "ABC2_1", "ABC3_1", "ABC4_1")

and my list was something like this:

list <- as.data.frame(c("ABC1_1", "ABC4_1"))

note: I made my list as a data frame because my actual list is an external file and I use "read.table" to load it in R

I tried using the following loop expression, but it doesn't work

final.table <- NULL
for (i in 1:nrow(list)) {
  a <- afastafile[grep(list[i,], afastafile, perl=TRUE)]
  final.table <- rbind(final.table,a)
}

I believe that subsetting an XStringSet by name should be an easy task to do in R, but I have been struggling for very long. Any help would be really appreciated. Thank you!

Bioconductor • 6.4k views

ADD COMMENT • link updated 7.2 years ago by James Ashmore ★ 3.5k • written 7.2 years ago by pedrorodrigues ▴ 10

score 1 · Answer 1 · 2017-09-24

1

Entering edit mode

7.2 years ago

James Ashmore ★ 3.5k

You can subset the list of sequences this way:

afastafile <- DNAStringSet(c("GCAAATGGG", "CCCGGGTT", "AAAGGGTT", "TTTGGGCC")) 
names(afastafile) <- c("ABC1_1", "ABC2_1", "ABC3_1", "ABC4_1")
list <- data.frame(name = c("ABC1_1", "ABC4_1"))
selected_sequences <- afastafile[list$name]

The only change I made to your example was to add a column name to your dataframe.

ADD COMMENT • link 7.2 years ago by James Ashmore ★ 3.5k

0

Entering edit mode

Thank you so much, James Ashmore! I realized later that my list of sequence names (from blast) only partially matched the names in my DNAStringSet object. But I wrote a little loop to get complete sequence names and, once that was corrected, your solution worked greatly!

Here is the loop I wrote if anyone is interested:

#select names from your XStringSet
names_in_myXStringSet <- names(myXStringSet)
names_in_myXStringSet  <- as.data.frame(names_in_myXStringSet )
colnames(names_in_myXStringSet ) <- "Names"

#Extract complete sequence names and make a new list
final.list <- NULL
for (i in 1:nrow(your.list)) {
  sel0 <- your.list$name[[i]] #assuming your list has "name" as the column name
  sel1 <- dplyr::filter(names_in_myXStringSet , grepl(sel0, Names))
  final.list <- rbind(final.list, sel1)
}

Cheers!

ADD REPLY • link 7.2 years ago by pedrorodrigues ▴ 10

0

Entering edit mode

No worries, happy you solved your problem!

ADD REPLY • link 7.2 years ago by James Ashmore ★ 3.5k