Subset XStringSet by Sequence Name
1
0
Entering edit mode
7.2 years ago

Hello,

I am wondering how to subset a fasta file in R, using biostrings. Specifically, I would like to only get sequences that match names in a list. For example, if my data was like the following:

afastafile <- DNAStringSet(c("GCAAATGGG", "CCCGGGTT", "AAAGGGTT", "TTTGGGCC")) 
names(afastafile) <- c("ABC1_1", "ABC2_1", "ABC3_1", "ABC4_1")

and my list was something like this:

list <- as.data.frame(c("ABC1_1", "ABC4_1"))

note: I made my list as a data frame because my actual list is an external file and I use "read.table" to load it in R

I tried using the following loop expression, but it doesn't work

final.table <- NULL
for (i in 1:nrow(list)) {
  a <- afastafile[grep(list[i,], afastafile, perl=TRUE)]
  final.table <- rbind(final.table,a)
}

I believe that subsetting an XStringSet by name should be an easy task to do in R, but I have been struggling for very long. Any help would be really appreciated. Thank you!

Bioconductor • 6.4k views
ADD COMMENT
1
Entering edit mode
7.2 years ago
James Ashmore ★ 3.5k

You can subset the list of sequences this way:

afastafile <- DNAStringSet(c("GCAAATGGG", "CCCGGGTT", "AAAGGGTT", "TTTGGGCC")) 
names(afastafile) <- c("ABC1_1", "ABC2_1", "ABC3_1", "ABC4_1")
list <- data.frame(name = c("ABC1_1", "ABC4_1"))
selected_sequences <- afastafile[list$name]

The only change I made to your example was to add a column name to your dataframe.

ADD COMMENT
0
Entering edit mode

Thank you so much, James Ashmore! I realized later that my list of sequence names (from blast) only partially matched the names in my DNAStringSet object. But I wrote a little loop to get complete sequence names and, once that was corrected, your solution worked greatly!

Here is the loop I wrote if anyone is interested:

#select names from your XStringSet
names_in_myXStringSet <- names(myXStringSet)
names_in_myXStringSet  <- as.data.frame(names_in_myXStringSet )
colnames(names_in_myXStringSet ) <- "Names"

#Extract complete sequence names and make a new list
final.list <- NULL
for (i in 1:nrow(your.list)) {
  sel0 <- your.list$name[[i]] #assuming your list has "name" as the column name
  sel1 <- dplyr::filter(names_in_myXStringSet , grepl(sel0, Names))
  final.list <- rbind(final.list, sel1)
}

Cheers!

ADD REPLY
0
Entering edit mode

No worries, happy you solved your problem!

ADD REPLY

Login before adding your answer.

Traffic: 2135 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6