Question

Subsetting a fasta file using seqinr in R

1

Entering edit mode

9.9 years ago

henrikkjeldal ▴ 10

I am trying to subset a fasta file like described in another post (How To Extract Multiple Fasta Sequences At A Time From A File Containing Sequences Ids Using R). I simply want to be able to extract a subset of sequences from a fasta file.

I load the package and the data as described in the other thread:

library("seqinr")

fastafile<- read.fasta(file = "proteins.fasta", 
                       seqtype = "AA",as.string = TRUE, set.attributes = FALSE)

as well as the list of IDs I was to use for subsetting (the column in subsetlist is called id).

subsetlist<-read.table("~/scripts/subsetfasta/test.txt", header=TRUE)

when I attempt to use the solution from the previous thread:

fastafile[names(fastafile) %in% subsetlist$id]

I get the following:

named list()

What am I doing wrong or missing?

best regards

Henrik

sequence R fasta • 22k views

ADD COMMENT • link updated 18 months ago by GenoMax 147k • written 9.9 years ago by henrikkjeldal ▴ 10

0

Entering edit mode

Dear all,

Does someone know how to do partial matching based on a column in a dataframe (here column id in dataframe subsetlist)? in case id thus only contains part of the names of the fasta file? (since %in% needs a complete match)

Thanks! ellen

ADD REPLY • link 18 months ago by Ellen ▴ 20

1

Entering edit mode

Please open a new question and add a reproducible example.

ADD REPLY • link 18 months ago by ATpoint 85k

4

Entering edit mode

9.8 years ago

Tanvir Ahamed ▴ 350

R solution:

fastafile[c(which(names(fastafile) %in% subsetlist$id))]

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

R can subset using a boolean vector, so the which is unnecessary here. I don't even get the point of the c. I wonder how this answer got 4 upvotes.

fastafile[names(fastafile) %in% subsetlist$id]

ADD REPLY • link 2.7 years ago by Ram 44k

0

Entering edit mode

9.8 years ago

Tanvir Ahamed ▴ 350

Not a R Solution:

UCSC utilities

$ ./faSomeRecords main.fasta id.txt output.fa

option -exclude will output sequences not present in main.fasta

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by Tanvir Ahamed ▴ 350

Ram · Accepted Answer · 2014-12-29

2

Entering edit mode

9.9 years ago

David W 4.9k

Looks like none of your fasta record names are in the subset list. You can check by just calling

names(fastafile) %in% subsetlist$id

Which will be a vector of FALSEs if that's the case.

You probably want to check exactly how the subset list and fasta file IDs are formatted, and use (g)sub, paste or related functions to get them to match.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by David W 4.9k