Subsetting a fasta file using seqinr in R
3
1
Entering edit mode
9.9 years ago

I am trying to subset a fasta file like described in another post (How To Extract Multiple Fasta Sequences At A Time From A File Containing Sequences Ids Using R). I simply want to be able to extract a subset of sequences from a fasta file.

I load the package and the data as described in the other thread:

library("seqinr")

fastafile<- read.fasta(file = "proteins.fasta", 
                       seqtype = "AA",as.string = TRUE, set.attributes = FALSE)

as well as the list of IDs I was to use for subsetting (the column in subsetlist is called id).

subsetlist<-read.table("~/scripts/subsetfasta/test.txt", header=TRUE)

when I attempt to use the solution from the previous thread:

fastafile[names(fastafile) %in% subsetlist$id]

I get the following:

named list()

What am I doing wrong or missing?

best regards

Henrik

sequence R fasta • 22k views
ADD COMMENT
0
Entering edit mode

Dear all,

Does someone know how to do partial matching based on a column in a dataframe (here column id in dataframe subsetlist)? in case id thus only contains part of the names of the fasta file? (since %in% needs a complete match)

Thanks! ellen

ADD REPLY
1
Entering edit mode

Please open a new question and add a reproducible example.

ADD REPLY
2
Entering edit mode
9.9 years ago
David W 4.9k

Looks like none of your fasta record names are in the subset list. You can check by just calling

names(fastafile) %in% subsetlist$id

Which will be a vector of FALSEs if that's the case.

You probably want to check exactly how the subset list and fasta file IDs are formatted, and use (g)sub, paste or related functions to get them to match.

ADD COMMENT
4
Entering edit mode
9.8 years ago

R solution:

fastafile[c(which(names(fastafile) %in% subsetlist$id))]
ADD COMMENT
0
Entering edit mode

R can subset using a boolean vector, so the which is unnecessary here. I don't even get the point of the c. I wonder how this answer got 4 upvotes.

fastafile[names(fastafile) %in% subsetlist$id]
ADD REPLY
0
Entering edit mode
9.8 years ago

Not a R Solution:

UCSC utilities

$ ./faSomeRecords main.fasta id.txt output.fa

option -exclude will output sequences not present in main.fasta

ADD COMMENT

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6