how to keep reads in a fasta file based on a seq id list in R?
2
0
Entering edit mode
3.6 years ago
mthm ▴ 50

I have seen many tools of filtering a fasta file based on a sequence list, which excludes the provided list of ids from the original file however, I have a list of ids to keep and I want to remove all the rest from the fasta file. I prefer to do that in R, had a look at the functions of the package "ShortRead" but couldn't figure it out which function to use, can anyone give me a hint?

R filter fasta • 3.3k views
ADD COMMENT
0
Entering edit mode

I haven't tried this but you can have a look at it.

subset.fasta function form FastaUtils package.

ADD REPLY
0
Entering edit mode

ah it doesn't want to be installed on my version of R, however, I tried this:

ref <- readFasta("XXX.fa")
list <- read.csv("*.txt")
filter <- ref[names(ref) %in% list$id]

but I have problem extracting it:

write.fasta(names= filter, sequences= filter, file.out= "subset.fa", nbchar = 60, as.string = FALSE)
Error in as.character.default(new("ShortRead", sread = new("DNAStringSet",  : 
  no method for coercing this S4 class to a vector
ADD REPLY
2
Entering edit mode
3.6 years ago

I am not sure if R is the best way to do this if you are doing following (unless you have a valid reason):

ref <- readFasta("XXX.fa")
list <- read.csv("*.txt")
filter <- ref[names(ref) %in% list$id]

I would suggest seqkit, seqtk for sub setting fasta files with names from another list. Resultant fasta file can be loaded in R for further processing. For generic fasta processing, CLI based, fasta specific tools are better IMO.

do this way if you still want to do in R:

library(Biostrings)
ref=readDNAStringSet("test.fa")
list = read.csv ("*.txt") # check field separators, white spaces, case and other formatting marks
filtered_ref=ref[names(ref) %in% list$seq]

list$seq assumes that the sequence names are stored in seq column in list object

ADD COMMENT
0
Entering edit mode

thanks, it worked.

ADD REPLY
2
Entering edit mode
3.6 years ago
Ram 44k

In your search, you'd have come across seqkit. seqkit grep has an --invert-match option that should be of use to you.

seqkit grep -n -v -f ids_file.txt sequences.fasta #not tested but should work
ADD COMMENT
0
Entering edit mode

I have used samtools faidx, but I haven't used seqkit, I'll try it thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6