Question

how to keep reads in a fasta file based on a seq id list in R?

0

Entering edit mode

3.6 years ago

mthm ▴ 50

I have seen many tools of filtering a fasta file based on a sequence list, which excludes the provided list of ids from the original file however, I have a list of ids to keep and I want to remove all the rest from the fasta file. I prefer to do that in R, had a look at the functions of the package "ShortRead" but couldn't figure it out which function to use, can anyone give me a hint?

R filter fasta • 3.3k views

ADD COMMENT • link updated 3.6 years ago by cpad0112 21k • written 3.6 years ago by mthm ▴ 50

0

Entering edit mode

I haven't tried this but you can have a look at it.

subset.fasta function form FastaUtils package.

ADD REPLY • link 3.6 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

ah it doesn't want to be installed on my version of R, however, I tried this:

ref <- readFasta("XXX.fa")
list <- read.csv("*.txt")
filter <- ref[names(ref) %in% list$id]

but I have problem extracting it:

write.fasta(names= filter, sequences= filter, file.out= "subset.fa", nbchar = 60, as.string = FALSE)
Error in as.character.default(new("ShortRead", sread = new("DNAStringSet",  : 
  no method for coercing this S4 class to a vector

ADD REPLY • link 3.6 years ago by mthm ▴ 50

2

Entering edit mode

3.6 years ago

Ram 44k

In your search, you'd have come across seqkit. seqkit grep has an --invert-match option that should be of use to you.

seqkit grep -n -v -f ids_file.txt sequences.fasta #not tested but should work

ADD COMMENT • link 3.6 years ago by Ram 44k

0

Entering edit mode

I have used samtools faidx, but I haven't used seqkit, I'll try it thanks.

ADD REPLY • link 3.6 years ago by mthm ▴ 50

score 2 · Accepted Answer · 2021-05-31

I am not sure if R is the best way to do this if you are doing following (unless you have a valid reason):

ref <- readFasta("XXX.fa")
list <- read.csv("*.txt")
filter <- ref[names(ref) %in% list$id]

I would suggest seqkit, seqtk for sub setting fasta files with names from another list. Resultant fasta file can be loaded in R for further processing. For generic fasta processing, CLI based, fasta specific tools are better IMO.

do this way if you still want to do in R:

library(Biostrings)
ref=readDNAStringSet("test.fa")
list = read.csv ("*.txt") # check field separators, white spaces, case and other formatting marks
filtered_ref=ref[names(ref) %in% list$seq]

list$seq assumes that the sequence names are stored in seq column in list object