Question

Subset FASTA file using partial matching of headers with list of taxon names

0

Entering edit mode

24 months ago

Ellen ▴ 20

Dear all,

I am trying to subset a FASTA file by comparing its headers (which contain the full taxonomy) with a list of taxon names (only genus level; in the below example this would be the "list" data frame). Below is the output of dput() for part of my fasta file (in a google doc - see link below; did not find a suitable example fasta file embedded in R or packages and the ouput of dput was too large to copy-paste here), which I will call for ease of reference fasta_new.

Example list of taxon names here:

list <- c("Ripella_1217112", "Vannella_95228")
 list <- as.data.frame (list)

Based on [this post][1], we can compare FASTA file headers with a list of values using

fasta_new[names(fasta_new) %in% list$list]

but this only works when the values in names are an exact match to the headers in the FASTA file (fasta_new), but my names data frame only contains a part of the FASTA header, so how can I look for a partial match between the names of the FASTA file (and thus the headers) and the values in my list dataframe (contained in ots 1 variable named "list" in this example?

Not sure whether I am explaining it clearly..

Thank you!

Ellen

https://docs.google.com/document/d/1Z85bgh6W1WWG1NzaMU9ufMiH8uh4lnCs_I84n-FzsX4/edit?usp=sharing

FASTA • 1.8k views

ADD COMMENT • link updated 24 months ago by barslmn ★ 2.4k • written 24 months ago by Ellen ▴ 20

0

Entering edit mode

If doing this on the command line is sufficient you can use seqkit. match.txt is a one column file containing the IDs you want to match.

seqkit grep -nrf match.txt input.fa > filtered.fa

If you need to use R, fasta manipulation is generally done via biostrings objects.

library("Biostrings")

fasta <- readDNAStringSet("in.fa")

genus <- c("Ripella_1217112", "Vannella_95228")
subset_fasta <- fasta[grepl(names(fasta), pattern=paste(genus, collapse="|")), ]

ADD REPLY • link 24 months ago by rpolicastro 13k

score 0 · Answer 1 · 2023-05-22

0

Entering edit mode

24 months ago

barslmn ★ 2.4k

You can use grepl for partial match.

my_vector1 <- c("abc", "def", "ghi", "prs", "tuv", "xyz")
my_vector2 <- c("ab", "de", "gh")

my_vector1[
  grepl(paste(my_vector2, collapse="|"), my_vector1)
]

Outputs:

'abc''def''ghi'

It would be much better if you share your code in code blocks instead of google docs.

ADD COMMENT • link 24 months ago by barslmn ★ 2.4k

0

Entering edit mode

Thanks!

Yes, I understand it is better to put the code in code blocks in the post, but unfortunately the code was too long and I exceeded the max. nr of characters, trying to include it in my post - hence the google doc. I could not think of any other way to share the code or find an example FASTA file from within R or any of its package to showcase my issue...
Concerning your solution, the FASTA file is not a vector, so I don't understand how I could subset the FASTA file (not only the headers) using your code?

ADD REPLY • link 24 months ago by Ellen ▴ 20

0

Entering edit mode

You can subset a data frame similarly. Only difference being the comma in the square brackets. Left side of the comma, where our expression is, indicating the subseting by the rows.

my_df[
  grepl(paste(my_vector2, collapse="|"), my_df$names),
]

ADD REPLY • link 24 months ago by barslmn ★ 2.4k