Subset FASTA file using partial matching of headers with list of taxon names
1
0
Entering edit mode
18 months ago
Ellen ▴ 20

Dear all,

I am trying to subset a FASTA file by comparing its headers (which contain the full taxonomy) with a list of taxon names (only genus level; in the below example this would be the "list" data frame). Below is the output of dput() for part of my fasta file (in a google doc - see link below; did not find a suitable example fasta file embedded in R or packages and the ouput of dput was too large to copy-paste here), which I will call for ease of reference fasta_new.

Example list of taxon names here:

list <- c("Ripella_1217112", "Vannella_95228")
 list <- as.data.frame (list)

Based on [this post][1], we can compare FASTA file headers with a list of values using

fasta_new[names(fasta_new) %in% list$list]

but this only works when the values in names are an exact match to the headers in the FASTA file (fasta_new), but my names data frame only contains a part of the FASTA header, so how can I look for a partial match between the names of the FASTA file (and thus the headers) and the values in my list dataframe (contained in ots 1 variable named "list" in this example?

Not sure whether I am explaining it clearly..

Thank you!

Ellen

https://docs.google.com/document/d/1Z85bgh6W1WWG1NzaMU9ufMiH8uh4lnCs_I84n-FzsX4/edit?usp=sharing

FASTA • 1.4k views
ADD COMMENT
0
Entering edit mode

If doing this on the command line is sufficient you can use seqkit. match.txt is a one column file containing the IDs you want to match.

seqkit grep -nrf match.txt input.fa > filtered.fa

If you need to use R, fasta manipulation is generally done via biostrings objects.

library("Biostrings")

fasta <- readDNAStringSet("in.fa")

genus <- c("Ripella_1217112", "Vannella_95228")
subset_fasta <- fasta[grepl(names(fasta), pattern=paste(genus, collapse="|")), ]
ADD REPLY
0
Entering edit mode
18 months ago
barslmn ★ 2.3k

You can use grepl for partial match.

my_vector1 <- c("abc", "def", "ghi", "prs", "tuv", "xyz")
my_vector2 <- c("ab", "de", "gh")

my_vector1[
  grepl(paste(my_vector2, collapse="|"), my_vector1)
]

Outputs:

'abc''def''ghi'

It would be much better if you share your code in code blocks instead of google docs.

ADD COMMENT
0
Entering edit mode

Thanks!

  • Yes, I understand it is better to put the code in code blocks in the post, but unfortunately the code was too long and I exceeded the max. nr of characters, trying to include it in my post - hence the google doc. I could not think of any other way to share the code or find an example FASTA file from within R or any of its package to showcase my issue...
  • Concerning your solution, the FASTA file is not a vector, so I don't understand how I could subset the FASTA file (not only the headers) using your code?
ADD REPLY
0
Entering edit mode

You can subset a data frame similarly. Only difference being the comma in the square brackets. Left side of the comma, where our expression is, indicating the subseting by the rows.

my_df[
  grepl(paste(my_vector2, collapse="|"), my_df$names),
]
ADD REPLY

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6