Dear all,
I am trying to subset a FASTA file by comparing its headers (which contain the full taxonomy) with a list of taxon names (only genus level; in the below example this would be the "list" data frame). Below is the output of dput()
for part of my fasta file (in a google doc - see link below; did not find a suitable example fasta file embedded in R or packages and the ouput of dput was too large to copy-paste here), which I will call for ease of reference fasta_new.
Example list of taxon names here:
list <- c("Ripella_1217112", "Vannella_95228")
list <- as.data.frame (list)
Based on [this post][1], we can compare FASTA file headers with a list of values using
fasta_new[names(fasta_new) %in% list$list]
but this only works when the values in names are an exact match to the headers in the FASTA file (fasta_new), but my names data frame only contains a part of the FASTA header, so how can I look for a partial match between the names of the FASTA file (and thus the headers) and the values in my list dataframe (contained in ots 1 variable named "list" in this example?
Not sure whether I am explaining it clearly..
Thank you!
Ellen
https://docs.google.com/document/d/1Z85bgh6W1WWG1NzaMU9ufMiH8uh4lnCs_I84n-FzsX4/edit?usp=sharing
If doing this on the command line is sufficient you can use seqkit.
match.txt
is a one column file containing the IDs you want to match.If you need to use R, fasta manipulation is generally done via biostrings objects.