Hello all,
I'm trying to fine-tune a BLAST search based on taxonomy IDs, searching within a species-level group (e.g. mycobacterium. taxid: 1763) but excluding specific subtypes within that group (e.g. leparea, taxid: 1769) and anything more distal to that. Options -taxids
and -negative_taxids
cannot be used in conjunction with each other, so I'm a bit stumped about how to continue.
This is just an example; in reality there are a dozen or so subtypes to remove from mycobacterium and then the whole thing needs to be replicated for a handful of other bugs, so curating a list manually is doable but a massive job (and prone to error). The only other way I could think of would be to use -outfmt 6
with the staxid
option and remove those entries with a script, but that would only account for exact matches and not take into account more distal nodes to those ids (e.g. removing leprae 1769 would not remove the subtype leprea Kyoto, taxid: 1288821).
I hope that makes sense...any suggestions greatly appreciated!
Have you actually checked this i.e. does the taxonomy file included with blast include this level of detail about subspecies?
Yes - taxonomy IDs can be specified at non-leaf nodes since a couple of years ago, I think. The blast output will only return one taxid, so filtering the output manually runs the risk of missing those at subtype level.
Have you considered creating a subset database using whichever ID's you need upfront? Or other option is the one Pierre noted below. Get the child ID's for top level ones you want to remove for filtering after the search.
It's certainly an idea. The only issue there is (there's always a 'but'!) that one aspect of an ongoing project is to periodically monitor changes to the main precompiled databases with a scheduled process. Using updateblastdb.pl is nice and simple (and fairly reliable). Creating a subset database can be done using esearch -> efetch -> makeblastdb but my experience is that doing this entry-by-entry just isn't as reliable, especially if it's not being actively monitored.
So I think taking the child IDs from somewhere would be the most reproducible way of curating the taxon list. If memory serves, there is some way to interrogate the taxonomy4blast.sqlite3 database included with the precompiled versions with a one-liner or two.
You could extract the sequences from
nt/nr
preformatted blast databases usingblastdbcmd
and the list of taxID's you are interested in. That would be reasonably foolproof and fast. It will require you to download the entirent/nr
database though.But then once you have a list of taxID's though you could simply limit your blast+ search to just those.
Just seen this...and it's probably a more elegant solution than mine (see below). Doh!
I might implement this in future versions of the code though because as you say, it's super fast to retrieve the information (and also works on the much smaller precompiled prokaryote database too, so that's nice). Then on top of that the resulting dataset will be considerably more memory efficient to compile as / run against as a standlone blast database.
Thanks as ever for all your help!