Question

Fine-tuning NCBI taxid / taxidlist

0

Entering edit mode

18 months ago

theclubstyle ▴ 40

Hello all,

I'm trying to fine-tune a BLAST search based on taxonomy IDs, searching within a species-level group (e.g. mycobacterium. taxid: 1763) but excluding specific subtypes within that group (e.g. leparea, taxid: 1769) and anything more distal to that. Options -taxids and -negative_taxids cannot be used in conjunction with each other, so I'm a bit stumped about how to continue.

This is just an example; in reality there are a dozen or so subtypes to remove from mycobacterium and then the whole thing needs to be replicated for a handful of other bugs, so curating a list manually is doable but a massive job (and prone to error). The only other way I could think of would be to use -outfmt 6 with the staxid option and remove those entries with a script, but that would only account for exact matches and not take into account more distal nodes to those ids (e.g. removing leprae 1769 would not remove the subtype leprea Kyoto, taxid: 1288821).

I hope that makes sense...any suggestions greatly appreciated!

BLAST taxonomy • 2.1k views

ADD COMMENT • link 18 months ago by theclubstyle ▴ 40

0

Entering edit mode

e.g. removing leprae 1769 would not remove the subtype leprea Kyoto, taxid: 1288821

Have you actually checked this i.e. does the taxonomy file included with blast include this level of detail about subspecies?

ADD REPLY • link 18 months ago by GenoMax 153k

0

Entering edit mode

Yes - taxonomy IDs can be specified at non-leaf nodes since a couple of years ago, I think. The blast output will only return one taxid, so filtering the output manually runs the risk of missing those at subtype level.

ADD REPLY • link 18 months ago by theclubstyle ▴ 40

0

Entering edit mode

Have you considered creating a subset database using whichever ID's you need upfront? Or other option is the one Pierre noted below. Get the child ID's for top level ones you want to remove for filtering after the search.

ADD REPLY • link 18 months ago by GenoMax 153k

0

Entering edit mode

It's certainly an idea. The only issue there is (there's always a 'but'!) that one aspect of an ongoing project is to periodically monitor changes to the main precompiled databases with a scheduled process. Using updateblastdb.pl is nice and simple (and fairly reliable). Creating a subset database can be done using esearch -> efetch -> makeblastdb but my experience is that doing this entry-by-entry just isn't as reliable, especially if it's not being actively monitored.

So I think taking the child IDs from somewhere would be the most reproducible way of curating the taxon list. If memory serves, there is some way to interrogate the taxonomy4blast.sqlite3 database included with the precompiled versions with a one-liner or two.

ADD REPLY • link 18 months ago by theclubstyle ▴ 40

0

Entering edit mode

You could extract the sequences from nt/nr preformatted blast databases using blastdbcmd and the list of taxID's you are interested in. That would be reasonably foolproof and fast. It will require you to download the entire nt/nr database though.

But then once you have a list of taxID's though you could simply limit your blast+ search to just those.

ADD REPLY • link 18 months ago by GenoMax 153k

0

Entering edit mode

Just seen this...and it's probably a more elegant solution than mine (see below). Doh!

I might implement this in future versions of the code though because as you say, it's super fast to retrieve the information (and also works on the much smaller precompiled prokaryote database too, so that's nice). Then on top of that the resulting dataset will be considerably more memory efficient to compile as / run against as a standlone blast database.

Thanks as ever for all your help!

ADD REPLY • link 18 months ago by theclubstyle ▴ 40

score 0 · Answer 1 · 2024-03-20

0

Entering edit mode

18 months ago

Pierre Lindenbaum 166k

you can use SPARQL uniprot to get all the descendant of a taxon id: example query (descendant of homo sapiens 9606)

use the taxon id in the URL to filter out your blast output

ADD COMMENT • link 18 months ago by Pierre Lindenbaum 166k

score 0 · Answer 2 · 2024-03-21

Thanks GenoMax and @Pierre , your help has been really useful!

Got there in the end; I used a local SQL approach using the taxonomy4blast.sqlite3 database, to retrieve child taxon IDs in a recursive lookup (i.e. node > child > child > child, until no more entries exist) and appending those to a list. Then it's just a case of running local blast within python with -outfmt 6 and -staxids, then removing any entries that contain the specified parent of child negative taxon IDs:

import sqlite3

# call sqlite3 to find full list of taxa to ignore should a parent taxon be specified
def find_child_taxa(conn, excluded_taxa):
    cursor = conn.cursor()
    new_taxa_found = True

    while new_taxa_found:
        new_taxa_found = False
        for taxon_ID in excluded_taxa[:]:  # Iterate over a copy of the list to allow modification
            cursor.execute("SELECT taxid FROM TaxidInfo WHERE parent = ?", (taxon_ID,))
            rows = cursor.fetchall()
            for row in rows:
                next_ID = row[0]
                if next_ID not in excluded_taxa:
                    excluded_taxa.append(next_ID)
                    new_taxa_found = True 
    # If new sub-IDs were found, loop will continue to search again
    cursor.close()

# Connect to sqlite3 dbase
conn = sqlite3.connect('/path_to/taxonomy4blast.sqlite3')

# Define negative taxa from file
with open(negative_taxidfilename, 'r') as neg_taxfile:
    for line in neg_taxfile:
        excluded_taxa.append(line.strip())

# Find and add corresponding IDs from 'Parent' list until no new IDs are found
find_child_taxa(conn, excluded_taxa)

# close dbase
conn.close()

print("Excluded taxa:", excluded_taxa)

# Continue code, filter excluded_taxa from blastout results using -outfmt 6 sstaxids