Question

Creating pathogen test database from nt NCBI

2

Entering edit mode

6 weeks ago

r.d.jongh ▴ 20

We are trying to build a comprehensive & latest (plant)pathogen database (virus/viroid/fungi/oomyceta/bacteria) in fasta format. To that end we downloaded the nt database from Index of /blast/db/v5. And are now trying to filter it using the blast+ suite with blastdbcmd using the -taxidlist option:

blastdbcmd -db /nak2_nanopore/nt/nt -taxids 4751,4762,2,10239 -outfmt '>W%g|tid|%T|%a|%t\n%s' | sed  's/\\n/\n/'  | awk  '/^>/ {gsub(/[^a-zA-Z0-9_|. ]/, "", $0); gsub(/ /, "_", $0); print ">"$0; next} {print}' > pathogens.fasta

The output of this script unfortunately contains plant, human, insect, etc sequences, which we are trying to filter out.

For example when we test on a subpart of the database nt.109 with the following command:

blastdbcmd -db /nak2_nanopore/nt/nt.109 -taxidlist ~/test_taxidfile.txt -outfmt '>W%g|tid|%T|
%a|%t\n%s' | sed  's/\\n/\n/'  | awk  '/^>/ {gsub(/[^a-zA-Z0-9_|. ]/, "", $0); gsub(/ /, "_", $0); print ">"$0; next} {print}'
 > nt_109_retry.fasta

we find the first hit to be:

>W2639371760|tid|82600|XM_061863963.1|PREDICTED_Cydia_pomonella_protein_groucholike_LOC133527073_mRNA

Which is a moth... Can you help us with this issue, or do you have an alternative way to create such a database?

Blast NCBI Metagenomics • 248 views

ADD COMMENT • link updated 6 weeks ago by GenoMax 148k • written 6 weeks ago by r.d.jongh ▴ 20

score 1 · Answer 1 · 2024-11-04

You have to obtain the species level taxID's for this to work right. There is a utility program included in blast+ for this purpose called get_species_taxids.sh. You will also need EntrezDirect (LINK) installed.

get_species_taxids.sh usage:
        -t <taxonomy ID>
                Get taxonomy IDs at or below input taxonomy ID level
        -n <Scientific Name, Common Name or Keyword>
                Get taxonomy information for organism