Creating pathogen test database from nt NCBI
1
2
Entering edit mode
6 weeks ago
r.d.jongh ▴ 20

We are trying to build a comprehensive & latest (plant)pathogen database (virus/viroid/fungi/oomyceta/bacteria) in fasta format. To that end we downloaded the nt database from Index of /blast/db/v5. And are now trying to filter it using the blast+ suite with blastdbcmd using the -taxidlist option:

blastdbcmd -db /nak2_nanopore/nt/nt -taxids 4751,4762,2,10239 -outfmt '>W%g|tid|%T|%a|%t\n%s' | sed  's/\\n/\n/'  | awk  '/^>/ {gsub(/[^a-zA-Z0-9_|. ]/, "", $0); gsub(/ /, "_", $0); print ">"$0; next} {print}' > pathogens.fasta

The output of this script unfortunately contains plant, human, insect, etc sequences, which we are trying to filter out.

For example when we test on a subpart of the database nt.109 with the following command:

blastdbcmd -db /nak2_nanopore/nt/nt.109 -taxidlist ~/test_taxidfile.txt -outfmt '>W%g|tid|%T|
%a|%t\n%s' | sed  's/\\n/\n/'  | awk  '/^>/ {gsub(/[^a-zA-Z0-9_|. ]/, "", $0); gsub(/ /, "_", $0); print ">"$0; next} {print}'
 > nt_109_retry.fasta

we find the first hit to be:

>W2639371760|tid|82600|XM_061863963.1|PREDICTED_Cydia_pomonella_protein_groucholike_LOC133527073_mRNA

Which is a moth... Can you help us with this issue, or do you have an alternative way to create such a database?

Blast NCBI Metagenomics • 247 views
ADD COMMENT
1
Entering edit mode
6 weeks ago
GenoMax 148k

You have to obtain the species level taxID's for this to work right. There is a utility program included in blast+ for this purpose called get_species_taxids.sh. You will also need EntrezDirect (LINK) installed.

get_species_taxids.sh usage:
        -t <taxonomy ID>
                Get taxonomy IDs at or below input taxonomy ID level
        -n <Scientific Name, Common Name or Keyword>
                Get taxonomy information for organism
ADD COMMENT

Login before adding your answer.

Traffic: 1898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6