We are trying to build a comprehensive & latest (plant)pathogen database (virus/viroid/fungi/oomyceta/bacteria) in fasta format. To that end we downloaded the nt database from Index of /blast/db/v5. And are now trying to filter it using the blast+ suite with blastdbcmd using the -taxidlist option:
blastdbcmd -db /nak2_nanopore/nt/nt -taxids 4751,4762,2,10239 -outfmt '>W%g|tid|%T|%a|%t\n%s' | sed 's/\\n/\n/' | awk '/^>/ {gsub(/[^a-zA-Z0-9_|. ]/, "", $0); gsub(/ /, "_", $0); print ">"$0; next} {print}' > pathogens.fasta
The output of this script unfortunately contains plant, human, insect, etc sequences, which we are trying to filter out.
For example when we test on a subpart of the database nt.109 with the following command:
blastdbcmd -db /nak2_nanopore/nt/nt.109 -taxidlist ~/test_taxidfile.txt -outfmt '>W%g|tid|%T|
%a|%t\n%s' | sed 's/\\n/\n/' | awk '/^>/ {gsub(/[^a-zA-Z0-9_|. ]/, "", $0); gsub(/ /, "_", $0); print ">"$0; next} {print}'
> nt_109_retry.fasta
we find the first hit to be:
>W2639371760|tid|82600|XM_061863963.1|PREDICTED_Cydia_pomonella_protein_groucholike_LOC133527073_mRNA
Which is a moth... Can you help us with this issue, or do you have an alternative way to create such a database?