Hi there,
here are several threads about creating a local BLAST database filtered by organism. With:
blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 9606) { print $1 } } ' | blastdbcmd -db nr -entry_batch - -out human_sequences.txt
it is possible to filter the DB for only human entries (txid: 9606). Nice!
But did anyone actually did this? Is splitted the job due to the file sizes into 60 single jobs for the NR database. This works really fine, except for the parts 08, 15 and 34 of the NR database. The jobs are running ridiculously long and the file sizes are 12 times bigger than the original DB-file (at the point where I stopped the script). Also, the created files seem to contain redundant copies of some entries, which causes the filesize. Is this intended? Why should it need multiple copies of one FASTA entry for creating the 'human NR'-database later?
Any suggestions?
why do you want to filter when you can restrict your search against certain organism or IDs.
This requires the -remote option which again queries the NCBI servers. I assume that this outsources the complete BLAST search or is this only for organism restriction in this case?
using options like
gilist, seqidlist, negative_gilist etc
from the standalone blast tool you can achieve the restricted search.So I'll create a list by the first 2/3 of the command above
and run with -gilist gi_list.list? That sounds so uncomplicated :) Thanks!
You could remove deplicate gi IDs based on the protein name except
unnamed/unknown
for example119597083, 119597084, 119597085, 119597086
all codes foractin related protein 2/3 complex, subunit 1B, 41kDa, isoform CRA_a
which are 100% identical end-2-end [was surprised by the existence of multiple copies of same in NR database]