I'm using a local installation of blast with the nr database. I'm mostly looking for viral species, but the viral refseq database misses some hits compared to a full nr search, but the nr search takes a long time. I'd like to reduce the time it takes by removing species I don't care about from the database. For example, I'd like to remove all eukaryote DNA from nr. Is this possible? How would I go about doing that?
You can check the taxonomy ID of the organism that you're studying and after search this ID at NCBI.
After that, look for the type of sequences you're looking for, nucleotide or protein and browse all sequences from organisms that have this taxonomy. Than download the entire GI list.
Use the GI list file to retrieve all sequences in fasta format from NR that matches in the list with blastdbcmd tool. With this fasta file, you can recreate your database filtered by taxonomy.
I did it only once, but I didn't remember the full roadmap to do it. But these are the steps.
Thanks! I'm actually trying to do the opposite of this. I'm interested in a lot of species so I want to remove sequences from the database that I know I'm not looking for so that I can hopefully speed up my search.
Thanks! I'm actually trying to do the opposite of this. I'm interested in a lot of species so I want to remove sequences from the database that I know I'm not looking for so that I can hopefully speed up my search.
you can use the same database passing the following command: