Question

Filter local BLAST DB by organism

0

Entering edit mode

8.0 years ago

daniello • 0

Hi there,

here are several threads about creating a local BLAST database filtered by organism. With:

blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 9606) { print $1 } } ' | blastdbcmd -db nr -entry_batch - -out human_sequences.txt

it is possible to filter the DB for only human entries (txid: 9606). Nice!

But did anyone actually did this? Is splitted the job due to the file sizes into 60 single jobs for the NR database. This works really fine, except for the parts 08, 15 and 34 of the NR database. The jobs are running ridiculously long and the file sizes are 12 times bigger than the original DB-file (at the point where I stopped the script). Also, the created files seem to contain redundant copies of some entries, which causes the filesize. Is this intended? Why should it need multiple copies of one FASTA entry for creating the 'human NR'-database later?

Any suggestions?

Database Filter BLAST • 3.2k views

ADD COMMENT • link updated 7.9 years ago by Biostar 20 • written 8.0 years ago by daniello • 0

0

Entering edit mode

why do you want to filter when you can restrict your search against certain organism or IDs.

ADD REPLY • link 8.0 years ago by Prasad ★ 1.6k

0

Entering edit mode

This requires the -remote option which again queries the NCBI servers. I assume that this outsources the complete BLAST search or is this only for organism restriction in this case?

ADD REPLY • link 8.0 years ago by daniello • 0

1

Entering edit mode

using options like gilist, seqidlist, negative_gilist etc from the standalone blast tool you can achieve the restricted search.

ADD REPLY • link 8.0 years ago by Prasad ★ 1.6k

0

Entering edit mode

So I'll create a list by the first 2/3 of the command above

blastdbcmd -db nr -entry all -outfmt "%g %T" | awk ' { if ($2 == 9606) { print $1 } } ' > gi_list.list

and run with -gilist gi_list.list? That sounds so uncomplicated :) Thanks!

ADD REPLY • link 8.0 years ago by daniello • 0

0

Entering edit mode

You could remove deplicate gi IDs based on the protein name except unnamed/unknown for example 119597083, 119597084, 119597085, 119597086 all codes for actin related protein 2/3 complex, subunit 1B, 41kDa, isoform CRA_a which are 100% identical end-2-end [was surprised by the existence of multiple copies of same in NR database]

ADD REPLY • link 8.0 years ago by Prasad ★ 1.6k