Can I use accessions list to make a bacteria subset Nr database?
Or is there any ideal to make a bacteria subset Nr database?
Or where can I get a gi format Nr database?
I think the answer is no as things stand now since there is no option for blastdb_aliastool to accept a list of accession numbers.
Let us hope NCBI has plans for addressing these sort of associated needs with new releases of various software (blast, eutils etc) with the gi numbers going away soon.
Did you ever find an answer. I am trying to do the same thing and I am close. I was able to write a python script to download all the bacterial gi's and I made an alias database but when I search against it, it doesn't work. I am trying to figure out why now. Were you successful?
Latest version of blast+ (v.2.6.0) has the new aliastool that has these options (I have not tried them).
Application to create BLAST database aliases, version 2.6.0+
This application has three modes of operation:
1) GI file conversion:
Converts a text file containing GIs (one per line) to a more efficient
binary format. This can be provided as an argument to the -gilist option
of the BLAST search command line binaries or to the -gilist option of
this program to create an alias file for a BLAST database (see below).
2) Alias file creation (restricting with GI List or Sequence ID List):
Creates an alias for a BLAST database and a GI or ID list which
restricts
this database. This is useful if one often searches a subset of a
database
(e.g., based on organism or a curated list). The alias file makes the
search appear as if one were searching a regular BLAST database rather
than the subset of one.
And then you make the db from all.faa.NR100. I'm not sure if 80 GB is actually enough RAM for the clustering part though. Also, you should probably curate cluster representative headers. Overall a really bad solution. A few of the downloads will probably fail..
I think I got it to work with making an alias database. The problem was downloading all the gi numbers for all bacteria. It is a 3 gb file. I wrote a biopython script (see below) that does it in 10,000 gi chunks. Then I used cat to put them all together. Then I use that gilist to make the alias database. It works. Also, the new blast does make the gilist in binary. I have not tried using that but that might be a good option.
Entrez.email = "email@here"
start=0 # last #double check as I think there is a cutoff at 9938 to 10966
stop= 318437911
skip=10000
for i in range(start,stop,skip):
last=i
if i%1000000==0:
print i
filename='downloadArch/python_archaea'+str(i)+'.gis'
try:
handle = Entrez.esearch(db="protein", retmax=10000, retstart=i,term="Archaea[organism]")
record = Entrez.read(handle)
except:
handle = Entrez.esearch(db="protein", retmax=10000, retstart=i,term="Archaea[organism]")
record = Entrez.read(handle)
print 'except',i
handle.close()
gis=pd.DataFrame(record['IdList'][:])
gis.to_csv(filename,sep=' ', index=False, header=False)
Curious as to how you managed to get GI numbers from NCBI (can you post a few examples). Even though they are still in internal use they are not available in any public facing services from NCBI AFAIK.
I think I am still getting the GI numbers alright. My script got them and when I follow the directions to get them from the website it also works. I used this page as a starter Vertebrate Subset Nr Database? Build My Own? Unless I am not getting all of them....
751382397
751382396
751382395
751382394
751382393
751382392
751382391
751382390
751382389
751382388
Here are some random ones from one of the files.
751382387
751382386
751382385
751382384
751382383
751382382
751382381
751382380
I think the answer is no as things stand now since there is no option for
blastdb_aliastool
to accept a list of accession numbers.Let us hope NCBI has plans for addressing these sort of associated needs with new releases of various software (blast, eutils etc) with the gi numbers going away soon.
Is there any way to solev this problem? I really need a Bacteria Subset Nr Database. Thanks.
Did you ever find an answer. I am trying to do the same thing and I am close. I was able to write a python script to download all the bacterial gi's and I made an alias database but when I search against it, it doesn't work. I am trying to figure out why now. Were you successful?
NCBI no longer uses
gi
numbers.Latest version of blast+ (v.2.6.0) has the new aliastool that has these options (I have not tried them).