Can I Use blastdb_aliastool and Accessions List Make a Bacteria Subset Nr Database
1
0
Entering edit mode
8.3 years ago

I have know that NCBI has (or is about to) stopped supporting gi numbers. It change to use Accessions.

I have use "Question: Vertebrate Subset Nr Database? Build My Own?" but the out put is "BLASTDB alias file creation failed. Some referenced files may be missing" So I think the problem is that nr change to use Accessions.

Can I use accessions list to make a bacteria subset Nr database? Or is there any ideal to make a bacteria subset Nr database? Or where can I get a gi format Nr database?

Thanks a lot.

genome blast next-gen • 3.5k views
ADD COMMENT
0
Entering edit mode

I think the answer is no as things stand now since there is no option for blastdb_aliastool to accept a list of accession numbers.

Let us hope NCBI has plans for addressing these sort of associated needs with new releases of various software (blast, eutils etc) with the gi numbers going away soon.

USAGE
  blastdb_aliastool [-h] [-help] [-gi_file_in input_file]
    [-gi_file_out output_file] [-db dbname] [-dbtype molecule_type]
    [-title database_title] [-gilist input_file] [-out database_name]
    [-dblist database_names] [-dblist_file file_name] [-vdblist vdb_names]
    [-vdblist_file file_name] [-num_volumes positive_integer]
    [-logfile File_Name] [-version]
ADD REPLY
0
Entering edit mode

Is there any way to solev this problem? I really need a Bacteria Subset Nr Database. Thanks.

ADD REPLY
0
Entering edit mode

Did you ever find an answer. I am trying to do the same thing and I am close. I was able to write a python script to download all the bacterial gi's and I made an alias database but when I search against it, it doesn't work. I am trying to figure out why now. Were you successful?

ADD REPLY
0
Entering edit mode

NCBI no longer uses gi numbers.

Latest version of blast+ (v.2.6.0) has the new aliastool that has these options (I have not tried them).

   Application to create BLAST database aliases, version 2.6.0+

   This application has three modes of operation:

   1) GI file conversion:
      Converts a text file containing GIs (one per line) to a more efficient
      binary format. This can be provided as an argument to the -gilist option
      of the BLAST search command line binaries or to the -gilist option of
      this program to create an alias file for a BLAST database (see below).

   2) Alias file creation (restricting with GI List or Sequence ID List):
      Creates an alias for a BLAST database and a GI or ID list which
   restricts
      this database. This is useful if one often searches a subset of a
   database
      (e.g., based on organism or a curated list). The alias file makes the
      search appear as if one were searching a regular BLAST database rather
      than the subset of one.
ADD REPLY
0
Entering edit mode
7.7 years ago
5heikki 11k

Here's a really impractical solution:

#!/bin/bash
mkdir all_prot
cd all_prot
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
awk 'BEGIN{FS="\t"}{print $20}' assembly_summary.txt \
    | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_protein.faa.gz"}' \
    | sh
gunzip *.gz
find . -maxdepth 1 -type f -name "*.faa" \
    | xargs cat > all.faa
cd-hit -i all.faa -o all.faa.NR100 -c 1.00 -M 80000 -T 16

And then you make the db from all.faa.NR100. I'm not sure if 80 GB is actually enough RAM for the clustering part though. Also, you should probably curate cluster representative headers. Overall a really bad solution. A few of the downloads will probably fail..

ADD COMMENT
0
Entering edit mode

I think I got it to work with making an alias database. The problem was downloading all the gi numbers for all bacteria. It is a 3 gb file. I wrote a biopython script (see below) that does it in 10,000 gi chunks. Then I used cat to put them all together. Then I use that gilist to make the alias database. It works. Also, the new blast does make the gilist in binary. I have not tried using that but that might be a good option.

Entrez.email = "email@here"
start=0 # last  #double check as I think there is a cutoff at 9938 to 10966
stop= 318437911
skip=10000
for i in range(start,stop,skip):
    last=i
    if i%1000000==0:
        print i
    filename='downloadArch/python_archaea'+str(i)+'.gis'
    try:
        handle = Entrez.esearch(db="protein", retmax=10000, retstart=i,term="Archaea[organism]")
        record = Entrez.read(handle)
    except:
        handle = Entrez.esearch(db="protein", retmax=10000, retstart=i,term="Archaea[organism]")
        record = Entrez.read(handle)
        print 'except',i


    handle.close()
    gis=pd.DataFrame(record['IdList'][:])
    gis.to_csv(filename,sep=' ', index=False, header=False)
ADD REPLY
0
Entering edit mode

Curious as to how you managed to get GI numbers from NCBI (can you post a few examples). Even though they are still in internal use they are not available in any public facing services from NCBI AFAIK.

ADD REPLY
0
Entering edit mode

I think I am still getting the GI numbers alright. My script got them and when I follow the directions to get them from the website it also works. I used this page as a starter Vertebrate Subset Nr Database? Build My Own? Unless I am not getting all of them....

ADD REPLY
0
Entering edit mode

I am wondering if what you are getting back are GI numbers or some other identifier. If you can post a few examples we can check.

ADD REPLY
0
Entering edit mode

751382397 751382396 751382395 751382394 751382393 751382392 751382391 751382390 751382389 751382388 Here are some random ones from one of the files. 751382387 751382386 751382385 751382384 751382383 751382382 751382381 751382380

ADD REPLY

Login before adding your answer.

Traffic: 2657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6