How to best get ALL Bacterial proteins from NCBI
2
0
Entering edit mode
5.8 years ago

Hey all,

I already have a head start on this question (following this tutorial.) However that method is taking a _really_ long time since I have a list of ~0.5 Billion sequences to get. Additionally, some of my threads during sequence filtering are throwing errors and I'm afraid this method might not work.

So! I'm asking you if you have a better idea on how to get every bacterial protein sequence from NCBI. I don't think Edirect will work (I'll be blocked). One idea I had was if I could use esearch and efetch on a local copy of the all protein record (nr.fa). However Edirect doesn't support local queries out of the box (at least to my knowledge).

Any advice on how to wrangle Edirect to do local queries or any other ideas would be much appreciated.

protein big data • 2.1k views
ADD COMMENT
0
Entering edit mode

You can also download .faa.gz files for every bacterium in RefSeq, check another tutorial

ADD REPLY
0
Entering edit mode

how to get every bacterial protein sequence from NCBI

That requirement, if absolute, will not be satisfied by these two things.

ADD REPLY
0
Entering edit mode

Yes I know, I guess proteins of bacteria in RefSeq are enough for his/her purpose, before knowing for what he/she use the data.

Anyway, one can try

# downlaod
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

# reformat
cat assembly_summary.txt | sed 1d | sed '1s/^# //' \
    | sed 's/"/$/g' > assembly_summary.tsv

# where to download
dir=download
mkdir -p $dir

cat assembly_summary.tsv \
    | csvtk cut -t -f ftp_path | sed 1d \
    | rush -v prefix='{}/{%}' -v dir=$dir \
        ' \
            wget -c {prefix}_protein.faa.gz -O {dir}/{%}_protein.faa.gz \
        ' \
        -j 10 -c -C download.rush

ADD REPLY
0
Entering edit mode

"all protein" sequences is a moving target, anyway...

ADD REPLY
2
Entering edit mode
5.8 years ago
GenoMax 147k

You could download nr blast indexes and then use blastdbcmd from BLAST+ (v. 2.8.1) package to do something like this:

 blastdbcmd -db /path_to/nr_v5 -taxids 2 -outfmt %f -out file.fa

This may not be completely foolproof but should mostly work.

Note: You will need to get new v.5 blast indexes for this to work.

ADD COMMENT
0
Entering edit mode

I may try this. I am looking for the most sequences possible right now, not just RefSeq.

ADD REPLY
0
Entering edit mode

Just occurred to me to ask: What would be the difference between the blast index filtered for bacteria and all of the RefSeq bacterial protein faa files?

ADD REPLY
1
Entering edit mode

Blast index will have data for all bacteria where as RefSeq will likely be restricted to well characterized manually curated datasets.

ADD REPLY
2
Entering edit mode
5.8 years ago
Carambakaracho ★ 3.3k

From blast/db/README

  1. Contents of the /blast/db/FASTA directory

    [...]

    nr.gz* | non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq

From README.genbank

Protein sequences

The protein sequences present in GenBank releases, via coding regions annotated on GenBank records, are made available via files located elsewhere at the NCBI FTP site:

These files replace the single, comprehensive protein FASTA which used to be provided in this directory ( relNNN.fsa_aa.gz ).

Please see the README in the /protein_fasta directory for further information.

This is what it points to: ftp://ftp.ncbi.nih.gov/ncbi-asn1/protein_fasta/ and its README

Is this what you're looking for?

ADD COMMENT
0
Entering edit mode

The gbbct* files in this directory would work but there is going to be a lot of redundancy. It may still be worth using the nr database to avoid this issue but that is something original poster will have to decide.

ADD REPLY
0
Entering edit mode

This may be a good backup to using the nr_v5 database.

ADD REPLY
0
Entering edit mode

I didn't believe it wasn't there anymore:

ftp://ftp.ncbi.nih.gov/blast/db/FASTA/

specifically the nr.gz file (links to 45GB file). Still requires a filter on the bacterial entries, though...

ADD REPLY

Login before adding your answer.

Traffic: 1091 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6