Question

Species coverage in the NCBI protein NR database ?

0

Entering edit mode

18 months ago

HERMAN ▴ 10

Hi Biostars,

I am currently trying to build a Eukaryote version of the NCBI NR database and I am not really sure that I fully understand how the NR is implemented.

Here is the code that I'm using to do so :

#!/usr/bin/bash 

##############
# DOWNLOAD FULL NR
##############

baseURL="https://ftp.ncbi.nlm.nih.gov/blast/db/" 

for i in $(seq -f "%02g" 0 82); do
     echo "$baseURL/nr.$i.tar.gz"
done > download_list.txt       

aria2c -i download_list.txt --max-concurrent-downloads=10

# Not shown here but I also download and check the md5 files

for file in nr.*.tar.gz; do
    tar -I pigz -xf "$file"
done

###########################
# FETCH ALL SEQID FOR EUKARYOTE SPECIES
###########################

# From https://bioinf.shenwei.me/taxonkit/tutorial/#making-nr-blastdb-for-specific-taxids

id=2759 # Eukaryotes taxid

taxonkit list --ids $id --indent "" > $id.taxid.txt

# prot.accession2taxid.gz is an NCBI file, name pretty straightforward

pigz -dc prot.accession2taxid.gz \
    | csvtk grep -t -f taxid -P $id.taxid.txt \
    | csvtk cut -t -f accession.version,taxid \
    | sed 1d \
    > $id.acc2taxid.txt

cut -f 1 $id.acc2taxid.txt > $id.acc.txt

#######################
# EXTRACT THE EUKARYOTE PROTEINS
#######################

blastdbcmd -db NR/nr -entry_batch $id.acc.txt -out NR_EUK/nr_euk.faa -dbtype prot -logfile extract.log

My problem is as following, for some fasta sequences I have this kind of header :

>A0A067SLB9.1 RecName: Full=Alpha-amanitin proprotein 1; Contains: RecName: Full=Alpha-amanitin; AltName: Full=Amatoxin; AltName: Full=Gamma-amanitin; Flags: Precursor [Galerina marginata CBS 339.88] >5N4C_E Chain E, Alpha-amanitin proprotein [Galerina marginata] >5N4C_F Chain F, Alpha-amanitin proprotein [Galerina marginata] >5N4C_G Chain G, Alpha-amanitin proprotein [Galerina marginata] >5N4C_H Chain H, Alpha-amanitin proprotein [Galerina marginata] >5N4E_C Chain C, Alpha-amanitin proprotein [Galerina marginata] >5N4E_D Chain D, Alpha-amanitin proprotein [Galerina marginata] >AYM46699.1 alpha-amanitin [Galerina sulciceps] >AEX26935.1 alpha-amanitin proprotein [Galerina marginata] >AYM46698.1 alpha-amanitin [Galerina marginata] >KDR68474.1 hypothetical protein GALMADRAFT_1387421 [Galerina marginata CBS 339.88] >QKM76215.1 alpha-amanitin [Galerina marginata]
MFDTNATRLPIWGIGCNPWTAEHVDQTLASGNDIC

As I understand it, the NIH " compress " sequences that look exactly alike into one representative sequence ? The thing is I would like every species that have this protein ( or other proteins, this one is just an example ) to be represented in the NR since I'm working with a datatation pipeline. One thing that confuses me aswell is that when I run a website BLASTP with the MFDTNATRLPIWGIGCNPWTAEHVDQTLASGNDIC peptide, I find every protein present in the header but as unique entries.

Sorry for the very long post and thank you for the ones of you that will read this until the end !

Best regard, Simon

NR • 1.3k views

ADD COMMENT • link updated 18 months ago by Istvan Albert 102k • written 18 months ago by HERMAN ▴ 10

1

Entering edit mode

as an answer to your question on how do they represent multiple entries with a single sequence

The header of the fasta file contains all headers that match identical sequences and these headers are separated with the /r (carriage return) character.

Hence each individual header contains multiple headers for each identical sequence. In your example if you keep scrolling right you will see

>A0A067SLB9.1  [...]  >5N4C_E Chain E, Alpha-amanitin [...] >AEX26935.1

it is a weird hack actually, but as most bioinformatics it kind of works

ADD REPLY • link 18 months ago by Istvan Albert 102k

0

Entering edit mode

As I understand it, the NIH " compress " sequences that look exactly alike into one representative sequence ? The thing is I would like every species that have this protein ( or other proteins, this one is just an example ) to be represented in the NR since I'm working with a datatation pipeline.

Then you need to follow the recommended option a in the tutorial.

From nr.fa exported from pre-formated blastdb (faster, smaller output file, recommended). DO NOT directly download nr.gz from ncbi ftp, in which the FASTA headers are not well formated.

ADD REPLY • link 18 months ago by shenwei356 8.7k

0

Entering edit mode

You might also check the new Blast+ version: NCBI BLAST+ 2.2.15 now available for download

New taxonomy limit option

You can now also limit your search to sequences from a group of organisms without first using a custom script to get all species-level Taxonomy IDs (taxids) for the group.

You just need to run

blastp -db nr -taxids 2759 -query ...

ADD REPLY • link 18 months ago by shenwei356 8.7k

0

Entering edit mode

Thank you for your quick response ! ( And for all the softwares haha ). I indeed overlooked this comment on the headers format.

Have a great day !

ADD REPLY • link 18 months ago by HERMAN ▴ 10