Hi Biostars,
I am currently trying to build a Eukaryote version of the NCBI NR database and I am not really sure that I fully understand how the NR is implemented.
Here is the code that I'm using to do so :
#!/usr/bin/bash
##############
# DOWNLOAD FULL NR
##############
baseURL="https://ftp.ncbi.nlm.nih.gov/blast/db/"
for i in $(seq -f "%02g" 0 82); do
echo "$baseURL/nr.$i.tar.gz"
done > download_list.txt
aria2c -i download_list.txt --max-concurrent-downloads=10
# Not shown here but I also download and check the md5 files
for file in nr.*.tar.gz; do
tar -I pigz -xf "$file"
done
###########################
# FETCH ALL SEQID FOR EUKARYOTE SPECIES
###########################
# From https://bioinf.shenwei.me/taxonkit/tutorial/#making-nr-blastdb-for-specific-taxids
id=2759 # Eukaryotes taxid
taxonkit list --ids $id --indent "" > $id.taxid.txt
# prot.accession2taxid.gz is an NCBI file, name pretty straightforward
pigz -dc prot.accession2taxid.gz \
| csvtk grep -t -f taxid -P $id.taxid.txt \
| csvtk cut -t -f accession.version,taxid \
| sed 1d \
> $id.acc2taxid.txt
cut -f 1 $id.acc2taxid.txt > $id.acc.txt
#######################
# EXTRACT THE EUKARYOTE PROTEINS
#######################
blastdbcmd -db NR/nr -entry_batch $id.acc.txt -out NR_EUK/nr_euk.faa -dbtype prot -logfile extract.log
My problem is as following, for some fasta sequences I have this kind of header :
>A0A067SLB9.1 RecName: Full=Alpha-amanitin proprotein 1; Contains: RecName: Full=Alpha-amanitin; AltName: Full=Amatoxin; AltName: Full=Gamma-amanitin; Flags: Precursor [Galerina marginata CBS 339.88] >5N4C_E Chain E, Alpha-amanitin proprotein [Galerina marginata] >5N4C_F Chain F, Alpha-amanitin proprotein [Galerina marginata] >5N4C_G Chain G, Alpha-amanitin proprotein [Galerina marginata] >5N4C_H Chain H, Alpha-amanitin proprotein [Galerina marginata] >5N4E_C Chain C, Alpha-amanitin proprotein [Galerina marginata] >5N4E_D Chain D, Alpha-amanitin proprotein [Galerina marginata] >AYM46699.1 alpha-amanitin [Galerina sulciceps] >AEX26935.1 alpha-amanitin proprotein [Galerina marginata] >AYM46698.1 alpha-amanitin [Galerina marginata] >KDR68474.1 hypothetical protein GALMADRAFT_1387421 [Galerina marginata CBS 339.88] >QKM76215.1 alpha-amanitin [Galerina marginata]
MFDTNATRLPIWGIGCNPWTAEHVDQTLASGNDIC
As I understand it, the NIH " compress " sequences that look exactly alike into one representative sequence ? The thing is I would like every species that have this protein ( or other proteins, this one is just an example ) to be represented in the NR since I'm working with a datatation pipeline. One thing that confuses me aswell is that when I run a website BLASTP with the MFDTNATRLPIWGIGCNPWTAEHVDQTLASGNDIC peptide, I find every protein present in the header but as unique entries.
Sorry for the very long post and thank you for the ones of you that will read this until the end !
Best regard, Simon
as an answer to your question on how do they represent multiple entries with a single sequence
The header of the fasta file contains all headers that match identical sequences and these headers are separated with the /r (carriage return) character.
Hence each individual header contains multiple headers for each identical sequence. In your example if you keep scrolling right you will see
it is a weird hack actually, but as most bioinformatics it kind of works
Then you need to follow the recommended option a in the tutorial.
You might also check the new Blast+ version: NCBI BLAST+ 2.2.15 now available for download
You just need to run
Thank you for your quick response ! ( And for all the softwares haha ). I indeed overlooked this comment on the headers format.
Have a great day !