Hi,
I would like to make a custom database (nr) for blast local running but containing only arthropods sequences.
Any way to do this ?
Thanks a lot.
Hi,
I would like to make a custom database (nr) for blast local running but containing only arthropods sequences.
Any way to do this ?
Thanks a lot.
Data:
Hardware in this tutorial
Tools:
Steps:
Listing all taxids below $id
using taxonkit.
id=6656
# 6656 is the phylum Arthropoda
# echo 6656 | taxonkit lineage | taxonkit reformat
# 6656 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Protostomia;Ecdysozoa;Panarthropoda;Arthropoda Eukaryota;Arthropoda;;;;;
# time: 3s
taxonkit list --ids $id --indent "" > $id.taxid.txt
wc -l $id.taxid.txt
# 518373 6656.taxid.txt
Retrieving target accessions. There are two options:
From prot.accession2taxid.gz (faster, recommended). Note that some accessions are not in nr
.
# time: 4min
pigz -dc prot.accession2taxid.gz \
| csvtk grep -t -f taxid -P $id.taxid.txt \
| csvtk cut -t -f accession.version \
| sed 1d \
> $id.acc.txt
wc -l $id.acc.txt
# 8174609 6656.acc.txt
From pre-formated nr
blastdb
# time: 40min
blastdbcmd -db nr -entry all -outfmt "%a %T" | pigz -c > nr.acc2taxid.txt.gz
pigz -dc nr.acc2taxid.txt.gz | wc -l
# 555220892
# time: 3min
pigz -dc nr.acc2taxid.txt.gz \
| csvtk grep -d ' ' -D ' ' -f 2 -P $id.taxid.txt \
| cut -d ' ' -f 1 \
> $id.acc.txt
wc -l $id.acc.txt
# 6928021 6656.acc.txt
Retrieving FASTA sequences from pre-formated blastdb. There are two options:
From nr.fa
exported from pre-formated blastdb (faster, smaller output file, recommended).
DO NOT directly download nr.gz
from ncbi ftp,
in which the FASTA headers are not well formated.
# time: 117min
blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out - | pigz -c > nr.fa.gz
# time: 80min
# perl one-liner is used to unfold records having mulitple accessions
cat <(echo) <(pigz -dc nr.fa.gz) \
| perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//; $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' \
| seqkit grep --delete-matched -f $id.acc.txt -o nr.$id.fa.gz
# counting sequences
#
# ls -lh nr.$id.fa.gz
# -rw-r--r-- 1 shenwei shenwei 902M 9月 13 01:42 nr.6656.fa.gz
#
pigz -dc nr.$id.fa.gz | grep '^>' -c
# 6928017
# Here 6928017 ~= 6928021 ($id.acc.txt)
Directly from pre-formated blastdb
# time: 5h20min
blastdbcmd -db nr -entry_batch $id.acc.txt -out - | pigz -c > nr.$id.fa.gz
# counting sequences
#
# Note that the headers of outputed fasta by blastdbcmd are "folded"
# for accessions from different species with same sequences, so the
# number may be small than $(wc -l $id.acc.txt).
pigz -dc nr.$id.fa.gz | grep '^>' -c
# 1577383
# counting accessions
#
# ls -lh nr.$id.fa.gz
# -rw-r--r-- 1 shenwei shenwei 2.1G 9月 13 03:38 nr.6656.fa.gz
#
pigz -dc nr.$id.fa.gz | grep '^>' | sed 's/>/\n>/g' | grep '^>' -c
# 288415413
makeblastdb
pigz -dc nr.$id.fa.gz > nr.$id.fa
# time: 3min ($nr.$id.fa from step 3 option 1)
#
# building $nr.$id.fa from step 3 option 2 with -parse_seqids would produce error:
#
# BLAST Database creation error: Error: Duplicate seq_ids are found: SP|P29868.1
#
makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id
# rm nr.$id.fa
test: blastp
# blastdb from step 3 option 2
#
time blastp -num_threads 16 -db nr.$id -query t4.fa > t4.fa.blast
# real 0m20.866s
# $ cat t4.fa.blast | grep Query= -A 10
# Query= A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-Hd1a
#
# Length=35
Score E
# Sequences producing significant alignments: (Bits) Value
# 2MPQ_A Chain A, Solution structure of the sodium channel toxin Hd1a 72.4 2e-17
# A0A0J9X1W9.2 RecName: Full=Mu-theraphotoxin-Hd1a; Short=Mu-TRTX-... 72.4 2e-17
# ADB56726.1 HNTX-IV.2 precursor [Haplopelma hainanum] 66.6 9e-15
# D2Y233.1 RecName: Full=Mu-theraphotoxin-Hhn1b 2; Short=Mu-TRTX-H... 66.6 9e-15
# ADB56830.1 HNTX-IV.3 precursor [Haplopelma hainanum] 66.6 9e-15
Do not follow this answer.
Here's the whole commands (replacing pigz
with gzip
if pigz
not installed).
id=6656
# Step 1) list all taxid below $id
taxonkit list --ids $id --indent "" > $id.taxid.txt
# Step 2) get all accessions
# 3 min for me
pigz -d -c prot.accession2taxid.gz \
| csvtk -t grep -f taxid -P $id.taxid.txt \
| csvtk -t cut -f accession.version \
> $id.taxid.acc.txt
# Step 3) Option A (very slow)
blastdbcmd -db nr -entry all -outfmt "%a\t%T" | \
csvtk -t grep -f 2 -P $id.taxid.acc.txt | \
csvtk -t cut -f 1 | \
blastdbcmd -db nr -entry_batch - -out nr.$id.fa
# Step 3) Option B (relative fast but buggy)
# reformat nr.gz and retrieve sequences (buggy)
# 16 min for me
cat <(echo) <(pigz -d -c nr.gz) \
| perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//; $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ /\[.+\]/) { print ">$_"; next; } while($h =~ /(.+? \[.+?\])/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } '\
| seqkit grep --delete-matched -f $id.taxid.acc.txt -o nr.$id.fa.gz
# checking sequences amount (with Option B)
seqkit stats nr.$id.fa.gz
# file format type num_seqs sum_len min_len avg_len max_len
# nr.6656.fa.gz FASTA Protein 6,913,200 2,860,805,704 6 413.8 28,516
wc -l $id.taxid.acc.txt
# 8174610 $id.taxid.acc.txt
# Step 4) makeblastdb
gzip -d -c nr.$id.fa.gz > nr.$id.fa
makeblastdb -in nr.$id.fa -dbtype prot -out nr.$id
Well, as for 6,913,200 < 8,174,610
:
Since many genes in different species have same sequences, they are recorded as one record in
nr.gz
.
The intuitional way is matching FASTA record heads with all target accesions using regular expression matching, however, it's very slow for such a big dataset:
$ pigz -d -c nr.gz | seqkit stats | tee stats.txt
file format type num_seqs sum_len min_len avg_len max_len
- FASTA Protein 169,199,080 61,790,996,688 6 365.2 74,488
So a better solution is "unfolding" these FASTA records.
But the FASTA headers in nr.gz
are not well formated, it's hard to rightly unfold
all records shared by muliple species. Here are some cases:
Normal. accession product [species]accession2 product2 [species2]
(easy to solve)
>XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]EAL68957.1 hypothetical protein DDB_G0276911 [Dictyosteli
um discoideum AX4]
Mxxxxxxxx
Some FASTA headers contain 0x01
(shown as ^
below) in front of accession (easy to solve)
$ pigz -d -c nr.gz | grep '^>' | grep CAH60339.1 | cat -A
>CAH60339.1 histone h3, partial [Copelatus sp. 028TA588]^AABG79397.1 histone H3, partial [Lumbrineris magnidentata]^AABG79414.1 histone H3, partial [Polygordius lacteus]^AABX89916.1 histone 3, partial [Akalyptoischion atrichos]
xxxx
>^accesion product
xxxxxx
Mix 1. Some headers contains species information in []
, but thelast one does not. (there are ways)
>xxxxx [xxxx]aaaa bbbbbb
Mix 2. Where B3N7H1.1
does not end with ]
, so the last character 4
and the later accession A1Z7U3.1
lay together -_-! (HARD to solve)
>NP_001027396.1 uncharacterized protein Dmel_CG33774 [Drosophila melanogaster]XP_001970317.1 uncharacterized protein Dere_GG10558 [Drosophila erecta]B3N7H1.1 RecName: Full=Dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit 4A1Z7U3.1 RecName: Full=Dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit 4AAZ52820.1 uncharacterized protein Dmel_CG33774 [Drosophila melanogaster]EDV59376.1 uncharacterized protein Dere_GG10558 [Drosophila erecta]
Thanks I will try.
1) For the option A: How do you create your database (nr) and how much time is very slow ?
Because I got this error:
[ERRO] field (2) out of range (1) in file: -
2) I am not sure to understand the option B, because you mentioned "buggy", but the makeblastdb is working ?
Get the pre-made blast indexes for nr
directly from NCBI.
May want to try and see if the solution I posted here works for nr
.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I think you should be able to modify the taxonomy ID in tutorials provided in the following post to suit your requirements.
Vertebrate Subset Nr Database? Build My Own?
Thanks for you answer !
But this topic is 8 years old and seems like those gi methods don't work anymore (message from genomax)
Yes, you're right, I completely forgot about that. I guess you could either use
-seqidlist
which is equivalent of-gilist
in the current version of BLAST+ 2.7 or you could use the alpha release of the newer version of BLAST+ that allows taxonomy ID based filter: https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdfExtract all protein sequences of specific taxons from the NCBI nr database, and
makeblastdb
.I have used your tutorial but it seems that the fasta at the end (made with
seqkit
) is not complete:I have downloaded the last
prot.accession2taxid.gz
andnr.gz
.do you have a way to fix ?
what's the taxid are you using?
Using Arthropods: 6656
change
to
Where the Perl script is for unfolding the records with same sequences in different species
to
done.
But I still got low number of sequence compared to the accession number:
is it normal ?
Possibly because shenwei356 is discarding records with same sequence.
Ok.
Anyway. I have tried to make a database with this new fasta, with this command:
and I got this error:
by looking at this line, I have something indeed a bit weird:
I'll check it after a full download of nr.gz. And have to go to work now.
yes so apparently, some sequences don't start with '>'
It's the bug of the perl one-liner, because of the not well formated
nr.gz
.I gave up after few hours. May be we should blast on whole
nr
and filter the result.