Hello Everyone. I wanted to parse all Animal sequences from NR database. Can anyone suggest me an easy go method of doing the same. ?
Hello Everyone. I wanted to parse all Animal sequences from NR database. Can anyone suggest me an easy go method of doing the same. ?
The BBMap package has a tool called "filterbytaxa" which will accomplish this. However, NCBI unfortunately never labels sequences with their taxID, which makes everything a little more difficult.
Following Manu's suggestion for using Metazoa, the usage would be like this:
filterbytaxa.sh in=nr.faa out=metazoa.faa ids=33208 include=t tree=tree.taxtree.gz gi=gitable.int1d.gz accession=prot.accession2taxid.gz,pdb.accession2taxid.gz,dead_prot.accession2taxid.gz
But you first need to get the accession files, taxonomic tree, and potentially gi tables like this:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/*.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
unzip taxdmp.zip
taxtree.sh names.dmp nodes.dmp tree.taxtree.gz -Xmx16g
gitable.sh gi_taxid_nucl.dmp.gz,gi_taxid_prot.dmp.gz gitable.int1d.gz -Xmx16g
filterbytaxa.sh, taxtree.sh, and gitable.sh are part of the BBMap package. wget and unzip are part of most Linux builds. It's easiest if you put all the BBMap shell scripts in the path before running this. If you have the most recent copy of nr, you shouldn't need the gi numbers.
"Animals" is a rather vague definition. You may want to narrow it to say "vertebrates" etc. While the specific commands have changed some with the blast+ package you should get an idea of how to go about doing this following this post: Vertebrate Subset Nr Database? Build My Own?
Are you eventually looking to build a blast database or just need the sequence data?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Metazoa (Taxonomy ID: 33208) could be a good candidate for "animals".