Hi, Is there a quick way to download bacterial and archaea genomes from ncbi using a list of taxid ??? (got them from the GOLD database). Mainly complete genomes, scaffold and contigs.
Thanks,
Hi, Is there a quick way to download bacterial and archaea genomes from ncbi using a list of taxid ??? (got them from the GOLD database). Mainly complete genomes, scaffold and contigs.
Thanks,
So here is a shell script I have used, that you could adapt for your needs. In the current form it downloads genome sequence and protein sequences, makes a blast database out of them, then joins all blastdbs in the directory into one alias db, called "SelectedArthropods". It works - mostly - and when it breaks it breaks hard, there is no error checking really :) You need
wget in your path
usage:
fetchAllGenomesByTaxon.sh "Alcanivorax borkumensis"
The genome is downloaded to the current directory. copies of the esearch results are saved as
Alcanivorax borkumensis.assembly.esearch.docsum,
Alcanivorax borkumensis.genome.esearch.docsum
in the current directory for your reference.
#!/bin/sh
set -u #exit on onbound variable
#TAXLIST=("Daphnia pulex" "Drosophila melanogaster" "Anopheles gambiae" "Pediculus humanus"
#"Ixodes scapularis" "Apis mellifera" "Bombyx mori")
#TAXLIST=("Strigamia maritima")
TAXLIST=$@ # provide multiple taxa on the cmd-line
for TAX in "${TAXLIST[@]}" ; do
echo getting genome for: $TAX
GENOME=$(esearch -db genome -query "$TAX"[orgn] |
efetch -format docsum | tee "${TAX}.genome.esearch.docsum")
ACC=`echo $GENOME | xtract -pattern DocumentSummary -element Assembly_Accession`
NAME=`echo $GENOME | xtract -pattern DocumentSummary -element Assembly_Name`
echo authoritative genome: $ACC $NAME
RESULT=$(esearch -db assembly -query "$ACC" |
efetch -format docsum | tee "${TAX}.assembly.esearch.docsum")
FTPP=`echo $RESULT | xtract -pattern DocumentSummary -element FtpPath_GenBank`
TAXID=`echo $RESULT | xtract -pattern DocumentSummary -element Taxid`
echo FtpPath: $FTPP
BASENAME=`basename $FTPP`
FTPPATHG=$FTPP/$BASENAME'_genomic.fna.gz'
FTPPATHP=$FTPP/$BASENAME'_protein.faa.gz'
echo Downloading $FTPPATHG ...
## get genome data
wget $FTPPATHG
BASENAME=`basename $FTPPATHG`
gunzip -f $BASENAME
BASENAME=`echo $BASENAME | sed s/.gz//`
makeblastdb -in $BASENAME -dbtype nucl -parse_seqids -taxid $TAXID -title "$TAX $NAME genomic"
echo Downloading $FTPPATHP ...
## get protein data
wget $FTPPATHP # this may throw an error if there is no proteome file
BASENAME=`basename $FTPPATHP`
gunzip -f $BASENAME
BASENAME=`echo $BASENAME | sed s/.gz//`
makeblastdb -in $BASENAME -dbtype prot -parse_seqids -taxid $TAXID -title "$TAX $NAME proteins"
done
### Make a blast db of all fasta files in directory
ls *.fna > dblist_nuc.txt
blastdb_aliastool -dblist_file dblist_nuc.txt -out SelectedArthropds -title 'selection of arth genomes' -dbtype nucl
ls *.faa > dblist_prot.txt
blastdb_aliastool -dblist_file dblist_prot.txt -out SelectedArthropds -title 'selection of arth proteins' -dbtype prot
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Get this file. Look up the taxid you need. The last column of the file has the directory which has the ftp location of the genome assembly. Get the
.fna
file from there.That's cool~ thank you
Hi i have noticed that the viral database viral database assembly file does only contain 3 viruses ???
Viral genomes are here.
RefSeq viral genomes are in this file.
Dear Michael, Thank you for providing a script for this task, I really appreciate! I am trying to run your script, but I am getting the following error message:
I am running this at a computer cluster. I was wondering if you could help figure out what is going wrong. Thank you in advance!
For future reference, please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized. Your response should have directly gone under @Michael's answer.As for your question, do you have NCBI eUtils installed and in your path? What do you get when you issue this command
which xtract
?Thank you, genomax! I am new to biostars and I appreciate your comment. I was indeed not sure how to get my message posted in the right place.
Regarding my question, I am using NCBI eUtils, but I wasn't able to add the utilities to my path (.bashrc). So I just added the eUtils folder path in every line where Michael's script called for them, like in this example:
I also added #!/bin/bash to the script. I think the issue is that the script is not finding the FtpPath variable, because perhaps it is missing in the records that are extracted at this step:
Can you run the the
esearch
command with an example accession number from above to see if you get anything back?Hi genomax! I ran esearch with the first accession number from above and I got the following result:
I think the problem is that your query
is returning multiple genome accession #, where as @Michael's script is expecting only one.
The search works fine with one accession.
Also you can do
export PATH=$PATH:/path_to/e-utils/edirect
to get the eUtils in your$PATH
(replace path_to with real path).I figured out my mistake: I am using a taxID of an order of insects (Lepidoptera = 7088), and therefore the script can't find an unique FTP address for all the different genomes. I am going to make a list of taxon ids for the specific genomes that I need and use it as input for the script. Thank you, genomax, for all the help!