Hi,
Could you please suggest to me how I can get all these 82697 sequences from this website using the linux command:
http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=1
Thanks
Hi,
Could you please suggest to me how I can get all these 82697 sequences from this website using the linux command:
http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=1
Thanks
(edited) To download the nucleotide sequences into 4 parts simultaneously:
API_KEY="REPLACE_WITH_YOUR_NCBI_EUTILS_API_KEY"
for p in {1..83}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gbnucdata" \
| sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
>> nucl_urls.txt
done
split -d -l 21000 nucl_urls.txt part
ls part0* | parallel -j 4 wget -O {}.fasta -i {}
(original answer to download protein sequences) You can download the sequences in this way (here for example all the protein sequences):
# First get all the accession numbers of protein sequences in all the pages
for p in {1..83}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gpprotdata.jsp?seqAccno" \
| sed -r 's|.+>(.+)<\/a><\/td>|\1|' \
>> prot_acc.txt
done
for acc in $(cat prot_acc.txt); do
wget -O ${acc}.fasta "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=${acc}"
done
thanks SMK.
I need to download nucleotide sequences according to gene ID so I tried to make some changes but it is showing some error:
#!/bin/bash
# the accession numbers of protein sequences in all the pages
for p in {1..83}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gpnucldata.jsp?seqgi” \
| sed -r 's|.+>(.+)<\/a><\/td>|\1|' \
>> gi_nucl.txt
done
# Then download the nucleotide sequences, for example:
for gi in $(cat gi_nucl.txt); do
i wget -O ${gi}.fasta "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=${gi}"
done
error:
$bash download.sh
download.sh: line 16: unexpected EOF while looking for matching `"'
download.sh: line 18: syntax error: unexpected end of file
First thing I saw is the line of | grep "gpnucldata.jsp?seqgi” \
. Try changing the end character to "
.
Also, I'd like to suggest to add set -euo pipefail
at the next line of #!/bin/bash
.
So it stops immediately when it gets an error.
And, you got an i
at the line of i wget -O
?
$ bash download.sh
download.sh: line 13: unexpected EOF while looking for matching `"'
#!/bin/bash
set -euo pipefail
# the accession numbers of nucleotide sequences in all the pages
for p in {1..83}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gpnucldata.jsp?seqgi”
| sed -r 's|.+>(.+)<\/a><\/td>|\1|' \
>> gi_nucl.txt
done
# Then download the nucleotide sequences, for example:
for gi in $(cat gi_nucl.txt); do
wget -O ${gi}.fasta "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=${gi}"
done
For nucleotide, you have to consider seq_start
and seq_stop
. For example, entry AOJK01000067
will be downloaded from http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&seq_start=53467&seq_stop=54312&strand=1&id=AOJK01000067.
Try this one:
#!/bin/bash
set -euo pipefail
for p in {1..83}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gbnucdata" \
| sed -r 's|.+seqAccno=(.+)&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1|' \
>> nucl_urls.txt
done
wget -O hmm_id_721_nucl.fa -i nucl_urls.txt
Where nucl_urls.txt
contains:
$ head nucl_urls.txt
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AM774418&seq_start=280304&seq_stop=281149&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOJK01000067&seq_start=53467&seq_stop=54312&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AF016485&seq_start=135862&seq_stop=137037&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIT01000069&seq_start=1403&seq_stop=2242&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LOPV01000533&seq_start=1144&seq_stop=1992&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOLJ01000027&seq_start=50569&seq_stop=51417&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOLH01000006&seq_start=26854&seq_stop=27702&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=BASG01000071&seq_start=6373&seq_stop=7167&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AUXP01000135&seq_start=5869&seq_stop=6663&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=MDCP01000151&seq_start=2694&seq_stop=3488&strand=1
When you copy the codes, make sure the end double quotation mark is in the correct format. It should work.
Indeed. Bioinfonext, you can check how many sequences are downloaded by: grep -c '^>' hmm_id_721_nucl.fa
.
This would work only for nr
database (proteins). For nucleic acid you don't have this option since you need to consider a start/stop
(as shown by @SMK, so that is the only solution).
$ blastdbcmd -db /path_to/nr_v5 -entry AAC82905 -outfmt %f
>O52025.1 RecName: Full=Arsenite methyltransferase [Halobacterium salinarum NRC-1] >AAC82905.1 unknown [Halobacterium salinarum NRC-1]
MELWTHPTPAAPRLATSTRTRWRRTSRCSQPWATTPGTNSSDASRTPTTASASATSKPQSASARARSVRRSPDCTPRAWS
RGARKDRGATTNRPRRPKFCSKRSTTCEATMSNDNETMVADRDPEETREMVRERYAGIATSGQDCCGDVGLDVSGDGGCC
SDETEASGSERLGYDADDVASVADGADLGLGCGNPKAFAAMAPGETVLDLGSGAGFDCFLAAQEVGPDGHVIGVDMTPEM
ISKARENVAKNDAENVEFRLGEIGHLPVADESVNVVISNCVVNLAPEKQRVFDDTYRVLRPGGRVAISDVVQTAPFPDDV
QMDPDSLTGCVAGASTVDDLKAMLDEAGFEAVEIAPKDESTEFISDWDADRDLGEYLVSATIEARKPARDD
Hi,
till now these sequences retrieved:
grep -c '^>' hmm_id_721_nucl.fa
31021
but now It is just keep showing msg like this:Reusing existing connection to www.ncbi.nlm.nih.gov:443. HTTP request sent, awaiting response... 200 OK
and no further gene sequences retrieved.
I run it as bash script on HPC. Should I need to submit this script on HPC?
tmode=text&rettype=fasta&id=JXLL01000019&seq_start=24218&seq_stop=24844&strand=1
Reusing existing connection to www.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘hmm_id_721_nucl.fa’
hmm_id_721_nucl.fa [ <=> ] 732 --.-KB/s in 0s
2019-07-02 18:38:35 (13.6 MB/s) - ‘hmm_id_721_nucl.fa’ saved [732]
URL transformed to HTTPS due to an HSTS policy
--2019-07-02 18:38:35-- https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=FOBO01000014&seq_start=86078&seq_stop=87025&strand=1
Reusing existing connection to www.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... No data received.
Retrying.
Thanks Bioinfonext
You are most likely running afoul of number of connections/queries allowed by NCBI per IP address. You may want to split nucl_urls.txt
into smaller chunks past the point where downloads have been successful and then run those pieces sequentially allowing one download to complete before starting next.
thanks genomax,
I splited nucl_urls.txt into four parts: :
nucl_urls1.txt, nucl_urls2.txt nucl_urls3.txt, nucl_urls4.txt,
but now not sure how to change the script or should I just change the page number in script like first to download from 1-25, and then page 26 to 50.....like that:
#!/bin/bash
set -euo pipefail
for p in {1..25}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gbnucdata" \
| sed -r 's|.+seqAccno=(.+)&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1|' \
>> nucl_urls.txt
done
wget -O hmm_id_721_nucl.fa -i nucl_urls.txt
Thanks for all help and your valuable time!
Hey Bioinfonext,
Got an idea: (1) Apply for an API_KEY (2) Split and download each chunk simultaneously. Have a read at https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/. It seems:
your key will increase the limit to 10 requests/second for all activity from that key
Thus the approach can be extended to (remember to remove nucl_urls.txt
if there is, and note that the sed
line is changed with &api_key=${API_KEY}
added):
API_KEY="REPLACE_WITH_YOUR_API_KEY"
for p in {1..83}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gbnucdata" \
| sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
>> nucl_urls.txt
done
split -d -l 21000 nucl_urls.txt part
ls part0* | parallel -j 4 wget -O {}.fasta -i {}
Which will download 4 parts at a time. Also makes sure that there are 82,697 entries in nucl_urls.txt:
$ wc -l nucl_urls.txt
82697 nucl_urls.txt
Hi, It Is showing some error:
$ bash download.sh
download.sh: line 12: parallel: command not found
It generated few files
$ ls
1 nucl_urls.txt part01 part03 part05 part07
download.sh part00 part02 part04 part06 part08
$ wc -l nucl_urls.txt
82697 nucl_urls.txt
and the script is:
#!/bin/bash
API_KEY="REPLACE_WITH_YOUR_API_KEY"
for p in {1..83}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gbnucdata" \
| sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
>> nucl_urls.txt
done
split -d -l 10000 nucl_urls.txt part
ls part0* | parallel -j 4 wget -O {}.fasta -i {}
Thanks
Hey Bioinfonext,
You have to download and install the program called parallel
, at https://savannah.gnu.org/projects/parallel/
And I hope in the script that you executed, the API_KEY
is your own version.
Hey Bioinfonext,
Read that blog...... it's for NCBI E-utilities API.
Got an idea: (1) Apply for an API_KEY (2) Split and download each chunk simultaneously. Have a read at https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/.
I just added this part to script:
#!/bin/bash
#SBATCH --job-name=DOWNLAOD
#SBATCH --output=/users/3052771/sharedscratch/arsenic_amplicon
#SBATCH --ntasks=20
#SBATCH --time=80:00:00
should I use 4 instead of 20 here in ntasks?
do I also need to add current working directory here but I am not sure how to add current working directory, our HPC just upgraded to slurm so not sure about it!
You can explicitly save the output files to directory you want otherwise they would be saved in the directory you run this script from.
Here is some food for thought:
You could basically submit a separate job to download each link in URL's file via sbatch
. That way a certain number of jobs (depending on job slot limit on your account) will start and rest will pend. As one job completes next in line would get pulled in. You may need to save output files separately and then cat
them into a big file later. You may want to weigh this option depending on if/how you are charged for use of compute resources.
In theory you could so something like this.
Note: I can't get this to work (getting HTTP error with the links that @SMK's script generates. You will need to adjust SLURM options as needed.
$ num=0;for i in `cat nucl_urls.txt`; do echo sbatch -t 1-0 -p htsf --wrap=\"wget -O ${num}.fa ${i}\"; num=$((num+1)); done
sbatch -t 1-0 -p partition --wrap="wget -O 0.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AM774418&seq_start=280304&seq_stop=281149&strand=1"
sbatch -t 1-0 -p partition --wrap="wget -O 1.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOJK01000067&seq_start=53467&seq_stop=54312&strand=1"
sbatch -t 1-0 -p partition --wrap="wget -O 2.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AF016485&seq_start=135862&seq_stop=137037&strand=1"
Edit: You will need to remove echo
and \
before "
to actually submit the jobs.
1st script to get nucl_urls.txt:
#!/bin/bash
API_KEY="REPLACE_WITH_YOUR_API_KEY"
for p in {1..83}; do
curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
| grep "gbnucdata" \
| sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
>> nucl_urls.txt
done
and then second script to do partition and download: do we need here API_KEY="REPLACE_WITH_YOUR_API_KEY", I am not sure I can insert it here if needed? do we also need to give input nucl_urls.txt as a variable?
#!/bin/bash
#SBATCH --job-name=DOWNLAOD
#SBATCH --output=/users/3052771/sharedscratch/arsenic_amplicon
#SBATCH --ntasks=20
#SBATCH --time=80:00:00
num=0;for i in `cat nucl_urls.txt`; do echo sbatch -t 1-0 -p htsf --wrap=\"wget -O ${num}.fa ${i}\"; num=$((num+1)); done
sbatch -t 1-0 -p partition --wrap="wget -O 0.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AM774418&seq_start=280304&seq_stop=281149&strand=1"
sbatch -t 1-0 -p partition --wrap="wget -O 1.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOJK01000067&seq_start=53467&seq_stop=54312&strand=1"
sbatch -t 1-0 -p partition --wrap="wget -O 2.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AF016485&seq_start=135862&seq_stop=137037&strand=1"
Bioinfonext : Method I demonstrated above was a way to submit individual SLURM jobs directly on the command line. You either submit jobs this way or create a SLURM script like you originally posted above.
You can't combine the two methods.
You would need to use your API_KEY
no matter which option you use.
Just in case it's confusing for Bioinfonext,
in the example script provided, his/her API key is saved as a variable called API_KEY
and appended to the end of the URL (...&rettype=fasta&id=\1&api_key=${API_KEY}
).
Hi SMK,
Thanks for your all help and time!
It finished downloading but the total number of sequences is not 82697;
FINISHED --2019-07-03 17:39:10--
Total wall clock time: 45m 55s
Downloaded: 2696 files, 2.2M in 1.5s (1.48 MB/s)
$ ls
1 part01 part03.fasta part06 part08.fasta
download.sh part01.fasta part04 part06.fasta
nucl_urls.txt part02 part04.fasta part07
part00 part02.fasta part05 part07.fasta
part00.fasta part03 part05.fasta part08
total count:
$ grep -c '^>' part00.fasta
8746
$grep -c '^>' part01.fasta
8751
$ grep -c '^>' part02.fasta
8753
$ grep -c '^>' part03.fasta
8750
$ grep -c '^>' part04.fasta
10000
$ grep -c '^>' part05.fasta
10000
$ grep -c '^>' part06.fasta
10000
$ grep -c '^>' part07.fasta
9998
$ grep -c '^>' part08.fasta
2696
these are 77694, could it be possible other sequences are not avaiable in NCBI database?
thanks
Since I was unable to get @SMK's method to work I reused a part of his code to come up with a method to use Entrezdirect. This example is just using the first page from the original website.
$ for p in {1..1}; do curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" | grep "gbnucdata" | awk -F "=|&" '{print $3,$5,$7}' | xargs -n 3 sh -c ' efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' > seqeuences.fa ;done
It is like this:
--2019-07-04 11:03:14-- http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=MPQZ01000001&seq_start=109958&seq_stop=110776&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=MPQZ01000001&seq_start=109958&seq_stop=110776&strand=1&api_key=redacted [following]
--2019-07-04 11:03:14-- https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=MPQZ01000001&seq_start=109958&seq_stop=110776&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘previously_failed.fa’
[ <=> ] 953 --.-K/s in 0s
2019-07-04 11:03:15 (9.84 MB/s) - ‘previously_failed.fa’ saved [953]
--2019-07-04 11:03:15-- http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LK031773&seq_start=53141&seq_stop=54004&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LK031773&seq_start=53141&seq_stop=54004&strand=1&api_key=redacted [following]
--2019-07-04 11:03:15-- https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LK031773&seq_start=53141&seq_stop=54004&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘previously_failed.fa’
[ <=> ] 971 --.-K/s in 0s
2019-07-04 11:03:15 (4.81 MB/s) - ‘previously_failed.fa’ saved [971]
--2019-07-04 11:03:15-- http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIL01000062&seq_start=38464&seq_stop=39354&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIL01000062&seq_start=38464&seq_stop=39354&strand=1&api_key=redacted
--2019-07-04 11:03:16-- https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIL01000062&seq_start=38464&seq_stop=39354&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘previously_failed.fa’
[ <=> ] 1,006 --.-K/s in 0s
2019-07-04 11:03:16 (16.7 MB/s) - ‘previously_failed.fa’ saved [1006]
--2019-07-04 11:03:16-- http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo [following]
--2019-07-04 11:03:17-- https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2019-07-04 11:03:17 ERROR 400: Bad Request.
FINISHED --2019-07-04 11:03:17--
Total wall clock time: 3m 51s
Downloaded: 251 files, 234K in 0.03s (7.86 MB/s)
$ grep -c '^>' previously_failed.fa
251
Edit: API user keys redacted @GenoMax.
It seems the last one is broken?
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo
Can you trace back and fix the URLs?
And remember to edit the post to remove your actual API KEY so that other people won't see it from here (c0ebfXXXXXX
)...
It seems the last one is broken?
Please, check the content of the URLs!!!!!! As I said before from your log, some of the URLs are be broken:
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo
If it was broken like that then you need to find out why and fix the URLs before re-running the script.
Hi SMK and genomax
thank a lot. I downloaded by using below command 82692 sequences successfully, but in this case, I have removed part.fasta to another folder and this has created all 82697 link in previously.failed.txt and finally, I got 82692 sequences in previously.failed.fa.
grep -w -v -f <(grep '^>' part0*.fasta | awk -F":" '{gsub(">", "", $2); gsub("\\.[0-9]+", "", $2); print $2}') nucl_urls.txt > previously_failed.txt
wget -O previously_failed.fa -i previously_failed.txt
Now I will try to grab the remaining 5 sequences by comparing all part.fasta and previously.failed.fa based on geneID.
thanks Bioinfornext
the end of previous_failed.txt is looks like this: BROKEN URL
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LK031773&seq_start=53141&seq_stop=54004&strand=1&api_key=c0ebf5aa4469318880bb45c13a88906b5dc08
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIL01000062&seq_start=38464&seq_stop=39354&strand=1&api_key=c0ebf5aa4931880bb45c13a88906b5dc08
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo
Bioinfonext : While it is tempting to post every error you get, these are things you need to work on and fix yourself. This is part of the learning process.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Not exactly sure what you are asking. If you
select all sequences
and clickbegin analysis
then it takes you to a new page where there is a download button to get "protein" or "nucleotide" sequence downloads.Edit: You can only download
10000
sequences at a time so you will need to chunk through this multiple times.Thanks a lot for all your help SMK, Should I run the whole script like this again, do I also need to delete the previous nucl_urls.txt file.
Hi,
I used the above script and it only able to download 251 sequences and finished with some error:
Thanks
I would check if the URLs are correct by pasting the problematic URLs to the browser and see what the browser returns.