How to download all expressed sequence tags (ESTs) for particular species from NCBI using command ?
1
0
Entering edit mode
8 months ago
Sony ▴ 10

Hello everyone,

I am going to do the gene annotation for my de novo assembled sequence using Maker2 pipeline. I need to download the expressed sequence tags ESTs. I already know download manually from NCBI webpage. It is taken time to download all due to it has 1277677 ESTs files. Please guide me How can I download all ESTs for Oryza sativa from NCBI using Linux command ?enter image description here Thank you so much all.

NCBI EST • 668 views
ADD COMMENT
1
Entering edit mode
8 months ago
GenoMax 147k

Using Entrezdirect

$ esearch -db nuccore -query "Oryza sativa [Organism]  AND biomol_mrna[PROP]" | efetch -format fasta > oryza_est.fa

This should generate 1363754 sequences.

Or you could do

$ esearch -db nuccore -query "Oryza sativa [Organism]  AND is_est [FILTER]" | efetch -format fasta > fle.fa

if you want to get 1255251 sequences you see above.

This will still take time since the search is equivalent to what you are doing on the web site above.

ADD COMMENT
0
Entering edit mode

Thank you so much for your assistance. Base on your suggestion, I installed entrez-direct via Conda and tried to download ESTs of Brassica species with this command:

esearch -db nuccore -query "Brassica oleracea [Organism] AND biomol_mrna[PROP]" | efetch -format fasta > brassica_oleracea_est.fa

And I got this warning:

(entrez-direct) sony@hpz6:~/Brassica_practice/Ca1_annotation_dataset$ esearch -db nuccore -query "Brassica oleracea [Organism] AND biomol_mrna[PROP]" | efetch -format fasta > brassica_oleracea_est.fa
curl: (28) Failed to connect to eutils.ncbi.nlm.nih.gov port 443 after 10203 ms: Connection timed out
 ERROR:  curl command failed ( Fri Mar  8 12:14:00 AM CST 2024 ) with: 28
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi -d query_key=1&WebEnv=MCID_65e9e7878ef23d35490460ad&retstart=1550&retmax=50&db=nuccore&rettype=fasta&retmode=text&tool=edirect&edirect=16.2&edirect_os=Linux&email=sony%40hpz6
 WARNING:  FAILURE ( Fri Mar  8 12:14:00 AM CST 2024 )
nquire -url https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ efetch.fcgi -query_key 1 -WebEnv MCID_65e9e7878ef23d35490460ad -retstart 1550 -retmax 50 -db nuccore -rettype fasta -retmode text -tool edirect -edirect 16.2 -edirect_os Linux -email sony@hpz6
EMPTY RESULT
SECOND ATTEMPT

I am currently using GPU Linux server, and my internet is good. Are there any suggestions about this matter? Thank you once again.

ADD REPLY
1
Entering edit mode

I see the following

$ esearch -db nuccore -query "Brassica oleracea [Organism] AND biomol_mrna[PROP]"
<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <QueryKey>1</QueryKey>
  <Count>236694</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

that means there are 236694 sequences that meet that criteria. Are you behind a firewall of some kind? It may not be allowing your request to go through.

You can see that the top 5 sequences indeed belong to the right species

$ esearch -db nuccore -query "Brassica oleracea [Organism] AND biomol_mrna[PROP]" | efetch -format fasta | grep ">" | head -5
>OR866439.1 Brassica oleracea var. italica cultivar BOP04-28-6 MY (MY) mRNA, complete cds
>OR866438.1 Brassica oleracea var. italica cultivar BOP04-28-6 SMT (SMT) mRNA, complete cds
>OR876376.1 Brassica oleracea var. italica cultivar BOP04-28-6 myrosinase (MY) mRNA, complete cds
>OR876375.1 Brassica oleracea var. italica cultivar BOP04-28-6 selenocysteine methyltransferase (SMT) mRNA, complete cds
>OR865333.1 Brassica oleracea response to low sulfur 2 mRNA, complete cds
ADD REPLY

Login before adding your answer.

Traffic: 2199 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6