I don't understand how to use E-UTILITIES. I'm trying to download the records associated with the REGene DB. Is there a simple to use GUI based application for this? TIA
I don't understand how to use E-UTILITIES. I'm trying to download the records associated with the REGene DB. Is there a simple to use GUI based application for this? TIA
Using EntrezDirect:
$ more id2
468509
711526
480169
12558
30291
$ for i in `cat id2`; do esearch -db gene -query "${i}[GeneID] AND ALIVE [PROP]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta >> gene_seq; done
File gene_seq
will contain sequences.
Note: There may be more than one sequence per gene ID even though your table has only one row.
$ for i in `cat id2`; do esearch -db gene -query "${i}[GeneID] AND ALIVE [PROP]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta | grep ">"; done
>XM_016933484.2 PREDICTED: Pan troglodytes cadherin 2 (CDH2), transcript variant X2, mRNA
>XM_523898.6 PREDICTED: Pan troglodytes cadherin 2 (CDH2), transcript variant X1, mRNA
>XM_028838057.1 PREDICTED: Macaca mulatta cadherin 2 (CDH2), transcript variant X3, mRNA
>XM_028838055.1 PREDICTED: Macaca mulatta cadherin 2 (CDH2), transcript variant X2, mRNA
>XM_015121712.2 PREDICTED: Macaca mulatta cadherin 2 (CDH2), transcript variant X1, mRNA
>NM_001287156.2 Canis lupus familiaris cadherin 2 (CDH2), mRNA
>NM_007664.5 Mus musculus cadherin 2 (Cdh2), mRNA
>XM_006525553.2 PREDICTED: Mus musculus cadherin 2 (Cdh2), transcript variant X1, mRNA
>NM_131081.2 Danio rerio cadherin 2, type 1, N-cadherin (neuronal) (cdh2), mRNA
$ for i in `cat regen.csv`; do esearch -db gene -query "${i}[GeneID] AND ALIVE [PROP]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta >> gene_seq; done
curl: (3) URL using bad/illegal format or missing URL
ERROR: curl command failed ( Tue 23 Nov 2021 11:37:57 AM EST ) with: 3
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?query_key=1&WebEnv=MCID_619d18e5a83b5f73a76527d4&retstart=0&retmax=1&db=gene&rettype=uilist&retmode=text&api_key=ca78f0a08d593f73292dbfbd65c103e96b08&tool=edirect&edirect=16.2&edirect_os=Linux&email=
WARNING: FAILURE ( Tue 23 Nov 2021 11:37:56 AM EST )
nquire -get https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esearch.fcgi -query_key 1 -WebEnv MCID_619d18e5a83b5f73a76527d4 -retstart 0 -retmax 1 -db gene -rettype uilist -retmode text -api_key ca78f0a08d593f73292dbfbd65c103e96b08 -tool edirect -edirect 16.2 -edirect_os Linux -email
EMPTY RESULT
SECOND ATTEMPT
Based on your post, I got the lists of genes from here: http://regene.bioinfo-minzhao.org/download.cgi
I downloaded both lists and created a single list of gene-ids using this command:
for f in *.txt; do cut -f1 $f | grep "^[0-9]" | sort | uniq >> regen.txt; done
From here, you can use this list to retrieve the gene sequences using datasets
. You can install datasets
using conda:
conda install -c conda-forge ncbi-datasets-cli
To download the genes, you can type:
datasets download gene gene-id --inputfile regen.txt --exclude-protein --exclude-rna --filename regen.zip
This command will download only the fasta file, and exclude the protein and rna sequences that are included by default in the data package.
Another option is to use NCBI Datasets web interface NCBI Datasets Gene
You can upload a list of genes (like the one created using the first command) or enter your list manually.
When you get to the gene table, if you select all (by clicking on the box on the left to Gene ID), you can download the dataset.
Let me know if you have any questions. :)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What kind of records? Do you have ID's/gene names? Post examples.
If you want a GUI based alternative you may want to check out NCBI Datasets. It may not give you access to exactly what you want but then it may.
Yes, I would like to build a local copy of the database that used in this paper (https://www.nature.com/articles/srep23167), and I have the GeneID, GeneSymb, GI, RefSeq and Organism - the number of records is 8460. Apparently there's a way to download just these records (as FASTA) via NCBI E-Utilities, but I can't figure it out (mainly the format of the commands).
My understanding is that I should be able to upload this list of GIs and then recursively download in 500 chunks the NCBI GenBank FASTA records. Some combination of EPost and EFetch, but I don't know how to structure the URLs and I'm unfamiliar with PERL.
Example of the first 10 rows of the file:
I'll try this...
It looks as if this (NCBI DATASETS) worked - I've requested a ZIPed DATASET download; we'll see if that works. If not, I'll try your commands below. One would have to install EDirect via the bash script so that one can use it in the terminal, correct?
You can install
entrez-direct
usingconda
.Doesn't work. NCBI DATASETs barfs when you try to download the entire dataset.
Retrieve in smaller batches.
cat
locally after download.Can you describe the command and error you're getting? Thanks!