Is there a faster way to download gene sequences from NCBI via E-utils
0
0
Entering edit mode
3.6 years ago

Hello,

I am trying to download gene sequences from NCBI via E-utils like this

esearch -db gene -q "1496[Taxonomy ID] AND proC[Gene Name] AND alive[prop]"  | elink -db gene -target nuccore | efetch -db nuccore -format gene_fasta > proC_1496_all.fasta
./fasta-unfold.sh proC_1496_all.fasta | egrep -A 1 "\[gene=proC\]" > proC_1496.fasta

Here fasta-unfold.sh is my script that just makes the fasta file that one line would be header and the next line would be the sequence. I would like to download a sequence for proC gene for a particular species. Unfortunately there are more than 100 records in the nucleotide database and it takes a long time to download the file.

After doing some basic comparisons (like shown below) it turns out that only 8 sequences (out of more than a 100) are unique.

cat proC_1496.fasta | egrep -A 1 "\[gene=proC\]" | grep -v '>' | sort -u | wc -l

I though maybe it would be possible to download only a sequence by coordinates, but

esearch -db gene -q "1496[Taxonomy ID] AND proC[Gene Name] AND alive[prop]"  | esummary

does not contain the sequence ID, start and stop coordinates of the gene in any sequence.

So I wonder is there a way to download only a sequence of a particular gene using E-utils, without downloading all related sequences from Nucleotide database?

Thank you for any suggestion in advance

NCBI E-utils Unix • 688 views
ADD COMMENT

Login before adding your answer.

Traffic: 1812 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6