Hello,
I am trying to download gene sequences from NCBI via E-utils like this
esearch -db gene -q "1496[Taxonomy ID] AND proC[Gene Name] AND alive[prop]" | elink -db gene -target nuccore | efetch -db nuccore -format gene_fasta > proC_1496_all.fasta
./fasta-unfold.sh proC_1496_all.fasta | egrep -A 1 "\[gene=proC\]" > proC_1496.fasta
Here fasta-unfold.sh is my script that just makes the fasta file that one line would be header and the next line would be the sequence. I would like to download a sequence for proC gene for a particular species. Unfortunately there are more than 100 records in the nucleotide database and it takes a long time to download the file.
After doing some basic comparisons (like shown below) it turns out that only 8 sequences (out of more than a 100) are unique.
cat proC_1496.fasta | egrep -A 1 "\[gene=proC\]" | grep -v '>' | sort -u | wc -l
I though maybe it would be possible to download only a sequence by coordinates, but
esearch -db gene -q "1496[Taxonomy ID] AND proC[Gene Name] AND alive[prop]" | esummary
does not contain the sequence ID, start and stop coordinates of the gene in any sequence.
So I wonder is there a way to download only a sequence of a particular gene using E-utils, without downloading all related sequences from Nucleotide database?
Thank you for any suggestion in advance