I have a bunch of EC numbers based on which I would like to download the corresponding genes from several species.
An example of a search would be "Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]" and if I use:
esearch -db nuccore -query "Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]"
I can see 7 matches (the same that I see using the search on the NCBI website). These are all genomic sequences including genes/CDS with the EC number I am looking for.
My question is: how can I retrieve the sequences of the CDS, not the full genome?
I have tried to use efetch -format gene_fasta but I get the sequences from all CDS and all genomes as a single fasta. It would be okay if the EC number was in the name of the sequences but this info is not displayed (only have the locus_tag, location, sometines gene).
I don't think this strategy is going work since the result from your query is not the specific gene but the genome accession. Trying to think about a way to do this.
The thing is that if I search the gene database, I do not get any results.
I am now pulling the features from the genomes using efetch -format ft, retrieving the locus_tag associated with the EC number I am looking for and the using that locus_tag to get the fasta sequence from the sequences obtained using efetch -format gene_fasta. Probably not the most effective way but it appears to be working.
This does not appear to be downloading specific EC numbers. The EC number above is for ferredoxin nitrate reductaseenzyme and you seem to be recovering DnaAgene. Looks like your query may simply be recovering all CDS.
The command, indeed, retrieves all genes from Bacillus sequences that have the required EC# -- 1 gene sequence per line. You just need to grep the output, e.g., here I got 19 sequences:
time esearch -db nuccore -query "Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]" | efetch -format gbc |
xtract -insd CDS gene product feat_location sub_sequence | sed 's/ /_/g' | grep -i ferredoxin > file**
real 1m55.001s
user 0m5.944s
sys 0m2.172s
$ wc -l file
19 file
$ head -1 file
CP017247.1 - Ferredoxin 2511566..2511814 ATGCCAAAATATACAATTGTAGACAAAGATACGTGCATCGCATGCGGAGCTTGTGGTGCTGCGGCTCCTGATATTTATGATTACGACGATGAGGGAATCGCATTTGTCACCCTTGACGACAATCAGGGTGTCGTCGAAGTCCCTGACGTCTTAGAAGAAGACATGATGGACGCGTTTGAAGGCTGTCCTACAGATTCGATCAAAGTTGCGGATGAGCCGTTCGAAGGCGACCCGCTTAAACACGAATAA
I don't think this strategy is going work since the result from your query is not the specific gene but the genome accession. Trying to think about a way to do this.
The thing is that if I search the gene database, I do not get any results. I am now pulling the features from the genomes using efetch -format ft, retrieving the locus_tag associated with the EC number I am looking for and the using that locus_tag to get the fasta sequence from the sequences obtained using efetch -format gene_fasta. Probably not the most effective way but it appears to be working.