Extract genes within a genome using eUtils
1
0
Entering edit mode
20 months ago

Hi,

I have a bunch of EC numbers based on which I would like to download the corresponding genes from several species.

An example of a search would be "Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]" and if I use:

esearch -db nuccore -query "Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]"

I can see 7 matches (the same that I see using the search on the NCBI website). These are all genomic sequences including genes/CDS with the EC number I am looking for.

My question is: how can I retrieve the sequences of the CDS, not the full genome?

I have tried to use efetch -format gene_fasta but I get the sequences from all CDS and all genomes as a single fasta. It would be okay if the EC number was in the name of the sequences but this info is not displayed (only have the locus_tag, location, sometines gene).

Thanks for the help!

ncbi eutils entrez • 1.0k views
ADD COMMENT
0
Entering edit mode

I don't think this strategy is going work since the result from your query is not the specific gene but the genome accession. Trying to think about a way to do this.

ADD REPLY
0
Entering edit mode

The thing is that if I search the gene database, I do not get any results. I am now pulling the features from the genomes using efetch -format ft, retrieving the locus_tag associated with the EC number I am looking for and the using that locus_tag to get the fasta sequence from the sequences obtained using efetch -format gene_fasta. Probably not the most effective way but it appears to be working.

ADD REPLY
0
Entering edit mode
15 months ago
Eugene • 0

Try:

$ time esearch -db nuccore -query "Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]" | efetch -format gbc | xtract -insd CDS gene product feat_location sub_sequence | sed 's/ /_/g' > file
1m53s
$ head -2 file
CP017247.1  -   Chromosomal_replication_initiator_protein_DnaA  34..1341    TTGGGGCAGATCGAAAAAAAATTGAGCAAGCCCAGCTTTGAAACATGGATGAAATCGACAAAGGCCCATTCATTGCAGGGCGATACGCTGATCATCACCGCACCGAACGAGTTCGCCAGAGACTGGCTTGAATCAAGATACCTGCACCTGATCGCCGATACGATCTATGATCTGACAGGAGAAGAATTGAGCATTAAATTTGTCATTCCTCAGAATCAAAATGAAGAAGATTTTATGCCAAAGTCTCCAATCAAAAAAATGTCGAAAGAAGAACCGGCTGATTTTCCGCAAAACATGCTGAATCCCAAATATACATTTGATACGTTCGTTATCGGTTCAGGAAACCGATTCGCCCACGCAGCGTCTTTGGCAGTGGCTGAAGCCCCGGCGAAAGCTTACAATCCGCTGTTTATTTACGGGGGAGTCGGACTTGGAAAGACTCACTTAATGCATGCGATCGGGCACTATGTCATCGATCACAATCCATCTGCAAAAGTGGTTTATTTGTCATCTGAGAAATTTACAAATGAGTTCATTAACTCGATCCGTGACAATAAAGCTGTCGATTTTCGCAATCGCTATCGAAATGTTGACGTTCTTTTAATAGACGATATTCAATTTTTAGCCGGAAAAGAACAGACGCAAGAGGAATTTTTCCATACGTTTAATACGCTGCATGAAGAAACAAAGCAGATTGTCATTTCCAGCGACCGGCCTCCAAAAGAGATCCCAACGCTTGAAGACCGTTTGCGCTCCCGTTTTGAATGGGGATTGATCACTGACATCACGCCTCCTGATCTGGAAACAAGAATTGCGATTTTAAGAAAGAAAGCAAAAGCAGAAGGACTTGATATCCCGAATGAAGTCATGCTTTATATTGCCAATCAGATCGACAGCAATATCAGGGAGCTGGAAGGGGCATTAATCAGGGTTGTCGCATATTCTTCCTTGATCAATAAAGACATTAACGCCGATCTGGCTGCTGAAGCTTTGAAAGATATCATTCCTTCTTCAAAGCCGAAAGTCATTACGATCAAAGACATCCAAAGAATCGTCGGCCAGCAGTTCAATATCAAGCTGGAGGATTTCAAGGCGAAGAAACGGACAAAATCGGTGGCTTTTCCGCGGCAGATCGCTATGTATCTATCAAGAGAAATGACAGATTCTTCTCTTCCGAAGATCGGCGAAGAATTTGGCGGACGCGACCACACGACGGTCATCCATGCCCATGAGAAAATATCAAAACTGCTGAGCGATGATGAACAGCTTCAGCAGCAGATTAAAGAAATTAAAGAGCAGCTGAGATAA
CP017247.1  dnaN    DNA-directed_DNA_polymerase 1518..2654  ATGAAGTTTACAATTCAAAAAGACCGCCTGGTCGAAAGTGTCCAAGATGTGTTAAAAGCCGTTTCTTCAAGAACGATGATTCCGATCTTAACCGGTATTAAAATCGTGGCCTCTGATGAAGGGGTCTCTCTGACAGGCAGCGATTCCGATATTTCGATTGAATCGTTTATCCCGAAAGAAGACGGCGATTTAGAGATCGTGACAATTGAACAGCCCGGCAGCATTGTGCTTCAAGCCCGTTTTTTCAGTGAAATTGTCAAAAAGCTGCCGATGTCAACAGTGGAAATCGAGGTTCAAAATCAATACTTAACGATCATCCGCTCCGGCAAAGCAGAGTTTAACTTAAACGGTTTGGATGCAAGCGAATATCCGCTTTTGCCGCAAATTGAAGAGCATCACGCTTTTCAAATTCCGACCGATCTGCTGAAAAACCTGATCCGCCAAACCGTTTTTGCAGTGTCCACCTCAGAAACACGCCCAATCTTGACAGGTGTAAACTGGAATGTCACTGGCGGTGAATTAATATGCACTGCAACGGATAGTCATCGTCTTGCGCTAAGGAAAGCTAAGCTCGACATTAACGAAGACAGTTCATACAATGTCGTCATCCCAGGAAAAAGCTTAACCGAGCTCAGCAAAATCCTTGATGACCATCAGGAGCTTGTCGATATTGTGATTACCGAAACACAAGTGCTGTTTAAAACAAAAAACGTTCTGTTTTTCTCCAGGCTTCTTGACGGAAACTATCCGGATACGAACCGCCTGATTCCTCAGGAAAGCAAAACGAACTTGATTGTCAATACTAAGGAATTTCTCCAGGCGATCGACAGGGCTTCGCTTTTGGCGAGAGAAGGACGAAACAACGTGGTGAAACTTTCCGCAGCCGCCAATGAGTCGATCGAAATTTCTTCAAACTCTCCGGAAATCGGAAAGGTTGTTGAAACGGTGAATGCCGAGCAGATCGAAGGGGAAGACTTAAAGATATCCTTTAGTCCGAAATATATGCTGGATGCCCTTAAAGTTCTTGAAGGAGAGGACATTCATGTAAGCTTCACAGGCGCTATGAGGCCTTTTCTGATCCGTACGCCGAATGACGATTCGATCGTCCAATTAATTCTTCCTGTCCGGACGTATTAA

You will get a table with 5 columns with -- one row per gene.

Gene

ADD COMMENT
0
Entering edit mode

This does not appear to be downloading specific EC numbers. The EC number above is for ferredoxin nitrate reductase enzyme and you seem to be recovering DnaAgene. Looks like your query may simply be recovering all CDS.

ADD REPLY
0
Entering edit mode

The command, indeed, retrieves all genes from Bacillus sequences that have the required EC# -- 1 gene sequence per line. You just need to grep the output, e.g., here I got 19 sequences:

time esearch -db nuccore -query "Bacillus[ORGN] AND 1.7.7.2[EC/RN Number]" | efetch -format gbc | 
xtract -insd CDS gene product feat_location sub_sequence | sed 's/ /_/g' | grep -i ferredoxin > file**
real    1m55.001s
user    0m5.944s
sys 0m2.172s
$ wc -l file
19 file
$ head -1 file
CP017247.1  -   Ferredoxin  2511566..2511814    ATGCCAAAATATACAATTGTAGACAAAGATACGTGCATCGCATGCGGAGCTTGTGGTGCTGCGGCTCCTGATATTTATGATTACGACGATGAGGGAATCGCATTTGTCACCCTTGACGACAATCAGGGTGTCGTCGAAGTCCCTGACGTCTTAGAAGAAGACATGATGGACGCGTTTGAAGGCTGTCCTACAGATTCGATCAAAGTTGCGGATGAGCCGTTCGAAGGCGACCCGCTTAAACACGAATAA
ADD REPLY

Login before adding your answer.

Traffic: 1208 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6