Question

From EC number to nucleotide sequence with biopython

2

Entering edit mode

8.4 years ago

simone.moro.2 ▴ 20

Hi,

if I have a list of EC numbers, is it possible to get the nucleotide sequence of the enzymes genes using biopython?

Thanks

sequence gene • 2.8k views

ADD COMMENT • link updated 4.6 years ago by Biostar 20 • written 8.4 years ago by simone.moro.2 ▴ 20

score 3 · Answer 1 · 2016-07-27

I doubt that this can be done in general: the E.C. number is classifying an enzymatic reaction and is not an unique ID to a certain enzyme nor its gene. E.g. one E.C. number will fit on several enzymes (either from several organisms and/or the same organism; and these enzymes can be totally different regarding their sequence). So there exist not a gene for a E.C. number.

What you can do is to find GenBank sequences which contain one or several E.C. numbers, and then try to extract the corresponding nucleotide sequence:

Search the nucleotide database for entries which contain your search term, e.g. E.C. 3.5.1.1. Of course, many GenBank entries don't have this information.
Fetch the sequences of these entries.

Here is an example for E.C. 3.5.1.1:

from Bio import Entrez, SeqIO

Entrez.email = 'Your.Name.Here@example.org'

# First, find entries with the contain the E.C. number
handle = Entrez.esearch(db='nucleotide', term='E.C. 3.5.1.1')
entries = Entrez.read(handle)
handle.close()

# Second, fetch these entries
handle = Entrez.efetch(db='nucleotide', id=entries['IdList'], rettype='gb',
                       retmode='xml') 
records = Entrez.parse(handle)

# Now, we go through the records and look for a feature with name 'EC_number'
for record in records:
    for feature in record['GBSeq_feature-table']:
        for subfeature in feature['GBFeature_quals']:
            if (subfeature['GBQualifier_name'] == 'EC_number' and
                subfeature['GBQualifier_value'] == '3.5.1.1'):

                    # If we found it, we extract the seq's start and end
                    accession = record['GBSeq_primary-accession']
                    interval = feature['GBFeature_intervals'][0]
                    interval_start = interval['GBInterval_from']
                    interval_end = interval['GBInterval_to']
                    location = feature['GBFeature_location']
                    if location.startswith('complement'):
                        strand = 2
                    else:
                        strand = 1

                    # Now we fetch the nucleotide sequence
                    handle = Entrez.efetch(db="nucleotide", id=accession,
                                           rettype="fasta", strand=strand,
                                           seq_start = interval_start,
                                           seq_stop = interval_end)
                    seq = SeqIO.read(handle, "fasta")

                    print('GenBank Accession:{}'.format(accession))
                    print(seq.seq)

handle.close()

The output of this program looks like this:

GenBank Accession:NZ_FLQQ01000003
ATGGAAAGAAAACACATCTATATCGCGTATACG....
GenBank Accession:NZ_MATF01000018
ATGGAGTTTTTCAGGAAAACGGCATTAGCTGCT....
GenBank Accession:NZ_MATG01000008
ATGGAGTTTTTCAGGAAAACGGCATTAGCTGCT....

The output contains only 6 sequences, although there are much more 'asparaginase' (= E.C. 3.5.1.1) sequences available, but those are not annotated with the E.C. number. Of course, if you are interested in finding asparaginase e.g. from human, you can use one of these sequences to do a BLAST search in the human genome. Biopython has also the possibility to search the KEGG database, but currently I don't know if you can use this to extract a nucleotide sequence.