Question

Feature type locations overlap in Biopython. Which number is correct?

1

Entering edit mode

9.6 years ago

Good Gravy ▴ 20

In biopython feature.location.end can be equal to next_feature.location.start. For example:

type: TRANSMEM
location: [187:208]
qualifiers:
    Key: description, Value: Helical. {ECO:0000255}.
type: TOPO_DOM
location: [208:411]
qualifiers:
    Key: description, Value: Extracellular. {ECO:0000255}.

Although there is some biological ambiguity over this example (residue 208), in others there is not. Hence I ask which domain do the residues that are overlapped truly belong to?

python biopython • 2.1k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Good Gravy ▴ 20

1

Entering edit mode

9.6 years ago

Devon Ryan 104k

Both, there's no reason that a given residue can't belong to more than one domain, particularly where they meet. A extracellular domain is a good example of that, since you could have helices/sheets that are extracellular (or intracellular for that matter).

In any case, this is less a biopython question than one for whomever did the annotation that you're looking at. Biopython is typically just parsing that and presenting it to you in a convenient manner.

ADD COMMENT • link 9.6 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2015-04-29

3

Entering edit mode

9.6 years ago

Peter 6.0k

Devon is right that a protein might well be annotated with overlapping domains, however in this case the domains in your example do NOT overlap. Biopython uses Python style slicing notation, so [187:208] and [208:411] do NOT overlap. e.g.

>>> example = "0123456789"
>>> example[3:6]
'345'
>>> example[6:9]
'678'

Also beware that Biopython and Python use zero-based counting, rather than the one-based counting you may be more used to. Note SwissProt/UniProt annotation files use one-based counting in their plain text and XML file formats.

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Peter 6.0k

0

Entering edit mode

Very informative, thanks. If I have understood you correctly, the only time this needs to be taken into account is when making the location integer into a human readable position, and is not a worry for the amino acids sequence? For example would there be an amino acid that would be incorrectly printed twice in print(TRANSMEM_domain.extract(record.seq), TOPO_DOM_domain.extract(record.seq))? The cookbook isn't very clear on this.

ADD REPLY • link updated 2.4 years ago by Ram 44k • written 9.6 years ago by Good Gravy ▴ 20

1

Entering edit mode

Yes, in this example you'd need to be careful about "position 208" (Python zero-based counting) versus "position 209" (more human-friendly one-based counting), which is the first amino acid in the TOPO_DOM feature.

The .extract(...) method knows about the slicing so would do the right thing.

ADD REPLY • link 9.6 years ago by Peter 6.0k