Question

In the gff3 format, could one eukaryotic mRNA contain more than one protein coding sequences (i.e. polycistronic)?

2

Entering edit mode

7.6 years ago

I0110 ▴ 160

Below is a simple example of gff3 file:

1   T1  gene    3631    4605    .   +   .   ID=ATNG01010
1   T1  mRNA    3631    4605    .   +   .   ID=ATNG01010.1;Parent=ATNG01010
1   T1  exon    3631    3913    .   +   .   ID=ATNG01010:exon:1;Parent=ATNG01010.1
1   T1  CDS 3860    3913    .   +   0   ID=ATNG01010:CDS:1;Parent=ATNG01010.1
1   T1  exon    3996    4276    .   +   .   ID=ATNG01010:exon:2;Parent=ATNG01010.1
1   T1  CDS 3996    4260    .   +   2   ID=ATNG01010:CDS:2;Parent=ATNG01010.1
1   T1  exon    4486    4605    .   +   .   ID=ATNG01010:exon:3;Parent=ATNG01010.1

My question is: if we found another coding sequence (encode a different protein) range from 3752 to 3904, how should the gff3 file look like? It seems to me that the gff3 file can only allow one protein-coding gene per mRNA. If not, could anyone show me one example? Thank you!

gff3 annotation • 3.0k views

ADD COMMENT • link 7.6 years ago by I0110 ▴ 160

score 4 · Accepted Answer · 2018-01-04

4

Entering edit mode

7.6 years ago

mbens ▴ 100

In principle, you can define an arbitrary number of CDS per mRNA. The Parent attribute of each CDS indicates to which mRNA it belongs. If your CDS feature spans multiple lines (discontinuous features) it must have an ID to indicate lines that collectively represent the CDS. In fact, your example already contains two different protein coding sequences for mRNA 'ATNG01010.1', namely 'ATNG01010:CDS:1' and 'ATNG01010:CDS:2'. You could add a third one using the same pattern.

GFF Specification: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

ADD COMMENT • link 7.6 years ago by mbens ▴ 100

0

Entering edit mode

Hi, mbens, thanks for your help. What I meant is actually: Can one gene/mRNA contains more than one open reading frames? Not CDS. I updated the question. My apologies.

ADD REPLY • link 7.6 years ago by I0110 ▴ 160

1

Entering edit mode

I don't understand how that is different. Do you mean polycistronic transcripts? Or maybe uORFs (Upstream Open Reading Frame)?

In case of polycistronic transcripts:

define both genes (and assign different ID attributes, e.g. ID=geneA and ID=geneB)
define a single mRNA feature (e.g. ID=mrnaX) and list all comprised genes in its Parent attribute (e.g. Parent=geneA,geneB)
define both ORFs/CDS (e.g. ID=CDSx and ID=CDSy) and assign the mRNA as Parent (e.g. Parent=mrnaX)
add "Derives_from" attribute to ORFs/CDS to indicate its origin (e.g. Derives_from=geneA and Derives_from=geneB)

In case of uORFs I am not aware of a special gff3 definition. I would add two CDS features and use the 'note' attribute to indicate that one is an uORF.

EDIT: According to Sequence Ontology you could use ' five_prime_open_reading_frame' as type (3rd column) for upstream open reading frames.

ADD REPLY • link 7.6 years ago by mbens ▴ 100

0

Entering edit mode

Brilliant, mbens! Indeed, "polycistronic" is exactly what I was looking for and should be used for this question. For the gene I have been working on, it is one gene/transcript by annotation, but riboseq data suggests 2 possible ORFs with different peptide sequences. By annotation, it is one gene. I just found a similar case in Arabidopsis gene model. They apparently make define it as one gene but different transcripts although the two transcripts are identical and the CDS part is different (see bellow). I guess both your suggestion and their method would work to create a gff. Thanks again.

From their gff file:

Chr5    TAIR10  gene    758374  760382  .   +   .   ID=AT5G03190;Note=protein_coding_gene;Name=AT5G03190
Chr5    TAIR10  mRNA    758374  760382  .   +   .   ID=AT5G03190.1;Parent=AT5G03190;Name=AT5G03190.1;Index=1
Chr5    TAIR10  protein 758793  760148  .   +   .   ID=AT5G03190.1-Protein;Name=AT5G03190.1;Derives_from=AT5G03190.1
Chr5    TAIR10  exon    758374  760382  .   +   .   Parent=AT5G03190.1
Chr5    TAIR10  five_prime_UTR  758374  758792  .   +   .   Parent=AT5G03190.1
Chr5    TAIR10  CDS 758793  760148  .   +   0   Parent=AT5G03190.1,AT5G03190.1-Protein;
Chr5    TAIR10  three_prime_UTR 760149  760382  .   +   .   Parent=AT5G03190.1
Chr5    TAIR10  mRNA    758374  760382  .   +   .   ID=AT5G03190.2;Parent=AT5G03190;Name=AT5G03190.2;Index=1
Chr5    TAIR10  protein 758539  760148  .   +   .   ID=AT5G03190.2-Protein;Name=AT5G03190.2;Derives_from=AT5G03190.2
Chr5    TAIR10  exon    758374  758660  .   +   .   Parent=AT5G03190.2
Chr5    TAIR10  five_prime_UTR  758374  758538  .   +   .   Parent=AT5G03190.2
Chr5    TAIR10  CDS 758539  758660  .   +   0   Parent=AT5G03190.2,AT5G03190.2-Protein;
Chr5    TAIR10  exon    758843  760382  .   +   .   Parent=AT5G03190.2
Chr5    TAIR10  CDS 758843  760148  .   +   1   Parent=AT5G03190.2,AT5G03190.2-Protein;
Chr5    TAIR10  three_prime_UTR 760149  760382  .   +   .   Parent=AT5G03190.2
Chr5    TAIR10  mRNA    758374  760382  .   +   .   ID=AT5G03190.3;Parent=AT5G03190;Name=AT5G03190.3;Index=1
Chr5    TAIR10  protein 758539  758676  .   +   .   ID=AT5G03190.3-Protein;Name=AT5G03190.3;Derives_from=AT5G03190.3
Chr5    TAIR10  exon    758374  760382  .   +   .   Parent=AT5G03190.3
Chr5    TAIR10  five_prime_UTR  758374  758538  .   +   .   Parent=AT5G03190.3
Chr5    TAIR10  CDS 758539  758676  .   +   0   Parent=AT5G03190.3,AT5G03190.3-Protein;
Chr5    TAIR10  three_prime_UTR 758677  760382  .   +   .   Parent=AT5G03190.3

ADD REPLY • link 7.6 years ago by I0110 ▴ 160