Below is a simple example of gff3 file:
1 T1 gene 3631 4605 . + . ID=ATNG01010
1 T1 mRNA 3631 4605 . + . ID=ATNG01010.1;Parent=ATNG01010
1 T1 exon 3631 3913 . + . ID=ATNG01010:exon:1;Parent=ATNG01010.1
1 T1 CDS 3860 3913 . + 0 ID=ATNG01010:CDS:1;Parent=ATNG01010.1
1 T1 exon 3996 4276 . + . ID=ATNG01010:exon:2;Parent=ATNG01010.1
1 T1 CDS 3996 4260 . + 2 ID=ATNG01010:CDS:2;Parent=ATNG01010.1
1 T1 exon 4486 4605 . + . ID=ATNG01010:exon:3;Parent=ATNG01010.1
My question is: if we found another coding sequence (encode a different protein) range from 3752 to 3904, how should the gff3 file look like? It seems to me that the gff3 file can only allow one protein-coding gene per mRNA. If not, could anyone show me one example? Thank you!
Hi, mbens, thanks for your help. What I meant is actually: Can one gene/mRNA contains more than one open reading frames? Not CDS. I updated the question. My apologies.
I don't understand how that is different. Do you mean polycistronic transcripts? Or maybe uORFs (Upstream Open Reading Frame)?
In case of polycistronic transcripts:
In case of uORFs I am not aware of a special gff3 definition. I would add two CDS features and use the 'note' attribute to indicate that one is an uORF.
EDIT: According to Sequence Ontology you could use ' five_prime_open_reading_frame' as type (3rd column) for upstream open reading frames.
Brilliant, mbens! Indeed, "polycistronic" is exactly what I was looking for and should be used for this question. For the gene I have been working on, it is one gene/transcript by annotation, but riboseq data suggests 2 possible ORFs with different peptide sequences. By annotation, it is one gene. I just found a similar case in Arabidopsis gene model. They apparently make define it as one gene but different transcripts although the two transcripts are identical and the CDS part is different (see bellow). I guess both your suggestion and their method would work to create a gff. Thanks again.
From their gff file: