Duplicated entries + joining of introns in EMBL file created using EMBLmyGFF3
1
0
Entering edit mode
5.1 years ago
standonn ▴ 20

Dear all,

I have a gff3 file that looks like this:

# start gene g1
scaf00001   AUGUSTUS        gene    6504    8593    .       +       .       ID=g1
scaf00001   AUGUSTUS        transcript      6504    8593    .       +       .       ID=g1.t1;Parent=g1;Ontology_term=GO:0055085,GO:0016021
scaf00001   AUGUSTUS        intron  6625    6675    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  6797    6841    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  6924    6966    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  7119    7161    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  7245    7286    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  7423    7476    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  7630    7673    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  7750    7962    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  8110    8158    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  8225    8265    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        intron  8365    8407    .       +       .       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    6504    6624    .       +       0       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    6676    6796    .       +       2       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    6842    6923    .       +       1       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    6967    7118    .       +       0       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    7162    7244    .       +       1       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    7287    7422    .       +       2       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    7477    7629    .       +       1       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    7674    7749    .       +       1       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    7963    8109    .       +       0       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    8159    8224    .       +       0       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    8266    8364    .       +       0       Parent=g1.t1,g1
scaf00001   AUGUSTUS        exon    8408    8593    .       +       0       Parent=g1.t1,g1

That I want to convert to the EMBL flat file format.

To do that I have been using EMBLmyGFF3 in the following way:

EMBLmyGFF3 -i XXX -m "genomic DNA" -p XXX --rg "XXX" -t linear -x "INV" -s "XXX" -r 1 -o all-annotations.embl all-annotations.gff3 genome.fna

The output contains the following errors:

the exons are duplicated in the EMBL file output. Here is an example:

FT   exon            6504..6624
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            6676..6796
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            6842..6923
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            6967..7118
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7162..7244
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7287..7422
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7477..7629
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7674..7749
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7963..8109
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            8159..8224
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            8266..8364
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            8408..8593
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            6504..6624
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            6676..6796
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            6842..6923
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            6967..7118
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7162..7244
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7287..7422
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7477..7629
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7674..7749
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            7963..8109
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            8159..8224
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            8266..8364
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"
FT   exon            8408..8593
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"

the introns are joined, which does not make any biological sense. Here is an example:

FT   intron          join(6625..6675,6797..6841,6924..6966,7119..7161,
FT                   7245..7286,7423..7476,7630..7673,7750..7962,8110..8158,
FT                   8225..8265,8365..8407)
FT                   /locus_tag="LOCUS1"
FT                   /note="source:AUGUSTUS"

Does someone know why I'm getting these errors?

Many thanks, Sophie

EMBLmyGFF3 EMBL flat file gff3 duplicated entries • 1.2k views
ADD COMMENT
0
Entering edit mode

Sorry, For some reason the embl lines are all together in the submitted questions. I tried formatting them in code but it didn't work.

ADD REPLY
0
Entering edit mode

Now formatted properly.

ADD REPLY
0
Entering edit mode

Thanks! It looks much better!

ADD REPLY
2
Entering edit mode
5.1 years ago
Juke34 8.9k

The duplicates are due to weirdnesses in the GFF file. Indeed, all intron and exon features have two parents, one is the transcript and the other the gene:

Parent=g1.t1,g1

You should remove ,g1 Having multiple parents is allowed but it is usually when an exon is share by several transcripts. Then all parents are transcript features.

For the introns joined together it sounds to be a bug, you should open an issue in the EMBLmyGFF3 GitHub repository.

In a general way exons and introns are not useful to submit (they could be deduced from the mRNA locations), so if you encounter problems with them, you could just skip them as described here by modifying the translation_gff_feature_to_embl_feature.json file

ADD COMMENT
0
Entering edit mode

Thank you so much for your answer!

I have removed the second "parent" as it is indeed not necessary. I also followed your link to skip the introns and exons and validated the embl file. Worked great!

I posted the issue about the joining of the intron features on the EMBLmyGFF3 Github page.

Again, thanks a lot!

ADD REPLY

Login before adding your answer.

Traffic: 2248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6