Convert Braker2 gff3 to EMBL flat file for ENA submission
3
0
Entering edit mode
5.7 years ago
m.eitel • 0

Hi!

How can I transfer the output gff3 of the Braker2 ab initio gene annotation pipeline to a valid EMBL flat file that I can submit to to ENA?

I tried using EMBLmyGFF3 (https://github.com/NBISweden/EMBLmyGFF3). To tool seems working fine, but the BRAKER gff3 seems to be non-standard and I am always getting error mesasges like:

13:59:52 WARNING feature: Partial CDS. The CDS with ID=g5848.t1.braker.CDS2 not a multiple of three.

This is the part of the braker2 gff3 it refers to:

scaffold_001 AUGUSTUS gene 2081205 2082079 1 + . ID=g5848.braker; scaffold_001 AUGUSTUS mRNA 2081205 2082079 1 + . ID=g5848.t1.braker;Parent=g5848.braker scaffold_001 AUGUSTUS start_codon 2081205 2081207 . + 0 Parent=g5848.t1.braker; scaffold_001 AUGUSTUS CDS 2081205 2081252 1 + 0 ID=g5848.t1.braker.CDS1;Parent=g5848.t1 scaffold_001 AUGUSTUS exon 2081205 2081252 . + . ID=g5848.t1.braker.exon1;Parent=g5848.t1; scaffold_001 AUGUSTUS intron 2081253 2081594 1 + . Parent=g5848.t1.braker; scaffold_001 AUGUSTUS CDS 2081595 2081656 1 + 0 ID=g5848.t1.braker.CDS2;Parent=g5848.t1 scaffold_001 AUGUSTUS exon 2081595 2081656 . + . ID=g5848.t1.braker.exon2;Parent=g5848.t1; scaffold_001 AUGUSTUS intron 2081657 2081747 1 + . Parent=g5848.t1.braker; scaffold_001 AUGUSTUS CDS 2081748 2081820 1 + 1 ID=g5848.t1.braker.CDS3;Parent=g5848.t1 scaffold_001 AUGUSTUS exon 2081748 2081820 . + . ID=g5848.t1.braker.exon3;Parent=g5848.t1; scaffold_001 AUGUSTUS intron 2081821 2081890 1 + . Parent=g5848.t1.braker; scaffold_001 AUGUSTUS CDS 2081891 2082079 1 + 0 ID=g5848.t1.braker.CDS4;Parent=g5848.t1 scaffold_001 AUGUSTUS exon 2081891 2082079 . + . ID=g5848.t1.braker.exon4;Parent=g5848.t1; scaffold_001 AUGUSTUS stop_codon 2082077 2082079 . + 0 Parent=g5848.t1.braker;

I am basically getting this error for all genes...

Any suggestions are highly appreciated.

Michael

genome assembly gene Assembly • 2.8k views
ADD COMMENT
0
Entering edit mode

You might also try GAG (https://github.com/genomeannotation/GAG) followed by tbl2asn (https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/) to try and submit to NCBI, but you will probably still get the same warnings

ADD REPLY
0
Entering edit mode

Thank you very much. I will give it a try on the latest annotation...

ADD REPLY
1
Entering edit mode
3.9 years ago
Juke34 9.0k

You should first standardize your BRAKER annotation with AGAT before to process it through EMBLmyGFF3.

agat_convert_sp_gxf2gxf.pl --gff input.gff -o output_standardized.gff

ADD COMMENT
0
Entering edit mode
5.7 years ago

This is a warning, not an error. A CDS should be divisible by three because codons are 3bp, and CDS consist of codons.

I am not sure if CDS will always be annotated in 3bp codons, for example when lncRNAs are annotated.

Have you checked the sequences and looked at the annotation, eg in IGV? Do the CDS sequences look valid ? Are the first and last codons always found and displayed correctly ?

I hope it is also not a 0 based vs 1 based error, but it should not be.

ADD COMMENT
1
Entering edit mode

It's not uncommon for Braker to output partial CDS, but they do not make any biological sense indeed. If they are not pseudogenes you will need to amend this before submitting as those will not be accepted by the public repo's.

lncRNAs should also not have CDS assigned to them as they are non-coding (hence the name) and will thus not produce a protein and the CDS is pointless for them

ADD REPLY
0
Entering edit mode

Some lncRNAs contain (micro)ORFs which can make perfect biological sense.

For example: https://www.sciencedirect.com/science/article/pii/S0968000416300317

ADD REPLY
0
Entering edit mode

true for the warning. However, I also got a bunch of ERROR messages:

13:59:49 ERROR feature: >>stop_codon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.

13:59:49 ERROR feature: >>start_codon<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.

13:59:51 ERROR feature: >>inferred_parent<< is not a valid EMBL feature type. You can ignore this message if you don't need the feature.

ADD REPLY
0
Entering edit mode

Ah, then you'll have to check the EMBL feature types which are allowed. Also check a few existing EMBL files for examples.

ADD REPLY
0
Entering edit mode

Hi, I'm a developer of EMBLmyGFF3. You can ignore those features (stop_codon, start_codon...), they are not useful. inferred_parent is created by the bcbio-gff python gff parser when a parent feature is missing. This is generally not a good sign. Do you have many of those inferred_parent warnings ?

ADD REPLY
0
Entering edit mode

when loading the gff into a visualization software (Geneious in my case) the CDS seem normal.

Just wondering if this is a braker/augustus bug?! or a non-standard gff3

ADD REPLY
0
Entering edit mode

Can you check at the nucleotide level that the start position indicated in the gff is really ATG and codon stop one of the accepted stop codon ? If it's a 0 based vs 1 based problem you should be able to find it out easily.

As said by @lieven.sterck it's not uncommon to get fragmented predicted genes, but if you have many of them it's really suspect. Was your assembly ultra fragmented ?

ADD REPLY
0
Entering edit mode
5.7 years ago
Juke34 9.0k

In your example your CDS is definitely multiple of three. The problem could come from something else. Could be due to a bug in the output format.

I mean all features level3 (exon, CDS, intron, stop_codon, etc) refer to g5848.t1 parental feature but this feature doesn't exits. Indeed the one is g5848.t1.braker.
So either add .braker to all sub-features or remove it from the mRNA ones.

It explains at least why you have then inferred_parent features appearing from nowhere ....

Try to fix that first, maybe it will solve the other problem too.

ADD COMMENT

Login before adding your answer.

Traffic: 1901 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6