Question

Converting Braker2 gtf output to gff then genbank

0

Entering edit mode

3.6 years ago

robert.murphy ▴ 90

After running the Braker2 pipeline how would one go about converting the braker.gtf output plus fasta into a genbank format?

I found this post suggesting EMBOSS seqret. So I set about converting to from .gtf to .gff. To do this I used AGAT as suggested in this post. However when doing this I get a lot of gff3 reader errors:

For example:

The feature type (3rd column) is constrained to be either a term from the Sequence Ontology or an SO accession number. The latter alternative is distinguished using the syntax SO:000000. In either case, it must be sequence_feature (SO:00
We filter the ontology to apply this rule.              We found 1757 terms that are sequence_feature or is_a child of it.
-------------------------------- parse features --------------------------------
=> GFF version parser used: 2.5
gff3 reader error level1: No ID attribute found @ for the feature: IC0001_1     AUGUSTUS        gene    3320954 3321577 1       -       .
gff3 reader error level2: No ID attribute found @ for the feature: IC0001_1     AUGUSTUS        transcript      3320954 3321577 1       -       .
WARNING level2: No Parent attribute found @ for the feature: IC0001_1   AUGUSTUS        transcript      3320954 3321577 1       -       .       ID "transcript-1"
WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the
IC0001_1        AUGUSTUS        transcript      3320954 3321577 1       -       .       ID "transcript-1"
gff3 reader error level1: No ID attribute found @ for the feature: IC0001_2468  AUGUSTUS        gene    2       442     0.78    -       .
gff3 reader error level2: No ID attribute found @ for the feature: IC0001_2468  AUGUSTUS        transcript      2       442     0.78    -       .
WARNING level2: No Parent attribute found @ for the feature: IC0001_2468        AUGUSTUS        transcript      2       442     0.78    -       .       ID "transcript-2"
WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the
IC0001_2468     AUGUSTUS        transcript      2       442     0.78    -       .       ID "transcript-2"
gff3 reader error level1: No ID attribute found @ for the feature: IC0001_1     AUGUSTUS        gene    11900730        11901159        1       -       .
gff3 reader error level2: No ID attribute found @ for the feature: IC0001_1     AUGUSTUS        transcript      11900730        11901159        1       -       .
WARNING level2: No Parent attribute found @ for the feature: IC0001_1   AUGUSTUS        transcript      11900730        11901159        1       -       .       ID "transcript-3"
WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the
IC0001_1        AUGUSTUS        transcript      11900730        11901159        1       -       .       ID "transcript-3"
gff3 reader error level1: No ID attribute found @ for the feature: IC0001_180   AUGUSTUS        gene    3084    5230    0.23    +       .
gff3 reader error level2: No ID attribute found @ for the feature: IC0001_180   AUGUSTUS        transcript      3084    5230    0.23    +       .
WARNING level2: No Parent attribute found @ for the feature: IC0001_180 AUGUSTUS        transcript      3084    5230    0.23    +       .       ID "transcript-4"
WARNING gff3 reader: Hmmm, be aware that your feature doesn't contain any Parent and locus tag. No worries, we will handle it by considering it as strictly sequential. If you disagree, please provide an ID or a comon tag by locus. @ the
IC0001_180      AUGUSTUS        transcript      3084    5230    0.23    +       .       ID "transcript-4"
gff3 reader error level1: No ID attribute found @ for the feature: IC0001_1386  AUGUSTUS        gene    494     973     0.65    -       .
gff3 reader error level2: No ID attribute found @ for the feature: IC0001_1386  AUGUSTUS        transcript      494     973     0.65    -       .
WARNING level2: No Parent attribute found @ for the feature: IC0001_1386        AUGUSTUS        transcript      494     973     0.65    -       .       ID "transcript-5"

I don't understand what the tools is unable to identify features when they are there. Is it perhaps because gtf allows for features names that gff3 does not? My worry is this will effect the final genbank files?

EDIT: Here is a sample of the gtf https://pastebin.com/CEyfqR1H

annotation • 5.5k views

ADD COMMENT • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

Could you provide a sample of your GTF file? It sounds the last column (the 9th) is missing. So AGAT is creating relationships (parent/id) for the feature sequentialy (reading line by line a mRNA coming after a gene will be linked to it etc...)

ADD REPLY • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

I have added a sample of the gtf.

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

Ok I have updated my answer. No worries, your conversion went well ^^

ADD REPLY • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

3.6 years ago

Juke34 8.9k

BTW something is wierd in your GTF file:

ANC-D_8901  AUGUSTUS    gene    1   783 0.7 +   .   g4417
ANC-D_8901  AUGUSTUS    transcript  1   783 0.7 +   .   g4417.t1
ANC-D_8901  AUGUSTUS    exon    670 783 .   +   .   transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";
ANC-D_8901  AUGUSTUS    stop_codon  781 783 .   +   0   transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417";

transcript_id "file_1_file_1_g4417.t1"; gene_id "file_1_file_1_g4417"; must be transcript_id "g4417.t1"; gene_id "g4417"; AGAT will not group the features together because they do not have proper relationships. In top of that I don't get where is the CDS, as you have a stop_codon feature defined.

If you didn't modify the file, and this is the original BRAKER output you should get in touch with them to show that. There are issues

ADD COMMENT • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

I have posted to their github. Do you see this as a problem for my use case?

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

You will have twice more gene and transcript features after AGAT correction. So it depends what you plan to do do and which tool you use downstream.

You can fix the problem by first filtering out gene and transcript from the original file. Then run AGAT, it will recreate those features properly.

ADD REPLY • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

The plan is to use them for antiSMASH. Do you mean to filter out any line the the feature gene and transcript string?

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

Yes with an awk command throw out all lines with gene and transcript feature type (3rd column). Then reconstruct them by runing agat_convert_sp_gxf2gxf.pl

ADD REPLY • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

Thank you, I will give this a try. Do you have any tools you would recomend for assesing the quallity of an annotation?

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

stat comparison against other close relative species (nb gene, mean size UTR, etc), BUSCO values, functional annotation retrieved if you do functional annotation and mainly human expertise looking at annotation in a genome browser against other data e.g protein alignemnet, transcript alignment, etc...

ADD REPLY • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

I will try the first few suggestions thanks :) The last one it where I will lack! I am certainly not expert enough to do that. Would looking at values such as :

Sample, Number of gene models, Min gene length, Max gene length, Average gene length, Number of exons, Average number of exons per gene model, Average exon length, Number of gene models less than 200bp length

also be effective?

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

Yes but you will need a referential to compare against

ADD REPLY • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

Okay that makes sense. How phylogenetically close do you recommend looking? I have annotated about 30 genomes of the genus I am looking into so can compare within that also?

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

score 2 · Accepted Answer · 2021-04-09

2

Entering edit mode

3.6 years ago

Juke34 8.9k

gff3 reader errors means that the line is not as expected, here is missing the 9th column apparently

The braker output (augustus output) is wrong for gene and transcript features:

ANC-D_12    AUGUSTUS    gene    45173   47264   0.23    -   .   g9575
ANC-D_12    AUGUSTUS    transcript  45173   47264   0.23    -   .   g9575.t1

Indeed the attribute of the 9th column must be key/value pairs. The current 9th column looks more like a GFF1 version style.

The others features are correct:

ANC-D_12    AUGUSTUS    intron  46736   46788   1   -   .   transcript_id "file_1_file_1_g9575.t1"; gene_id "file_1_file_1_g9575";
ANC-D_12    AUGUSTUS    CDS 46789   46848   1   -   0   transcript_id "file_1_file_1_g9575.t1"; gene_id "file_1_file_1_g9575";
ANC-D_12    AUGUSTUS    exon    46789   46848   .   -   .   transcript_id "file_1_file_1_g9575.t1"; gene_id "file_1_file_1_g9575";

So the bioperl parser used by AGAT do not succeed to catch key/value pair from gene and attribute feature (so drops the g9575 and g9575.t1 information from the case described above), consequently AGAT inform you about this problem and fix it automatically by creating ID and Parent attributes.

ADD COMMENT • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

Thank you very much for the detailed response. In the future should I set --gvi 1 then for AGAT?

Secondly what it your suggested best course of action for converting this gff to genbank? The EMBOSS seqret tool seems to just go on forever with no output as though it is stalled or stuck somewhere.

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

--gvi 1 is not a good solution beeause it will works for the gene and transcript features and then fails for the others...

For the conversion for NCBI submission look at GAG
For the conversion for EBI submission look at EMBLmyGFF3

One or the other; once submitted the data will be available in the other database within 24h, because this is a redundancy of the same DB. It is just the way to get in that is different, either the NCBI door or EBI door.

ADD REPLY • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

In EMBLmyGFF3 script should I concern myself much with the topology option? If yes do you have suggestions for good places to educate (I can't seem to find a good one) myself about it and how would I go about identifying which option is the best?

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

You can get some information calling the EMBLmyGFF3 adanced help:

EMBLmyGFF3  --advanced_help topology

Otherwise you can find verbose information from the EMBL user manual here: ftp://ftp.ebi.ac.uk/pub/databases/embl/doc/usrman.txt

ADD REPLY • link 3.6 years ago by Juke34 8.9k

0

Entering edit mode

Oh it is simply whether the DNA is circular or linear, I thought it was something more complicated like chromatin organisation! Thank you.

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90

0

Entering edit mode

I have successfully converted to EMBL but I have multiple CDS features with the same locus_tag. Do you know what could be causing that? From the ones I have checked it is only 2 CDS per tag so one on the forward strand and one on the reverse strand?

ADD REPLY • link 3.6 years ago by robert.murphy ▴ 90