GFF to TBL file conversion for annotation submission
2
2
Entering edit mode
6.8 years ago
arsilan324 ▴ 90

Hi everyone,

I am wondering if someone can help me convert annotated gff3 file to tbl file which is required for submission at NCBI. I have tried using GAG, and the output is useless. It has no information. Then I used EMBLmyGFF3 to generate .EMBL file but it lacks annotation description in output file. I have tried also seqret and no help as well. I am just curious/surprised how people submit their annotation to NCBI. Expert opinion would be of great help!

Thanks

GFF Annotation NCBI Genome • 7.0k views
ADD COMMENT
0
Entering edit mode

.....................

ADD REPLY
0
Entering edit mode

Hi did you get to resolve this issue?

ADD REPLY
0
Entering edit mode

Btw, at the end, how did it end up with EMBLmyGFF3 ?

ADD REPLY
2
Entering edit mode
6.8 years ago

I think NCBI accepts gff(3) files as well Annotating Genomes with GFF3 or GTF files , but indeed the .tbl (or genbank format) are preferred for this matter.

There are a number of requirements or rules you will have to oblige to, so you will need to check those.

ADD COMMENT
1
Entering edit mode

I wish if they have accepted gff(3) file as annotation. But they have made this step really difficult. Well I have used different scripts and tools doing this purpose but nothing is helping. This includes gag, gff2tbl, python scripts by different researchers, and seqret. All in vain. Ridiculous and hectic.

ADD REPLY
0
Entering edit mode

OK, they don't directly accept gff files but they do offer a toolset to convert the gff files to a format accepted by the submission portal as it is explained on the webpage I mentioned.

ADD REPLY
1
Entering edit mode

Yes, they do offer that toolset which is not working at all. I have having 100,00+ errors which I am unable to fix manually.

ADD REPLY
0
Entering edit mode

GFF is the worst "standard" to use ;-) , I think it's probably not the toolset to blame but rather your gff file .

can you otherwise post a substract of your gff file (and perhaps even the errors/warnings you get ). perhaps it can be easily 'fixed' ?

ADD REPLY
0
Entering edit mode

I agree. The formatting of gff3 sucks. Here I copy some of them.

##gff-version 3.2.1
Transcript_100000   LAST    translated_nucleotide_match 194 331 3.200000e-04    +   .   ID=homology:205519;Name=N1RML9_FUSC4;Target=N1RML9_FUSC4 109 155 +;database=OrthoDB
Transcript_100000   LAST    translated_nucleotide_match 278 685 8.800000e-45    -   .   ID=homology:273349;Name=UniRef90_A0A199V823;Target=UniRef90_A0A199V823 123 260 +;database=uniref90
Transcript_100001   LAST    translated_nucleotide_match 111 535 1.100000e-45    +   .   ID=homology:273350;Name=UniRef90_M0RZL0;Target=UniRef90_M0RZL0 71 216 +;database=uniref90
Transcript_100002   HMMER   protein_hmm_match   1045    1848    3.900000e-48    .   .   ID=homology:50536;Name=Pkinase;Target=Pkinase 2 256 +;Note=Protein kinase domain;accuracy=0.79;env_coords=1042 1860;Dbxref="Pfam:PF00069.21"
Transcript_100002   HMMER   protein_hmm_match   1051    1857    6.000000e-49    .   .   ID=homology:50535;Name=Pkinase_Tyr;Target=Pkinase_Tyr 4 259 +;Note=Protein tyrosine kinase;accuracy=0.84;env_coords=1042 1860;Dbxref="Pfam:PF07714.13"
Transcript_100002   HMMER   protein_hmm_match   418 696 5.000000e-12    .   .   ID=homology:50538;Name=Stress-antifung;Target=Stress-antifung 7 92 +;Note=Salt stress response/antifungal;accuracy=0.76;env_coords=406 699;Dbxref="Pfam:PF01657.13"
Transcript_100002   HMMER   protein_hmm_match   85  372 1.700000e-14    .   .   ID=homology:50537;Name=Stress-antifung;Target=Stress-antifung 2 93 +;Note=Salt stress response/antifungal;accuracy=0.8;env_coords=82 372;Dbxref="Pfam:PF01657.13"
Transcript_100002   HMMER   protein_hmm_match   868 906 8.300000e+03    .   .   ID=homology:50539;Name=Stress-antifung;Target=Stress-antifung 46 58 +;Note=Salt stress response/antifungal;accuracy=0.77;env_coords=808 915;Dbxref="Pfam:PF01657.13"
Transcript_100002   LAST    translated_nucleotide_match 82  1905    4.600000e-168   -   .   ID=homology:273351;Name=UniRef90_UPI00057AFEB3;Target=UniRef90_UPI00057AFEB3 38 638 +;database=uniref90
Transcript_100002   LAST    translated_nucleotide_match 988 1644    2.000000e-45    -   .   ID=homology:205520;Name=ENSMGAP00000009823;Target=ENSMGAP00000009823 116 338 +;database=OrthoDB
Transcript_100002   transdecoder    CDS 141 2174    .   -   .   ID=cds.Transcript_100002|m.119955;Parent=Transcript_100002|m.119955
Transcript_100002   transdecoder    exon    1   2174    .   -   .   ID=Transcript_100002|m.119955.exon1;Parent=Transcript_100002|m.119955
Transcript_100002   transdecoder    gene    1   2174    .   -   .   ID=Transcript_100002|g.119955;Name=ORF%20Transcript_100002%7Cg.119955%20Transcript_100002%7Cm.119955%20type%3A5prime_partial%20len%3A678%20%28-%29
Transcript_100002   transdecoder    mRNA    1   2174    .   -   .   ID=Transcript_100002|m.119955;Parent=Transcript_100002|g.119955;Name=ORF%20Transcript_100002%7Cg.119955%20Transcript_100002%7Cm.119955%20type%3A5prime_partial%20len%3A678%20%28-%29
Transcript_100002   transdecoder    three_prime_UTR 1   140 .   -   .   ID=Transcript_100002|m.119955.utr3p1;Parent=Transcript_100002|m.119955

And the error is poor feature table output, that can be seen below

>Feature Transcript_1
8   85  gene
            locus_tag   homology:583549__conditional_reciprocal_best_LAST
8   85  conditional_reciprocal_best_LAST
            locus_tag   homology:583549__conditional_reciprocal_best_LAST
            protein_id  homology:583549__conditional_reciprocal_best_LAST
            note    hypothetical protein
            note    IMG_locus='homology:583549__conditional_reciprocal_best_LAST'
51  >257    gene
            locus_tag   homology:205516__translated_nucleotide_match
51  >257    translated_nucleotide_match
            locus_tag   homology:205516__translated_nucleotide_match
            protein_id  homology:205516__translated_nucleotide_match
            note    hypothetical protein
            note    IMG_locus='homology:205516__translated_nucleotide_match'
21  >257    gene
            locus_tag   homology:273345__translated_nucleotide_match
21  >257    translated_nucleotide_match
            locus_tag   homology:273345__translated_nucleotide_match
            protein_id  homology:273345__translated_nucleotide_match
            note    hypothetical protein
            note    IMG_locus='homology:273345__translated_nucleotide_match'
7   85  gene
            locus_tag   homology:388081__conditional_reciprocal_best_LAST
7   85  conditional_reciprocal_best_LAST
            locus_tag   homology:388081__conditional_reciprocal_best_LAST
            protein_id  homology:388081__conditional_reciprocal_best_LAST
            note    hypothetical protein
            note    IMG_locus='homology:388081__conditional_reciprocal_best_LAST'
>Feature Transcript_2
67  300 gene
            locus_tag   homology:56023__protein_hmm_match
67  300 protein_hmm_match
            locus_tag   homology:56023__protein_hmm_match
            protein_id  homology:56023__protein_hmm_match
            note    hypothetical protein
            note    IMG_locus='homology:56023__protein_hmm_match'
ADD REPLY
0
Entering edit mode

As mentioned on the NCBI webpage I linked previously, these 'translated_nucleotide_match' and 'protein_hmm_match' are no part of the standard SO recognized types. Do you need them to be included in the submission files? If not, try to remove them eg with :

awk '$2 !~ /LAST/ && $2 !~ /HMMER/' < gff-file > gff-out-file

and try to run the parser again.

ADD REPLY
0
Entering edit mode

Possibly you'll also need to sort the gff file on coordinates (and type etc) : you can use this cmdline to achieve that:

sort -k1,1V -k4,4n -k5,5rn -k3,3r some.gff > some.sorted.gff
ADD REPLY
0
Entering edit mode

That's right. Thanks for sharing this. First code worked perfectly but the second code didn't... It gives this error

sort: stray character in field spec: invalid field specification `1,1V'

Can you please comment on it? Thanks

ADD REPLY
0
Entering edit mode

most likely due to a different version of sort.

V is for 'natural' sorting, but it's not critical so you can try to omit it from the sort cmdline

ADD REPLY
1
Entering edit mode
6.2 years ago
Shaun Jackman ▴ 420

NCBI now accepts annotations submitted in GFF3 format. See https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/

ADD COMMENT
0
Entering edit mode

First this link has been already posted by lieven sterck 7 month ago in this post (see above). Secondly, for me this is a shortcut. Indeed, NCBI and EBI do NOT accept direct submission in GFF format. By reading carefully it is explained that the submission is in asn format. As explained in the EMBLmyGFF3 article and as you can understand when you read the link you are sharing, they just give a recipe how to prepare your gff3 (manually if it is not already perfect) to be ready to convert into asn that will be then submit to the DB. For EBI the accepted format will be the .embl format.

ADD REPLY

Login before adding your answer.

Traffic: 2414 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6