Question

Seqret gff to genbank error: features didn't transfer to output genbank properly

0

Entering edit mode

2.3 years ago

James ▴ 20

I have a fasta DNA sequence, and I have a gff-formated file that I made myself. I want to convert these two into a single genbank file with the features and the sequence. It seems like seqret is the best way to do this, however, I'm not getting the result I want and I can't find other posts with the same issue.

My input fasta looks like this:

>NODE_1_length_44443_cov_52.250767_cutoff_20_type_circular
GCAAGTAAAGTTATTGATGCAATAGTTGATATAGCATCAGAAAATGGAGAAGTTAGTGAA
AAAGAACTTTACTTTGCCGTCGAAGTAGCATCAGTTCTATGTGGAGAACCACATAGAAAA
ATCTTTAATAGAGAACTAGTAGATATGATAGTTAAAGATGCAGTAAAAGAATTAACTCAG...

and my input gff looks like this:

##gff-version 3
##sequence-region NODE_1_length_44443_cov_52.250767_cutoff_20_type_circular 1 44443
NODE_1_length_44443_cov_52.250767_cutoff_20_type_circular   CRISPR_BLAST    alignment   4017    4055    84.61538462 +   .   ID=Desulfofustis_sp_PB-SRB1_repeat_14_spacer_7_target
NODE_1_length_44443_cov_52.250767_cutoff_20_type_circular   CRISPR_BLAST    alignment   16670   16686   82.35294118 +   .   ID=Desulfofustis_sp_PB-SRB1_repeat_03_spacer_241_target
NODE_1_length_44443_cov_52.250767_cutoff_20_type_circular   tBLASTx alignment   31110   31358   100 -   .   ID=Desulfofustis_phage_MD01_head_protein_tBLASTx_hit
NODE_1_length_44443_cov_52.250767_cutoff_20_type_circular   tBLASTx alignment   31111   31173   100 +   .   ID=Desulfofustis_phage_MD01_head_protein_tBLASTx_hit...

I am using this command with seqret:

seqret -sequence fasta.fna -feature -fformat gff -fopenfile features.gff -osformat genbank -ofdirectory_outseq gbk_file -auto

And the result looks like this:

LOCUS       NODE_1_length_44443_cov_52.250767_cutoff_20_type_circular      44443 bp    DNA     linear   UNC 24-AUG-2022
FEATURES             Location/Qualifiers
 misc_feature    4017..4055
 misc_feature    16670..16686
 misc_feature    join(complement(31110..31358),31111..31173)
ORIGIN
    1 GCAAGTAAAG TTATTGATGC AATAGTTGAT ATAGCATCAG AAAATGGAGA AGTTAGTGAA
   61 AAAGAACTTT ACTTTGCCGT CGAAGTAGCA TCAGTTCTAT GTGGAGAACC ACATAGAAAA...

Instead of the feature type "alignment" showing, it's being replaced by "misc_feature". Also, I'm not sure why the "join(complement(..." part is appearing.

I considered the possibility that my gff file is not properly formatted, so I uploaded my fasta and gff files into IGV to see if it would display my features with their proper annotations, and it worked fine. The correct sequence was displayed with its correct features, coordinates, and annotations. I also tried the same seqret command but with a different fasta and gff file that I obtained from a sample online, and it converted properly.

What could be causing my problem here? Thank you in advance for the help.

*Note: I edited the text displayed in the code sample blocks so they would display the areas that are most messed up rather than the entire files. To view or download my actual input and output files, I've put them here.

gff annotation genbank seqret emboss • 1.1k views

ADD COMMENT • link 2.3 years ago by James ▴ 20

score 1 · Answer 1 · 2022-08-24

1

Entering edit mode

2.3 years ago

Istvan Albert 102k

I suspect that seqret wants to use valid feature types in the genbank file. And alignment may not be one of the valid types.

ADD COMMENT • link 2.3 years ago by Istvan Albert 102k

0

Entering edit mode

That's a good idea. I did a find and replace in my gff file from "alignment" to CDS to see if it worked. "misc_feature" was replaced by "CDS" which is a good sign, but the "join(complement(..." section still appeared and none of the feature IDs were shown.

ADD REPLY • link 2.3 years ago by James ▴ 20

1

Entering edit mode

the join(complement is how the reverse strand is indicated in GenBank

GenBank is an awkward and antiquated format without a proper specification, seqret is also an antiquated tool without a proper explanation of what it does ... so there

Try to use the words/terms that you see here

https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

for example, instead of ID call it protein_id or gene

ADD REPLY • link 2.3 years ago by Istvan Albert 102k

0

Entering edit mode

That worked! Everything is as it should be now. I changed "alignment" to "CDS" and "ID" to "note" and all of my annotations and features were in the genbank result. Thanks for your help!

ADD REPLY • link 2.3 years ago by James ▴ 20