How to parse protein and dna sequences from prokka generated gbk or gff file based on locus_tag?
1
0
Entering edit mode
3.8 years ago
Kumar ▴ 120

I need to parse accessory gene sequences (both dna and amino acid sequences) from roary pangenome output. I have the locus_tag list and their corresponding gbk and gff files, Is there any way to extract both amino acid and dna sequences from the gbk or gff files.The gbk and gff file were generated through prokka pipeline. Is there any tool to do the same. The roary accessory genes locus_tag list and corresponding strain gbk and gff file samples are shown below,

locus_tag list.csv

             locus_tag/Pcissicola19
    xynB_1   BGDHLHFA_02833
    smpB     BGDHLHFA_01427

Pcissicola19.gbk

gene            complement(39965..40852)
                     /gene="xynB_3"
                     /locus_tag="BGDHLHFA_02833"
     CDS             complement(39965..40852)
                     /gene="xynB_3"
                     /locus_tag="BGDHLHFA_02833"
                     /EC_number="3.2.1.37"
                     /inference="ab initio prediction:Prodigal:002006"
                     /inference="similar to AA sequence:UniProtKB:P36906"
                     /codon_start=1
                     /transl_table=11
                     /product="Beta-xylosidase"
                     /protein_id="Prokka:BGDHLHFA_02833"
                     /translation="MPELLAFVAKHKLPIDFVTTHTYGVDGGFLDENGKQDTKLSASL
                     DAIVGDVRRVRAQIQASPFPNLPLYFTQWSSSYTPRDFVHDSYISAPYILTKLKQVQG
                     LVQGMSYWTYTDLFEEPGPPPTPFHGGFGLMNREGIRKPAWFAYKYLHALKGRDVPLS
                     DAHSLAAVDGTRVAALVWNWQQPMQAVSNTPFYTKQVPATDSAPLRMRMTHVPAGTYQ
                     LQVRKTGYRRNDPLSLYIDMGMPKDLAPRQLTQLRQATHDAPEQDRRVRVGADGVVEI
                     NVPMRSNDVVLLTLEPAAR"

Pcissicola19.gff

ID=BGDHLHFA_02833_gene;Name=xynB_3;gene=xynB_3;locus_tag=BGDHLHFA_02833
gnl|Prokka|BGDHLHFA_249 Prodigal:002006 CDS 39965   40852   .   -   0   ID=BGDHLHFA_02833;Parent=BGDHLHFA_02833_gene;eC_number=3.2.1.37;Name=xynB_3;gene=xynB_3;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P36906;locus_tag=BGDHLHFA_02833;product=Beta-xylosidase;protein_id=gnl|Prokka|BGDHLHFA_02833

For your kind reference my datasets having both draft genome and complete genomes.

The expected dna and amino acid sequence output is given below respectively,

>BGDHLHFA_02833
tcagcgcgccgccggctccagcgtcagcagcaccacatcgttgctgcgcatcggcacgttgatctcgaccacgccatcggcgcccacacgcacacgccgatcctgttcgggcatcgtgcgtggcctgtcgcagctgcgtcaactggcgcggcgccaggtccttgggcatgcccatgtcgatgtacagcgacaacgggtcgttacgccgatagccggtcttgcgcacctgcagctggtacgtgccggcaggcacatgggtcatgcgcatgcgcagcggcgcgctgtcggtggcgggcacctgtttggtgtagaacggcgtattgctcaccgcctgcatgggctgctgccaattccacaccagtgcggcgacgcgcgtgccgtccactgcggcgagggaatgtgcgtcgctcagcggcacatcgcggcccttgagcgcatgcaagtacttgtaagcgaaccaggccggtttgcgaatgccttcgcgattcatcagcccaaacccgccgtggaagggcgtgggcggtgggccgggttcttcgaacagatcggtatagtccagtaactcatgccctgcaccaggccctgcacctgcttgagcttggtcaggatgtacggcgcgctgatgtaactgtcgtggacgaaatcgcgcggcgtatagctgctgctccactgggtgaagtacagcggcaggttgggaaatggcgaggcctggatctgcgcgcgcacgcgtcgcacatcgccgacgatggcatccagagatgcggacagcttggtgtcctgcttgccgttctcatcgagaaacccgccatccacgccataggtatgcgtggtgacgaagtcgatcggcagtttgtgcttggcaacgaaggccagcagttccggcac

>BGDHLHFA_02833
MPELLAFVAKHKLPIDFVTTHTYGVDGGFLDENGKQDTKLSASLDAIVGDVRRVRAQIQASPFPNLPLYFTQWSSSYTPRDFVHDSYISAPYILTKLKQVQGLVQGMSYWTYTDLFEEPGPPPTPFHGGFGLMNREGIRKPAWFAYKYLHALKGRDVPLSDAHSLAAVDGTRVAALVWNWQQPMQAVSNTPFYTKQVPATDSAPLRMRMTHVPAGTYQLQVRKTGYRRNDPLSLYIDMGMPKDLAPRQLTQLRQATHDAPEQDRRVRVGADGVVEINVPMRSNDVVLLTLEPAAR
genome perl python bash R • 2.4k views
ADD COMMENT
0
Entering edit mode

Please post example file/lines for better understanding the issue and do not post images of the data.

ADD REPLY
0
Entering edit mode

@cpad0112 I have revised my question. Please go through it.

ADD REPLY
1
Entering edit mode

I recently posted here how to extract aa and nt sequenece (C: How to extract all gene nucleotide sequences separately from multiple Genbank fi) from gbk. What you need to do is extract the locus_tag and loop over those tags and extract only those sequences from gbk.

ADD REPLY
2
Entering edit mode
3.8 years ago
Mensur Dlakic ★ 28k

prokka makes .ffn and .faa files, which contain codons and their translations, respectively. They should have the same annotations as .gbk files. In this case you don't need to parse anything - just extract the sequences of interest directly from these files.

ADD COMMENT
0
Entering edit mode

Thank you @Mensur Dlakic.

ADD REPLY

Login before adding your answer.

Traffic: 1616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6