I need to parse accessory gene sequences (both dna and amino acid sequences)
from roary pangenome
output. I have the locus_tag
list and their corresponding gbk and gff
files, Is there any way to extract both amino acid and dna sequences from the gbk or gff
files.The gbk and gff
file were generated through prokka pipeline
. Is there any tool to do the same.
The roary
accessory genes locus_tag
list and corresponding strain gbk
and gff
file samples are shown below,
locus_tag list.csv
locus_tag/Pcissicola19
xynB_1 BGDHLHFA_02833
smpB BGDHLHFA_01427
Pcissicola19.gbk
gene complement(39965..40852)
/gene="xynB_3"
/locus_tag="BGDHLHFA_02833"
CDS complement(39965..40852)
/gene="xynB_3"
/locus_tag="BGDHLHFA_02833"
/EC_number="3.2.1.37"
/inference="ab initio prediction:Prodigal:002006"
/inference="similar to AA sequence:UniProtKB:P36906"
/codon_start=1
/transl_table=11
/product="Beta-xylosidase"
/protein_id="Prokka:BGDHLHFA_02833"
/translation="MPELLAFVAKHKLPIDFVTTHTYGVDGGFLDENGKQDTKLSASL
DAIVGDVRRVRAQIQASPFPNLPLYFTQWSSSYTPRDFVHDSYISAPYILTKLKQVQG
LVQGMSYWTYTDLFEEPGPPPTPFHGGFGLMNREGIRKPAWFAYKYLHALKGRDVPLS
DAHSLAAVDGTRVAALVWNWQQPMQAVSNTPFYTKQVPATDSAPLRMRMTHVPAGTYQ
LQVRKTGYRRNDPLSLYIDMGMPKDLAPRQLTQLRQATHDAPEQDRRVRVGADGVVEI
NVPMRSNDVVLLTLEPAAR"
Pcissicola19.gff
ID=BGDHLHFA_02833_gene;Name=xynB_3;gene=xynB_3;locus_tag=BGDHLHFA_02833
gnl|Prokka|BGDHLHFA_249 Prodigal:002006 CDS 39965 40852 . - 0 ID=BGDHLHFA_02833;Parent=BGDHLHFA_02833_gene;eC_number=3.2.1.37;Name=xynB_3;gene=xynB_3;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P36906;locus_tag=BGDHLHFA_02833;product=Beta-xylosidase;protein_id=gnl|Prokka|BGDHLHFA_02833
For your kind reference my datasets having both draft genome and complete genomes.
The expected dna and amino acid sequence output is given below respectively,
>BGDHLHFA_02833
tcagcgcgccgccggctccagcgtcagcagcaccacatcgttgctgcgcatcggcacgttgatctcgaccacgccatcggcgcccacacgcacacgccgatcctgttcgggcatcgtgcgtggcctgtcgcagctgcgtcaactggcgcggcgccaggtccttgggcatgcccatgtcgatgtacagcgacaacgggtcgttacgccgatagccggtcttgcgcacctgcagctggtacgtgccggcaggcacatgggtcatgcgcatgcgcagcggcgcgctgtcggtggcgggcacctgtttggtgtagaacggcgtattgctcaccgcctgcatgggctgctgccaattccacaccagtgcggcgacgcgcgtgccgtccactgcggcgagggaatgtgcgtcgctcagcggcacatcgcggcccttgagcgcatgcaagtacttgtaagcgaaccaggccggtttgcgaatgccttcgcgattcatcagcccaaacccgccgtggaagggcgtgggcggtgggccgggttcttcgaacagatcggtatagtccagtaactcatgccctgcaccaggccctgcacctgcttgagcttggtcaggatgtacggcgcgctgatgtaactgtcgtggacgaaatcgcgcggcgtatagctgctgctccactgggtgaagtacagcggcaggttgggaaatggcgaggcctggatctgcgcgcgcacgcgtcgcacatcgccgacgatggcatccagagatgcggacagcttggtgtcctgcttgccgttctcatcgagaaacccgccatccacgccataggtatgcgtggtgacgaagtcgatcggcagtttgtgcttggcaacgaaggccagcagttccggcac
>BGDHLHFA_02833
MPELLAFVAKHKLPIDFVTTHTYGVDGGFLDENGKQDTKLSASLDAIVGDVRRVRAQIQASPFPNLPLYFTQWSSSYTPRDFVHDSYISAPYILTKLKQVQGLVQGMSYWTYTDLFEEPGPPPTPFHGGFGLMNREGIRKPAWFAYKYLHALKGRDVPLSDAHSLAAVDGTRVAALVWNWQQPMQAVSNTPFYTKQVPATDSAPLRMRMTHVPAGTYQLQVRKTGYRRNDPLSLYIDMGMPKDLAPRQLTQLRQATHDAPEQDRRVRVGADGVVEINVPMRSNDVVLLTLEPAAR
Please post example file/lines for better understanding the issue and do not post images of the data.
@cpad0112 I have revised my question. Please go through it.
I recently posted here how to extract aa and nt sequenece (C: How to extract all gene nucleotide sequences separately from multiple Genbank fi) from gbk. What you need to do is extract the locus_tag and loop over those tags and extract only those sequences from gbk.