Hi. This question is pertaining to parsing a genbank file. I would like to parse through the file using SeqIO and output specific feature information to a CSV file (one column for each piece of information). My example input looks like the following:
SBxxxxxx.LargeContigs.gbk
:
LOCUS scaffold_31 38809 bp DNA UNK 01-JAN-1980
DEFINITION scaffold_31.
ACCESSION scaffold_31
VERSION scaffold_31
KEYWORDS .
SOURCE .
ORGANISM .
COMMENT ##antiSMASH-Data-START##
Version :: 6.1.1
Run date :: 2022-09-21 11:09:55
##antiSMASH-Data-END##
FEATURES Location/Qualifiers
protocluster 26198..38809
/aStool="rule-based-clusters"
/category="terpene"
/contig_edge="True"
/core_location="[36197:37079](-)"
/cutoff="20000"
/detection_rule="(Terpene_synth or Terpene_synth_C or
phytoene_synt or Lycopene_cycl or terpene_cyclase or NapT7
or fung_ggpps or fung_ggpps2 or trichodiene_synth or TRI5)"
/neighbourhood="10000"
/product="terpene"
/protocluster_number="1"
/tool="antismash"
Now for the output file, I want to create a csv with 3 columns. one column will have the Scaffold information (i.e., scaffold_31
), the second column will have the category value in the protocluster
feature (i.e., /category = "terpene"
) and the third column will have the product value in the protocluster
feature (i.e., /product="terpene"
)
This is what I have so far for code. I know it is incomplete but any guidance much appreciated. I am completely new to parsing through gene bank files so have little knowledge in this domain. Thanks!
import Bio
from Bio import SeqIO
import os
input = "/Path to SBxxxxxx.LargeContigs.gbk"
output = open("output.csv", "w")
if not os.path.exists(output):
for record in SeqIO.parse(input, "genbank")
for feature in record.features:
if feature.type == "protocluster" and "category" and "product" in feature.qualifiers:
outfile = feature.qualifiers["category"][0] + "," + feature.qualifiers["product"][0] + "\n"
output.write(outfile)
Thank you, this is very helpful! I will go ahead and try implementing some of these changes. Now quick follow up questions. Will this print the scaffold/contig as well? I'm assuming that is what the record.id that you added is for, correct me if I'm wrong though. Also, could you give me a brief overview as to what exactly this addition is doing:
Yes, the name of the scaffold/contig will also be present in the CSV file. You are right – the scaffold/contig name is under
record.id
. As for the additional if statement, I added it to just handle potential cases of GenBank records where theproteocluster
feature is present, but it doesn’t havecategory
orproduct
qualifiers. If this is the case, thecategory
and/orproduct
variables will be set to’Unknown’
.