Parsing a genbank file and outputting specific feature information to a csv using BioPython
1
0
Entering edit mode
2.2 years ago
Zach G • 0

Hi. This question is pertaining to parsing a genbank file. I would like to parse through the file using SeqIO and output specific feature information to a CSV file (one column for each piece of information). My example input looks like the following:

SBxxxxxx.LargeContigs.gbk:

LOCUS       scaffold_31            38809 bp    DNA              UNK 01-JAN-1980
DEFINITION  scaffold_31.
ACCESSION   scaffold_31
VERSION     scaffold_31
KEYWORDS    .
SOURCE      .
      ORGANISM  .
COMMENT     ##antiSMASH-Data-START##
            Version      :: 6.1.1
            Run date     :: 2022-09-21 11:09:55
            ##antiSMASH-Data-END##
FEATURES             Location/Qualifiers
            protocluster    26198..38809
                            /aStool="rule-based-clusters"
                            /category="terpene"
                            /contig_edge="True"
                            /core_location="[36197:37079](-)"
                            /cutoff="20000"
                            /detection_rule="(Terpene_synth or Terpene_synth_C or
                            phytoene_synt or Lycopene_cycl or terpene_cyclase or NapT7
                            or fung_ggpps or fung_ggpps2 or trichodiene_synth or TRI5)"
                            /neighbourhood="10000"
                            /product="terpene"
                            /protocluster_number="1"
                            /tool="antismash"

Now for the output file, I want to create a csv with 3 columns. one column will have the Scaffold information (i.e., scaffold_31), the second column will have the category value in the protocluster feature (i.e., /category = "terpene") and the third column will have the product value in the protocluster feature (i.e., /product="terpene")

This is what I have so far for code. I know it is incomplete but any guidance much appreciated. I am completely new to parsing through gene bank files so have little knowledge in this domain. Thanks!

import Bio
from Bio import SeqIO
import os

input = "/Path to SBxxxxxx.LargeContigs.gbk"
output = open("output.csv", "w")

if not os.path.exists(output):
     for record in SeqIO.parse(input, "genbank")
          for feature in record.features:
              if feature.type == "protocluster" and "category" and "product" in feature.qualifiers:
                  outfile = feature.qualifiers["category"][0] + "," + feature.qualifiers["product"][0] + "\n"
                  output.write(outfile)
Bio.SeqIO genbank biopython • 1.1k views
ADD COMMENT
2
Entering edit mode
2.2 years ago

Overall, you're on the right track with your code. I made some minor changes.

import os
from Bio import SeqIO

infile = 'my_scaffold.gb'

oh = open('output.csv', 'w')
for record in SeqIO.parse(infile, 'genbank'):
    for feature in record.features:
        if feature.type == "protocluster":
            category = 'Unknown'
            product = 'Unknown'
            if 'category' in feature.qualifiers:
                category = feature.qualifiers['category'][0]
            if 'product' in feature.qualifiers:
                product = feature.qualifiers['product'][0]
            line = f'{record.id},{category},{product}\n'
            oh.write(line)
oh.close()
ADD COMMENT
0
Entering edit mode

Thank you, this is very helpful! I will go ahead and try implementing some of these changes. Now quick follow up questions. Will this print the scaffold/contig as well? I'm assuming that is what the record.id that you added is for, correct me if I'm wrong though. Also, could you give me a brief overview as to what exactly this addition is doing:

if feature.type == "protocluster":
    category = 'Unknown'
    product = 'Unknown'
ADD REPLY
0
Entering edit mode

Yes, the name of the scaffold/contig will also be present in the CSV file. You are right – the scaffold/contig name is under record.id. As for the additional if statement, I added it to just handle potential cases of GenBank records where the proteocluster feature is present, but it doesn’t have category or product qualifiers. If this is the case, the category and/or product variables will be set to ’Unknown’.

ADD REPLY

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6