First time posting, I'll try keep it concise:
After running Antismash, a 'final.gbk' file is generated. Using the following python script I can generate a file listing each Biosynthetic Gene Cluster (BGC) type:
#!/usr/bin/env python
from Bio import SeqIO
# Set input file, output file to write to
gbk_file = "contig_1.final.gbk"
tsv_file = "test2_output.tsv" # can be .txt, .csv etc..
cluster_out = open(tsv_file, "w")
# Extract Cluster info, write to file
for seq_record in SeqIO.parse(gbk_file, "genbank"):
for seq_feat in seq_record.features:
if seq_feat.type == "cluster":
cluster_number = seq_feat.qualifiers["note"][0].replace(" ","_").replace(":","")
cluster_type = seq_feat.qualifiers["product"][0]
cluster_out.write("#"+cluster_number+"\tCluster Type:"+cluster_type+"\n")
The script gives the output1:
#Cluster_number_1 Cluster Type: other
#Cluster_number_2 Cluster Type: terpene
#Cluster_number_3 Cluster Type: cf_putative
#Cluster_number_4 Cluster Type: cf_putative
#Cluster_number_5 Cluster Type: t1pks-nrps
and so on.
Using Python, I would like to read in the above file, and count the number of occurrences for "other", "terpene" "t1pks-nrps" etc. Ideally the output file would be 2 columns like so:
BGC Class Count
other 1
terpene 1
cf_putative 2
t1pks-nrps 1
This file will then be fed into R to generate a basic histogram plot showing BGC quantities.
My initial thought is read in output1 as a panda dataframe, and to check if each row contains an element of a list, which contains the BGC Class names.
Any help for this python beginner would be greatly appreciated.
Update for Antismash v5:
(I have not used antismash since v3, there is more BGC info stored under protocluster
, proto_core
and cond_cluster
[i.e seq_feat.type
] that you may want to extract?)
The below code is based on the sample output file given by the website for Aspergillus fumigatus Af293
from Bio import SeqIO
genbank = "/home/barry/Downloads/AAHF00000000.1.gbk"
tsv_out = "/home/barry/test.tsv"
out_file = open(tsv_out, "w")
for seq_record in SeqIO.parse(genbank, "genbank"):
for seq_feat in seq_record.features:
if seq_feat.type == "source":
chromosome = seq_feat.qualifiers["chromosome"][0]
chromosome = str("Chromosome number: " + chromosome)
if seq_feat.type == "protocluster":
number = seq_feat.qualifiers["protocluster_number"][0]
cluster = seq_feat.qualifiers["product"][0]
number = str("Cluster number: " + number)
cluster = str("BGC type: " + cluster)
out_file.write(chromosome + "\t" + number + "\t" + cluster +"\n")
Gives:
Chromosome number: 1 Cluster number: 1 BGC type: T1PKS
Chromosome number: 1 Cluster number: 2 BGC type: NRPS
Chromosome number: 1 Cluster number: 3 BGC type: terpene
Chromosome number: 1 Cluster number: 4 BGC type: betalactone
Chromosome number: 1 Cluster number: 5 BGC type: NRPS
Chromosome number: 1 Cluster number: 6 BGC type: T1PKS
Chromosome number: 2 Cluster number: 1 BGC type: T1PKS
Chromosome number: 2 Cluster number: 2 BGC type: T1PKS
Chromosome number: 2 Cluster number: 3 BGC type: indole
Chromosome number: 3 Cluster number: 1 BGC type: T1PKS
Chromosome number: 3 Cluster number: 2 BGC type: T1PKS
Chromosome number: 3 Cluster number: 3 BGC type: NRPS-like
Cheers to Valentin Berrios FarÃas for reaching out on LinkedIn and pointing this out
Did this output provide the dataframe that you used in order to count the occurrence of the lists elements for each row in the dataframe, this being the BGC class? Thanks and cheers
Check the stackoverflow link below. If you are using antismash, and need help parsing the data, I would reccomend posting a new question and figure out how to tag Joe (https://www.biostars.org/u/20598/). He helped me out in the past using BioPython.
thanks :D I will check it out
Hello,
I want to count Biosynthetic Gene Cluster (BGC) using the above python command but I get the empty output file (no error during the run). Can you please help me how can I count the BGC type in my .gbk file?
Thanks,