Question

How to extract nucleotide sequence of select genes from a text file with Genbank-like format

2

Entering edit mode

5.6 years ago

mac03pat ▴ 30

I uploaded a single FASTA file with multiple gene clusters from different organisms to an online program called antiSMASH, or fungiSMASH in my case. Most clusters have a ketosynthase (KS) gene in them. antiSMASH identified the putative KS genes and provided me with an output in a text file. Can anyone assist me in extracting (parsing may be the correct term?) the genes of interest (nucleotide sequences) with associated accession number and definition, or just the taxon name from the definition?

I believe all of the genes of interest will say:

/aSDomain="PKS_KS"

And if this is present then I will want the range of nucleotides indicated adjacent to the heading aSDomain. For example:

aSDomain        2527..3816

However, sometimes the adjacent numbers will says something like:

aSDomain        join(1610..1702,1756..2109,2165..3010)

In which case I believe I would want to concatenate each range indicated.

Or:

aSDomain        complement(join(10640..10648,10717..11439))

In which case I believe I would want to concatenate all ranges and then take the complementary sequence.

I would like to do an alignment and then make a phylogenetic tree based on the extracted KS genes. I believe FASTA format would be a good output to have my KS genes in, but I can convert if necessary.

This is my first time using antiSMASH and I'm new to coding so I apologies for any obvious blunders and I would have preferred to attach a file of my output data but I didn't see that as an option! Thanks in advance for any help!

Here's a link to my output:

https://drive.google.com/open?id=1KWbh3D7jY7u5AytlGCLxC65MR_KKUY7X

If someone has a better way of attaching a large text file (~1.7 million characters), I'm all ears.

Here's a link all of the output that antiSMASH generated (not just text file of all annotated genes):

https://drive.google.com/open?id=1KWbh3D7jY7u5AytlGCLxC65MR_KKUY7X

genbank antismash parse biopython nucleotide • 2.5k views

ADD COMMENT • link updated 4.0 years ago by Biostar 20 • written 5.6 years ago by mac03pat ▴ 30

score 2 · Accepted Answer · 2019-05-23

2

Entering edit mode

5.6 years ago

AK ★ 2.2k

Hi mac03pat,

This should work:

#!/usr/bin/env python

from Bio import SeqIO
from Bio.SeqFeature import SeqFeature

gbk = "antiSMASH_processed_geneclusters.txt"
fa = "antiSMASH_processed_geneclusters.fa"
input_handle = open(gbk, "r")
output_handle = open(fa, "w")

for record in SeqIO.parse(input_handle, "genbank"):
    features = [feature for feature in record.features if feature.type == "aSDomain"]
    for feature in features:
        if feature.qualifiers["aSDomain"][0] == "PKS_KS":
            output_handle.write(">%s %s\n%s\n" % (
                record.id,
                feature.location,
                SeqFeature(feature.location).extract(record.seq)))

output_handle.close()
input_handle.close()

ADD COMMENT • link 5.6 years ago by AK ★ 2.2k

0

Entering edit mode

Thank you for your answer SMK! This code worked very well. There's one more piece of information I'd like included, the genus and species names included under "DEFINITION". For example, from this line I'd like just the words Cladonia grayi so that later on my final phylogenetic tree can include the taxa.

DEFINITION  GU930713.1 Cladonia grayi metabolic gene cluster, complete sequence.

ADD REPLY • link 5.6 years ago by mac03pat ▴ 30

1

Entering edit mode

You can change the codes to:

from Bio import SeqIO
from Bio.SeqFeature import SeqFeature

gbk = "antiSMASH_processed_geneclusters.txt"
fa = "antiSMASH_processed_geneclusters.fa"
input_handle = open(gbk, "r")
output_handle = open(fa, "w")

for record in SeqIO.parse(input_handle, "genbank"):
    features = [feature for feature in record.features if feature.type == "aSDomain"]
    for feature in features:
        if feature.qualifiers["aSDomain"][0] == "PKS_KS":
            output_handle.write(">%s %s\n%s\n" % (
                record.description,
                feature.location,
                SeqFeature(feature.location).extract(record.seq)))

output_handle.close()
input_handle.close()

Which gives me:

$ grep '^>' antiSMASH_processed_geneclusters.fa | head
>DQ660910.3 Xanthoria elegans strain gbM10 polyketide synthase type I (PKSI) mRNA, complete cds [2526:3816](+)
>HM180409.1 Peltigera membranacea non-reducing type I polyketide synthase (PKS6) gene, complete cds join{[1258:1268](+), [1286:1342](+), [1402:1756](+), [1812:2658](+)}
>HQ823619.1 Cladonia grayi isolate PKS13 putative polyketide synthase gene, complete cds [1362:2646](+)
>HQ823620.1 Cladonia grayi isolate PKS14 putative polyketide synthase gene, partial cds [1509:2805](+)
>HQ823621.1 Cladonia grayi isolate PKS15 putative polyketide synthase gene, complete cds join{[1609:1702](+), [1755:2109](+), [2164:3010](+)}
>JN408682.1 Usnea longissima strain CH050148 non-reducing polyketide synthase (PKS1) mRNA, complete cds [1161:2454](+)
>JQ340775.1 Cladonia macilenta voucher Hur CH050136 type I polyketide synthase (PKS1) mRNA, complete cds [1374:2448](+)
>JX067626.1 Hypogymnia physodes non-reducing polyketide synthase 1 (pks1) gene, complete cds [501:1743](+)
>JX232188.1 Usnea longissima type I polyketide synthase PKS6 (PKS6) mRNA, complete cds [864:2178](+)
>KJ501919.1 Xanthoparmelia substrigosa strain XsmPKSI polyketide synthase type I (PKSI) mRNA, complete cds [1161:2454](+)

ADD REPLY • link 5.6 years ago by AK ★ 2.2k

0

Entering edit mode

This worked great, thank you again SMK.

One more question if you don't mind. The previous results were generated by the online version of antiSMASH. I am now running a package version of antiSMASH from bash commandline and the output is different. I tried to change your code to process this slightly different output but I was unable to find a good solution. Would you mind helping me to make these changes?

Here's the output that I am now working with: https://drive.google.com/open?id=1FT5r6QPAutEcDCgEvgu9O141H3w6L_5W

ADD REPLY • link 5.6 years ago by mac03pat ▴ 30

0

Entering edit mode

You're welcome. Changing from if feature.qualifiers["aSDomain"][0] == "PKS_KS": to if feature.qualifiers["domain"][0] == "PKS_KS": should work again.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 5.6 years ago by AK ★ 2.2k