Slicing Genbank File by 'gene

Slicing Genbank File by 'gene_id' range

0

Entering edit mode

6.9 years ago

Barry Digby ★ 1.3k

Hi guys,

I have searched quite extensively for this, to no avail.

I have no problem extracting Gene_ids from a genbank file when given a list of gene_ids to extract:

for seq_record in SeqIO.parse(gbk_file, "genbank"):
    for feat in seq_record.features:
        if feat.type == "CDS":
            for gene in genes:
                if gene in feat.qualifiers['gene_id']:
                    seq = feat.qualifiers['translation']
                    print(gene)
                    print(seq)

Where "gene in genes" is a variable storing each desired Gene_id in a list.

However

I would like to provide a start gene, and an end gene as my range so I can extract a Gene Cluster in fasta format. Is this possible?

I will be looping through the start and end gene_id's in a for loop as follows:

for x, y, z in zip(start_list, end_list, cluster_list):

Once the gene_id range is captured (along with its Amino Acid sequence), I will output them to a unique file using:

with open("%s" % (z), "w") as outfile:
outfile.write(">%s\n%s" % (gene, sequence))

Any help in capturing a range of gene_ids would be greatly appreciated.

genbank python slicing • 5.2k views

ADD COMMENT • link updated 6.9 years ago by Joe 22k • written 6.9 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

Would it be possible to use start and end coordinates range instead of gene IDs to extract genes within a particular slice?

ADD REPLY • link 6.9 years ago by Sej Modha 5.3k

0

Entering edit mode

yep thats my plan, use the gene_id to then capture the corresponding location :)

Check out my answer to jrj.healey

ADD REPLY • link 6.9 years ago by Barry Digby ★ 1.3k

5

Entering edit mode

6.9 years ago

Joe 22k

It shouldn't be too difficult to do. BioPython allows you to slice SeqRecords using standard python notation.

The important bits are that sequence features have an accessible location you can use, and then chopping out the bit you need is as simple as subrecord = record[start:end]. Biopython does extra magic, by keeping the subrecord in the same format as the original record, so when you come to write out the data, you still have all the features/translations etc. If you wanted to write out the genes only, you could just loop over the features in the subrecord and write them as fastas.

This essentially borrows the same idea that i used for a script some time ago: https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

(except that that script is designed to use either known indices, or a complementary FASTA sequence to figure out what sequence span to subset)

Edit: Updated code, now handles multigenbanks, and the reverse strand logic should be correct.

Edit II: Code updated again. Realised the strand switching logic actually wasn't needed as biopython stores everything relative to the forward strand anyway. Added argparse, and also fixed an error in how multigenbanks are processed (it was slicing all records not just the record under iteration at that moment).

	#!/usr/bin/env python

	# Slice subregions out of a genbank between 2 genes.
	# Usage:
	# slice_genes.py infile.gbk gene1:gene2:strand

	from Bio import SeqIO
	import sys, argparse

	def get_args():
	"""Parse command line arguments"""

	try:
	parser = argparse.ArgumentParser(
	description='Subset genbanks between 2 genes.')
	parser.add_argument('-i', '--infile', action='store',
	help='Input genbank file to slice.')
	parser.add_argument('-r', '--range', action='store',
	help='The 2 gene LOCUS TAGS to slice between')
	if len(sys.argv) == 1:
	parser.print_help(sys.stderr)
	exit(1)
	except:
	sys.stderr.write("An exception occurred with argument parsing. Check your provided options.")

	return parser.parse_args()


	def main():
	args = get_args()
	genes = args.range.split(':')
	sys.stderr.write('Fetching subsequence betweeen {} and {}.\n'.format(genes[0], genes[1]))

	records = SeqIO.parse(args.infile, 'genbank')

	for record in records:
	sys.stderr.write('Operating on {}.\n'.format(record.id))
	loci = [feat for feat in record.features if feat.type == "CDS"]
	try:
	start = min([int(l.location.start) for l in loci if l.qualifiers['locus_tag'][0] in genes])
	end = max([int(l.location.end) for l in loci if l.qualifiers['locus_tag'][0] in genes])
	except ValueError:
	sys.stderr.write('No indices returned for those loci, assume they don\'t feature in this record, moving on...\n.')
	continue
	try:
	(start and end)
	subrecord = record[start:end]
	print subrecord.format('genbank')
	except NameError:
	sys.stderr.write('Didn\'t get any indices even though the genes seemed to match. Couldn\'t slice.\n')
	pass

	if __name__ == '__main__':
	main()

view raw slice_genes.py hosted with ❤ by GitHub

Invoke the code as per the usage comments:

$ python slice_genes.py -i infile.gbk -r GeneA:GeneB

I was using locus tags to test this, you will want to change the 4 occurrences of 'locus_tag' back to 'gene_id' (or whatever other qualifier you like

I'm fairly sure this is behaving correctly now. The script will print to screen though, so I'll leave writing a file to you. I've used sys.stderr specifically so you can pipe/redirect STDOUT if you'd prefer to do that instead of writing a file

ADD COMMENT • link 6.9 years ago by Joe 22k

0

Entering edit mode

import pandas as pd 
from Bio import SeqIO
from Bio import SeqFeature

df = pd.read_csv('27SEPTtest.txt', sep='\t', names=['Cluster', 'Start', 'Stop'])
start_gene = df["Start"].tolist()
end_gene = df["Stop"].tolist()
cluster_name = df["Cluster"].tolist()

gbk = "antiSMASH.gbk"
record = next(SeqIO.parse("antiSMASH.gbk", "genbank"))

for x, y, z in zip(start_gene, end_gene, cluster_name):
    for seq_record in SeqIO.parse(gbk, "genbank"):
        for feat in seq_record.features:
            if feat.type == "CDS":
                if x in feat.qualifiers["gene_id"]:

                    cluster_start = feat.location.start

                else:

                    if y in feat.qualifiers["gene_id"]:

                        cluster_end = feat.location.end

                        subrecord = record[cluster_start:cluster_end]

                        filename = "%s.gbk" % z

                        SeqIO.write(subrecord, filename, "gb")

Here is my attempt. I'd place myself at "just above noob level" and not used to using your style of for loops -- so I used this way instead.

I'm at a loss because the "cluster_start" and "cluster_end" strings correspond to the correct location in the genbank file. However when I call "record[cluster_start:cluster_end]" and write it to a file, it generates a genbank file where the captured genes are about 100 genes downstream of my actual target.

I feel like I am very close to solving this >:(

Output can be genbank or Fasta (protein seq) because its going to be used for multigeneblast.

Perhaps useful to note, I am parsing a Genbank file created from antiSMASH. Yes I know they have cluster'XX'.gbk files that can be downloaded, but I am trying to extract different Clusters (predicted by SMIPS/CASSIS).

ADD REPLY • link 6.9 years ago by Barry Digby ★ 1.3k

1

Entering edit mode

Yep I’m familiar with both tools. Give me a little more time and I’ll make something properly functional :)

Is your input file a multi-genbank?

ADD REPLY • link 6.9 years ago by Joe 22k

0

Entering edit mode

I believe it is a multigenbank file. It has each Contig in it seperated by //.

ADD REPLY • link 6.9 years ago by Barry Digby ★ 1.3k

1

Entering edit mode

Ok, I thought that might be the case, that does complicate things slightly but should still be doable.

(Alternatively you could merge the genbanks)

ADD REPLY • link 6.9 years ago by Joe 22k

0

Entering edit mode

Sorry I got confused by the terminology. I am parsing the "final.gbk" file generated by antismash. One file, that I'm guessing is merged together by antismash during analysis.

ADD REPLY • link 6.9 years ago by Barry Digby ★ 1.3k

1

Entering edit mode

I've updated my post with some properly functional code. I still didn't test it very thoroughly, but the slicing seems to work for both strands, and it handled a multigenbank of 2 sequences.

You might have to tweak it to read from your dataframe and/or write a file for each record (I haven't done this because you will have to append an identifier or iterator to the file name (or allow appending) else the loop will cause your file to be overwritten for each input sequence).

ADD REPLY • link 6.9 years ago by Joe 22k

1

Entering edit mode

Also just be aware, if you copy and paste the code, that there is a known formatting bug on the forum that stops some brackets rendering properly.

The line:

sys.stderr.write('Operating on {}.\n'.formatrecord.id))

Should have a ( between format and record.id.

ADD REPLY • link 6.9 years ago by Joe 22k

1

Entering edit mode

Thanks a million for your time. I'll give it a whirl when I get home. Get off the laptop and enjoy a pint 🤙🏻

ADD REPLY • link 6.9 years ago by Barry Digby ★ 1.3k

1

Entering edit mode

Comment content no longer relevant, main post updated.

ADD REPLY • link 6.9 years ago by Joe 22k

0

Entering edit mode

FYI, I've updated the code again as I realised the strand logic actually isnt necessary, and there actually was an error in how multigenbanks were processed still.

I think the code is now fully correct.

ADD REPLY • link 6.9 years ago by Joe 22k

Login before adding your answer.