Slicing Genbank File by range
0
0
Entering edit mode
3.9 years ago
kamel ▴ 70

Dear colleagues, I have a genbank file annotated from a genome draft (contig_1 to contig_89). I'm interested in extracting only a region 20,500 to 21,500 from contig_2 and I would like to keep the gbk format as output. Can someone help me with this.

Thanks in advance !

sequence extract gbk python • 1.2k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
1
Entering edit mode

Related to the above, the code that was derived from is here: https://github.com/jrjhealey/bioinfo-tools/blob/master/Genbank_slicer.py

The linked thread was an adaptation to slice between features, but if you want to use indices, its actually even easier.

ADD REPLY
0
Entering edit mode

Thanks @Joe for your feedback. I used this script on the genbank of the reference strain but got the error below. Please also find the command I used.

./Genbank_slicer.py -g sequence.gbk -s 45948 -e 92378

 Traceback (most recent call last):
  File "./Genbank_slicer.py", line 283, in <module>
    main()
  File "./Genbank_slicer.py", line 271, in main
    print(subRecord.format("genbank"))
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqRecord.py", line 719, in format
    return self.__format__(format)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqRecord.py", line 757, in __format__
    SeqIO.write(self, handle, format_spec)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 530, in write
    count = writer_class(handle).write_file(sequences)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 244, in write_file
    count = self.write_records(records, maxcount)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 218, in write_records
    self.write_record(record)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/InsdcIO.py", line 981, in write_record
    self._write_the_first_line(record)
  File "/home/anaconda3/lib/python3.7/site-packages/Bio/SeqIO/InsdcIO.py", line 744, in _write_the_first_line
    raise ValueError("missing molecule_type in annotations")
ValueError: missing molecule_type in annotations
ADD REPLY
0
Entering edit mode

I solved this problem by installing biopython: 1.77

pip3 install biopython == 1.77

Do you have any idea how to specify the contig number on which I can extract the part that interests us. For example I want to target contig 2: from 20,500 to 21,500 ??

ADD REPLY
1
Entering edit mode

Is your input file a multi-genbank in that case?

If you know which of these records you want (2) that makes it a little easier, but still needs quite a lot of code refactoring. Is it a safe assumption that your contigs are listed in order within the file?

I don't have time right now to write the full code, so there is some pseudo-code below. If I get chance I'll try and work this up properly.

contig_wanted = 2    # pass this in from a dedicated argparse argument
# Iterate each record in order
for i, record in enumerate(SeqIO.parse("/path/to/genbanks.gb", "genbank")):
    #check index against requested (remember python starts at 0 so you need the i-1th index
    # alternatively you can just adjust enumerate to start from 1 instead of 0.
    if i == contig_wanted:
         # do slicing
   #write file etc
ADD REPLY

Login before adding your answer.

Traffic: 2247 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6