Hello again,
based on the task you described in this post here is a workflow that could help.
The following python script takes the genebank file as input and creates these 3 files:
- a
fasta
file with the genome sequence
- a
bed
file with the coordinates of the genes
- a dummy
bed
file for the whole genome, which we use later for getting the intergenic sequences
Run the script like this:
$ python extract_files.py KC912694.gb
extract_files.py
looks like this:
import sys
import os
from Bio import SeqIO
def main(input):
file_name = os.path.basename(input)
outname = os.path.splitext(file_name)[0]
SeqIO.convert(input, "genbank", outname+".fasta", "fasta")
for record in SeqIO.parse(input, "genbank"):
record_name = record.annotations["accessions"][0]+"."+str(record.annotations["sequence_version"])
with open(record_name+".bed", "w") as bed, open("genome.bed", "w") as genome:
for feature in record.features:
if feature.type == "source":
start = feature.location.start.position
stop = feature.location.end.position
bed_line = "{0}\t{1}\t{2}\n".format(record_name, start, stop)
genome.write(bed_line)
if feature.type == 'gene':
start = feature.location.start.position
stop = feature.location.end.position
try:
name = feature.qualifiers['gene'][0]
except:
name = feature.qualifiers['locus_tag'][0]
bed_line = "{0}\t{1}\t{2}\t{3}\n".format(record_name, start, stop, name)
bed.write(bed_line)
if __name__ == '__main__':
main(sys.argv[1])
To get the sequence of the genes we can use bedtools getfasta:
$ bedtools getfasta -fi KC912694.fasta -bed KC912694.1.bed -fo genes.fa
To get the intergenic sequences we have to create a bed file containing the regions
$ bedtools subtract -a genome.bed -b KC912694.1.bed > intergenic.bed
Use bedtools getfasta
again to get intergenic sequences:
$ bedtools getfasta -fi KC912694.fasta -bed intergenic.bed -fo intergenic.fa
fin swimmer
I would run fast and far from that script, it’s a mess. Looking at the Github page, much of the indentation appears to be messed up, and it uses a lot of outdated python syntax. This task can probably be achieved in under 100 lines with BioPython.
Please just edit your question to describe the task you want to achieve other than “make this script run”, and someone may have an even better solution.
Example fasta would have helped addressing the issue @ raveendars
Hi Thank you kindly look at the article https://doi.org/10.1016/j.compbiolchem.2017.09.003
If there is no specific reason to use this script, then please explain what do you want to achieve at the end and what files you already have for that. Additionally, we want to know your inputs for the same i.e. what efforts did you put in?
Hi Thank you for your reply I would like to use the method mentioned in the following article https://doi.org/10.1016/j.compbiolchem.2017.09.003
In siligo analysis with chloroplast genome to differentiate Triticum species.
I would like to extract the intergenic region from the chloroplast genome for that I have downloaded the fasta and gb file from NCBI. So, I followed the instruction as the author mentioned but got the error while running with windows python 2.7