I created a file that takes a GFF3 file and a FASTA genome file and writes each exon from the GFF3 file into a new file and each gene group of exons into a new CDS file. The problem is that the exons write to a separate file but the CDS files have only a header and no CDS sequence.
I suspect that the problem begins at line
if Gene_Name != previous:
when I try to see whether the parent gene of the exon is the same as the previous exon. If it's not, then the current CDS file is written and closed before opening a new CDS file.
Any suggestions on fixing the if else clause beginning at
if Gene_Name != previous:
I think it's because I set the value previous inside the scope of the program.
How do I fix this paradox/casual loop in a program splitting a file into multiple files?
Input files:
ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.49_FB2013_01/fasta/dmel-all-chromosome-r5.49.fasta.gz This is the FASTA file
ftp://ftp.ensemblgenomes.org/pub/metazoa/release-32/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.32.chromosome.2L.gff3.gz This is the gff3 file.
I have a gff3 file and a FASTA genome file
The gff3 file is like this:
20 protein2genome exon 12005 12107 . - . ID=Blah_exon:1;Name=Blah:1;Parent=Blah
20 protein2genome exon 12108 12200 . - . ID=Blah_exon:2;Name=Blah:1;Parent=Blah
20 protein2genome exon 12005 12107 . - . ID=Blah_exon:3;Name=Blah:1;Parent=Blah
20 protein2genome exon 12342 12542 . - . ID=Blah_exon:4;Name=Blah:1;Parent=Blah
20 protein2genome exon 13005 13107 . - . ID=ABC_exon:1;Name=ABC:1;Parent=ABC
For each exon, I need to write a new file named gene_name_exon_number.fa
The header inside the file is this:
gene_name exon_number chromosome_name exon_start exon_end
followed by the nucleotide bases in its chromosome using the start and end numbers
So the first exon file would look like:
Blah 1 20 12005 12107 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT
Each exon will be a new file.
Also, I need to create a CDS file for each gene containing all the exons named gene_name_cds.fa
The header inside the file is this:
gene_name number_of_exons chromosome_name cds_start cds_end cds_total_length
So the first CDS file would look like:
Blah 4 20 12005 12542 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT
Note: the nucleotide bases are just all of the nucleotide bases from the exons pasted together.
In the end, there will be several exon files and one CDS file for EACH gene in a GFF3 file containing several hundred genes and thousands of exons.
For the exon, I need this:
gene_name exon_number chromosome_name exon_start exon_end
For the CDS, I need this:
gene_name number_of_exons chromosome_name cds_start cds_end cds_total_length
cds_total_length should be cds_end-cds_start
from collections import defaultdict
from Bio import SeqIO
Exon_file = "exon.gff3"
FASTA_file = "melanogaster.fasta"
handle = open(FASTA_file, "rU")
FASTA_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
handle.close()
Exon_dict = {}
with open(Exon_file) as f:
for line in f:
Exon_dict[line.split()[-1].rsplit(";")[0].replace(":","")[3:]] = line.split()
previous = ""
CDS_sequence = ""
CDS_number_of_exons = int(0)
CDS_start = int(0)
CDS_end = int(0)
CDS_file = open (str("testfolder/" + "DUMMY" +"_CDS" + ".fas"),"w+")
for key, value in Exon_dict.iteritems():
for k,v in FASTA_dict.iteritems():
chromosome = value[0]
direction = value[6]
Gene_Name = value[-1].split()[0].rsplit(';')[0][3:].rsplit("_")[0]
if chromosome == k:
start = value[3]
end = value[4]
exon_file = open (str("testfolder/" + key + ".fas"),"w+")
exon_file.write(">" + key + " " + k + " " + start + " " + end + "\n")
if direction == "+":
exon_file.write(str(v.seq[int(int(start)-1):(int(end))]))
CDS_sequence += (str(v.seq[int(int(start)-1):(int(end))]))
elif direction == "-":
exon_file.write(str(v.seq[int(int(start)-1):(int(end))].reverse_complement()))
CDS_sequence += (str(v.seq[int(int(start)-1):(int(end))].reverse_complement()))
if Gene_Name != previous:
CDS_file.write(">" + str(Gene_Name) + " " + str(CDS_number_of_exons) + " " + str(chromosome) + " " + str(CDS_start) + " " + str(CDS_end) + " " + str(int(CDS_end)-int(CDS_start)) + "\n")
CDS_file.write(CDS_sequence)
CDS_file.close()
CDS_file = open (str("testfolder/" + Gene_Name +"_CDS" + ".fas"),"w+")
previous = Gene_Name
CDS_sequence = ""
CDS_number_of_exons = int(0)
CDS_start = int(0)
CDS_end = int(0)
else:
CDS_number_of_exons += 1
CDS_end += int(end)
if start > CDS_start:
pass
else:
CDS_start = start
BTW, the
for k,v in FASTA_dict.iteritems():
loop is completely unnecessary. You made a dictionary, use it as such.Thanks! += fixed the CDS_sequence. However, I'm still having trouble with the count for CDS_start, CDS_end, and CDS_number_of_exons. Any suggestions? Also, why do you say this?
It's completely unclear from your code exactly what
NMID
is getting set to, so put someprint("NMID {}, previous {}").format(NMID, previous))
in there.Regarding the loop, you already know what the chromosome is, so:
Also, delete all instances of the following:
Those are just wasting CPU.
Thanks! NMID is the name of a parent gene. I've updated the code to reflect some edits and deleted all instance of.
I'll look at the for loop you suggested. The exon loop isn't the problem since it prints all of the bases. It's these lines: