NOTE: I already tried using bedtools and bcbio-gff but they were not useful
I have a gff3 file and a FASTA genome file
The gff3 file is like this:
20 protein2genome exon 12005 12107 . - . ID=Blah_exon:1;Name=Blah:1;Parent=Blah
20 protein2genome exon 12108 12200 . - . ID=Blah_exon:2;Name=Blah:1;Parent=Blah
20 protein2genome exon 12005 12107 . - . ID=Blah_exon:3;Name=Blah:1;Parent=Blah
20 protein2genome exon 12342 12542 . - . ID=Blah_exon:4;Name=Blah:1;Parent=Blah
20 protein2genome exon 13005 13107 . - . ID=ABC_exon:1;Name=ABC:1;Parent=ABC
For each exon, I need to write a new file named gene_name_exon_number.fa
The header inside the file is this:
gene_name exon_number chromosome_name exon_start exon_end
followed by the nucleotide bases in its chromosome using the start and end numbers
So the first exon file would look like:
Blah 1 20 12005 12107 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT
Each exon will be a new file.
Also, I need to create a CDS file for each gene containing all the exons named gene_name_cds.fa
The header inside the file is this:
gene_name number_of_exons chromosome_name cds_start cds_end cds_total_length
So the first CDS file would look like:
Blah 4 20 12005 12542 ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT ACGGGCCTCAAAGCCGCGACACGACGGCTGTCGGCCGGTAACAGTAACCCCGGAGTGAACTCCTAT
Note: the nucleotide bases are just all of the nucleotide bases from the exons pasted together.
In the end, there will be several exon files and one CDS file for EACH gene in a GFF3 file containing several hundred genes and thousands of exons.
However, the file is generating CDS files in the gigabtye range when the entire genome file is only 130mb. So the problem must be for loop order. Can someone help me fix the loop order?
I tried to use several for loops and prepender. My code looks like this so far:
import fileinput
input_file = "dummy.gf3"
search_file = "dummy.fa"
def line_pre_adder(filename, line_to_prepend):
f = fileinput.input(filename, inplace=1)
for xline in f:
if f.isfirstline():
print line_to_prepend.rstrip('\r\n') + '\n' + xline,
else:
print xline,
def exon_extractor():
exon_file = open(input_file)
for exon_line in exon_file:
previous = ""
CDS_exon_number = 1
CDS_start = int(0)
CDS_end = int(0)
NMID = exon_line.split()[-1].split()[0].rsplit(';')[0][3:].rsplit("_")[0]
exon_number = str(exon_line.split()[-1].split()[0].rsplit(';')[0][-1])
filename = str(exon_line.split()[-1].split()[0].rsplit(';')[0][3:-2]) + "_" + str(exon_line.split()[-1].split()[0].rsplit(';')[0][-1])
chromosome = exon_line.split()[0]
start = exon_line.split()[3]
end = str(int(exon_line.split()[4])+1)
exon_header = ">" + NMID + " " + exon_number + " " + start + " " + str(int(end)-1)
exon_file = open (str("exon_folder" + filename + ".fas"),"w+")
cds_file = open (str("test_folder" + NMID +"_CDS" + ".fas"),"a+")
exon_file.write(exon_header + '\n')
desired = int(end) - int(start)
with open (search_file,'r') as genome_file:
for genome_line in genome_file:
if str(chromosome + " ") in genome_line:
for genome_line in genome_file:
for i in range (int(desired)/60):
exon_file.write(genome_line.rstrip())
if NMID != previous:
cds_file.close()
line_pre_adder(str("test_folder" + NMID +"_CDS" + ".fas"), str(NMID + " " + str(CDS_exon_number) + " " + str(CDS_start) + " " + str(CDS_end) + " " + str(CDS_end-CDS_start)))
cds_file = open (str("test_folder" + NMID +"_CDS" + ".fas"),"a+")
cds_file.write(genome_line.rstrip())
previous = NMID
CDS_exon_number = exon_number
CDS_start += int(start)
CDS_end += (int(end)-1)
else:
cds_file.write(genome_line.rstrip())
CDS_exon_number = exon_number
CDS_start += int(start)
CDS_end += (int(end)-1)
if (int(desired) % 60) != 0:
exon_file.write(genome_line[0:(int(desired) % 60)+1].rstrip())
cds_file.write(genome_line[0:(int(desired) % 60)+1].rstrip())
break
else:
pass
else:
pass
Not useful in what way? At first glance
bedtools getfasta
should do what you need. You can pull out (and split) sequences you need after you get all the intervals out first (many threads about this request on Biostars).The problem is that I got only one file like this which only contains the exons. I still need another file which contains the CDS for each gene and the headers were wrong for the file. Also, I couldn't split the FASTA file using the "split" command in bedtools.
If I could get two files, one with the exons and one with the CDS, and each file had the correct headers, then bedtools would be useful.
The exon file looked like this and the headers did not include the gene name so they weren't in the format: gene_name exon_number chromosome_name exon_start exon_end
Just some remarks, no complete solution:
pass
, you can remove that. Just an if clause works fineI heard this code has such an interesting backstory they are thinking about making a cinematic adaptation.