Hello everybody!!
I have a serious problem with my script on python.
I have to count the appearances of each dinucleotide for each cDNA, Write it into table.
Then Calculate the ratio between some specific dinucleotides and write this data into another table.
I wrote Python script that does all these.
But the problem is that on big files (500 MB) it takes hours for the algorithm to run...
Can anybody help me improve the run time - by changing the script or translate it to another language like Perl or bash.
Unfortunately I am not close with these languages...
And I understood that python is quite slow on this stuff...
Here is my script:
"""This program gets an organism's cDNA fasta file
and generates a table for the frequency of each dinucleotide for all the cDNAs
and a table for the ratio between each dinucleotides - sense/antisense."""
from Bio import SeqIO
import csv
#Initializes a list of all the posibile dinucleotides
dinucleotides = ['AA','AT','AC','AG',
'TT','TA','TC','TG',
'CC','CA','CT','CG',
'GG','GA','GT','GC']
ratio = ['AA/TT','AC/GT','AG/CT',
'TT/AA','TC/GA','TG/CA',
'CC/GG','CA/TG','CT/AG',
'GG/CC','GA/TC','GT/AC']
def Counting(seq):
"""This function gets a cDNA and returns the frequency of each dinucleotide"""
counting = {} #A dictionary where the dinucleotides are the keys and the counts are the values.
#Scans all the dinucleotides and checks how many time they appear in the sequence.
for dinuc in dinucleotides:
for i in range (len(seq)-2):
pos = i+1,i+2
if(seq[i:i+2] == dinuc):
if(counting.get(dinuc,None)!= None): #If the dictionary at the specific dinucleotide is not empty.
counting[dinuc] += 1 #Adds the number of times that the dinucleotide appears.
else:
counting[dinuc] = 1 #initializes the dinucleotide with 1 instance.
elif (seq[i:i+2] != dinuc and counting.get(dinuc,None)== None): #If the dinucleotide doesn't appear at all.
counting[dinuc] = 0 #initializes the dinucleotide with 0 instances.
return counting
def Ratio(c):
"""This function calculates the ratio between each dinucleotide on the sense and antisense."""
di = dict() #A dictionary for the ratio.
for ra in ratio:
r = ra.split('/') #Splits the header for 2 dinucleotides.(r is a list)
first = r[0] #The first dinucleotide.
second = r[1] #The second dinucleotide.
if(c[second] == 0): #In case of division at 0.
di[ra] = None
else:
di[ra] = ((float)(c[first])/c[second]) #The ratio beetween the first dinucleotide and the second.
return di
def cDNA(Id,seq):
"""This function gets a sequence and it's id and writes into csv file it's details"""
# A dictionary with all the details: id, length, counting and ratio.
data = {'Id': Id , 'length': len(seq) , 'dinucleotides' : Counting(seq),'ratio':Ratio(Counting(seq))}
# build row
list1 = list() #A list for all the data of table1.
list2 = list() #A list for all the data of table2.
list1.append(data['Id']) #The sequence id.
list1.append(data['length']) #The length of the sequence.
#Scans all the "counting" dictionary
for dinuc in dinucleotides:
list1.append(data['dinucleotides'][dinuc]) #The counting of all the dinucleotides in the sequence.
list2.append(data['Id']) #The sequence's id.
for r in ratio:
list2.append(data['ratio'][r]) #The ratio between each dinucleotides in the sequence.
# write row to CSV handle
writer1.writerow(list1)
writer2.writerow(list2)
def main(path_to_file):
with open(path_to_file, mode='r') as handle:
# Use Biopython's parse function to process individual FASTA records.
for record in SeqIO.parse(handle, 'fasta'):
# Extract individual parts of the FASTA record
identifier = record.id #The sequence's Id.
sequence = record.seq #The sequence itself.
sequence = sequence.upper() #Turns all the nucleotides to upper case.
cDNA(identifier,sequence)
handle.close()
table1.close()
table2.close()
print "well done\n"
#Generates a csv file for all the information
path_to_file = raw_input()
name = path_to_file[:path_to_file.find('.')] #The organism's name.
table1 = (open(name + ".csv", "wb+")) #A file for the counting table.
writer1 = csv.writer(table1)
table2 = (open(name + "-ratio.csv", "wb+")) #A file for the ratio table.
writer2 = csv.writer(table2)
header_keys1 = ["Id", "length" ] + dinucleotides #The header of table 1.
header_keys2 = ["Id"] + ratio #The header of table 2
writer1.writerow(header_keys1) #Writes the header into file.
writer2.writerow(header_keys2) #Writes the header into file.
main(path_to_file)
For example (just a tiny file)
>CCE57618 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:166:1143:1 gene:HUS2011_pII0001 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:repA description:replication initiation protein RepFIB
GTGGATAAGTCGTCCGGTGAGCTGGTGACACTGACACCAAACAATAACAACACCGTACAA
CCTGTGGCGCTGATGCGTCTGGGCGTCTTTGTACCGACCCTTAAGTCACTGAAGAACAGT
AAAAAAAATACACTGTCACGCACTGATGCCACGGAAGAACTGACGCGTCTTTCCCTGGCC
CGTGCAGAAGGATTCGATAAGGTTGAGATCACCGGTCCCCGGCTGGATATGGATAACGAT
TTCAAGACCTGGGTGGGGATCATTCATTCCTTTGCCCGCCATAACGTGACTGGTGACAAA
GTTGAACTGCCTTTTGTCGAGTTTGCAAAACTGTGTGGTATACCTTCAAGCCAGTCATCA
CGCAGGCTGCGTGAGCGCATCAGCCCTTCCCTGAAACGCATTGCCGGTACCGTGATCTCC
TTTTCCCGCACCGATGAGAAGCACACCCGGGAATATATCACCCATCTGGTACAGTCAGCC
TACTACGATACTGAACGGGATATTGTTCAGTTACAGGCTGATCCCCGCCTGTTTGAACTG
TACCAGTTTGACAGAAAGGTCCTTCTTCAGCTTAAGGCGATTAATGCCCTGAAGCGACGG
GAGTCCGCCCAGGCACTTTACACCTTTATAGAGAGCCTGCCCCGGGATCCGGCACCGATA
TCGCTGGCGCGGCTACGTGCCCGCCTCAATCTGAAGTCTCCGGTATTTTCCCAGAACCAG
ACGGTCAGACGGGCAATGGAGCAGTTACGTGAGATTGGATATCTTGATTACACGGAGATC
CAGCGGGGGCGAACAAAACTCTTCTGTATTCACTACCGACGTCCCCGGTTAAAAGCGCCG
AATGATGAGAGTAAGGAAAATCCGTTGCCACCTTCACCTGTGGAAAAAGTCAGTCCGGAG
ATGGCGGAGAAACTTGCCCTGCTTGAAAAACTGGGCATCACGCTGGATGACCTGGAAAAA
CTCTTCAAATCCCGCTGA
>CCE57619 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1219:1338:-1 gene:HUS2011_pII0002 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
ATGTTTTATGAAGGGAGCAATGCCTCAGCATCAGGTTACGGGGTGACTCACGTAAGGGAC
AGGCAGATGGCAGCTCAGCCACAGGCAGCACTGCAGGAAACTGAATATAAACTGCAGTGA
>CCE57620 cdna:annotated plasmid:HUSEC2011CHR1:pHUSEC2011-2:1422:2162:-1 gene:HUS2011_pII0003 gene_biotype:protein_coding transcript_biotype:protein_coding description:site-specific recombinase
ATGAACAATGTCATTCCCCTGCAGAATTCACCAGAACGCGTCTCCCTGTTACCCATTGCG
CCGGGGGTGGATTTTGCAACAGCGCTCTCCCTGAGAAGAATGGCCACTTCCACGGGGGCC
ACACCGGCCTACCTGCTGGCCCCGGAAGTGAGTGCCCTTCTTTTCTATATGCCGGATCAG
CGTCACCATATGCTGTTCGCCACCCTCTGGAATACCGGAATGCGTATTGGCGAAGCCCGG
ATGCTGACACCGGAATCATTTGACCTGGATGGAGTAAGACCGTTTGTGCGGATCCAGTCC
GAAAAAGTGCGTGCGCGACGCGGACGCCCGCCAAAAGATGAAGTGCGCCTGGTTCCGCTG
ACAGATATAAGCTATGTCAGGCAGATGGAAAGCTGGATGATCACCACCCGGCCCCGTCGT
CGTGAACCATTATGGGCCGTGACCGACGAAACCATGCGCAACTGGCTGAAGCAGGCTGTC
AGACGGGCCGAAGCTGACGGGGTACACTTTTCGATTCCGGTAACACCACACACTTTCCGG
CACAGCTATATCATGCACATGCTCTATCACCGCCAGCCCCGGAAAGTCATCCAGGCACTG
GCTGGTCACAGGGATCCACGTTCGATGGAGGTCTATACACGGGTGTTTGCGCTGGATATG
GCTGCCACGCTGGCAGTGCCTTTCACAGGTGACGGACGGGATGCTGCAGAGATCCTGCGT
ACACTGCCTCCCCTGAAGTAA
My output is these 2 csv file:
counting.csv
Id,length,AA,AT,AC,AG,TT,TA,TC,TG,CC,CA,CT,CG,GG,GA,GT,GC
CCE57618,978,74,52,70,52,55,38,55,73,80,61,61,60,60,75,53,57
CCE57619,120,8,7,8,13,4,5,4,9,2,15,6,2,12,7,5,11
CCE57620,741,34,44,50,35,31,22,38,62,68,55,43,52,56,52,35,62
ratio.csv
Id,AA/TT,AC/GT,AG/CT,TT/AA,TC/GA,TG/CA,CC/GG,CA/TG,CT/AG,GG/CC,GA/TC,GT/AC
CCE57618,1.34545454545,1.32075471698,0.852459016393,0.743243243243,0.733333333333,1.19672131148,1.33333333333,0.835616438356,1.17307692308,0.75,1.36363636364,0.757142857143
CCE57619,2.0,1.6,2.16666666667,0.5,0.571428571429,0.6,0.166666666667,1.66666666667,0.461538461538,6.0,1.75,0.625
CCE57620,1.09677419355,1.42857142857,0.813953488372,0.911764705882,0.730769230769,1.12727272727,1.21428571429,0.887096774194,1.22857142857,0.823529411765,1.36842105263,0.7
Sorry for the mess, if anyone can explain how to upload an Excel file it will be very helpful....
Thanks!!!
How are you executing this? Where are you specifying the input file?
On this line: main(path_to_file) And then the main function calls the other functions.
Working with a interpreted language like python will always make your code slower. It seems that your problem could be easily paralellized. Give a look at the multiprocessing package or to GNU Parallel.
Is this package available on python?
import multithreading
is intended here. "GNU Parallel" will be of no help to you, ignore that. But really, get the core script to run as quickly as possible before making things more complicated with threading.O.K. I will.
any tips for that?
Or maybe I'll try to write it in another language?
Unless you are writing software which is for production or for using multiple times, python is perfectly fine. With some parallelization, you can easily improve on runtime. The task you are writing for is not too hard.
How many cores does your machine have available?
Perl or bash is not going to help (or not a lot). If people talk about faster languages it's about compiled languages such as java or C.