Entering edit mode
4.5 years ago
Caroline
•
0
Hi everyone! I'm a freshman in Python. I was trying to extract the alternative splicing sites that overlapped with SNP sites. I wrote a code to do this, and it works, but too slow and costs much memory. I post it here and wonder anyone could give me some advices to improve it. Thanks.
import re
path = "D:/work_dir/"
fin1 = open(path + "AS_event_list.txt",'r')
fin2 = open(path + "SNP_site.txt",'r')
fot = open(path + "SNP_AS.txt",'w')
dic1={}
with fin1 as file:
for line in file:
#e.g.,G19254_scaffold_5_-_4353064_4354635_4353064_4354330_4354479_4354635
_line = line.strip().split('_')
chr1 = '_'.join(_line[1:3])
_line1 = _line[0] + '\t' + chr1 + '\t'+ '\t'.join(_line[3:10])
dic1[_line1] = chr1
dic2={}
with fin2 as file2:
for line2 in file2:
#e.g., scaffold_1 1876 . G C . . EFF=intergenic_region (seperated by tab)
_line2 = line2.strip().split('\t')
chr2 = _line2[0]
pos = _line2
dic2[line2] = chr2
for key in dic2:
newpos = key.split('\t')
_chr0 = newpos[0]
pos2 = newpos[1]
for k in dic1:
newpos3 = k.split('\t')
chr0 = newpos3[1]
poss = '\t'.join(newpos3[3:9])
pos0 = poss.split('\t')
for _pos in pos0:
if(str(pos2) == str(_pos) and chr0 == _chr0):
fot.write( key + '\t' + k)
else:
continue
fin1.close()
fin2.close()
fot.close()
print("Done.")
It is unfortunate to only show a block of code. Please illustrate the problem by showing input and expected outputs. The answer is probably anyway "use
bedtools intersect
".^This exactly. Don't re-invent the wheel. Writing code to solve problems that have already been solved efficiently is not worth the effort. No amount of optimization a beginner could do will yield a better result than an expert's program written in an optimized language.
Ok, while I agree that you should probably use
bedtools intersect
, I am always happy to see that people like to learn how things work. So, here's what comes to mind in your script:You're storing everything from file2 in a dictionary, which you only iterate over once. You could easily merge blocks 2 and 3, reading each line from file2 (your SNPs) and then do your comparisons against the alternative splicing data in file1. Looking at your code fragment, there is no need to actually store the SNP data. (-> this will already save a huge amount of memory, depending how many SNPs you got)
In your comparison of AS vs SNPs (block 3), you're redoing the extraction of chr0 and pos0 over and over again, resulting in extra runtime. It would make more sense to preprocess that and store the data differently (e.g. in a list of tuples (chr, pos, full_line)).
In the last for loop you're comparing each position before checking the seq (scaffold/contig/chromosome). It would make more sense to compare the seq first, and then only iterate over the positions if the seqs match. In addition, you're using str() casts, which should not be necessary, since you're never casting to int() before. (also adds up in terms of runtime)
If I read this correctly, you're storing all data in dictionaries (with using the whole data line as key, which is strange), but never actually use the values. Dictionaries in Python are (or used to be?) huge memory hogs, so for this case you might be better off using lists (s. 1.)