I have the code pasted below for running on FASTQ file entries in order to compare specific parts and remove the redundancy of the same sequences (based on the miRNA + umi_seq
combination). I save the entry IDs and then make a new FASTQ file based on them.
The problem with my code is that it takes a very long time for a FASTQ file with 72 million entries (~18 GB) to output a new file. How can I optimize the code so it can run quite fast? Thank you.
from Bio import SeqIO
import sys
def QIAseq_UMI_correction():
script=sys.argv[0]
file_name=sys.argv[1]
dicts1 = {}
dicts2 = {}
lst = []
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
#print(record.id)
#print(record.seq)
#looking for the 3 prime adapter
if "AACTGTAGGCACCATCAAT" in record.seq:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
#Only record is used to be able to save the all atributes like phred score in the fastq file
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
i = record.id
x = miRNAseq.seq+umi_seq.seq
#print((miRNAseq+umi_seq).format("fastq"))
dicts1[i]=x
#write ids and seq in a dictionary and keep one if there are multiple seqs with miRNA-seq + UMI
for k,v in dicts1.items():
if v not in dicts2.values():
dicts2[k] = v
#making a list
for keys in dicts2:
lst.append(keys)
#print(dicts1)
#print(dicts2)
#print(lst)
with open(file_name, "r") as Fastq:
for record in SeqIO.parse(Fastq,'fastq'):
#based on the saved ids in the list print the entries (miRNA + 3' adapter + UMI)
if record.id in lst:
adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
miRNAseq = record[:adapter_pos]
adapter_seq=record[adapter_pos:adapter_pos+19]
umi_seq = record[adapter_pos+19:adapter_pos+19+12]
#print(record.seq)
#print(miRNAseq.seq)
#print(adapter_seq.seq)
#print(umi_seq.seq)
#print("@"+record.id)
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')
if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')
if __name__ == '__main__':
QIAseq_UMI_correction()
The initial 'non-simplistic' solution is to parallelise the code. I suspect this is running on a single processor. Beyond that there is a number of key improvements:
The list goes on and its not simplistic. This is a full week and more of work and with expertise its couple of days solid work, minimum - well okay there's not much code here, so maybe not.
An easier solution is to shift to GCP or Amazon and load in enough RAM to prevent bottlenecks. There is also the behaviour of supervisord which seems (?) to load the job automatically across +1 CPU, at least on GCP. I never understood how this worked.
... of course this assumes you will stick with your code, which always reaps rewards.
cross-posted: https://stackoverflow.com/questions/71212442/