Analyzing and slicing FASTQ file entries using Python
1
2
Entering edit mode
2.8 years ago
Apex92 ▴ 300

I have the code pasted below for running on FASTQ file entries in order to compare specific parts and remove the redundancy of the same sequences (based on the miRNA + umi_seq combination). I save the entry IDs and then make a new FASTQ file based on them.

The problem with my code is that it takes a very long time for a FASTQ file with 72 million entries (~18 GB) to output a new file. How can I optimize the code so it can run quite fast? Thank you.

from Bio import SeqIO
import sys


def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts1 = {}
    dicts2 = {}
    lst = []
    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #print(record.id)
            #print(record.seq)
            #looking for the 3 prime adapter
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                #Only record is used to be able to save the all atributes like phred score in the fastq file
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                i = record.id
                x = miRNAseq.seq+umi_seq.seq
                #print((miRNAseq+umi_seq).format("fastq"))
                dicts1[i]=x

        #write ids and seq in a dictionary and keep one if there are multiple seqs with miRNA-seq + UMI
        for k,v in dicts1.items():
            if v not in dicts2.values():
                dicts2[k] = v

        #making a list
        for keys in dicts2:
            lst.append(keys)

    #print(dicts1)
    #print(dicts2)
    #print(lst)


    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #based on the saved ids in the list print the entries (miRNA + 3' adapter + UMI)
            if record.id in lst:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                #print(record.seq)
                #print(miRNAseq.seq)
                #print(adapter_seq.seq)
                #print(umi_seq.seq)
                #print("@"+record.id)
                if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) <= 50:
                    print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')

                if len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) > 50:
                    cut = len(miRNAseq.seq+adapter_seq.seq+umi_seq.seq) - 50
                    print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')

if __name__ == '__main__':
    QIAseq_UMI_correction()
biopython sequence performance FASTQ • 1.9k views
ADD COMMENT
0
Entering edit mode

The initial 'non-simplistic' solution is to parallelise the code. I suspect this is running on a single processor. Beyond that there is a number of key improvements:

  • checking and timing code efficiency
  • getting the data beneath the RAM ceiling and rewriting the code to accodmate this
  • shifting to faster functions

The list goes on and its not simplistic. This is a full week and more of work and with expertise its couple of days solid work, minimum - well okay there's not much code here, so maybe not.

An easier solution is to shift to GCP or Amazon and load in enough RAM to prevent bottlenecks. There is also the behaviour of supervisord which seems (?) to load the job automatically across +1 CPU, at least on GCP. I never understood how this worked.

... of course this assumes you will stick with your code, which always reaps rewards.

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
2.8 years ago
Apex92 ▴ 300

I just solved it by removing the second part - then checking if x is not present in the dict then writes miRNAseq+adapter_seq+umi_seq in output. I only use one dict now and the code is already generating output. Here is the updated version.

from Bio import SeqIO
import sys

def QIAseq_UMI_correction():
    script=sys.argv[0]
    file_name=sys.argv[1]
    dicts = {}
    lst = []

    with open(file_name, "r") as Fastq:
        for record in SeqIO.parse(Fastq,'fastq'):
            #print(record.id)
            #print(record.seq)
            #looking for the 3' adapter
            if "AACTGTAGGCACCATCAAT" in record.seq:
                adapter_pos = record.seq.find('AACTGTAGGCACCATCAAT')
                #Only using record to be able to save the all atributes like phred score in the fastq file
                miRNAseq = record[:adapter_pos]
                adapter_seq=record[adapter_pos:adapter_pos+19]
                umi_seq = record[adapter_pos+19:adapter_pos+19+12]
                i = record.id
                x = miRNAseq.seq+umi_seq.seq
                y = miRNAseq.seq+adapter_seq.seq+umi_seq.seq

                if x not in dicts.values():
                    dicts[i]=x

                    if len(y) <= 50:
                        print((miRNAseq+adapter_seq+umi_seq).format("fastq"),end='')

                    if len(y) > 50:
                        cut = len(y) - 50
                        print((miRNAseq+adapter_seq+umi_seq)[:-cut].format("fastq"), end='')

if __name__ == '__main__':
    QIAseq_UMI_correction()
ADD COMMENT
1
Entering edit mode

Thats great. The data set might be smaller than I imagined.

ADD REPLY
1
Entering edit mode

Thank you - actually the files that I am tackling with are more than 18 GB which is very painful.

ADD REPLY
0
Entering edit mode

Good to know for future, thats definitely a bit on the big size.

ADD REPLY
1
Entering edit mode

Also one important thing that I changed was

if x not in dicts.values():
     dicts[i]=x

to

if x not in dicts.keys():
     dicts[x]=i

and this increased the speed a lot. Apparently searching for keys is much faster then searching for values.

ADD REPLY

Login before adding your answer.

Traffic: 1679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6