I have a sequence file where for each read, I want to check for instances of a pattern(s). I will demultiplex the file into several files based on which pattern is found in each read. I am trying out the Biopython Motif module as -
motifs=Motif.Motif(alphabet=IUPAC.unambiguous_dna)
motif.add_instance(Seq('ATAGCATAG',motifs.alphabet))
motif.add_instance(Seq('AAGCATAAG',motifs.alphabet))
motif.add_instance(Seq('CTAGCAGGG',motifs.alphabet))
etc. Then I loop over the file, and print out to the files as -
for read in SeqIO.parse(r_file,"fastq"):
for motif in motifs.search_instances(read.seq):
motif_seq= str(motif[1])
out_handle=dict_out_file_handles[motif_seq]
out_handle.write(read.format("fastq"))
This is turning out to be rather slow. Is there are faster way to do this? Should I use plain regex?
Thanks very much.
You're doing a lot of IO during the loops! Perhaps you could read the FASTQ file into memory first, push any identified sequences with the motifs to a list and then output to disk?