Hello everyone!
I developed a little script to search for two primers in reads which are converted into .fa format. so the script is able to find both primers in one read using bbduk.
The problem is that the script is not as efficient as it could be. It takes a lot of time to process. According to the code below(Python) it searches for the first primer in reads that contain the primer, then deletes the initial file that contains all the reads, after that it searches for the second primer and gives and an empty files that do not contain both primers.
The question is: how can I improve the processing speed by changing the code or using a different approach. Or does something similar already exist as an example?
#!/usr/bin/env python3
import csv
import os
#determine tool directories
bbduk = '/home/vladislav.shevtsov/miniconda3/envs/mlvapython3/bin/bbduk.sh'
#determine files directories
data_in_dir = '/home/vladislav.shevtsov/bbduk_search/reads'
data_out_dir = '/home/vladislav.shevtsov/bbduk_search/output/'
fastq_gz_row_list = sorted([os.path.join(data_in_dir, i) for i in os.listdir(data_in_dir) if i.endswith('.gz')])
def generate_bbduk_output_name(name, suffix = 'OUT'): #Generates the output name
a, b = os.path.splitext(name)
new_name = a + '_'+ suffix + b
return new_name
doc='/home/vladislav.shevtsov/bbduk_search/Primers_Pseudo_7fix_run.txt' # Path to Primers.txt
bash_file = 'bbdu1.sh'
with open(bash_file, 'w') as f, open(doc,'r') as primer:
file_reader=csv.reader(primer,delimiter='\t')
f.write('#!/bin/bash \n')
for lines in file_reader:
for sample in fastq_gz_row_list:
name = os.path.basename(sample)
name, _ = os.path.splitext(name)
out_sample = os.path.join(data_out_dir, name)
out_sample = generate_bbduk_output_name(out_sample, lines[0])
out_sample2 = generate_bbduk_output_name(out_sample)
#print(out_sample)
if not os.path.isfile(out_sample):
f.write(f'{bbduk} in={sample} outm={out_sample} literal={lines[1]} mm=f k=22 minlen=1 hdist=1 forbidn=t threads=18 \n') #searches primer 1 ... run_1
f.write(f'{bbduk} in={out_sample} outm={out_sample2} literal={lines[2]} mm=f k=10 minlen=1 hdist=1 forbidn=t threads=18 \n') #searches primer 2 from previous run
f.write(f'rm {out_sample} \n' ) #removes run_1 files
Please a representative input an output example. People are typically relucant to go through multiple lines of your just tounderstand the problem. A good illustration always helps.
Thank you for the advice. the input file for Primers.txt looks like this:
INPUT FILE 1
Ft-M1_3bp_83bp_3u TTAGCAGCCGCGATTACATCTATCAG TAATATTAAAATCATAGAATTTTAAAC
Ft-M2_6bp_90bp_4u TTTAATTCTTTTCATAATAAATTTTGT TTTCAATAACATTCATTTTAAAAAGA
.........
INPUT FILE 2
The script also needs reads.fastq.gz as an input in input folder.
OUTPUT FILES
The output files contain all the reads with both primes in a single read
SRR942045.1900911 1900911/1 AAATTCAGTAATTGATCTGATCTTACCAGTAACTTTATCACCTGGTTTGTAGTTTTTCTCAAACTCGCTCCAAGGATTAGGTCTGCATTGCTTGATACCAAGAGATATTCTGTGGTTATCAGCATCTAGTTCAAGTACGATAACTTCAACTTCTTGACCAATAGATACAGCTTTATGAGGGTTAACGTTTTTGTTTGTCCAATCCATTTCAGCTGTATGAACAAGACCTTCGACACCTTCTTTTAACTTAA + ?????BBB?BBBBBBBF;??CC>CFFHFCAEEGHHHHFFAEFFFHHFHEHHHGHHHFFF=??EFDEE@DDC=CE;,CFGGHHHHHFGFHGHBFGDHHEHHFFF?FDGHDFFHBGF7<dgccf?ccfh.@7ddfhdfdfee+=@?f,=ddee;b,5,=;be;?b,?;?;;cbc?ee?,;bb83,*0:8:a*))'00*:aceeceeaca??cc:**: *0*0="" :**="" ??*="" 0**).')*="" ******0:c*0="" @srr942045.1903979="" 1903979="" 1="" atcttgtggtgagcaacaaaagctacgcttgtgtaaaatatttaccaaagcttatgacttgattttgcttgatgaggcaacttcaaatatagatgcaaaatcagagaaaaaatctatcaactacttaaggataaaggtctagcttatatctcagttagtcataatgatagagttaaagcatatcatgataaagttattgagcttactagtgattgactagaagatgaacgataatactcgcaaactaccca="" +="" ?????bbbdbddddddffffffhfcfcefhhdgddfdgfgbghhhhfhhffhhhg@dhbbghhfhhhhhhhhhhhhh@>cfhhhhhcf="/A7?GFG/C?FDGHDFFF=C-CEFFHHCF?DFFFHHFFF.BDFFFFFD==DD=DBBDEBDDBEFFC=B,5=ABEBCEBAEEEEEFABEE=5ACAAAEFFF:CEE?<em">11::::A?AEA::CEEFEFEEAAE:::A?C888??1?::88'85000 @SRR942045.1908205 1908205/1 ATGTAACTACTTGACCGCCAGTATATGCTTGACCTTGACTCCATGCCTCACCTGAAACAAAAGGTTTCTTCCATGCGCCACCATCAGCCGGATTATTACCTTGTGTCCACCATTGAGCTTCATAAGTCACACCATTAACAATAACTTTATCACCCTTATTGTAGAC + ??A??BBBEEDDDDDDGGGGGGIIIIIIHIHIIIIIHIIIIIIIHIIIIIIHIHIIIIIIIIIHHHGGHIHIIIHIHHEEHHHGHFHGHHHCHHHHHHHHHHHHFHHHHHHHHGGGGGGGGFGGGGGGGGGGGGEGGGGGEGGGGGGGGGGGEGGGGEEGEEGGGG @SRR942045.1917232 1917232/1 TCTTCTAGTCAATCACTAGTAAGCTCAATAACTTTATCATGATATGCTTTAACTCTATCATTATGACTAACTGAGATATAAGCTAGACCTTTATCCTTAAGTAGTTGATAGATTTTTTCTCTGATTTTGCATCTATATTTGAAGTTGCCTCATCAAGCAAAATCAAGTCATAAGCTTTGGTAAATATTTTACACAAGCGTAGCT + AAAAABBBDDDEDEDDGGGGGGIIHHIIIIIIIIIIHIIIHIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIHHIIHIIIIIDGHIIIHIIHHHHFHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGEEGGGGEEGBEGGGEGGGGGGGGGCEEGCEEEAACACEGG8CC @SRR942045.1920897 1920897/1 GTTGCCATACCCTCGATAACATCACCATCACAAGGTATCAAAGTATTTGTCTCAACAATAACTGTATCACCTATTTTAAGGCTTTCAGCATCAACTTTTGTGACACTACCATCTTCTTCTATTCTTAAGGCAAAGAGTTTAGATTTAGCAGCTTTAAGAGTATCTGCTTGAGCCTTACCTCTTCCTTCAGCGATACCTTCAGCAAAGTTTGCAAATAAAATTGTAAACCATAACCATATAACTACCTG + ?????BBBDDDD9A@ACCEFFFHHFHFFHHIIIFH9AEGHHIBGFHHH@FFEGHFFHFFFGGHHHHIHHHHDEFFGHHIDGHHIFHGHHHBGCGHHDFHFFHIIIIIIHFHHHGHHHFHFHHBFHFHHHHHFHHHFFBBDFFFDDFFFEEEEDEEFFFFEDEEECEBEEEEFBEECBC,ACEEBB=CAACEEDFA;:C:A:AAEEE0:CEE:?:A?C:::?::?:?8::C??:C?CCE:?:A @SRR942045.1933571 1933571/1 GATCTTACCAGTAACTTTATCACCTGGTTTGTAGTTTTTCTCAAACTCGCTCCAAGGATTAGGTCTGCATTGCTTGATACCAAGAGATATTCTGTGGTTATCAGCATCTAGTTCAAGTACGATAACTTCAACTTCTTGACCAATAGATACAGCTTTATGAGGGTTAACGTTTTTGTTTGTCCAATCCATTTCAGATGTATGAACAAGACCTTCGATACCTTCTTTTAACTTAACAAAACAACCGTAGTCAG +
Depending on how many of these sequence you have in your primer file and the size of your dataset this is how long it is going to take (assuming your python code is not adding any additional overhead, not sure why you are not running
bbduk.sh
directly, it will take a file of fasta format sequences as input, doing it the way you are it is running each sequence individually so the files have to be read over again and again). You could also try givingbbduk.sh
some additional memory by-Xmx10g
to see if that helps.