I am looking forward to getting a valuable suggestion for a bioinformatic problem.
Background:
Currently, I am performing a de novo whole genome assembly. At the stage of barcode correction, I lost nearly half of all the reads due to erroneous barcodes. While library construction, 18-base molecular barcodes were used for the linked-read generation, and the barcodes have a high incidence of "N" in positions between 12 and 18, as you may see below.
Now, I have been asked by the company to do one of two things, to recover the lost reads:
- Either replace the barcodes(barcode fastq file from sequencing core) with the correct ones from a pool of barcodes (This is a text file I received from the company that contains the pool of all 18 base barcode sequences)
- Since most "N"s are seen in positions 12 to 18 of the 18-base barcode sequences (barcode fastq file from the sequencing core), replace the 11-mers in the barcode file, and replace them with the 11-mers from the correct barcode file (text file).
I have decided to use Python code to execute the first method.
#Define the path where fastq file is stored.
import os
from Bio.SeqIO.QualityIO import FastqGeneralIterator
path=os.path.join(os.path.expanduser("~"), "Desktop", "SVA1_S1_R2_001.fastq")
#create a new file handle to write the file
path_new=os.path.join(os.path.expanduser("~"), "Desktop", "trimmedfastq.fastq")
path_ust_barcode=os.path.join(os.path.expanduser("~"), "Desktop", "UST40MBarcodes.txt")
write_file=open(path_new, "w")
#open the text file containg the correct pool of barcodes
with open(path_ust_barcode, "r")as fh:
barcodes=fh.readlines()
#===================================================
# Return the Hamming distance between string1 and string2.
def hamming_distance(string1, string2):
# Start with a distance of zero, and count up
distance = 0
# Loop over the indices of the string
L = len(string1)
for i in range(L):
# Add 1 to the distance if these two characters are not equal
if string1[i] != string2[i]:
distance += 1
# Return the final count of differences
return distance
#===================================================
#set correct barcode to "None"
corr_barcode=None
#set the minimum hamming distance to "None"
min_distance=None
#open the fastq file
with open(path, "r") as handle:
for title, sequence, qual in FastqGeneralIterator(handle):
for read in barcodes:
h_distance=hamming_distance(sequence, read)
#checking of the h_distance ==None or <min_distance
if min_distance == None:
min_distance=h_distance
elif h_distance <= min_distance:
min_distance=h_distance
corr_barcode=read
#write a new file
with open(path_new, "a") as wf:
wf.write("@%s\n%s+\n%s\n" % (title, corr_barcode, qual))
continue
Problem
I would like to know if this is the only Pythonic way to tackle this problem. This method is dead slow. In fact, the Python code is generating the corrected barcode (fastq) file, but the procedure is very very slow. My erroneous barcode fastq file is 30+ Gb, and the program is generating the new corrected file at a rate of 60Kb evey ~24 hours. I am looking forward to getting valuable suggestions here, with respect to the code, especially how I can make the process faster.
NB: I was told by the company that the second method is better than the first one, but nonethless the code corresponds to the first method.
why not split the BAM file into chunks and run this in parallel separately on each one
@benformatics Did you mean, to split the text file into multiple component files and perform the process on each of them? I think there are about more than 40 million sequences in the text file
You could, for example, split it into 40 files, each with one million reads.
@i.sudbery, Thank you! I can do that. Also, I would like to know if a bash script file can do the work faster than Python.
Was this an error on the part of whoever designed the barcodes i.e. were they low-complexity in this region? Having low-complexity sequence (and not including enough phiX to compensate for this will lead to N's showing up in the sequencing especially if the run was borderline overloaded).
If they are not low complexity then this seems to be a run time reagent/sequencer problem that should be addressed by the company that did the sequencing. They should re-run this pool at no cost. Asking you to try and salvage this data informatically is bad.