Question

I want to correct the erroneous barcode file, and the Python code that I've written, using Biopython, is very slow. How can I make this process faster, and is there any alternative method?

2

Entering edit mode

21 months ago

Vijith ▴ 90

I am looking forward to getting a valuable suggestion for a bioinformatic problem.

Background:

Currently, I am performing a de novo whole genome assembly. At the stage of barcode correction, I lost nearly half of all the reads due to erroneous barcodes. Tellread report While library construction, 18-base molecular barcodes were used for the linked-read generation, and the barcodes have a high incidence of "N" in positions between 12 and 18, as you may see below. enter image description here

Now, I have been asked by the company to do one of two things, to recover the lost reads:

Either replace the barcodes(barcode fastq file from sequencing core) with the correct ones from a pool of barcodes (This is a text file I received from the company that contains the pool of all 18 base barcode sequences)
Since most "N"s are seen in positions 12 to 18 of the 18-base barcode sequences (barcode fastq file from the sequencing core), replace the 11-mers in the barcode file, and replace them with the 11-mers from the correct barcode file (text file).

I have decided to use Python code to execute the first method.

#Define the path where fastq file is stored.
import os
from Bio.SeqIO.QualityIO import FastqGeneralIterator 
path=os.path.join(os.path.expanduser("~"), "Desktop", "SVA1_S1_R2_001.fastq")

#create a new file handle to write the file
path_new=os.path.join(os.path.expanduser("~"), "Desktop", "trimmedfastq.fastq")
path_ust_barcode=os.path.join(os.path.expanduser("~"), "Desktop", "UST40MBarcodes.txt")
write_file=open(path_new, "w")

#open the text file containg the correct pool of barcodes
with open(path_ust_barcode, "r")as fh:
    barcodes=fh.readlines()


#===================================================
# Return the Hamming distance between string1 and string2.
def hamming_distance(string1, string2): 
    # Start with a distance of zero, and count up
    distance = 0
    # Loop over the indices of the string
    L = len(string1)
    for i in range(L):
        # Add 1 to the distance if these two characters are not equal
        if string1[i] != string2[i]:
            distance += 1
    # Return the final count of differences
    return distance
#===================================================

#set correct barcode to "None"
corr_barcode=None
#set the minimum hamming distance to "None"
min_distance=None
#open the fastq file
with open(path, "r") as handle:
    for title, sequence, qual in FastqGeneralIterator(handle):
        for read in barcodes:
            h_distance=hamming_distance(sequence, read)

            #checking of the h_distance ==None or <min_distance
            if min_distance == None:
                min_distance=h_distance
            elif h_distance <= min_distance:
                min_distance=h_distance 
                corr_barcode=read
        #write a new file
        with open(path_new, "a") as wf:
            wf.write("@%s\n%s+\n%s\n" % (title, corr_barcode, qual))
            continue

Problem

I would like to know if this is the only Pythonic way to tackle this problem. This method is dead slow. In fact, the Python code is generating the corrected barcode (fastq) file, but the procedure is very very slow. My erroneous barcode fastq file is 30+ Gb, and the program is generating the new corrected file at a rate of 60Kb evey ~24 hours. I am looking forward to getting valuable suggestions here, with respect to the code, especially how I can make the process faster.

NB: I was told by the company that the second method is better than the first one, but nonethless the code corresponds to the first method.

NGS biopython illumina WGS • 5.7k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 21 months ago by Vijith ▴ 90

1

Entering edit mode

why not split the BAM file into chunks and run this in parallel separately on each one

ADD REPLY • link 21 months ago by benformatics 4.1k

0

Entering edit mode

@benformatics Did you mean, to split the text file into multiple component files and perform the process on each of them? I think there are about more than 40 million sequences in the text file

ADD REPLY • link 21 months ago by Vijith ▴ 90

1

Entering edit mode

You could, for example, split it into 40 files, each with one million reads.

ADD REPLY • link 21 months ago by i.sudbery 21k

1

Entering edit mode

@i.sudbery, Thank you! I can do that. Also, I would like to know if a bash script file can do the work faster than Python.

ADD REPLY • link 21 months ago by Vijith ▴ 90

1

Entering edit mode

and the barcodes have a high incidence of "N" in positions between 12 and 18,

Was this an error on the part of whoever designed the barcodes i.e. were they low-complexity in this region? Having low-complexity sequence (and not including enough phiX to compensate for this will lead to N's showing up in the sequencing especially if the run was borderline overloaded).

If they are not low complexity then this seems to be a run time reagent/sequencer problem that should be addressed by the company that did the sequencing. They should re-run this pool at no cost. Asking you to try and salvage this data informatically is bad.

ADD REPLY • link 21 months ago by GenoMax 148k

score 4 · Answer 1 · 2023-04-19

With the disclaimer that I just skimmed through the question and code, I notice this:

    for title, sequence, qual in FastqGeneralIterator(handle):
        ...
        with open(path_new, "a") as wf:
            wf.write("@%s\n%s+\n%s\n" % (title, corr_barcode, qual))

it looks like you are opening a file for each read and that could be the issue. I would move open(path_new, "a") outside the main loop.

score 3 · Answer 2 · 2023-04-19

3

Entering edit mode

21 months ago

Jeremy Leipzig 23k

    if min_distance == None:
        min_distance=h_distance

no need to run this 350 million times

in fact if you have a min_distance you should stop comparing sequences once it falls below your threshold

ADD COMMENT • link 21 months ago by Jeremy Leipzig 23k

score 2 · Answer 3 · 2023-04-19

2

Entering edit mode

21 months ago

i.sudbery 21k

The length you are going to want to go to optimise this depend on whether you are creating a tool to do this over and over, or its just a one off. There are more efficient ways to solve this, but they are almost certainly not worth the effort, if this is a one off.

If its taking 24hrs to process 60kb of data, then something is definately wrong. With speeds this low, the first place I'd look is memory usage (are you running out of memory, and thus causing disk thrashing). The second place I'd look is IO - in this case look to @dariober's suggestion. Definately take this open statement outside of any loops. Just open the file once at the top of the script.

After that, here are some ideas in order of easiness to implement:

You continue looking, even if you've found a perfect match. Check whether your edit distanct is 0, and if it is immediately break out of your for loops, because you've found your answer. Actaully, you can skip the for loop all together for perfect matches by checking if your read sequence is in the barcodes list with a simple in statement. - if sequence in barcodes:.
I don't know what your barcodes are, but many designed barcodes are designed to have a minimum distance between them. If thats the case, its a fair assumption that if edit distance is 1, then you are not going to find a better match, and so can immediately use that and break out of the loop.
Given that most of your errors are in the last 7 bases, you only really need to search barcodes that share the same first 11 bases. Build a dictionary of all of the used 11mers, where the content of the dict is a list of barcodes that share those first 11 bases. Use this to only search barcodes that have the same first 11 bases are your query sequence.
Pre-compute all possible barcodes that are 2 errors away from a correct barcode. Create a dictionary with these as the keys, and the correct barcode as the value. Look up into this dictionary with your read sequence.
Cythonise your edit distance function. In fact, you could probably cythonise the whole thing.
Implement Daniel Liu's bit-shifting method for calculating hamming distasnces in constant time.
Use n-grams or BK-trees as a better algorithm than comparing every read to every barcode. These are described in Daniel Liu's paper. There is code implementing n-grams in UMI-tools network.py

ADD COMMENT • link 21 months ago by i.sudbery 21k

0

Entering edit mode

@i.sudbery, I would love to quote what the sequencing company says.

" Let’s say in your run, you have reads 1,2,3,… for a total of N reads, and among these N reads, you have barcodes 1,2,3,… for a total of M barcodes. Each barcode is a kmer of 18 bases that contain A,C,G,T,N. Each read has a specific barcode from M assigned to that read. Because positions 12-18 in the barcode are error-prone, I recommend removing them, leaving you barcode positions 1-11 and m possible barcodes. Each read now has a specific barcode from m assigned to that read. I provided you a set of 40M UST barcodes that constitute the whitelist. These barcodes are correct and good to use. Let’s call this set U. For each barcode from m, pick a random barcode from U. Now, each read has a specific barcode from U assigned to that read."

They recommend that I remove the 7-mers, positioned between and including 12 & 18, and for each barcode from "m", pick a barcode from "U" (40M UST barcode). Here,I am afraid if I can use if sequence in barcodes:, removing the for loop altogether because, there are single base substitutions and Ns even at postions 1 -11.

ADD REPLY • link 21 months ago by Vijith ▴ 90

1

Entering edit mode

My suggestion to use "if sequence in barcode" was not to remove the loop altogether, but to check if first, and not execute the loop if a prefect match exists. You would still execute the loop if a perfect match wasn't present.

ADD REPLY • link 21 months ago by i.sudbery 21k

1

Entering edit mode

By the way, if your barcode whitelist has 40M barcodes in it, then the reason you are going slowly is almost certainly a memory problem. I just profiled creating a list of 40M random 18mers, and this used 12GB of RAM just to do that. How much memory does the machine you are working on have free?

ADD REPLY • link 21 months ago by i.sudbery 21k

0

Entering edit mode

I am using an IBM server with 197 GB RAM. Every time I run, I make sure no other (internet browser, etc.) application is running in the background. That being said, I can monitor its memory usage while the process is running.

ADD REPLY • link 21 months ago by Vijith ▴ 90

1

Entering edit mode

Hmm.... I you are on a big server, I guess thats not likely to be the problem (sorry, your references to "~/Desktop" made me think you were on a personal workstation). Another thing to consider is where your storage is located if its a big server. Is the storage local?

ADD REPLY • link 21 months ago by i.sudbery 21k

0

Entering edit mode

@i.suberi, I've modified my Python code as you see below, making necessary changes as suggested by @Jeremy Leipzig and @dariober, and now the code seems to be generating the file significantly quicker: for example, in a few seconds it generated 2Gb of data.

#Define the path where fastq file is stored.
import os
from Bio.SeqIO.QualityIO import FastqGeneralIterator 
path=os.path.join(os.path.expanduser("~"), "Desktop", "SVA1_S1_R2_001.fastq")

#create a new file handle to write the file
path_new=os.path.join(os.path.expanduser("~"), "Desktop", "trimmedfastq.fastq")
path_ust_barcode=os.path.join(os.path.expanduser("~"), "Desktop", "UST40MBarcodes.txt")
write_file=open(path_new, "w")

#open the text file containg the correct pool of barcodes
# with open(path_ust_barcode, "r")as fh:
fh=open(path_ust_barcode, "r") 
wf=open(path_new, "a")
handle=open(path, "r")  

#===================================================
# Return the Hamming distance between string1 and string2.
def hamming_distance(string1, string2): 
    # Start with a distance of zero, and count up
    distance = 0
    # Loop over the indices of the string
    L = len(string1)
    for i in range(L):
        # Add 1 to the distance if these two characters are not equal
        if string1[i] != string2[i]:
            distance += 1
    # Return the final count of differences
    return distance
#===================================================

#set correct barcode to "None"
corr_barcode=None
#set the minimum hamming distance to "None"
min_distance=None
#open the fastq file

for title, sequence, qual in FastqGeneralIterator(handle):
    for read in fh:
        h_distance=hamming_distance(sequence, read)

        #checking of the h_distance ==None or <min_distance

        min_distance=h_distance
        if h_distance <= min_distance:
            min_distance=h_distance 
            corr_barcode=read
    #write a new file
    #with open(path_new, "a") as wf:
    wf.write("@%s\n%s+\n%s\n" % (title, corr_barcode, qual))
    continue

fh.close()
wf.close()
handle.close()
print("Done")

ADD REPLY • link 21 months ago by Vijith ▴ 90

1

Entering edit mode

Have you checked the output? As far as I can tell, this is going ro assign the same barcode to every read, because it always assigns h_distance to min_distance, so your if condition will always be true.

ADD REPLY • link 21 months ago by i.sudbery 21k

0

Entering edit mode

Yes, it always assigns the h_distance to min_distance unless it is in an if conditional. Now, I have split the large UST10x barcode file into 1000 component files. 1000 is an arbitrary number I chose: in fact, the large file can be split using different parameters. I've monitored the memory consumption as you suggested yesterday, and it slightly exceeds 10 GB. The server has an Intel octa-core processor and two cores were at 100% while the program was running.

ADD REPLY • link 21 months ago by Vijith ▴ 90

1

Entering edit mode

is this TELL-seq data?

ADD REPLY • link 21 months ago by benformatics 4.1k

0

Entering edit mode

Yes it's TELL-Seq data. The modified Index file will be used for TELL-Read pipeline.

ADD REPLY • link 21 months ago by Vijith ▴ 90

0

Entering edit mode

doesn't part of the pipeline provided by UST correct the barcodes? or was that not working

ADD REPLY • link 21 months ago by benformatics 4.1k

1

Entering edit mode

There appears to be a problem with the sequencing run with N's in several positions of barcodes.

ADD REPLY • link 21 months ago by GenoMax 148k

0

Entering edit mode

Yes, most barcodes have Ns in positions between and including 12 and 18. An initial run using Supernova 2.0 estimated that the genome size is about 2Gb (This is a plant DNA); previous to this the first assembly was performed using TELL-link, and the final scaffold file was just 23 Mb. This was due to the fact that we lost nearly half of all reads during the barcode correction process.

ADD REPLY • link 21 months ago by Vijith ▴ 90

0

Entering edit mode

You ideally should re-sequence this sample since the entire premise of TELLseq seems to be based on barcodes.

ADD REPLY • link 21 months ago by GenoMax 148k

0

Entering edit mode

Since h_distance is always assigned to min_distance, the condition in your if statement will never be false. Thus the barcode assigned to every read will always be the last barcode in the list. Hence every read will be assigned the same barcode.

ADD REPLY • link 21 months ago by i.sudbery 21k

0

Entering edit mode

Yes, I have corrected that mistake in the code, and now the program is generating the .fastq file as expected, but as slowly as before. Now, the "yet-to-be-tested" method is, to split the barcode file into multiple smaller files. Also, I am totally a beginner in "Bioinformatics" and Linux, and I have seen people write bash scripts to process and work with .fastq files. Do you have expertise in that? Or based on your observation, do you think a bash script file can perform it much faster?

ADD REPLY • link 21 months ago by Vijith ▴ 90

1

Entering edit mode

I can't think of a way to do this with bash. Others might be able to.

I'm surprised that just returning your "if h_distances is None" to the script slows it down so much!

If I understand correctly, you are splitting your input into 1000 input files, and then running the bash script on each of them simulataneously? But, despite running 1000 jobs simulataneously, you are only maxing out 2 of the 8 cores? This suggests to me that your task is IO limited. This is not particularly suprising - read processing jobs in python often are IO limited, as python is well known for having fairly slow IO routines. It strikes me however, that if you runn 1000 IO intensive jobs simulatenously, you may be choking the disk controller. Try reducing the number of tasks so you are only running, say 8 at once.

ADD REPLY • link 21 months ago by i.sudbery 21k

0

Entering edit mode

Could you please explain what you mean by task IO limited, and Python being well-known for fairly slow IO routines? I will reduce the number of tasks to 8 at once while running simultaneously. Is it true that iterating through each of the 1000 files and looking for the matching barcode is computationally as slow as the previous method?

ADD REPLY • link 21 months ago by Vijith ▴ 90

0

Entering edit mode

By IO limited, I mean that the bottle neck might not be python running the calcualtions fast enough, but reading the data off the disk and writing the data to the disk. Because you have 8 CPU cores, they can be doing 8 things at once. But if you only have one disk, it can only be reading or writing one thing at once. If the limiting factor was the speed of the CPUs, then you would see 100% utilisation of all your CPUs if you were running many tasks simulataneously. But because you only see 200%, that suggests that the other 6 cores are waiting for the disk to catch up. Given the amount of memory you have, it might even be worth writing to memory, and then dumping that to disk in one go, rather than writing each record to disk as you go. Lets see how quick it is when you dial back the number of simulataneous jobs.

Something like:

#Define the path where fastq file is stored.
import os
from Bio.SeqIO.QualityIO import FastqGeneralIterator 
path=os.path.join(os.path.expanduser("~"), "Desktop", "SVA1_S1_R2_001.fastq")

#create a new file handle to write the file
path_new=os.path.join(os.path.expanduser("~"), "Desktop", "trimmedfastq.fastq")
path_ust_barcode=os.path.join(os.path.expanduser("~"), "Desktop", "UST40MBarcodes.txt")
write_file=open(path_new, "w")

#open the text file containg the correct pool of barcodes

with open(path_ust_barcode, "r") as fh:
   barcodes = fh.readlines()

handle=open(path, "r")  
output = []

#===================================================
# Return the Hamming distance between string1 and string2.
def hamming_distance(string1, string2): 
    # Start with a distance of zero, and count up
    distance = 0
    # Loop over the indices of the string
    L = len(string1)
    for i in range(L):
        # Add 1 to the distance if these two characters are not equal
        if string1[i] != string2[i]:
            distance += 1
    # Return the final count of differences
    return distance
#===================================================

#set correct barcode to "None"
corr_barcode=None
#set the minimum hamming distance to "None"
min_distance=19
#open the fastq file

for title, sequence, qual in FastqGeneralIterator(handle):

    # if a perfect match exists, just output as is
    if sequence in barcodes:
        output.append("@%s\n%s+\n%s\n" % (title, corr_barcode, qual))
        continue

    for read in barcodes:

        h_distance=hamming_distance(sequence, read)

        # since if there was a perfect match we won't be here, 
        # if distance is 1, we'll never do better
        if h_distance == 1:
            corr_barcode = read
            break

        if  h_distance <= min_distance:
            min_distance=h_distance 
            corr_barcode=read

    output.append("@%s\n%s+\n%s\n" % (title, corr_barcode, qual))


with open(path_new, "a") as wf:
    for line in output:
        wf.write(line)

handle.close()
print("Done")

Because this saves all the output to a list before outputting it all to disk at the end, this is going to use a lot more memory. But it should go way faster. The other thing to note is that i've loaded all the barcodes in at the start, as opposed to reading the one at a time for every read. Again, this will increase memory usage, but increase speed by a very large amount. If you decide to go with something like the above, I would start by running it with just one of your 1000 files, and use time -v to measure how long it takes, how much memory it uses, and what % of the time is CPU time.

Another thing that crossed my mind is that one thing that is happening here, is that all similarly ambigous reads are being assigned the same barcode. So, for example all reads with AAAAAAAAAAANNNNNNN would be assigned the same barcode. Is this the correct behavoir?

ADD REPLY • link 21 months ago by i.sudbery 21k

0

Entering edit mode

Thank you so much for the help, sir. Today, I ran your code with one of my 1000 files. A single file is ~750 Kb (the original file ~780Mb). The program has been running without outputting any results for the past 2.5 hours. Around 5Gb memory was being used throughout the run until my observation. Regarding the question that you raised, my answer is our sample DNA was sequenced using NovSeq 6000 system, using the TELL technology. So all short reads, generated from DNA molecules in a GEM (Gel Emulsion in Bead), received the same barcode from a million copies of a specific one.

ADD REPLY • link 21 months ago by Vijith ▴ 90

1

Entering edit mode

It won't output anything until it is finished. It stores all the output up, and then only outputs it when it is finished. This minimizes the usage of the disk. However, if you are using only 750kb files, I'm surprised it is taking so long.

Hmmmm...

I think I might have spotted a bug in the code. Try replacing

with open(path_ust_barcode, "r") as fh:
   barcodes = fh.readlines()

with

barcodes = list()
with open(path_ust_barcode, "r") as fh:
   for b in fh:
      barcodes.append(b.rstrip())

I can think of ways to make this go faster but we start getting progressively more complex.

ADD REPLY • link 21 months ago by i.sudbery 21k

0

Entering edit mode

@i.sudbery, I've made the changes you suggested, but still, the process is running very slowly. I let the program run for two days (calculated using "time cat"), and it was still running. There are ~40,000 lines per text file (one of 1000 files created). Also, when two programs are running simultaneously, all the processor cores are at 100%. Is there any other method I can follow to make it go faster?

ADD REPLY • link 20 months ago by Vijith ▴ 90

0

Entering edit mode

Try substituting:

barcodes = list()
with open(path_ust_barcode, "r") as fh:
   for b in fh:
      barcodes.append(b.rstrip())

with

barcodes = set()
with open(path_ust_barcode, "r") as fh:
   for b in fh:
      barcodes.add(b.rstrip())

By the way, I didn't come up with this via some great store of knowledge. I tried out various things in a jupyter notebook, using %timeit to work out where the bottleneck might be and to improve it.

This works because if barcodes is a list, then python has to compare your candidate barcode against each of the 40M barcodes to determine if it correct. However, if barcodes is a set, it will hash the barcode (create a reduced representation of it), and can then immediately know if the test barcode is in the set without having to compare it to each and every one.

I also noted that the curr_barcode and min_distance is not being reset with each read.

#set correct barcode to "None"
corr_barcode=None
#set the minimum hamming distance to "None"
min_distance=19

needs to be inside the

for title, sequence, qual in FastqGeneralIterator(handle):

loop

ADD REPLY • link 20 months ago by i.sudbery 21k

0

Entering edit mode

Actually, this I noticed, and I reset min_distance = 19 under the second

output.append("@%s\n%s+\n%s\n" % (title, corr_barcode, qual))
min_distance = 19

ADD REPLY • link 20 months ago by Vijith ▴ 90

0

Entering edit mode

Okay, if the above is still not working well enough, you can try this:

#Define the path where fastq file is stored.
import os
from Bio.SeqIO.QualityIO import FastqGeneralIterator 
from time import localtime, strftime
from collections import defaultdict
from collections import Counter

path=os.path.join(os.path.expanduser("~"), "Desktop", "SVA1_S1_R2_001.fastq")
stats = counter()

#create a new file handle to write the file
path_new=os.path.join(os.path.expanduser("~"), "Desktop", "trimmedfastq.fastq")
path_ust_barcode=os.path.join(os.path.expanduser("~"), "Desktop", "UST40MBarcodes.txt")
write_file=open(path_new, "w")

#open the text file containg the correct pool of barcodes
now = strftime("%H:%M:%S", localtime())
print ("#%s\tLoading barcodes")
barcodes = set()
barcode_tree = defaultdict(set)
with open(path_ust_barcode, "r") as fh:
   for bc in fh:
       barcodes.add(bc.rstrip())
       barcode_tree[bc[:11]].add(bc[11:])
now = strftime("%H:%M:%S", localtime())
print ("#%s\tBarcodes loaded\n#%s\tMatching reads" % (now, now)))
handle=open(path, "r")  
output = []

#===================================================
# Return the Hamming distance between string1 and string2.
def hamming_distance(string1, string2): 
    # Start with a distance of zero, and count up
    distance = 0
    # Loop over the indices of the string
    L = len(string1)
    for i in range(L):
        # Add 1 to the distance if these two characters are not equal
        if string1[i] != string2[i]:
            distance += 1
    # Return the final count of differences
    return distance
#===================================================

def find_best_match(sequence, barcode_list):
    '''Scans the barcodes in barcode list and returns
    the barcode with the lowest hamming distance, and that
    distance. Use min_distance if you have already scanned
    a different list of barcodes.'''

    # if a perfect match exists, just output as is
    if sequence in barcode_list:
        return (sequence)

    curr_barcode = None
    min_distance = len(sequence) + 1

    for read in barcode_list:

        h_distance=hamming_distance(sequence, read)

        # since if there was a perfect match we won't be here, 
        # if distance is 1, we'll never do better
        if h_distance == 1:
            return (read)

        if  h_distance <= min_distance:
            min_distance=h_distance 
            corr_barcode=read

    return (corr_barcode)


#open the fastq file

for title, sequence, qual in FastqGeneralIterator(handle):

    stats["reads"] += 1

    if sequence in barcodes:
        stats["perfect_match"] += 1
        output.append("@%s\n%s+\n%s\n" % (title, sequence, qual))
        continue

    if sequence[:11] in barcode_tree:
        corr_barcode = sequence[:11] + find_best_match(sequence[11:], barcode_tree[sequence[:11]])
        stats["First 11 match"] += 1
    else:
        stats["Full search"] += 1
        corr_barcode = find_best_match(sequence, barcodes)

    output.append("@%s\n%s+\n%s\n" % (title, corr_barcode, qual))
    if stats["reads"] % 100 == 0:
        now = strftime("%H:%M:%S", localtime())
        print ("#%s\t%i reads processed, %i perfect matches, %i first 11 matches, %i full searches" % 
               (now, stats["reads"], stats["perfect_match"], stats["First 11 match"], stats["Full search"]))

now = strftime("%H:%M:%S", localtime())
print ("#%s\tAll Reads Processed\n#%s\tOutputting Results" % (now, now)))
with open(path_new, "a") as wf:
    for line in output:
        wf.write(line)

handle.close()
now = strftime("%H:%M:%S", localtime())
print("#%s\tDone" % now)

This will probably double your memory usage again. What it does is to split the barcode into the first 11 and the rest. If first checks if the barcode you are examining matches to the first 11 bp of any barcodes in the set. If it does, then it only compares the barcode to those others. If does this by creating a dictionary that uses the first 11nt as keys, and the value is a list of all the possible last 7 nucleotides. This means that each barcode is checked against a far smaller number of candidates. If the first 11 don't match anything, then the full list is searched, which will be very slow.

I've also added some logging code so that you can see what is happening.

ADD REPLY • link 20 months ago by i.sudbery 21k

0

Entering edit mode

i.sudbery, I haven't run the recent code that you suggested; meanwhile, I would like to draw your attention to the previous code and I would like to know if it will generate the required output. In this screenshot, it is the code that you've suggested, and just to see if items are being populated in the list to a final list length of 39945 (This is the number of items in one of 1000 ustbarcode.txt split-files), I added a print function to display the list length. I assume, 39945 sequences from the fastq file will match with barcodes in the text file. So, the length of the list() shouldn't exceed 39945. But, as you can see while printing, the said value exceeds 39945; It is in lakhs. I don't know if the "for loop" is doing its job correctly. What I'm trying to say is, if h_distance is 18 (this can happen if the sequence and read are totally dissimilar), it is still less than min_distance (19), and if none of the barcodes holds an h_distance less than 18, in the end of the iteration a dissimilar read will be appended to the list. This means, for every sequence (of 3.7 core barcodes in the fastq), there is going to be a barcode appended to the list. I think the program is going to take time until it creates a list of size 3.7 crore - the number of barcodes in the fastq.

ADD REPLY • link 20 months ago by Vijith ▴ 90

0

Entering edit mode

In your code, the position you've put the count+=1 means that it doesn't get executed if there is a perfect match. If you want to count the number of incoming reads, you should put the count += 1 at the top of the loop. If you put the count at the top and the len(output) at the end, then you can see reads in vs reads out.

There is only one way in which the number of reads in the output could be larger than the number of reads in the input, and that is if append is called more than once for each input read. And as far as I can see, the only way that could happen would be if the loop isn't correctly advancing if a perfect match is found. I can't see that bit of the code thought, and really I'd need the code in my hands, along with the data to see if that was what was happening.

Is this after changing to use a set rather than a list for the barcodes?

ADD REPLY • link 20 months ago by i.sudbery 21k

0

Entering edit mode

I have taken a screenshot of the process. Here, I am using the raw barcode fastq file and the original UST40mbarcode.txt file. I have quick doubt: any prospects of this process getting a bit faster if I run the program with one of the split raw barcode fastq files (instead of splitting the barcode text file), and the complete original barcode text file? It looks like it is taking ~ 3-7 minutes for processing every batch of 100 reads. Also, I was just wondering if you can also try it out if I share with you the files (via a drive link)?

ADD REPLY • link 20 months ago by Vijith ▴ 90

0

Entering edit mode

Something is wrong here, it shouldn't be this slow. If you share the UST40mbarcode file and one of your 1000 fastq files, I'll try to take a quite look at it.

ADD REPLY • link 20 months ago by i.sudbery 21k

0

Entering edit mode

i.sudbery, thank you so much. This is a drive link, and you can access the barcode files, here. https://drive.google.com/drive/folders/1XQRybcD0Lk6uaBQEYwZ47lrKyS5-4GW8?usp=share_link

ADD REPLY • link 20 months ago by Vijith ▴ 90

0

Entering edit mode

Here you can find the files. https://drive.google.com/drive/folders/1XQRybcD0Lk6uaBQEYwZ47lrKyS5-4GW8?usp=share_link

ADD REPLY • link 20 months ago by Vijith ▴ 90

0

Entering edit mode

@1.sudbery, could you gain access to the files I've shared?

ADD REPLY • link 20 months ago by Vijith ▴ 90

0

Entering edit mode

@i.sudbery, could you run the program?

ADD REPLY • link 20 months ago by Vijith ▴ 90