I have a list of 90,000 barcodes with length of 15bp.
Now I have to compare another list with millions of 14-16bp long sequences against the barcodes. I want to calculate the levenshtein-distance (edit distance) and keep the matches of distance 0 or 1.
What tool would you ideally use to do this?
I guess it has to be a fast implementation since we have about 90,00 x 3,000,000 = 270 Billion comparisons. So it has to be a method which works with some type of index of the barcodes
Wouldn't it make sense to generate all possible barcodes based on your 90k file that are within that edit distance, then filter the second list for these (and be it with awk/seen) and then retain those entries that were part of the original 90k file? That way you avoid making these billions of comparisons and it comes down to keeping entries that are matched in a file.
I think bowtie2 can be persuaded to do that with some tweaking of the parameters. I'm not sure if the edit distance it reports is Levenshtein, but maybe you are not too strict about that.
I used bowtie2 to align fastq reads against a library of guide RNAs using these settings:
Where my90k.fa would be your indexed file of the 90,000 barcodes. In your case, you should reduce -L (seed length) and if you want all matches for each read, add the -a option. Check also the --local option is necessary. If your 3 million reads are in fasta also add the -f option.
You could try
bbduk.sh
in filter mode. A guide is available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/It can use multiple threads and will be plenty performant as long as you can make it work for your use case.
Wouldn't it make sense to generate all possible barcodes based on your 90k file that are within that edit distance, then filter the second list for these (and be it with
awk/seen
) and then retain those entries that were part of the original 90k file? That way you avoid making these billions of comparisons and it comes down to keeping entries that are matched in a file.