Rapid lookup of barcodes
1
2
Entering edit mode
2.4 years ago
Michael ▴ 270

I have a list of 90,000 barcodes with length of 15bp.

Now I have to compare another list with millions of 14-16bp long sequences against the barcodes. I want to calculate the levenshtein-distance (edit distance) and keep the matches of distance 0 or 1.

What tool would you ideally use to do this?

I guess it has to be a fast implementation since we have about 90,00 x 3,000,000 = 270 Billion comparisons. So it has to be a method which works with some type of index of the barcodes

levenshtein-distance NGS barcodes edit-distance • 917 views
ADD COMMENT
1
Entering edit mode

You could try bbduk.sh in filter mode. A guide is available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/

It can use multiple threads and will be plenty performant as long as you can make it work for your use case.

ADD REPLY
0
Entering edit mode

Wouldn't it make sense to generate all possible barcodes based on your 90k file that are within that edit distance, then filter the second list for these (and be it with awk/seen) and then retain those entries that were part of the original 90k file? That way you avoid making these billions of comparisons and it comes down to keeping entries that are matched in a file.

ADD REPLY
1
Entering edit mode
2.4 years ago

I think bowtie2 can be persuaded to do that with some tweaking of the parameters. I'm not sure if the edit distance it reports is Levenshtein, but maybe you are not too strict about that.

I used bowtie2 to align fastq reads against a library of guide RNAs using these settings:

bowtie2 -x my90k.fa -U my3M.reads.fq -N 1 -L 18 --local -p 8 --score-min L,10,0

Where my90k.fa would be your indexed file of the 90,000 barcodes. In your case, you should reduce -L (seed length) and if you want all matches for each read, add the -a option. Check also the --local option is necessary. If your 3 million reads are in fasta also add the -f option.

(I'd be interested in which solution you go for)

ADD COMMENT

Login before adding your answer.

Traffic: 2697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6