I have a fastq file of 400,000 reads (so speed is important). In the sequences there are barcodes integrated that should be present twice. Given a barcode, I want to find the sequences that have the barcode present twice with <= 2 mismatches. So, with a barcode 'ATTCGACCGATAGG', I would like to retrieve all of the following sequences-
>TATCTTGTGGAAAGGACGAAACACCGAACACAAAGCATAGATGCGTTTAAGAGCTATGCTGGAAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
Note that the first barcode in the fourth sequence is short of one character. I have tried with biopython and regex but it's just too slow given I have 5k barcodes. I am wondering if there is a fast solution available in python or in something like grep, awk or anything else. Thanks.
Use cutadapt and control the error rate. Please read cutadapt manual for parameter explanation:
edit: edited for fastq, instead of fasta
Thanks for the reply. Does cutadapt allow for <=n mismatches on the barcodes?
Cutadapt allows maximum error rate or number of mismatches (n) per matched index sequence. Please read cut adapt manual on error rate.