Find pattern that is present twice and allow <=2 mismatches on each
0
0
Entering edit mode
3.8 years ago
nafizh • 0

I have a fastq file of 400,000 reads (so speed is important). In the sequences there are barcodes integrated that should be present twice. Given a barcode, I want to find the sequences that have the barcode present twice with <= 2 mismatches. So, with a barcode 'ATTCGACCGATAGG', I would like to retrieve all of the following sequences-

>TATCTTGTGGAAAGGACGAAACACCGAACACAAAGCATAGATGCGTTTAAGAGCTATGCTGGAAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACCGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA
TATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTT**ATTCGACGATAGG**GGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAG**ATTCGACCGATAGG**TGGCGTAACTAGATCTTGAGACAAA

Note that the first barcode in the fourth sequence is short of one character. I have tried with biopython and regex but it's just too slow given I have 5k barcodes. I am wondering if there is a fast solution available in python or in something like grep, awk or anything else. Thanks.

fastq grep python awk barcode • 936 views
ADD COMMENT
1
Entering edit mode

Use cutadapt and control the error rate. Please read cutadapt manual for parameter explanation:

$ cutadapt --action=none --trimmed-only -g ATTCGACCGATAGG...ATTCGACCGATAGG input.fq

edit: edited for fastq, instead of fasta

ADD REPLY
0
Entering edit mode

Thanks for the reply. Does cutadapt allow for <=n mismatches on the barcodes?

ADD REPLY
1
Entering edit mode

Cutadapt allows maximum error rate or number of mismatches (n) per matched index sequence. Please read cut adapt manual on error rate.

ADD REPLY

Login before adding your answer.

Traffic: 1684 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6