My goal is to split a sequence at a specific site into two separate sequences. Searching for the site should be a bit fuzzy due to sequencing-pipeline (basecalling on MinION) error.
Example:
Assume a sequence as below. X, Y, Q and Z are sequence nucleotides not necessary for understanding the problem but are useful for demonstration purposes.
XXXXXXXXXXXXXXYYYYYACTCATAQQQQQQQQQZZZZZZZZZZZZZ
|-----|
I would like to find site ACTCATA
(with fuzzy matching) and split the sequence into
XXXXXXXXXXXXXXYYYYY
and
QQQQQQQQQZZZZZZZZZZZZZ
with optionally discarding the matched sequence.
Bonus points if this is done on fastq files where data on quality of reads is also split into new strings.
This could probably be accomplished the pedestrian way in biopython but was wondering if I missed a tool that does what I describe above.
Are you looking for adapter sequences? If so: https://github.com/rrwick/Porechop
Thank you @WouterDeCoster. I may end up using this in another part of the pipeline.