Hello, my lab is trying to work on SSR sequencing where we have designed specific SSR primers and we are trying to capture the regions between consecutive SSR primers. Until now, I was using exact match with "seqkit locate" option to exactly match the primer+anchor sequences. I have not yet done any QC on the demultiplexed data. So this is on the rawest sequence.
zcat 221027_MN01111_0087_A000H535FM.XXXX.R1.fastq.gz | seqkit locate -f pattern.fa >221027_MN01111_0087_A000H535FM_XXX_R1_locate.txt
However, I noticed that we pick up partial repeat primer sequences or even partial primer+ complete primer sequences at the beginning of the read (like a primer dimer). An example:
Here what I thought as the ISSR (region between two repeats) is actually another partial repeat primer from my list. How can I make the search pattern more flexible ? Any tools I could try ? Thanks
Please post some actual text format data.
bbduk.sh
would be another tool to try.