I am attempting to infer the identify of an unknown 5mer present in amplified fragments after first-strand synthesis using the smart-seq v4 kit. .
I want to amplify fragments using this oligo from the original reverse-transcribed products before illumina library preparation.
I am using the shortread bioconductor package to sample ~1e6 reads from a few hundred untrimmed single cell fastq pairs, then filtering to exclude poly-A or poly-T sequences and listing the most frequent subsequent 5mers following the known oligo sequence, AAGCAGTGGTATCAACGCAGAGTAC. I am finding an overrepresentation of GGGNN sequences. Is there some explanation for this pattern? Something to do with C:G percentages and repetitive elements which I'm not dealing with through this naive approach?