Hello all, I am processing my sequences from an BS/oxBS (bisulfite, oxidative bisulfite) sequencing runs, and observed some amount of contamination from short (~60bp) reads. I suspect that these are the spike-in sequences, because their sizes are also 60bp. I added these as a control during library prep to estimate how the oxidation, conversion went. I would like to remove these reads before alignment. The problem is, these spike-in reads are also bisulfite converted, at various locations and levels.
For example: The SQ6hmC spike in is: TACGATCACGGCGAATCCGATCGAATCAGTCAAGCGCTTTACGAAGTGCGACAGCCTTAG Within this, some Cs are unmethylated, some methylated (5mC), and some hydroxymethylated (5hmC). After BS reaction, all unmethylated Cs will be converted to Ts. After oxBS reaction, all unmethylated C AND 5hmC are converted to Ts.
I've attached the pic here for all spike-in sequences. Green=5hmC; Red=5mC; Grey=C.
What would be the best way to go about removing these spike-in reads? Thank you!
Thank you for your answer! Using FastQ_Screen indeed showed that the spike-in sequences are distinct from the zebrafish genome I'm working with. The program estimated about 30% zebrafish sequences and 70% something else (not zebrafish, mammal, e.coli, or phiX).
An idea I've been thinking is to pre-convert the spike-in sequences manually and then remove anything that match these within the fastq files. I just don't know of any program that can remove reads containing certain sequences from the pool of fastq...
ps. I'm not sure why the image isn't showing here, but here is the link (https://ibb.co/bwW8sc)