Question

NuGEN Ovation RRBS Methyl-Seq System - How to use nudup.py to remove PCR duplicates- RRBS

0

Entering edit mode

2.9 years ago

BB • 0

Hi,

I've used the NuGEN Ovation RRBS Methyl-Seq System. I'm trying to follow the instructions in https://github.com/nugentechnologies/NuMetRRBS. The only thing I don't understand is how to remove duplicates.

I have fastq files of reads that I can trim and align, but I don't have an index.fq file (according to docs: "FASTQ file containing the molecular tag sequence for each read name in the corresponding SAM/BAM file"). Am I supposed to generate this file or is supposed to be provided by the sequencing service?
Do I need to have a 12 nucleotide index read to remove duplicates?

The user-guide states "If you wish to utilize this PCR duplicate marking feature, increase the index read from 6 to 12 nucleotides, then use the Tecan-provided Duplicate Marking tool, NuDup, to identify and discard any PCR duplicates found"

On the other hand, the nudup.py documentation (https://github.com/nugentechnologies/NuMetRRBS) states "If the index FASTQ read length is 6, 8, 12, 14, or 16nt long as expected for Tecan products, the molecular tag sequence to be extracted from the read according to -s and -l parameters, otherwise the molecular tag will be extracted from the header of the FASTQ entry."

As far as I can tell my reads seem to have a 6 base index. Take for example a read in one of my fastq files named FGC1866_s_4_2_GTCGTA.fastq.gz:

@K00315:137:HVJ5KBBXX:4:1101:1560:1490 2:N:0:GTCGTA
TTCGATTTCCAACGTATATATTTTTTTTTTTTTCTCACTCATATAAAATATTCTACAATATAATTTTCGTCATTTTCCATGTTTTTGATTATACCTCATTAATATACACTATTCTAAAATACCGAATTATCAAAAAAATACACATTTAAA
+
AAAFFJJJJJJJJJJJJJJJJJJJJ---<-A7--<AJ-7FJ7F<AFFA--AAA--AAFF-<-A<FAFJ-<--<-AA<----7<<FA--7--7--77--7A<AF-A----<-<FJFJ-77<--7------77---------7-----7---

For future experiments, should I ask the sequencing core to use a 12-base index? Should my fastq headers format be something like @K00315:137:HVJ5KBBXX:4:1101:1560:1490 2:N:0:GTCGTACTCACT?

Now that you have seen the files I have, can I remove PCR duplicates from these after alignment or not?

I'm confused and would appreciate any help you can provide. Thanks!

PCR_duplicates RRBS NuGEN • 1.0k views

ADD COMMENT • link 2.9 years ago by BB • 0

score 1 · Answer 1 · 2022-06-09

If you wish to utilize this PCR duplicate marking feature, increase the index read from 6 to 12 nucleotides

If that is the case then you are out of luck. Your data appears to have been sequenced with only a 6 bp index.

For future experiments, should I ask the sequencing core to use a 12-base index?

Yes. This run will need to be repeated if you wish to use the said tool.

That said you may be able to use clumpify.sh to achieve a similar effect with current data (if you are not able to re-sequence): Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.