Question

NovaSeq Duplicates

2

Entering edit mode

3.7 years ago

trio.qbm ▴ 20

Dear community,

our group switched to NovaSeq and since then we have seen very strange data, meaning that we have extremly high read counts (over 200 million reads in the BAM files for paired-end data) and the number of reads remaining after using Picard MarkDuplicates goes down drastically. The most extreme case was from 270 million to 9 million reads. We talk now from ChIP-seq data (TFs)

I am not on the experimental side, but as far as I know they did not change the protocol for library prep. Is there something we should know or how to solve this issue? Can this data be used as it is now? Before NovaSeq we did not have this kind of problems (with HiSeq).

Thank you for any advice.

Regards from our mini group QBM

duplicationrate ChIP-seq • 3.0k views

ADD COMMENT • link updated 2.4 years ago by lieven.sterck 15k • written 3.7 years ago by trio.qbm ▴ 20

score 4 · Answer 1 · 2021-03-31

4

Entering edit mode

3.7 years ago

lieven.sterck 15k

it can be that because the much higher throughput of the novaseq machines your libraries are not complex enough and you thus sequence the same sequences over and over resulting in what could seen like duplicates indeed.

ADD COMMENT • link 2.4 years ago by lieven.sterck 15k

score 3 · Answer 2 · 2021-03-31

200-300 million reads is much higher than what is normally done for ChIP-seq samples. Since for ChIP-seq you can start out with a fairly low number of molecules going into PCR, your initial library complexity is probably low - meaning a low number of unique molecules to sequence per samples. Since PCR duplicates tend to have a fairly uniform distribution, you are probably sequencing most unique molecules by 10 million or so reads, and then you are just sequencing PCR duplicates of those molecules for the remainder.

score 2 · Answer 3 · 2021-03-31

2

Entering edit mode

3.7 years ago

GenoMax 147k

In addition to what others have said my recommendation is to run clumpify.sh (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) on this dataset. This will allow you to identify duplicates without doing any alignments. Find out how many optical duplicates you have as opposed to other types of duplicates. It is possible that your facility is overloading these FC in addition to the fact that these are low complexity libraries to begin with.

ADD COMMENT • link 3.7 years ago by GenoMax 147k

0

Entering edit mode

Thank you all for the very fast reply! I will try out clumpify for sure.

I still would like to have your opinion whether you would redo the sequencing or not.

ADD REPLY • link 3.7 years ago by trio.qbm ▴ 20

0

Entering edit mode

I would personally not redo the sequencing, in most cases that will not result in any "better" result. If you do the whole experiment again (eg. starting from sample/lib-prep etc) then you might be able to run a more efficient sequencing but in the end the result you have now is not wrong, it's merely an artifact of the improved sequencing technology.

ADD REPLY • link 3.7 years ago by lieven.sterck 15k