NovaSeq Duplicates
3
2
Entering edit mode
3.7 years ago
trio.qbm ▴ 20

Dear community,

our group switched to NovaSeq and since then we have seen very strange data, meaning that we have extremly high read counts (over 200 million reads in the BAM files for paired-end data) and the number of reads remaining after using Picard MarkDuplicates goes down drastically. The most extreme case was from 270 million to 9 million reads. We talk now from ChIP-seq data (TFs)

I am not on the experimental side, but as far as I know they did not change the protocol for library prep. Is there something we should know or how to solve this issue? Can this data be used as it is now? Before NovaSeq we did not have this kind of problems (with HiSeq).

Thank you for any advice.

Regards from our mini group QBM

duplicationrate ChIP-seq • 3.0k views
ADD COMMENT
4
Entering edit mode
3.7 years ago

it can be that because the much higher throughput of the novaseq machines your libraries are not complex enough and you thus sequence the same sequences over and over resulting in what could seen like duplicates indeed.

ADD COMMENT
3
Entering edit mode
3.7 years ago

200-300 million reads is much higher than what is normally done for ChIP-seq samples. Since for ChIP-seq you can start out with a fairly low number of molecules going into PCR, your initial library complexity is probably low - meaning a low number of unique molecules to sequence per samples. Since PCR duplicates tend to have a fairly uniform distribution, you are probably sequencing most unique molecules by 10 million or so reads, and then you are just sequencing PCR duplicates of those molecules for the remainder.

ADD COMMENT
2
Entering edit mode
3.7 years ago
GenoMax 147k

In addition to what others have said my recommendation is to run clumpify.sh (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ) on this dataset. This will allow you to identify duplicates without doing any alignments. Find out how many optical duplicates you have as opposed to other types of duplicates. It is possible that your facility is overloading these FC in addition to the fact that these are low complexity libraries to begin with.

ADD COMMENT
0
Entering edit mode

Thank you all for the very fast reply! I will try out clumpify for sure.

I still would like to have your opinion whether you would redo the sequencing or not.

ADD REPLY
0
Entering edit mode

I would personally not redo the sequencing, in most cases that will not result in any "better" result. If you do the whole experiment again (eg. starting from sample/lib-prep etc) then you might be able to run a more efficient sequencing but in the end the result you have now is not wrong, it's merely an artifact of the improved sequencing technology.

ADD REPLY

Login before adding your answer.

Traffic: 1122 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6