We have performed whole exome sequencing on cell-free DNA from pancreatic cancer patients with both the HiSeq 4000 and the NovaSeq 6000 machines. The HiSeq run was 150bp paired-end, and the NovaSeq was 100bp paired-end, on DNA fragments up to 400bp.
When running through the GATK mapping pipeline, we are getting vastly different rates of duplication for these samples: the HiSeq results in about 20-30% duplication (as reported and flagged by MarkDuplicates), but the NovaSeq results in anywhere from 60-98% (!) duplication, meaning that our aimed-for 1000X sequencing is reduced down to an effective 20X.
We are utilising our pipeline both with and without the MarkDuplicates step to check whether it materially effects any of our downstream analyses (variant calling/copy number variation), and it seems that the massive drop in coverage in NovaSeq does appear to affect the number and quality of the variant/CNAs called.
Has anyone experienced this level of duplication before? Is it an indication of the low complexity of the input library (100bp paired-end compared to 150bp paired-end), hence lots of DNA fragments are being called duplicates as false positives or is it an issue with NovaSeq sequencing in general?
If it's either of these possibilities, then we will just have to ignore the MarkDuplicates flags (such as here: https://www.ncbi.nlm.nih.gov/pubmed/30404863 where the duplicates are marked, but not removed and so contribute to their downstream analysis).
You should check to see how many of these are optical duplicates. These are a known issue with patterned flowcells ( Duplicates on Illumina ) and if the libraries are not meeting a narrow criteria (defined insert sizes and loading conc) then you can end up with a problem.
Check this thread on how to identify all vs optical replicates: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files