Question

Duplication and deduplication in FASTQC report

0

Entering edit mode

3.0 years ago

tea.vuki ▴ 20

Hello,

I have read a lot of instructions on analyzing duplication and deduplication including this (Revisiting the FastQC read duplication report) amazing explanation that helped me a lot. However, I still have certain technical questions (that might be very basic and stupid but I am new to this so apologize): what do numbers on X and Y plot even mean? In my report I have 85988702 total sequences with the length of 76 and in Sequence Duplication Levels I got there results: sequence remained after deduplication: 79.57% (now correct me if I am wrong but I assume in simple terms that this means that initially I had 20.43% of the sequences that were duplicats?), and I have a peak with the blue line on >10 on the X axis, does that mean that I have sequences that have between 10 and 50 copies or that I have 10 sequences with duplicates? I hope that I was clear enough. I will post a picture of my results so that you understand what I want to ask. Basically, I need someone to explain me in detalis the meaning of the numbers on the X and Y axis. Thank you in advance!

fastqc • 1.1k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 3.0 years ago by tea.vuki ▴ 20

1

Entering edit mode

This should help: https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/

Your intuition/reasoning is correct for all questions you have posed.

I assume in simple terms that this means that initially I had 20.43% of the sequences that were duplicats?

You still have those duplicates in your data (not just initially). If you were to deduplicate the data then you will lose 20% of reads. For RNAseq data dedeuplication is not warranted unless you have an independent means of deciding PCR duplicates (e.g. Unique molecular indexes, UMI).