Hello,
I have read a lot of instructions on analyzing duplication and deduplication including this (Revisiting the FastQC read duplication report) amazing explanation that helped me a lot. However, I still have certain technical questions (that might be very basic and stupid but I am new to this so apologize): what do numbers on X and Y plot even mean? In my report I have 85988702 total sequences with the length of 76 and in Sequence Duplication Levels I got there results: sequence remained after deduplication: 79.57% (now correct me if I am wrong but I assume in simple terms that this means that initially I had 20.43% of the sequences that were duplicats?), and I have a peak with the blue line on >10 on the X axis, does that mean that I have sequences that have between 10 and 50 copies or that I have 10 sequences with duplicates? I hope that I was clear enough. I will post a picture of my results so that you understand what I want to ask. Basically, I need someone to explain me in detalis the meaning of the numbers on the X and Y axis. Thank you in advance!
This should help: https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/
Your intuition/reasoning is correct for all questions you have posed.
You still have those duplicates in your data (not just initially). If you were to deduplicate the data then you will lose 20% of reads. For RNAseq data dedeuplication is not warranted unless you have an independent means of deciding PCR duplicates (e.g. Unique molecular indexes, UMI).
Thank you so much! I will check the link now.