I have a RNA Seq data (Illumina 1.9). I did QC using Fastqc after over represented sequences and adapter removal. On fastqc I observed there were failures for Kmer, GC content and sample duplication modules . Reading on several blog post suggesting it to be a normal occurrence I then aligned to reference genome using STAR followed by HT-seq for read counts and then Deseq2 for differential expression. RNA samples for RNA sequencing was isolated from polysomal heavy fraction so essentially the samples had ribosomal bound messenger transcripts. Poly A selection method was employed the company who did sequencing. After analyzing RNA-Seq data I did RT-qPCR and have validated the findings I have got after Deseq2 analysis and have seen almost similar results to RNA-Seq data.
Percentage of unique reads after deduplication, as suggested by fastqc, for some of my samples is as low as 8%. My validation suggests to me that the libraries were fine. I have read different opinions online and it has got me all confused now. Some suggest to remove duplicates and then proceed, whereas, some suggest it as a no no.
Is this a normal for RNA-seq data to have such a low unique reads as suggested by Fastqc?
High sequence duplication levels in RNA-seq are normal and expected. Do not remove duplicates. This would underestimate the true expression of highly expressed genes, as it would artificially downscale the counts of these genes.
What does it mean that "examination of the distribution of duplicates in a specific genomic region will allow the distinction between over-sequencing and general technical duplication"? What should I be looking for? Is it that over-sequencing will appear as a large number of overlapping reads, some of which are exact duplicates by chance, while technical (PCR) duplication will appear as individual stacks of multiple copies of the exact same read?