Hi all,
I'm trying to analyze a quality of some open RNA-seq data from the granulosa cells, and I'm dealing with some weird QC problems in 2 separate study data.
Study 1 : a lot of duplicate reads with the polyT overrepresented sequence (sample treatment: RNA-seq libraries were prepared using the KAPA Stranded RNA-Seq Library Preparation Kit from KAPA®, sequencing - Illumina HiSeq 2000, paired-end).
Study 2: similar problem but a LOT of overrepresented sequences blasting on rRNAs mostly (sample treatment: before the construction of an RNA-seq library, rRNA was removed from the total RNA samples using the RiboMinus Eukaryote Kit, the resulting RNA-seq library was quantified using an Agilent 2100 Bioanalyzer and was run on the HiSeq PE150 platform (Illumina, CA, USA) for paired-end 150 RNA sequencing).
My main questions here are:
- Does it look like a problem with the rRNA depletion process?
- Can I use this data in the analysis (for example, after rRNA reads filtering) or I should discard it?
I've encountered different opinions about filtering rRNA reads (but I still hold the view that it can bias the result of expression measurement), but the authors of these datasets themselves filter rRNA reads as part of their data processing.
You can access the full FastQC reports here: my FastQC reports
Thank you in advance!
Have you seen this blog post from authors of FastQC: https://sequencing.qcfail.com/articles/libraries-can-contain-technical-duplication/
Instead of getting bogged down with QC details you may want to make a note of this observation and proceed on with rest of your analysis. If this is public data you don't have much control over what was done/reported. If you are planning to do any meta analysis them add relevant metadata columns in your PCA plots etc if you are planning to try and compare/combine data from multiple kits.