Hi,
I have obtained RNAseq pair-end data from an Illumina Hiseq run.
Initial FASTQC check of the raw data showed quite good quality in term of Q value and length distribution. However, the module regarding Duplication level of FASTQC showed high level of duplication (10-20% have duplicate in range >10 to >5k) and the percent of seq remain after deduplicate is only around 15%. Is this phenomenon normal in RNAseq data? Could you please give me some advices regarding this problem?
In addition, should I use Prinseq "-derep" parameter to filter out replicated reads in these raw data (for example -derep 24), Will this filtering step affect the further analysis of differential expression ?
Thank you very much!
Hi Ryan, Thank you very much for your answer. I still have some confusion here, for some samples checked via FASTQC, I saw percent of seq remain after deduplicate is 40-50%, but for some, this percent is below 15%. How come this difference could happen? Could it was because of quality of input library?
Do you have poly-A enriched samples or were they ribo-depleted? Particularly for the latter you'll see large changes like this. In general, any time you have a small number of really highly expressed genes then the percentage remaining after deduplication (according to FastQC) will be very low.
Thank you Ryan, I have poly-A enriched samples. I think that maybe the cDNA input quality and/or quantity was suboptimal or too low (I am not the one who directly perform these libraries preparation steps), so only the highly expressed genes were enriched and they over-flourish others low abundance genes, which lead to this issue in duplication rate.
This came up in my lab meeting the other day Devon, and we were unsure if removing optical duplicates might still be relevant for RNA-Seq, (although I don't know of a tool which will mark ONLY optical duplicates as duplicates). Even better, if you have two technical replicates (same biological sample, different PCR library) if there was some way to model the PCR duplication in each and remove it? Perhaps RNA is too flakey for that to work..?
As Steffen mentioned in said meeting, the solution is to do random-hexamer priming so all PCR dupes can be eliminated without statistical trickery - maybe with some spike-in controls ;) But for now, i'm not sure whats best for old data...
You can remove optical duplicates, though it's rarely worthwhile (it's faster to just use enough replicates and exclude outlier samples on a given gene...this is automagically done by DESeq2).
BTW, Steffen meant that one could use UMIs, rather than random hexamer priming :) That should be done with your single-cell data (at least the couple sample that I've seen).
Ah, he probably said UMI and I just mis-spoke :) I haven't seen the single-cell data yet - I gave my presentation last week, and next week's is Steffens, so im sure this will all be a lot clearer to me then :)
As for not removing optical dupes - yeah, makes sense if you're plugging it in to DESeq anyway. replicates4life yo. :)
Am I right that if we have random hexamers (that is pretty common for Illumina reads) in reads we can remove the real duplicates from fastq files before trimming?