Hi,
I have 2 very straight forward questions.
- What exactly is fastqc "duplicated read" definition (reads having similar sequences ? or reads mapping to similar genomic locations or reads that have similar genomic start and ends ?)
- What does one do, if they see high duplication level like ~86% ? (re-sequence the samples ? or remove the duplicate reads and continue the analysis ? or keep all the reads and continue with the analysis ?)
Many thanx in advance
I do not have any experience with RNAseq, so I would not comment on 2) but for 1) FastQC does not do any mapping at all so it is rather "reads having similar sequences" which is correct.
EDIT : Read this http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Duplicate%20Sequences.html
Try to extract the reads that have the highest duplication levels and then see to which transcripts they map. It might not be an artifact but just because there are a few transcripts that are very highly expressed in your sample. Also see How To Extract And Quantify Duplicated Rna-Seq Reads? and Duplicated Reads In Rna-Seq Experiment thread
@Isran: This RNA-Seq is polyA+ so it is very unlikely that most of the duplicate reads maps to few genes. I think this duplication problem is very common in non polyA+ type and the threads are also referring to the same.
Ok, then I misunderstood the matter :-( i know rRNA does not have polyA-tail but all other mRNA does right? Then way is it still not possible?