In a genomic analysis including variant discovery it is advisable to not to consider OR remove duplicate reads because a replication error could be easily misunderstood as a SNP. So we usually use reads with unique start positions.
Library generation protocol for RNAseq involves amplification at some point. I know amplification is required for sequencing and I am also sure that amplification is uniformly carried out for the transcriptome (at least theoretically). For example, if Gene A has 4 RNA copies in the sample and Gene B has 10 copies. After amplification, if gene A has 16 copies then Gene B should have 40 copies. I also know if you have very deep library then lot of duplicate reads are expected not because of the amplification.
a) Has someone any idea about what percentage (range) of duplicate reads in total reads is considered to be normal for the RNASeq data ?
Also, do we also need to discard duplicate reads in case of RNASeq experiments where we have to compare gene expression between two samples where the number of duplicate reads differ significantly between these two samples(One sample was more amplified and the other was less). I know dividing the read counts with the total number of mapped read in the sample removes the bias when you have unequal number of reads for different samples but will this normalization step take care of duplicates so that we get the true representation of the transcriptome after normalization.
For some protocols dealing with very small quantity of RNA as a starting material, lot of amplification is required before sequencing. I have a RNAseq data for two of such samples where first sample has only 30% of unique or non-duplicate reads and the second sample has around 50% of unique or non-duplicate reads. Can I still carry out the RPKM normalisations and use tools like DEGseq, EdgeR to get the list of differentially expressed genes ?
Thanks
I think we DON'T need to discard duplicate reads from RNASeq experiments
I agree with that too