Question

What If Your Polya+, 50Bp, Illumina_Hi-Seq_Paired-End_Rna-Seq Data Has ~86% Duplication (Fastqc) ?

1

Entering edit mode

12.2 years ago

biorepine ★ 1.5k

Hi,

I have 2 very straight forward questions.

What exactly is fastqc "duplicated read" definition (reads having similar sequences ? or reads mapping to similar genomic locations or reads that have similar genomic start and ends ?)
What does one do, if they see high duplication level like ~86% ? (re-sequence the samples ? or remove the duplicate reads and continue the analysis ? or keep all the reads and continue with the analysis ?)

Many thanx in advance

rna-seq fastqc • 4.0k views

ADD COMMENT • link updated 12.2 years ago by Sean Davis 27k • written 12.2 years ago by biorepine ★ 1.5k

2

Entering edit mode

I do not have any experience with RNAseq, so I would not comment on 2) but for 1) FastQC does not do any mapping at all so it is rather "reads having similar sequences" which is correct.

EDIT : Read this http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Duplicate%20Sequences.html

ADD REPLY • link 12.2 years ago by toni ★ 2.2k

0

Entering edit mode

Try to extract the reads that have the highest duplication levels and then see to which transcripts they map. It might not be an artifact but just because there are a few transcripts that are very highly expressed in your sample. Also see How To Extract And Quantify Duplicated Rna-Seq Reads? and Duplicated Reads In Rna-Seq Experiment thread

ADD REPLY • link 12.2 years ago by Irsan ★ 7.8k

0

Entering edit mode

@Isran: This RNA-Seq is polyA+ so it is very unlikely that most of the duplicate reads maps to few genes. I think this duplication problem is very common in non polyA+ type and the threads are also referring to the same.

ADD REPLY • link 12.2 years ago by biorepine ★ 1.5k

0

Entering edit mode

Ok, then I misunderstood the matter :-( i know rRNA does not have polyA-tail but all other mRNA does right? Then way is it still not possible?

ADD REPLY • link 12.2 years ago by Irsan ★ 7.8k

score 2 · Answer 1 · 2013-02-21

2

Entering edit mode

12.2 years ago

Sean Davis 27k

Tony dealt with question #1 in his comment. As for #2, check to see if your sequences are enriched in adapters (I'm guessing they are). If that is the case, you may need to go back to the lab folks to see if there were problems on that side of things (too little sample, degraded sample, etc.). You might still be able to use the data if necessary, but your coverage will be very low, I imagine. Be sure to adapter-trim if you go ahead with using the sample.

ADD COMMENT • link 12.2 years ago by Sean Davis 27k

0

Entering edit mode

thanx! but could you please elaborate how it affects coverage ? and which part of sequence should one trims ?

ADD REPLY • link 12.2 years ago by biorepine ★ 1.5k

1

Entering edit mode

IF I am correct about the duplication being adapter contamination, then >86% of your reads are technical (do not come from the sample itself) and probably will not align to the genome. Adapter trimming can be done using any number of trimming softwares. I'll leave it to you to try a couple. The next step, after trimming, is to align to the genome. That will give you the most definitive answer regarding coverage.

ADD REPLY • link 12.2 years ago by Sean Davis 27k