What If Your Polya+, 50Bp, Illumina_Hi-Seq_Paired-End_Rna-Seq Data Has ~86% Duplication (Fastqc) ?
1
1
Entering edit mode
11.8 years ago
biorepine ★ 1.5k

Hi,

I have 2 very straight forward questions.

  1. What exactly is fastqc "duplicated read" definition (reads having similar sequences ? or reads mapping to similar genomic locations or reads that have similar genomic start and ends ?)
  2. What does one do, if they see high duplication level like ~86% ? (re-sequence the samples ? or remove the duplicate reads and continue the analysis ? or keep all the reads and continue with the analysis ?)

Many thanx in advance

rna-seq fastqc • 3.8k views
ADD COMMENT
2
Entering edit mode

I do not have any experience with RNAseq, so I would not comment on 2) but for 1) FastQC does not do any mapping at all so it is rather "reads having similar sequences" which is correct.

EDIT : Read this http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/9%20Duplicate%20Sequences.html

ADD REPLY
0
Entering edit mode

Try to extract the reads that have the highest duplication levels and then see to which transcripts they map. It might not be an artifact but just because there are a few transcripts that are very highly expressed in your sample. Also see How To Extract And Quantify Duplicated Rna-Seq Reads? and Duplicated Reads In Rna-Seq Experiment thread

ADD REPLY
0
Entering edit mode

@Isran: This RNA-Seq is polyA+ so it is very unlikely that most of the duplicate reads maps to few genes. I think this duplication problem is very common in non polyA+ type and the threads are also referring to the same.

ADD REPLY
0
Entering edit mode

Ok, then I misunderstood the matter :-( i know rRNA does not have polyA-tail but all other mRNA does right? Then way is it still not possible?

ADD REPLY
2
Entering edit mode
11.8 years ago

Tony dealt with question #1 in his comment. As for #2, check to see if your sequences are enriched in adapters (I'm guessing they are). If that is the case, you may need to go back to the lab folks to see if there were problems on that side of things (too little sample, degraded sample, etc.). You might still be able to use the data if necessary, but your coverage will be very low, I imagine. Be sure to adapter-trim if you go ahead with using the sample.

ADD COMMENT
0
Entering edit mode

thanx! but could you please elaborate how it affects coverage ? and which part of sequence should one trims ?

ADD REPLY
1
Entering edit mode

IF I am correct about the duplication being adapter contamination, then >86% of your reads are technical (do not come from the sample itself) and probably will not align to the genome. Adapter trimming can be done using any number of trimming softwares. I'll leave it to you to try a couple. The next step, after trimming, is to align to the genome. That will give you the most definitive answer regarding coverage.

ADD REPLY

Login before adding your answer.

Traffic: 2573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6