Question

Markduplicates in RNASEQ

0

Entering edit mode

6.9 years ago

Omics data mining ▴ 260

Dear all,

I am working on RNASEQ data of multiple samples each with multilane sequence data of variable read length (35-75 bp each sample). My purpose is to et SNP using GATK (widely used) . For Multiple lanes (L1, L2, L3 and L4) for each samples , i used following approach

individual lanes mapping (using STAR-2 pass mapping and got uniquely mapped reads percentage >80 % for each lane L1,L2,L3,L4 belong to sampleA)
Added readgroup information using picard for each lane L1,L2,L3,L4 belong to sampleA
merge multiple lanes mapping and get single single bam file
Markduplicates on merged bam

LIkewise, I repeated this procedure for all samples .

But after doing markduplication, I am getting PERCENT_DUPLICATION > 70 percent. Should I remove the duplciates reads ? Why this percentage is high even through I achieved more than >80 uniquely mapped reads.

Do I need to make some change during Mapping step??

waiting for reply

Thank you in advance

RNA-Seq SNP • 4.6k views

ADD COMMENT • link updated 6.9 years ago by Ido Tamir 5.2k • written 6.9 years ago by Omics data mining ▴ 260

0

Entering edit mode

Don't remove the duplicates.

See: How detrimental are duplicate reads in RNAseq experiments?
Removing Duplicates From Rna-Seq Data
Should We Remove Duplicated Reads In Rna-Seq ?

ADD REPLY • link 6.9 years ago by GenoMax 147k

score 2 · Answer 1 · 2017-12-20

Duplication measures the amount by which one read (sequence) is present multiple times. Uniquely mapping / multiple mapping concerns the times one read (sequence) is present in the genome (transcriptome). Both are independent of each other: A duplicated read can map to one region or multiple regions and a multi mapping read can be an unicate i.e. being sequence only once in the whloe library, but it can come from a very conserved region of a gene family and maps to many locations in the genome.

The amount of duplications depends on the organism and the read depth. In e.g. plant libraries I see a large percentage of duplicated reads coming from chloroplasts (>1000x duplicated). But we have good libraries with mouse SR50, 11% duplication rate at 40M reads. You should have a look at how the duplications are distributed to get a better picture i.e. is a large amount of sequences repetitive 5x,6x (<10x) or is a subpopulation of reads that are >1000x present. Also if you have PE data, the duplication rate is more accurately estimated than for SR data.

Revisiting the FastQC read duplication report

DupRadar after mapping and markduplicates gives you the best picture: https://bioconductor.org/packages/release/bioc/vignettes/dupRadar/inst/doc/dupRadar.html

Then you can go through your data with and without deduplication and think about which one gives you the more accurate view of your data and if they are really different at all (although 70% sounds quite high, but again - read depth ...)?