Markduplicates in RNASEQ
1
0
Entering edit mode
6.9 years ago

Dear all,

I am working on RNASEQ data of multiple samples each with multilane sequence data of variable read length (35-75 bp each sample). My purpose is to et SNP using GATK (widely used) . For Multiple lanes (L1, L2, L3 and L4) for each samples , i used following approach

  1. individual lanes mapping (using STAR-2 pass mapping and got uniquely mapped reads percentage >80 % for each lane L1,L2,L3,L4 belong to sampleA)
  2. Added readgroup information using picard for each lane L1,L2,L3,L4 belong to sampleA
  3. merge multiple lanes mapping and get single single bam file
  4. Markduplicates on merged bam

LIkewise, I repeated this procedure for all samples .

But after doing markduplication, I am getting PERCENT_DUPLICATION > 70 percent. Should I remove the duplciates reads ? Why this percentage is high even through I achieved more than >80 uniquely mapped reads.

Do I need to make some change during Mapping step??

waiting for reply

Thank you in advance

RNA-Seq SNP • 4.6k views
ADD COMMENT
2
Entering edit mode
6.9 years ago
Ido Tamir 5.2k

Duplication measures the amount by which one read (sequence) is present multiple times. Uniquely mapping / multiple mapping concerns the times one read (sequence) is present in the genome (transcriptome). Both are independent of each other: A duplicated read can map to one region or multiple regions and a multi mapping read can be an unicate i.e. being sequence only once in the whloe library, but it can come from a very conserved region of a gene family and maps to many locations in the genome.

The amount of duplications depends on the organism and the read depth. In e.g. plant libraries I see a large percentage of duplicated reads coming from chloroplasts (>1000x duplicated). But we have good libraries with mouse SR50, 11% duplication rate at 40M reads. You should have a look at how the duplications are distributed to get a better picture i.e. is a large amount of sequences repetitive 5x,6x (<10x) or is a subpopulation of reads that are >1000x present. Also if you have PE data, the duplication rate is more accurately estimated than for SR data.

Revisiting the FastQC read duplication report

DupRadar after mapping and markduplicates gives you the best picture: https://bioconductor.org/packages/release/bioc/vignettes/dupRadar/inst/doc/dupRadar.html

Then you can go through your data with and without deduplication and think about which one gives you the more accurate view of your data and if they are really different at all (although 70% sounds quite high, but again - read depth ...)?

ADD COMMENT

Login before adding your answer.

Traffic: 1715 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6