Question

RNA-seq dedupe PCR contamination before or after mapping

0

Entering edit mode

8.8 years ago

umn_bist ▴ 390

Will deduping (marking duplicates) with Picard before mapping affect my variant calling? Is it common practice to dedupe after mapping?

Reason why I ask is because my FastQC reports still have a lot of Kmer, overrepresented sequences, and bad GC content. I figured these can be corrected by removing PCR contamination. This is after trimming adapter and low quality (10) bases using BBDuk.

RNA-Seq • 3.6k views

ADD COMMENT • link updated 8.8 years ago by Carlo Yague 8.9k • written 8.8 years ago by umn_bist ▴ 390

0

Entering edit mode

Depends on what data you have, but a slight bimodal distribution of GC content in whole exome data, seems to be the norm (I haven't figured out a reason why, but it appears to be commonplace)

ADD REPLY • link 8.8 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

I'm working with tumor/normal PE RNA-seq samples from TCGA. The distribution varies across the board. Some are slight, some are drastic. I fear that mapping my reads without correcting GC and Kmer bias may muddle my variant calling downstream.

http://p08i.imgup.net/ScreenShot4fe0.png

http://i86i.imgup.net/ScreenShot1738.png

http://t38i.imgup.net/ScreenShote90f.png

ADD REPLY • link 8.8 years ago by umn_bist ▴ 390

0

Entering edit mode

I highly recommend you look at the GATK best practises, it includes caveats for using RNA seq data (providing the samples have suitable depth) https://www.broadinstitute.org/gatk/guide/best-practices.php

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by andrew.j.skelton73 6.6k

score 2 · Answer 1 · 2016-02-09

Overrepresented sequences / skewed GC content is expected in RNA-seq data. It usually comes from the most highly expressed transcripts (such as rRNA). However, it can also come from PCR duplicates and those can completely skew variant calling. For this reason, while people usually don't dedupe RNA-seq data for differential expression analysis, it is still recommended to do so for variant calling.

Some reference : http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0058815