Question

GATK; Variant calling RNA-Seq best practices: bam processing steps

2

Entering edit mode

7.2 years ago

user230613 ▴ 380

Hi folks,

I have been looking for the best approach for call the variants in RNA-Seq data. GATK proposes the following workflow:

Map the reads using STAR.
Remove duplicates
Split N Trim
Indel realignment
Base recalibration
HaplotypeCaller

My questions are:

Is there any study showing the improvement of the variant calling after applying all the BAM processing steps? Let's say, the accuracy of the calls when you run HaplotypeCaller (or other variant caller) using a "raw" bam file or post-processed bam file.
Split N Trim: Is this step only required when using STAR as mapper (to fix MQ 255 issue)?

Thank you in advance,

variant-calling RNA-Seq gatk • 3.3k views

ADD COMMENT • link updated 14 months ago by Ram 44k • written 7.2 years ago by user230613 ▴ 380

score 4 · Answer 1 · 2017-09-27

Is there any study showing the improvement of the variant calling after applying all the BAM processing steps? Let's say, the accuracy of the calls when you run HaplotypeCaller (or other variant caller) using a "raw" bam file or post-processed bam file.

The GATK RNA Seq best practises were never published, but turned into a poster as far as I remember, however the steps are completely logical and make sense:

Map using STAR - STAR is arguably the best performing aligner around, and it certainly was at the time this protocol was conceived.
Remove Duplicates - RNA Seq data will have a lot of repeats in there as it's a quantification experiment after all, so removing duplicates is useful to reduce down the dataset.
Split N Trim - An RNA specific step that hard clips trailing bases (reduces FP rate a lot), fixes the Q255 issue, and splits the read into exon segments.

When comparing to a "Raw Alignment", the extra steps certainly improve things. Check out this page which in some cases shows the consequence of not performing a step (see section 3 for example). If you didn't do duplicate marking, you'd get a very large spike in your FP rate most likely. If you don't split N trim, it would cause a lot of downstream issues with the CIGAR string, Q255 values, etc.

Indel realignment is kind of the black sheep here, as since the introduction of the HaplotypeCaller and the deprecation of the UnifiedGenotyper, this may not improve things drastically. The HaplotypeCaller will perform graph based assembly, which should fix issues that the UnifiedGenotyper would miss. So this step is optional.

BQSR will improve things and remove systematic artefacts of quality scores.

Overall, yes, these steps certainly help and there's consequences for not following them. See the link above for extra detail, it's a very well explained protocol.