Is there any study showing the improvement of the variant calling after applying all the BAM processing steps? Let's say, the accuracy of the calls when you run HaplotypeCaller (or other variant caller) using a "raw" bam file or post-processed bam file.
The GATK RNA Seq best practises were never published, but turned into a poster as far as I remember, however the steps are completely logical and make sense:
- Map using STAR - STAR is arguably the best performing aligner around,
and it certainly was at the time this protocol was conceived.
- Remove Duplicates - RNA Seq data will have a lot of repeats in there as it's
a quantification experiment after all, so removing duplicates is
useful to reduce down the dataset.
- Split N Trim - An RNA specific step that hard clips trailing bases (reduces FP rate a lot), fixes the Q255 issue, and splits the read into exon segments.
When comparing to a "Raw Alignment", the extra steps certainly improve things. Check out this page which in some cases shows the consequence of not performing a step (see section 3 for example). If you didn't do duplicate marking, you'd get a very large spike in your FP rate most likely. If you don't split N trim, it would cause a lot of downstream issues with the CIGAR string, Q255 values, etc.
Indel realignment is kind of the black sheep here, as since the introduction of the HaplotypeCaller and the deprecation of the UnifiedGenotyper, this may not improve things drastically. The HaplotypeCaller will perform graph based assembly, which should fix issues that the UnifiedGenotyper would miss. So this step is optional.
BQSR will improve things and remove systematic artefacts of quality scores.
Overall, yes, these steps certainly help and there's consequences for not following them. See the link above for extra detail, it's a very well explained protocol.