This is not a question with a single answer but as you said, I can share our story with you.
For our whole exome pipeline, we used to have Novoalign + GATK Indel Realigner + GATK BQSR + Unified Genotyper for both indel and snps it works great for indels up to 30bp. The problem with this pipeline is GATK Indel Raligner + GATK BQSR would take about 40% of total running time which for whole genome can be huge. Also I based on GATK website, version 2.6+ of GATK Haplotype Caller works better and faster both for SNPs and Indels and it can detect larger indels as well. I compared four pipelines using the GCAT tool (Novoalign version is 3.01 and GATK version is 2.6-5 and markdups was done using Picard in all four after alignment):
- Novoalign + GATK Indel Realigner + GATK BQSR + GATK Unified Genotyper
- Novoalign + GATK Haplotype Caller
- Novoalign + GATK Unified Genotyper
- Novoalign + GATK Indel Realigner + GATK BQSR + GATK Haplotype Caller
You can view the report of the comparison here. Based on this comparison, we chose Novoalign V3 directly (of course after marking dups) followed by GATK Haplotype Caller (version 2.6+) both for SNPs and small indels. Since Novoalign does base quality realignment and it guarantees optimal alignment, based on our comparison, GATK BQSR and Indel Realigner do not have significant impact on the results and given their speed, we removed them from our pipeline.
As we moved to whole genome, we surveyed a number of structural variation callers including: Pindel, Breakdancer, Delly, Lumpy and CNVnator. We chose combination of Delly (delly, jumpy, invy, duppy) and CNVnator for large insertions. All the comparison was done by comparing a 60x whole genome data with the same sample CNV array data and only the combination of these two tools were able to call the most complex structural variants for the sample we used for comparison.
I should mention that for all these comparisons, we were looking for the best balance of speed, sensitivity and specificity. Some of the structural variation tools run way too slow which make them impractical to use in a pipeline used to process hundreds of whole genome samples (at least within our infrastructure).
you might want to break this up into 4 questions - this is too much for one post
I'll change but 1 to 4 are discussion points for a SNP/SV software and options which fall under the general umbrella question.
Is your data whole-genome or exome? PM-me if it is exome.
Whole genome data