Indel Discovery Delly, Pindel, Samtools, Gatk
2
1
Entering edit mode
11.3 years ago
rob234king ▴ 610

I'm testing out samtools vs GATK for snp and indel calling, and looking at using pindel for SV in particular focusing on insertions and Delly for the other SV. What experience do people have of SNP calling and SV tools including filtering options?

Some possible discussion points: 1. Has anyone used Delly and found the results to be comparable with other SV detectors? 2. What filtering parameters do people use for SNP calling and indels?

I'm reading through papers on INDEL calling and samtools looks good for SNPs, GATK is better at small indels than samtools, but for larger INDELS and SV it doesn't seem as clear cut, I presume because it is a more complicated mechanism of detection.

Anybody got any recommendations of software they would use for a SNP/INDEL/SV calling pipeline. I've seen an older biostars post but there was no mention of larger indels and SV.

indel snp gatk samtools pindel • 9.2k views
ADD COMMENT
1
Entering edit mode

you might want to break this up into 4 questions - this is too much for one post

ADD REPLY
0
Entering edit mode

I'll change but 1 to 4 are discussion points for a SNP/SV software and options which fall under the general umbrella question.

ADD REPLY
0
Entering edit mode

Is your data whole-genome or exome? PM-me if it is exome.

ADD REPLY
0
Entering edit mode

Whole genome data

ADD REPLY
4
Entering edit mode
11.3 years ago
Mahdi Sarmady ▴ 310

This is not a question with a single answer but as you said, I can share our story with you.

For our whole exome pipeline, we used to have Novoalign + GATK Indel Realigner + GATK BQSR + Unified Genotyper for both indel and snps it works great for indels up to 30bp. The problem with this pipeline is GATK Indel Raligner + GATK BQSR would take about 40% of total running time which for whole genome can be huge. Also I based on GATK website, version 2.6+ of GATK Haplotype Caller works better and faster both for SNPs and Indels and it can detect larger indels as well. I compared four pipelines using the GCAT tool (Novoalign version is 3.01 and GATK version is 2.6-5 and markdups was done using Picard in all four after alignment):

  1. Novoalign + GATK Indel Realigner + GATK BQSR + GATK Unified Genotyper
  2. Novoalign + GATK Haplotype Caller
  3. Novoalign + GATK Unified Genotyper
  4. Novoalign + GATK Indel Realigner + GATK BQSR + GATK Haplotype Caller

You can view the report of the comparison here. Based on this comparison, we chose Novoalign V3 directly (of course after marking dups) followed by GATK Haplotype Caller (version 2.6+) both for SNPs and small indels. Since Novoalign does base quality realignment and it guarantees optimal alignment, based on our comparison, GATK BQSR and Indel Realigner do not have significant impact on the results and given their speed, we removed them from our pipeline.

As we moved to whole genome, we surveyed a number of structural variation callers including: Pindel, Breakdancer, Delly, Lumpy and CNVnator. We chose combination of Delly (delly, jumpy, invy, duppy) and CNVnator for large insertions. All the comparison was done by comparing a 60x whole genome data with the same sample CNV array data and only the combination of these two tools were able to call the most complex structural variants for the sample we used for comparison.

I should mention that for all these comparisons, we were looking for the best balance of speed, sensitivity and specificity. Some of the structural variation tools run way too slow which make them impractical to use in a pipeline used to process hundreds of whole genome samples (at least within our infrastructure).

ADD COMMENT
0
Entering edit mode

Interesting thanks for posting. Novoalign maps less but better specificity than BWA but requires a licence for multithread. Why did you go with novoalign? Is it because snp discovery specficity more important to you than sensitivity? Interested in your opinion on using novoalign over BWA-MEM, for instance I can map 99.8% of a tomato genome with BWA and 90% using novoalign.

ADD REPLY
0
Entering edit mode

We chose Novoalign because according to Heng Li's BWA-MEM manuscript (and many other benchmarks) "On accuracy, NovoAlign is the best". We do have the license and it does multi-threading and MPI very well. Although we work only with human data so all the information I wrote in my answer is based on experience with human data.

ADD REPLY
0
Entering edit mode

This has been very helpful thanks. Are you aloud to say roughly how much was your licence?

ADD REPLY
0
Entering edit mode

I don't think it is allowed, but contact them, the price is very reasonable.

ADD REPLY
0
Entering edit mode

I agree with you that IndelRealigner and BQSR definitely take up a large percentage of the pipeline you mentioned; but I think that for people processing less samples and are determined to get the most accurate results, it may be worth it to follow GATK's best practices and therefore keep IndelRealigner and BQSR. I like to believe we do not suffer through those extra 25hrs for nothing (also with human data); although you have definitely followed a systematic approach that suggests it may not be necessary.

ADD REPLY
0
Entering edit mode

We chose combination of Delly (delly, jumpy, invy, duppy) and CNVnator for large insertions.??? What about deletion? I guess here's your typo, you would wanna say deletion?

ADD REPLY
1
Entering edit mode
11.3 years ago

Does the fact that BWA-MEM trim the reads Q15 by default effect pindel because some reads will be shorter than expected and the insert size will be slightly off for some paired reads?

no - 3' trimming should not affect where the reads map and therefore the observed insert size should be the same

ADD COMMENT

Login before adding your answer.

Traffic: 2078 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6