let me share some thoughts with you regarding variant calling on Ion data
we have been using TVC for a couple of years as the only source of variants for our Ion PGM and Ion Proton machines since we weren't able to find any specific variant caller for Ion that would give us better results. we tried to use GATK in the past because we do so with SOLiD data running it in parallel with LifeScope (we end up with 2 variant sets that we can merge and let the researcher decide whether to use just one or both of the sets), but we the number of false positives were too high at that time. in fact the number of GATK's variants were almost 10x the number of TVC's.
but we wanted to review this process, so
we are currently running very basic tests in order to compare those TVC results, considering them as our gold standard, with the 2 variant callers that we know best: GATK and samtools+bcftools. we used 8 samples which we have Sanger sequenced and we played around with several configurations (full preprocessing the bam files as the GATK best practices suggests and not doing so before calling, using both HaplotypeCaller and UnifiedGenotyper algorithms on GATK, hard filtering and not doing so) and, although our test are yet raw, I can tell you that the best results we have are the ones coming from the complete GATK pipeline that we run on SOLiD data: full bam files preprocessing before running GATK's HaplotypeCaller and hard filtering the resulting variants.
but we don't feel quite confident using not suggested software, because
the GATK's developers have always stated that their software was not designed to be used with Ion data. in fact, Geraldine has stated a couple of weeks ago that "I'm afraid this has been tabled definitively; we will not produce best practices for dealing with this data type.". but the fact is that we have a 83%-87% overlap between HC's passed variants and TVC's (other configurations reached overlaps never above 75%) and, although we haven't yet compared all this data with Sanger's results to confirm true and false positive rates, it seems that the current GATK's best practices generate a fairly robust variant set, which is even better that samtools+bcftools (this combination, without even preprocessing the bam files, surprisingly outperforms all the rest but the full GATK's best practices reaching overlaps of 80% and being much faster - ~30min vs. ~6h). in fact this pipeline is capable of detecting some single-base indels that we knew and looked for them manually, indels that even samtools+bcftools nor even TVC (that should be optimized to distinguish technology indels from real indels) are not able to detect.
so we are still torturing these variants
because we are not sure whether to stick with TVC variants only or whether to merge them with a completely unsupported yet promising variant set. also, if the computing resources were critical we could be tempted to go for the samtools+bcftools set, but since we'd like to reduce the false positives rate to its minimum we may have to invest more time on the variant calling process. it seems a lose-lose situation, but we really want to make sure that we are generating the best possible results.
Thank you Jorge. This is very useful input. Would you mind sharing the TMAP/TVC and GATK commands and parameters you used, as a starting point so that others may potentially save lots of time?
No big deal, but at least all the decisions made can be easily referenced and reproducible.