Hello everyone,
I'm comparing 3 sets of variants from 3 different NGS pipeline runs, where all 3 operate on a common set of 150 samples. I see a lot of difference in the variants found in each run.
The three runs are structured as follows:
- The 150 samples are run as a single batch
- The 150 samples are run as part of a larger cohort (total cohort size = 250 samples, say)
- The 150 samples are split into 5 batches of 30 samples each
For #2, GATK SelectVariants is used to extract variants found in the 150 samples. For #3, CombineVariants is used to combine the VCF files.
When I look at a venn diagram of these 3 sets, I see only around a 40% overlap in variants. For reference, 100% is the set of all unique variants discovered across all 3 runs.
To exclude pipeline quirks (AKA "this is the default behavior"), I compared 2 runs that were run 6 months apart on the exact same 25+ samples, and the variants discovered were identical. So, we can confidently say that only the cohort size difference could have caused this gap in variant discovery.
Can we discuss what could be the case here please? I wish to understand why I see 3/5 of the dataset not being called in at least one of the runs.
EDIT: These 3 runs are only computational NGS pipeline runs, they're not sequencing runs. In other words, I'm using the same BAM files across the board.
Are these samples barcoded somehow so you totally completely 100% don't have an extra 100 samples worth of variants in #2? That would be might first guess, if the QC from the samples otherwise looks normal (sequencing depth, coverage, etc)
I'm sorry, I should've mentioned - these 3 runs are only on the NGS pipeline. I'm using the same BAM files.
Are these exomes with tumor/normal samples ? There are multiple papers where they have compared analysis from different tools, such as GTAK, varscan, somaticSnipper, and so on. At lower threshold of filter, overlap is infact small, and oveelap number varies depending upon the filtering threshold. If you take top 50 variants (p-value based) from each tools, what is the overlap.
These are Whole Exome. The tool used for all runs is GATK's HaplotypeCaller, so I don't understand what you mean by
from each tools
.Are the positions not called as variant in set A in set B/C reference call or no-call, mostly?
I'll check that and get back to you, @WouterDeCoster.