I have a VCF file with 24 control samples and 24 treated sample all called jointly through GATK. I am not interested in the differences between all my samples vs ensembl reference used for mapping. I am interested in control vs treated. So I would like to use my control samples as reference and regenotype my treated samples against it.
To explain further, here is a simplified example:
var ref alt t1 t2 t3 c1 c2 c3
1 A T A/T A/T A/T A/T A/T A/T
2 A T A/T A/T A/T A/A A/A A/A
3 G C G/G G/G G/G G/C G/C G/C
var1 would not be interesting as it does not differ between controls and treated. var2 and var3 are interesting.
I am thinking of an approach where I would pick variants that have identical genotypes across controls and find a consensus. Then use that as the reference and regenotype my treated samples against it. Now that brings us to an interesting question... Which allele do I pick as reference for heterozygous positions. Now sure how they do that for all the reference genomes...
In this example, I am just going to pick the most common allele for each variant and set that as the reference. var1 was skipped earlier and the new reference looks like:
var ref
2 A
3 G
Now if we regenotype the treated samples against the new reference, we get:
var ref alt c1 c2 c3
2 A A A/A A/A A/A
3 G C G/C G/C G/C
var 2 can be skipped because it is not polymorphic anymore (only because A was chosen as ref). Then we have:
var ref alt c1 c2 c3
3 G C G/C G/C G/C
Is this correct? Anything like this implemented in any workflow/software? Ultimately the aim is to pick only differences caused due to the experimental condition.
not clear to me: what should you do with your t* samples ?
From original
Well, the point is that I have 3 variants when considering all samples (top code block), but I have only 1 variant if I regenotype my T against C (bottom code block). I would expect this to make a big difference in downstream variant effect predictions and so on.