Question

Regenotype vcf using some samples as reference

0

Entering edit mode

6.8 years ago

firestar ★ 1.6k

I have a VCF file with 24 control samples and 24 treated sample all called jointly through GATK. I am not interested in the differences between all my samples vs ensembl reference used for mapping. I am interested in control vs treated. So I would like to use my control samples as reference and regenotype my treated samples against it.

To explain further, here is a simplified example:

var ref alt t1   t2   t3   c1   c2   c3
1   A   T   A/T  A/T  A/T  A/T  A/T  A/T
2   A   T   A/T  A/T  A/T  A/A  A/A  A/A
3   G   C   G/G  G/G  G/G  G/C  G/C  G/C

var1 would not be interesting as it does not differ between controls and treated. var2 and var3 are interesting.

I am thinking of an approach where I would pick variants that have identical genotypes across controls and find a consensus. Then use that as the reference and regenotype my treated samples against it. Now that brings us to an interesting question... Which allele do I pick as reference for heterozygous positions. Now sure how they do that for all the reference genomes...

In this example, I am just going to pick the most common allele for each variant and set that as the reference. var1 was skipped earlier and the new reference looks like:

var ref
2   A
3   G

Now if we regenotype the treated samples against the new reference, we get:

var ref alt c1   c2   c3
2   A   A   A/A  A/A  A/A
3   G   C   G/C  G/C  G/C

var 2 can be skipped because it is not polymorphic anymore (only because A was chosen as ref). Then we have:

var ref alt c1   c2   c3
3   G   C   G/C  G/C  G/C

Is this correct? Anything like this implemented in any workflow/software? Ultimately the aim is to pick only differences caused due to the experimental condition.

vcf variant-calling SNP • 2.1k views

ADD COMMENT • link 6.8 years ago by firestar ★ 1.6k

0

Entering edit mode

not clear to me: what should you do with your t* samples ?

ADD REPLY • link 6.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

From original

I would pick variants that have identical genotypes across controls and find a consensus. Then use that as the reference and regenotype my treated samples against it.

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

Well, the point is that I have 3 variants when considering all samples (top code block), but I have only 1 variant if I regenotype my T against C (bottom code block). I would expect this to make a big difference in downstream variant effect predictions and so on.

ADD REPLY • link 6.8 years ago by firestar ★ 1.6k

0

Entering edit mode

6.8 years ago

pfs ▴ 280

If you care about comparing the case vs controls then you still have two differences. The C's are homozygous Major while the T's are heterozygous for var 2, and the opposite for var 3. The major and minor reference allele do not always result in in optimal/not optimal gene expression. You should take the genotype calls as is and annotate the variants using an annotation program to determine effect

ADD COMMENT • link 6.8 years ago by pfs ▴ 280

score 1 · Accepted Answer · 2018-01-26

1

Entering edit mode

6.8 years ago

firestar ★ 1.6k

I used PLINK to do the association between control/treated and find significant sites. I then filtered my VCF based on those sites.

ADD COMMENT • link 6.8 years ago by firestar ★ 1.6k