Question

Dealing with Multiallelic in GWAS

1

Entering edit mode

4.5 years ago

godth13teen ▴ 70

Hi, I'm quite new to GWAS, based on my understanding so far, I have some questions.

The model is often care only about the zygosity of the SNP, but not the actual nucleotide change. "Quantitative genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes: the three distinct genotype values, corresponding to one heterozygous and two homozygous allele" . So how to deal with multiallelic SNP, when the nucleotide can have multiple change? For example, given a SNP at position chr1:240072043, T>C, I can encode it at 0, 1, 2 for minor, major homozygous and heterozygous, but if at the same position, in a different sample, I got T>A, then how can I treat this in the model?
When I care about epistasis, one gene/SNP can have interaction with others, how can I modify the GWAS model to convey this information?

Thank you for answering my question!

SNP GWAS • 2.7k views

ADD COMMENT • link updated 4.5 years ago by chrchang523 11k • written 4.5 years ago by godth13teen ▴ 70

score 2 · Answer 1 · 2020-06-14

2

Entering edit mode

4.5 years ago

chrchang523 11k

You include n-1 genotype columns in your regression, where n is the number of alleles. (One allele, usually the highest-frequency one, must be omitted to avoid linear dependence in the regression.)

ADD COMMENT • link 4.5 years ago by chrchang523 11k

0

Entering edit mode

Hi, I'm not so clear about your answer, could you please explain a bit more? Thank you

ADD REPLY • link 4.5 years ago by godth13teen ▴ 70

1

Entering edit mode

Suppose you have 4 samples; let's label them A, B, C, and D. Sample A has genotype T/T at this SNP, and phenotype value 175. Sample B has genotype C/T and phenotype value 160; sample C has genotype C/C and phenotype value 155; and sample D has genotype T/T and phenotype value 173.

A standard GWAS is based on [phenotype] ~ [genotype, intercept, other predictors] regressions. Ignoring "other predictors" for now, the data matrices for the regression at this SNP would look like

phenotype        intercept  #C
      175                1   0
      160                1   1
      155                1   2
      173                1   0

I've labeled the single genotype column "#C" here, representing "number of copies of the C allele".

Now change sample D's genotype to A/T. This would leave the original data matrices unchanged: neither A/T nor T/T have any copies of C. Which may actually be fine for detecting whether the C allele has a noticeable effect, but we're now also interested in whether the A allele does. We investigate that by adding a #A column:

phenotype        intercept  #A  #C
      175                1   0   0
      160                1   0   1
      155                1   0   2
      173                1   1   0

Of course, with only 4 samples, we can't conclude much. But (with a good choice of "other predictors") this approach becomes quite effective as your sample size increases.

ADD REPLY • link 4.5 years ago by chrchang523 11k

0

Entering edit mode

Ah, it's clear to me now, thank you

ADD REPLY • link 4.5 years ago by godth13teen ▴ 70

score 1 · Answer 2 · 2020-06-14

The model is usually linear so 0,1,2 is the number of minor alleles in the genome (so 0=homo-major, 1=hetero, 2=homo-minor) and the assumption is that two minor alleles will have two times the effect of the major. It doesn't have to hold for every test and tool but this is what I've seen. If there are alternative minor alleles they could be two different SNPs or assumed to have the same effect (or avoided altogether).
One way of dealing with epistasis could be to multiply the two SNPs values and divide by 2 (to be in the 0-2 range). I don't know a tool that can do this but statistically is should be valid (assuming linear interaction and additive effect).