is local ancestry inference typically always run w/ array genotypes instead of imputed genotypes?
1
1
Entering edit mode
3.4 years ago
curious ▴ 820

local ancestry inference is usually done with rfmix, for large cohorts (uk biobank etc) I get the strong impression that this is usually done on array genotypes (few hundred thousand) instead on imputed genotypes (millions). Is this the case?

ancestry • 1.6k views
ADD COMMENT
0
Entering edit mode
3.4 years ago
LauferVA 4.5k

This is a very difficult question to answer precisely. The theoretical argument is clear (based on information content literature), but in practice there are a lot of ways to muddy the waters...Let me give a theoretical argument first, then make several practical arguments afterwards. I hope that will do an OK job of getting at the theory but still also addressing practical concerns.

Theory: There is information about the ancestry of an individual that can be extracted from correctly genotyped markers. If the local ancestry algorithm (hereafter, LAA) and the imputation algorithm (hereafter, IA) are equally good at extracting the information content from the source data (and if they are provided the same data to begin with) then there is no reason the imputation algorithm (IA) should outperform the LAA. This is why most authors (that I am aware of) use only the genotyped data - there is no information gain and you are only providing additional redundant information. However, if either assumption doesn't hold, then using the imputed output could actually be better. Let me try to clarify by breaking down the input data portion: Input Data:

  1. Your sample genotypes
  2. All reference genotypes provided
  3. Any additional information provided to an algorithm To reiterate, in theory, it should not matter what you call the algorithm (LAA vs IA) - as long as it is implemented appropriately and receives the same data, you should not get any better estimate from one rather than the other. Ideally, the algorithm will generate a "sufficient statistic" for the local ancestry estimate, which you can think of as the estimate you'd get if you correctly extract all the information from 1. - 3.

Praxis: However, there are a number of problems and issues that may or may not apply to a particular project that could influence the process. I'll try to divide these into a sort of "pro" and "con" type list, here:

Potential reasons why including imputed markers could increase accuracy of local ancestry estimates (LAEs): Many imputation algorithms run "in the cloud" so to speak. If you do not have access to all the background/reference genotypes (data type "2" in the list above) used to make the imputation estimates, and cannot get access to them, then it might be possible to generate a better local ancestry estimate using the imputed data than the genotyped data alone. Some imputation algorithms conduct pre-phasing, etc. using specialized panels and software. Again, if you do not have access to all of that, then it might be possible to generate better LAEs using the imputed, phased genotypes than your raw genotyping data.

Potential reasons why including imputed markers could decrease accuracy of LAEs: If the local ancestry estimation software will regard any variant you provide as "ground truth" and does not allow for the possibility the estimate is an incorrectly imputed genotype, then it is likely you will decrease the accuracy of your LAEs by including imputed markers (specifically, imputed markers with low quality metrics). Assuming you LAA is no better and no worse at retaining information than the IA, it may run more slowly than if you ran the data on the genotyped samples only (because you are inputing far more markers) with no more actual information (because of the assumption at the beginning of the sentence).

Things you should do no matter what you decide: No matter what, data preparation and QC will have the biggest impact on the final results...

  • 1A. You should include only genotyped variants that have high quality metrics (low missingness, no differential missingness between groups, not very far out of HWE, etc.)
  • 1B. In just the same way, if you do decide to use imputed markers, you should remove all imputed variants with low imputation quality. Personally, I would be pretty stringent about this (only include variants you are quite sure are imputed with v. high accuracy).
  • 1C. If possible, run the analysis twice, once on G + I and once on genotyped only, and see how (dis)similar they are. If I can be of further help, let me know.
ADD COMMENT
0
Entering edit mode

If you do not have access to all the background/reference genotypes (data type "2" in the list above) used to make the imputation estimates, and cannot get access to them, then it might be possible to generate a better local ancestry estimate using the imputed data than the genotyped data alone

This is getting closer to what I am wondering, because I am mostly interested in this recent paper, where they use the inferred local ancestry for each variant in association testing frameworks for admixed individuals. They apply this to approach to performing associations tests on about 4K admixed African Europeans. I can't find it explicitly mentioned in the paper, but I based on their code I think they are probably doing this on array genotype data (I might ask the authors though).

If they were limited to array data, seems they could test more sites for association if applying LAI to well-imputed genotypes rather than just directly typed sites, but my wonder is if 1) this is not computationally tractable or 2) it is known that LAI quality is degraded when performed on imputed genotypes compared to directly typed sites or 3) I totally misunderstand the paper.

1C. If possible, run the analysis twice, once on G + I and once on genotyped only, and see how (dis)similar they are.

I do wonder a bit if this is a legit researchable question. I suppose I could mask directly typed sites, impute, run LAI then compare output to an analysis where those sites were not masked.

ADD REPLY
0
Entering edit mode

Regarding the recent paper: Yeah. There are all sorts of ways to do it, and there are key papers here and there that demonstrate which approaches have better statistical power and under what circumstances.

It is my experience (with my own data) that, assuming the null hypotheses are framed appropriately, random effects / mixed modeling almost always outperforms more "traditional" fixed effects modeling.

There are a couple cute benefits of (RE/Mixed) modeling; for example, because the genotype matrix is included, you do not need to exclude more distantly related samples to avoid inflating type I error. There are a few other reasons, but they are fairly technical to try to describe in a text format. We could set up a call if you wanted.

Ultimately, the most quantitative and (debatably?) the most scientific way to address this question is to perform a power analysis of the various approaches, with the specifics of your cohort in mind with approaches (like Tractor) being one of them, then to select the approach that you believe to be the best powered.

However, really implementing such a comparative analysis is very time consuming, so I think most researchers elect to choose what seems like the best option based on heuristics.

ADD REPLY
0
Entering edit mode

Regarding your response to 1C, that approach sounds like it could get published. I think that to make it high impact, you'd need to be proposing your own solution based on the results of your study, then showing its better. I think that is a useful research question, but I am wondering what your immediate goals are. For instance if you are a PhD candidate in a statistics department, and you really want to dig deeper into imputation, mixed modeling, etc. in the context of admixed populations, I'd say knock yourself out. If you're a biologist, mostly, by training, though, you might ultimately generate more impact by doing the association study as well as possible and as quickly as possible using (any of the) current state-of-the-art algorithms, then doing functional follow-up on the results you generate.

Good luck, my curious friend. Let me know if I can be of further help.

ADD REPLY

Login before adding your answer.

Traffic: 1498 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6