Question

What is a bad lowest-CV value for Admixture?

0

Entering edit mode

10.0 years ago

devenvyas ▴ 770

I am running Admixture analyses. In order to recapitulate some IACs from the literature, I took the intersection of the larger dataset I am working with, which is based off the Affymetrix Human Origins array, with some data from populations-of-interest, which were based off the Illumina Omni 1M chip. I had about 140,000 SNPs after merging and then about 111,000 or 120,000 after pruning (--indep-pairwise 200 25 0.4 or --indep-pairwise 50 5 0.5)

When I am just using Affymetrix Human Origins data (I have around ~280,000 SNPs after pruning), and I get CV errors minimuming or plateauing around of 0.33 or 0.35-6.

With the overlap dataset, my CV errors are much larger. For example, with the 50 5 0.5 pruning method, here are my CVs

CV error (K=1): 0.58226
CV error (K=2): 0.54319
CV error (K=3): 0.53868
CV error (K=4): 0.53628
CV error (K=5): 0.53454
CV error (K=6): 0.53349
CV error (K=7): 0.53230
CV error (K=8): 0.53179
CV error (K=9): 0.53115
CV error (K=10): 0.53091
CV error (K=11): 0.53074
CV error (K=12): 0.53059
CV error (K=13): 0.53086
CV error (K=14): 0.53057
CV error (K=15): 0.53094
CV error (K=16): 0.53102
CV error (K=17): 0.53136
CV error (K=18): 0.53161
CV error (K=19): 0.53186
CV error (K=20): 0.53243

I am wondering why are these so high compared to the original dataset. Are these too high, or is this reasonable?

Thanks!

admixture SNP CV • 5.1k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.0 years ago by devenvyas ▴ 770

Ram · Answer 1 · 2015-08-21

It is very likely that the CV values are increased due to the additional variability introduced by merging two different datasets, which could have various differences, such as being generated at different times, on two different companies arrays, by different people, on different individuals coming from potentially different populations, etc. all of those things can impact the standard error.

The only one of those that I can say for sure is the arrays, because you mention that, but from your description it seems that is true.

The values do seem ok to me.

One approach to see if you could lower them is to generate the admixture estimates, then looking at SNPs that have very different AF between the two groups / chips despite similar ancestry estimation, then removing those SNPs, then re-generate the estimates.