Question

Why Does Pca Mirror Geographic Map

2

Entering edit mode

14.0 years ago

jvijai ★ 1.2k

Why do population stratification using SNPs mirror geography?

It is not intuitively clear why allele-frequencies and geographical distances (latitude and longitude) should even have a equi-dimensional relationship, but apparently they can predict a person's geographic ancestry. There is even the famous article by Novembre et al. (titled: Genes mirror geography within Europe that has demonstrated this very well.

The overall geographic pattern .. fits the theoretical expectation for models in which genetic similarity decays with distance in a two-dimensional habitat, as opposed to expectations for models involving discrete well-differentiated populations.[1]

The success of PCA-based correction is not unexpected here, because the PCs are excellent predictors of latitude and longitude, and we used only linear functions of latitude and longitude to determine the means of our simulated phenotypes. [1]

But any ideas why this occurs?
East Asian pops

European pops

This is probably a question for evolutionary biologists, but I think bioinformatics probably has an answer on PCA/MDS/SVD methodologies.

1: http://www.nature.com/nature/journal/v456/n7218/full/nature07331.html
2: http://bit.ly/hHXiOY
3: http://www.nature.com/nature/journal/v456/n7218/fig_tab/nature07331_F2.html

snp pca • 4.8k views

ADD COMMENT • link updated 14.0 years ago by Qdjm 1.9k • written 14.0 years ago by jvijai ★ 1.2k

0

Entering edit mode

Interesting topic, but not really bioinformatics, IMHO. This pattern is in the data, and not caused by the analysis methods.

ADD REPLY • link 14.0 years ago by Egon Willighagen 5.4k

0

Entering edit mode

I think it is an interesting and thoughtfully posed question. Importantly, it indirectly addresses one of the key analysis issues for any genetic association problem: population sub-structure confounding with endpoints. I think this is something that bioinformatics analysts should be aware of if they're going to work with GWAS data.

ADD REPLY • link 14.0 years ago by David Quigley 11k

Ram · Answer 1 · 2010-12-30

Warning: I am not an evolutionary biologist. That said:

You're more likely to mate with someone who lives near you. As you raise the coefficient of inbreeding (meaning, the degree to which two individuals share identical alleles by descent), you lower genetic diversity, which shows up as decreased variation between two SNP chip measurements. When you look at different countries in the same continent, Spaniards are most likely to mate with other Spanards, but they're more likely to mate with someone from France than with a Russian, and if they migrate they're more likely to migrate somewhere nearby than to move far away. PCA is picking up the biggest contributions to variation in the genotypes, which is physical location. Interesting that they note more variation on the North-South axis, perhaps reflecting the Northern-Southern Europe cultural divide?

score 2 · Answer 2 · 2011-01-01

Warning: I'm at home without access to Nature and I don't have the methods in front of me, so I can only answer based on what is in your question.

Are you sure that you accept this result? Just because something is published in Nature, it doesn't mean that it is true. You need to look at everything with a very critical eye. I don't know the authors or the paper, though they have good reputations and I'm sure that they are careful scientists, but in complex data analysis that involves a lot of somewhat arbitrary choices, there's a lot of opportunity for confirmation bias to sneak in.

Here's a few of things to think about when interpreting Figure 2 from Novembre et al:

The distribution of simulated phenotypes was (linearly) scaled, rotated and flipped to make it correspond as closely as possible to the map of Europe. Even so, there are areas where this correspondence breaks down, for example, the ES/PT distributions are almost completely overlapping. Also, why are only ~1,400 individuals shown in the figure, not the full 3,000 individuals mentioned in the abstract? How was this subset selected?
How were genetic distance between two individuals was calculated? Is it simply the covariance of allelic frequencies across all 500k SNPs, or is there some pre-selection of SNPs and/or some sort of unusual distance measure? If it is the latter, before you accept Figure 2, you need to convince yourself that there was no feedback between the result and the choice of distance measure.
Where was the phenotyping done? Was it all in the same lab, or were there different labs depending on country of origin of the sample? Batch effects can depend on geographically distributed quantities like humidity.
The last thing to look closely at how the "country of origin" was originally assigned to each sample. Was there some sort of pre-selection for exemplar samples?

In terms of PCA, if genetic distance does scale with geographic distance, and this is the major source of variation among the samples, then (as the authors point out) it is not at all surprising that the first two principal components are linear functions of latitude and longitude. This might even work if there the relationship between geographic and genetic distance was non-linear but monotonically increasing, so long as the genetic distances were not very large: most manifolds are locally linear.