Question

PCA plot interpretation (single population)

1

Entering edit mode

14 months ago

Shane ▴ 20

Hi everyone,

I am new to the bioinformatics field so I would appreciate your help. I have SNP data (Axiom array) for ~4,300 individuals from one population in Europe I am doing a GWAS looking at hypertension (case/control) and have followed the QC tutorial from Marees et al., 2017.

After pruning --indep-pairwise 100 10 0.2 and --maf 0.1, I was left with 71139 variants to do PCA analysis.

Most of the articles I have seen use at least two populations. Does it make sense to plot just one population or should I be plotting with other publicly available population data (e.g 1000 Genomes)? When I run plink --bfile pruned_data --pca --out data_pca, the proportion of variance between each PC is always equal (10PCs - 10% each; 5PCs 20% each). In GWAS tutorial, I see that you can control for population stratification, would this be necessary in my case given that I know its from one population?

PCA plot

Thank you.

PCA GWAS pruning • 1.5k views

ADD COMMENT • link updated 6 months ago by AH • 0 • written 14 months ago by Shane ▴ 20

2

Entering edit mode

I'm slightly concerned about the variance your PCs explain when you change the number. I've never seen PC10 explaining 10% of the variance, nor equal variance among more than 2 axes. PCs should be ranked, with PC1 explaining the most variance.

And generally, a PCA should result in the same PC1 and PC2 regardless of the number of PCs calculated unless your data is too homogeneous.

I don't think you should add 1000G data as it's not something you plan to use elsewhere in the experiment, and PC1 and PC2 (and maybe more) would likely just show the difference between your data and 1000G's populations.

I would look through other PCs, but it doesn't look like population stratification is a serious issue with your data. Though you can check for stratification with an admixture plot.

ADD REPLY • link 14 months ago by dthorbur ★ 2.9k

2

Entering edit mode

In GWAS tutorial, I see that you can control for population stratification, would this be necessary in my case given that I know its from one population?

Absolutely, it will be wise to genetically determine the population stratification of your data rather than assuming that they are from the same ancestry from phenotypic information. I would suggest to perform PCA analysis along with 1KG population and see where your subjects fall. From the PCA what you have shown here, there are clearly more than 1 ancestries.

ADD REPLY • link 14 months ago by bk11 ★ 3.0k

0

Entering edit mode

Just updating for anyone interested. I followed this post to download 1KG data. I found the common variants between my data and 1KG data (and then merged together). Performed PCA on this merged data.

PCA plot of my data + 1KG

ADD REPLY • link 14 months ago by Shane ▴ 20

1

Entering edit mode

Hard to guess the significance of the distance between these 2 blobs of points, use 1kg to see where your samples are compared other populations. Also, plot other PCs: PC1 vs PC2, PC1 vs PC3, etc.

ADD REPLY • link 14 months ago by zx8754 12k

0

Entering edit mode

I have plotted PC1 vs PC2, PC1 vs PC3 which are identical.

PCA plot 2

ADD REPLY • link 14 months ago by Shane ▴ 20

1

Entering edit mode

THere is no such thing as "a single population". Humans exist along a continum, and discontinuities within this are multi-level. If you take just europeans, you will find sub-populations. If you take just one of those sub-populations, you will find sub-sub-populations. In particular, the group of indeviduals that call themselves "European" are definately not a homogeneous group.

In general, the fully mixed, perfectly randomly interbreeding populations that are used in population biological models don't exist.

ADD REPLY • link 14 months ago by i.sudbery 21k

0

Entering edit mode

If you really only have European samples, I would say you can ignore the PCA. But if you really want to be sure, doing a PCA is a good idea for sure.

ADD REPLY • link 14 months ago by DBScan ▴ 470

0

Entering edit mode

Among the three options which one is best : 1. PCA using 1000 genomes identify outlier remove them rerun PCA of your own study samples include those as covariate 2. PCA using the relevant population of 1000 genomes (in this case europeans or in other SAS , identify outlier remove them rerun PCA with the same population (study samples + 1000 genomes relvant pop) use those as covariates 3. Run PCA of using 1000 genomes remove any identified outlier, re run PCA on study population only identify outliers remove them rerun PCA after removal of outlier , add them as covariate . shane DBScan

ADD REPLY • link 6 months ago by AH • 0