Question

PCA from plink2 for SGDP using a pangenome and DeepVariant

0

Entering edit mode

14 months ago

Matteo Ungaro ▴ 110

Hi there,

I'm doing my first experiments with PCA and UMAP as dimensionality reductions to visualize a dataset I've been working on. Basically, I used the samples from the SGDP which I then mapped on the human pangenome for, finally, calling small variants with DeepVariant.

I moved on with some PopGen analyses and as a preliminary inspection of groups in this panel I'm doing a PCA with Plink2. Now, starting from the joint callset for this ~300 samples I removed genomic regions which could be troublesome e.g. repeats, cent&sat, low mappability and SDs. Following this I attempted my first PCA but, for some reason, samples are smeared all over the plot... (see figure below)

Looking up, I found this old but very useful post on how things should have been done. That is, I should have removed INDELs and focused on bi-allelic SNPs. So, my next step has been to run the following on my VCF file

bcftools norm -m+ $VCF | bcftools view -m2 -M2 -v snps -Oz -o $new_file_name

However, the result didn't change significantly. The smearing issue persists and there are no defined clusters/groups in the plot...

For reference this are the Plink2 commands I'm using to generate the eigenvec and eigenval files to use for plotting

./plink2 --vcf $VCF --set-missing-var-ids @:#:\$r:\$a --rm-dup --indep-pairwise 200kb 0.5 --not-chr X,Y,MT --vcf-half-call m --out SGDP_snps_bi_norm

./plink2 --vcf $VCF --set-missing-var-ids @:#:\$r:\$a --not-chr X,Y,MT --vcf-half-call m --maf 0.05 --extract SGDP_snps_bi_norm.prune.in --make-pgen --pca --out SGDP_snps_bi_norm

which I double-checked with the author of the tool. I'm kind of lost on what's going wrong, if anyone has more experience with this type of analysis any help is much appreciated. Thanks in advance! pca

umap DeepVariant pangenome plink2 pca • 1.1k views

ADD COMMENT • link updated 14 months ago by DBScan ▴ 470 • written 14 months ago by Matteo Ungaro ▴ 110

0

Entering edit mode

What actually is your goal of the PCA? What do you want to visualize, is it just the ancestry of your samples?

ADD REPLY • link 14 months ago by DBScan ▴ 470

0

Entering edit mode

DBScan not so much ancestry, but rather how the populations of this dataset cluster based on the reported place of origin where samples have been sequenced.In theory, individuals from the same place which belong to the same population in the dataset should stick together in the plot emphasizing their greater genetic similarity.

ADD REPLY • link 14 months ago by Matteo Ungaro ▴ 110

0

Entering edit mode

The PCA plot looks in theory good, maybe you just made an error in assigning the population to the right sample? Basically the samples on the right side should belong to AFR, and the ones at the bottom should be EUR.

ADD REPLY • link 14 months ago by DBScan ▴ 470

0

Entering edit mode

And that was my first thought too, even before looking up for solutions here; however, I cross-referenced the metadata of the dataset multiple times and there are no errors in population-to-sample assignment...

ADD REPLY • link 14 months ago by Matteo Ungaro ▴ 110

0

Entering edit mode

Maybe try a different program for PCA then? For a quick ancestry classification, you can use somalier. https://github.com/brentp/somalier

ADD REPLY • link 14 months ago by DBScan ▴ 470