Help with PCA plot from SNPRelate
1
0
Entering edit mode
2.5 years ago

Hello, BioStars community! I am working on a datase of 50 samples(between cases and controls) with genotyping data for ~200 SNPs. I ran SNPRelate PCA analysis without adding population data and tried to plot its results. My PCA plot seems a bit "strange" since the observations did not come together to form "groups" PCA PLOT
Also, I plotted principal component pairs for the first four PCs, but the dots did not appear on my figure PCA PLOT 2.

The commands I used for this analysis are down here. Would anyone help me to understand my output figures? What can one conclude from such a "non-grouped" PCA plot in terms of population stratification?

SNP pruning based on LD:

  set.seed(1000)
 snpset <- snpgdsLDpruning(genofile, autosome.only=FALSE, ld.threshold=0.2)

Chromosome 1: 45.16%, 14/31

Chromosome 2: 53.57%, 15/28

Chromosome 3: 73.33%, 11/15

 Chromosome 4: 62.50%, 5/8 

Chromosome 5: 58.33%, 7/12 

Chromosome 6:31.11%, 14/45 

Chromosome 7: 63.64%, 7/11 

Chromosome 8: 62.50%, 5/8

Chromosome 9: 50.00%, 1/2 

Chromosome 10: 26.67%, 4/15 

Chromosome 11:66.67%, 6/9 

Chromosome 12: 60.00%, 6/10 

Chromosome 14: 37.50%, 3/8 

Chromosome 15: 66.67%, 2/3 

Chromosome 16: 50.00%, 9/18 

Chromosome 17:100.00%, 4/4 

Chromosome 18: 21.43%, 3/14 

Chromosome 19: 66.67%, 4/6 

Chromosome 20: 40.00%, 2/5 

Chromosome 21: 100.00%, 1/1 

Chromosome 22: 25.00%, 3/12 

Chromosome X: 28.57%, 2/7 

128 markers are selected in total.

> str(snpset)

List of 22

 $ chr1 : int [1:14] 1 5 8 10 11 12 16 18 19 20 ...

 $ chr2 : int [1:15] 32 34 35 38 40 41 43 44 48 49 ...

 $ chr3 : int [1:11] 60 64 66 67 68 69 70 71 72 73 ...

 $ chr4 : int [1:5] 76 78 79 80 81

 $ chr5 : int [1:7] 83 85 87 88 90 92 94

 $ chr6 : int [1:14] 98 99 101 105 110 118 131 132 133 134 ...

 $ chr7 : int [1:7] 141 142 143 144 146 147 150

 $ chr8 : int [1:5] 152 153 154 155 156

 $ chr9 : int 160

 $ chr10: int [1:4] 161 166 167 172

 $ chr11: int [1:6] 177 179 180 181 182 184

 $ chr12: int [1:6] 187 189 190 192 193 194

 $ chr14: int [1:3] 195 196 199

 $ chr15: int [1:2] 203 205

 $ chr16: int [1:9] 206 208 211 214 215 220 221 222 223

 $ chr17: int [1:4] 224 225 226 227

 $ chr18: int [1:3] 228 238 241

 $ chr19: int [1:4] 242 244 245 246

 $ chr20: int [1:2] 249 251

 $ chr21: int 253

 $ chr22: int [1:3] 255 263 264

 $ chrX : int [1:2] 266 271

**#PCA ANALYSIS**

> pca <- snpgdsPCA(genofile, autosome.only=FALSE, snp.id=snpset.id, num.thread=2)


Principal Component Analysis (PCA) on genotypes:

Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)

    # of samples: 50

    # of SNPs: 128

    using 2 threads

    # of principal components: 32

PCA:    the sum of all selected genotypes (0,1,2) = 2377

CPU capabilities: Double-Precision SSE2

Fri May 27 17:39:12 2022    (internal increment: 7124)

[==================================================] 100%, completed, 0s  
Fri May 27 17:39:12 2022    Begin (eigenvalues and eigenvectors)
Fri May 27 17:39:12 2022    Done.

> pc.percent <- pca$varprop*100

> head(round(pc.percent, 2))

[1] 6.96 5.65 5.56 5.19 4.94 4.63

> tab <- data.frame(sample.id = pca$sample.id,
   EV1 = pca$eigenvect[,1],    # the first eigenvector
   EV2 = pca$eigenvect[,2],    # the second eigenvector
   stringsAsFactors = FALSE)

> head(tab)

  sample.id         EV1          EV2

1      Z100 -0.15640307  0.033644598

2      Z110  0.05745959  0.104254360

3      Z112 -0.11022784  0.007529606

4      Z113  0.17145391 -0.259216687

5      Z114 -0.14967317  0.056786244

6      Z125 -0.04177276  0.197420240

#PLOTTING PCA 

> plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1")

**#Plot the principal component pairs for the first four PCs:**

> lbls <- paste("PC", 1:4, "\n", format(pc.percent[1:4], digits=2), "%", sep="")
> pairs(pca$eigenvect[,1:4], col=tab$pop, labels=lbls)
PLOT SNPRelate PCA • 1.6k views
ADD COMMENT
1
Entering edit mode
2.5 years ago
Mensur Dlakic ★ 28k

Would anyone help me to understand my output figures? What can one conclude from such a "non-grouped" PCA plot in terms of population stratification?

I don't know how many you were expecting, but there are two groups of points.

If I interpret you graphs correctly, your first two PCs explain only 12.6% of data variance. If so, it is not surprising that you don't get clear separation into groups. That may mean that there is no distinct grouping to begin with. Or it may mean that linear methods such as PCA can't separate your samples. You can try feeding top 50 or so PCs (or even your raw data) into t-SNE or UMAP, and see if non-linear dimensionality methods do a better job of separating your data.

By the way, I assume you normalized the data before PCA.

ADD COMMENT
1
Entering edit mode

You may get some idea as to what kind of separation can be obtained from t-SNE compared to PCA from the following notebook:

https://colab.research.google.com/drive/1_yo5kJqoG1amRnI4rh4IjNeC1peJyGVs

This kind of improvement is not to be expected for all datasets. Besides, not all datasets have clear underlying structure after dimensionality reduction.

ADD REPLY
0
Entering edit mode

Thank you Mr. Dlakic!

I agree with your point regarding the low data variance explained by the top 2 PCs. Since I am working on a genetic association study, my main doubt was to check for population stratification to include PCs as covariates in logistic regression association. I am not quite sure if these results suggest there is no need to include PCs as covariates for further association analysis... the genotyping data I used in this specif example was filtered for LD before running PCA , but I also tested the unfiltered vcf (adding just the 0.2 LD filter on the LD-based SNP pruning stage) and the result was about the same (~14.7% of variance explained by the top2 PCs).

I appreciate your help!!

ADD REPLY
1
Entering edit mode

12.6% of data variance. If so, it is not surprising that you don't get clear separation into groups.I'm not sure I totally agree with this - you can get structure which top 2 PCs that explain less than 2-3% of the variance. The proportion of variance explained varies a lot based on e.g. the number of SNPs and individuals used.

I think it's more likely you simply don't have enough SNPs/samples to uncover proper structure, 128 is barely anything for a human population.

ADD REPLY
0
Entering edit mode

That's a point, indeed. Do you think I should try to run it again without a LD treshold on the SNP prunning step? For this example, I used 0.2 as cuttoff.

ADD REPLY
0
Entering edit mode

I don't know how many you were expecting, but there are two groups of points.

I am expecting 2 groups, indeed (cases and controls)!!

ADD REPLY

Login before adding your answer.

Traffic: 2365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6