Hello, BioStars community! I am working on a datase of 50 samples(between cases and controls) with genotyping data for ~200 SNPs. I ran SNPRelate PCA analysis without adding population data and tried to plot its results. My PCA plot seems a bit "strange" since the observations did not come together to form "groups"
Also, I plotted principal component pairs for the first four PCs, but the dots did not appear on my figure .
The commands I used for this analysis are down here. Would anyone help me to understand my output figures? What can one conclude from such a "non-grouped" PCA plot in terms of population stratification?
SNP pruning based on LD:
set.seed(1000)
snpset <- snpgdsLDpruning(genofile, autosome.only=FALSE, ld.threshold=0.2)
Chromosome 1: 45.16%, 14/31
Chromosome 2: 53.57%, 15/28
Chromosome 3: 73.33%, 11/15
Chromosome 4: 62.50%, 5/8
Chromosome 5: 58.33%, 7/12
Chromosome 6:31.11%, 14/45
Chromosome 7: 63.64%, 7/11
Chromosome 8: 62.50%, 5/8
Chromosome 9: 50.00%, 1/2
Chromosome 10: 26.67%, 4/15
Chromosome 11:66.67%, 6/9
Chromosome 12: 60.00%, 6/10
Chromosome 14: 37.50%, 3/8
Chromosome 15: 66.67%, 2/3
Chromosome 16: 50.00%, 9/18
Chromosome 17:100.00%, 4/4
Chromosome 18: 21.43%, 3/14
Chromosome 19: 66.67%, 4/6
Chromosome 20: 40.00%, 2/5
Chromosome 21: 100.00%, 1/1
Chromosome 22: 25.00%, 3/12
Chromosome X: 28.57%, 2/7
128 markers are selected in total.
> str(snpset)
List of 22
$ chr1 : int [1:14] 1 5 8 10 11 12 16 18 19 20 ...
$ chr2 : int [1:15] 32 34 35 38 40 41 43 44 48 49 ...
$ chr3 : int [1:11] 60 64 66 67 68 69 70 71 72 73 ...
$ chr4 : int [1:5] 76 78 79 80 81
$ chr5 : int [1:7] 83 85 87 88 90 92 94
$ chr6 : int [1:14] 98 99 101 105 110 118 131 132 133 134 ...
$ chr7 : int [1:7] 141 142 143 144 146 147 150
$ chr8 : int [1:5] 152 153 154 155 156
$ chr9 : int 160
$ chr10: int [1:4] 161 166 167 172
$ chr11: int [1:6] 177 179 180 181 182 184
$ chr12: int [1:6] 187 189 190 192 193 194
$ chr14: int [1:3] 195 196 199
$ chr15: int [1:2] 203 205
$ chr16: int [1:9] 206 208 211 214 215 220 221 222 223
$ chr17: int [1:4] 224 225 226 227
$ chr18: int [1:3] 228 238 241
$ chr19: int [1:4] 242 244 245 246
$ chr20: int [1:2] 249 251
$ chr21: int 253
$ chr22: int [1:3] 255 263 264
$ chrX : int [1:2] 266 271
**#PCA ANALYSIS**
> pca <- snpgdsPCA(genofile, autosome.only=FALSE, snp.id=snpset.id, num.thread=2)
Principal Component Analysis (PCA) on genotypes:
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
# of samples: 50
# of SNPs: 128
using 2 threads
# of principal components: 32
PCA: the sum of all selected genotypes (0,1,2) = 2377
CPU capabilities: Double-Precision SSE2
Fri May 27 17:39:12 2022 (internal increment: 7124)
[==================================================] 100%, completed, 0s
Fri May 27 17:39:12 2022 Begin (eigenvalues and eigenvectors)
Fri May 27 17:39:12 2022 Done.
> pc.percent <- pca$varprop*100
> head(round(pc.percent, 2))
[1] 6.96 5.65 5.56 5.19 4.94 4.63
> tab <- data.frame(sample.id = pca$sample.id,
EV1 = pca$eigenvect[,1], # the first eigenvector
EV2 = pca$eigenvect[,2], # the second eigenvector
stringsAsFactors = FALSE)
> head(tab)
sample.id EV1 EV2
1 Z100 -0.15640307 0.033644598
2 Z110 0.05745959 0.104254360
3 Z112 -0.11022784 0.007529606
4 Z113 0.17145391 -0.259216687
5 Z114 -0.14967317 0.056786244
6 Z125 -0.04177276 0.197420240
#PLOTTING PCA
> plot(tab$EV2, tab$EV1, xlab="eigenvector 2", ylab="eigenvector 1")
**#Plot the principal component pairs for the first four PCs:**
> lbls <- paste("PC", 1:4, "\n", format(pc.percent[1:4], digits=2), "%", sep="")
> pairs(pca$eigenvect[,1:4], col=tab$pop, labels=lbls)
You may get some idea as to what kind of separation can be obtained from t-SNE compared to PCA from the following notebook:
https://colab.research.google.com/drive/1_yo5kJqoG1amRnI4rh4IjNeC1peJyGVs
This kind of improvement is not to be expected for all datasets. Besides, not all datasets have clear underlying structure after dimensionality reduction.
Thank you Mr. Dlakic!
I agree with your point regarding the low data variance explained by the top 2 PCs. Since I am working on a genetic association study, my main doubt was to check for population stratification to include PCs as covariates in logistic regression association. I am not quite sure if these results suggest there is no need to include PCs as covariates for further association analysis... the genotyping data I used in this specif example was filtered for LD before running PCA , but I also tested the unfiltered vcf (adding just the 0.2 LD filter on the LD-based SNP pruning stage) and the result was about the same (~14.7% of variance explained by the top2 PCs).
I appreciate your help!!
12.6% of data variance. If so, it is not surprising that you don't get clear separation into groups
.I'm not sure I totally agree with this - you can get structure which top 2 PCs that explain less than 2-3% of the variance. The proportion of variance explained varies a lot based on e.g. the number of SNPs and individuals used.I think it's more likely you simply don't have enough SNPs/samples to uncover proper structure, 128 is barely anything for a human population.
That's a point, indeed. Do you think I should try to run it again without a LD treshold on the SNP prunning step? For this example, I used 0.2 as cuttoff.
I am expecting 2 groups, indeed (cases and controls)!!