Background: I am a grad student doing eQTL analysis and just starting to dip my feet into plink. From what I understand, LD pruning is typically done by the '--indep-pairwise option'. Additionally, I can use the '--show-tags all' option to keep track of the pruned SNPs.
Thing is, I came across a tutorial highlighting why clumping is preferred over pruning (https://privefl.github.io/bigsnpr/articles/pruning-vs-clumping.html). I believe I understand the limitation with pruning (e.g. a situation where many SNPs are prune may arise, creating larger-then intended regions with no SNP representation). That being said, I am quite confuse about how clumping works and simply looking for more material to read on. For instance, I don't understand what the association test is calculating and how that is being used in the clumping procedure (according to the tutorial, there is some MAF statistic being used, but that statistic isn't present in the association file I created). I'm also having difficulty understanding how the index variant and clump variant are used.
Perhaps I am going down a rabbit hole that I shouldn't be concerned with based on my eventual goal (eQTL). But was hoping someone could recommend some resources comparing the two approaches.
I can explain the algorithms for you:
pruning: it uses the first SNP (in genome order) and computes the correlation with the following ones (e.g. 50). When it finds a large correlation, it removes one SNP from the correlated pair, keeping the one with the largest minor allele frequency (MAF), thus possibly removing the first SNP. Then it goes on with the next SNP (not yet removed). So, in some worst case scenario, this algorithm may in fact remove all SNPs of the genome (expect one).
clumping; it uses some statistic (usually p-value in the case of GWAS/PRS) to sort the SNPs by importance (e.g. keeping the most significant ones). It takes the first one (e.g. most significant SNP) and removes SNPs that are too correlated with this one in a window around it. As opposed to pruning, this procedure makes sure that this SNP is never removed, keeping at least one representative SNP by region of the genome. Then it goes on with the next most significant SNP that has not been removed yet. In the case of computing principal components, there is no p-value available, so I propose to use the MAF instead as the statistic to rank SNPs (in decreasing order). Using MAFs makes clumping very similar to pruning, but without any worst-case scenario.
If I remember correctly, that blog was written with Polygenic Score Analysis in mind where Clumping is preferred. The reason why clumping is preferred in Polygenic Score analysis is that we want to maintain the SNPs that has the strongest signal (lowest p-value). With pruning, the SNPs were randomly removed whereas with clumping, we preferentially retain any SNPs with stronger signal, therefore allow us to construct a more predictive polygenic risk score.
Of course clumping should be preferred in Polygenic Score analysis.
In the document, the author (me) refers to the case of computing Principal Components, where pruning is typically used.
If this document is not clear enough, please mention which parts and I'll try to improve it.