Hi,
I have a list of genes, for which I need to compare LD plots of the gene across all subpopulations in 1000 Genome. I have written a perl script to do so but have run into a few challenges.
My script does the following, given a gene name:
- extract the list of variants (maf 0.05) present in 1000G data from the VCF file
- For each subpopulation in 1000G, pairwise LD is calculated for all variants using PLINK
- Plot the LD for all subpopulations in one PDF file using R.
However, I run into a couple of challenges, as I am new to population genetics.
- From the image below, you can see the list of variants is not the same for all subpopulations. So can I just take the common subset of variants so that I can compare the LD among them?
2) For some genes, the number of variants are too many (>100, sometimes 200-300) and hence the LD plot does not appear or is uninformative (see below). How can I subset the list of variants WITHOUT LOSING LD structure? (NOTE: --indep option in PLINK is not suitable for me, I am NOT looking for independent SNPS)
Thank you for the response Dr. Blighe! To answer your queries
I am new to population genetics, but what I understand so far is that to compare LD between 2 population, I should use the same set of variants. Is that not so? My aim is to see if there is a difference in gene structure among the different populations.
The MAF filter is applied globally - I filtered the VCF file for each chromosome using vcftools
Maybe I should increase my canvas size and the plots might display. I will try that.
In the plink forum, when I asked this question, I was adviced to use this option. But my understanding of this option is that it will give a list of variants that are independent and NOT in LD with each other, which is opposite of what I am looking for.
Yes, it identifies a set of variants that are in 'approximate' linkage equilibrium, and this is achieved by identifying and filtering out other variants that are in disequilibrium, based on a threshold, like r-squared or VIF.
I have not used gaston, unfortunately.
For haploblock analysis, you don't have to filter out any SNPs based on LD; however, you could still apply the Global MAF filter, if you wish. This being said, the haploblocks are very much defined based on r-squared or some other metric.
What you are doing we did previously for comparing 1000 Genomes populations, i.e., comparing their haplo-blocks and -types across a specific gene. If you can take a look at my part 3, here: Manhattan plots and linkage disequilibrium heatmap, and also here for a more complete walkthrough: How do I compute ld blocks from the hapmap ld_data?
The actual comparisons were then just visual, e.g., "this 'haplotype' was observed at higher frequency in South-East Asians when compared to Europeans".
Thank you so much for your insights! I have a clearer picture of what is to be done now. I had read the part 3 of your tutorial earlier but understood its significance to my problem only now. The ld blocks tutorial makes it clearer.