Question

GWAS data from an Illumina Omni express Array and Illumina 660 W Quad Array

0

Entering edit mode

9.2 years ago

Sheila ▴ 460

Is it possible to run a gwas analysis where half of the subjects has GWAS data from an Illumina Omni express array and the other half of the subjects have GWAS data from an Illumina 660 W Quad Array?

What are the necessary steps required to include both of these data in a complete analysis - in terms of combining these groups?

Thanks!

illumina data gwas • 3.7k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.2 years ago by Sheila ▴ 460

Ram · Answer 1 · 2016-02-10

This is a very large question with no simple answer.

Here is what you should do:

Google "GWAS quality control"
Start reading papers like this one from Stephen Turner: "Quality Control Procedures for GWAS" http://www.ncbi.nlm.nih.gov/pubmed/21234875
As you read these papers (there are a couple dozen that will help you) start to take notes on what kinds of things they recommend. For instance, you will want to do QC by variant, by sample (individual person), by batch or plate, and by chip. Take notes on each of those.

Once you have a command of the literature, construct something like this:

I. Initial processing of new data

Genotype Calling (Illuminus)
X an Y probe intensity, Structural Variation (Illumina Bead Studio)
Coversion to bed bim fam (Custom, PLINK)

II.*Sample QC*

Sex Check (PLINK)
Missingness Outliers (PLINK)
Heterozygosity Rate Outliers (PLINK)
1. Calculate observed heterozygosity per individual
2. Plot Missingness on X axis, Heterozygosity on Y. Decide reasonable thresholds for exclusion
Relatedness Checks
1. Prune out high LD regions (e.g., HLA)
2. Prune down to 50,000 high quality, LD-independent SNPs
3. Check for IBD > 0.185, visualize (PLINK, R (turner))
4. Mark or exclude
Ancestry Checks (PLINK, smartPCA, R scripts)
1. Extract SNPs not featured in Hapmap 3 Rel. 2 four ancestral populations
2. Merge with hapmap data, flipping hapmap strand
3. PCA on merged file
4. Plot PC loadings
5. Determine all PCs having significant correlation to ancestry (R)
6. Exclude ancestry outliers (R)
Per Chip comparisons on a.-d. (Custom)
Exclude or mark all sample outliers

III. Marker QC

Excessive Missingness (PLINK)
1. Select threshold based on visual inspection of histogram
HWE (PLINK)
1. If a higher threshold is chosen, manually inspect cluster plot
Differential Missingness Check (PLINK)
1. Informative Missingness - CNV
2. Consecutive Missingness in a stretch
Low MAF (PLINK)
Internal Sample Reproducibility (Between Chips) (PLINK)
External Sample Reproducibility (HapMap Concordance) (PLINK)
Per Chip Call Rate, AF, GF, comparisons on a.-d. (Custom)

IV. Batch Effects

Average MAF (PLINK, Custom)
Average call rates (PLINK, Custom)
Association Testing by plate (remove MAF <5%) (Custom, PLINK)
Correction via population stratification techniques if necessary

V. Dataset Merging and Harmonization

Sample Checks
1. Must perform same checks as before on merged set.
2. Results should confirm previous relationships, find new related pairs.
HWE - after merging, high number of SNPs out of HWE due to differences in ancestry.
1. Need to stratify by ethnicity, then look for HWE outliers p < 0.0001.
Population Stratification
1. Use AIMs from Dumitrescu 2010
Marker Checks
1. After removing 95% from single study, second check for 99% overall.
Batch Effects
1. Test independence of AF with plate membership, and compare the distribution of chi-square statistics to the null distribution.
Merging

VI. Integrated imputation, phasing, and strand flipping

Genotype Harmonizer
1. Across Study-Side Hapmap sample Concordance (GH)
2. Inspect original source file designation (GH)
3. MAF comparisons (GH)

VII. Association Testing

Post QC PCA
Decide between Logistic Regression and Mixed Modelling
1. Degree of Relatedness

VIII. Evaluation of QC Quality after Association Analysis

Calculation of Lambda
Examination of Intensity Plots
Replicate SNPs of interest on a DIFFERENT Technology