Is it possible to run a gwas analysis where half of the subjects has GWAS data from an Illumina Omni express array and the other half of the subjects have GWAS data from an Illumina 660 W Quad Array?
What are the necessary steps required to include both of these data in a complete analysis - in terms of combining these groups?
As you read these papers (there are a couple dozen that will help you) start to take notes on what kinds of things they recommend. For instance, you will want to do QC by variant, by sample (individual person), by batch or plate, and by chip. Take notes on each of those.
Once you have a command of the literature, construct something like this:
I. Initial processing of new data
Genotype Calling (Illuminus)
X an Y probe intensity, Structural Variation (Illumina Bead Studio)
Coversion to bed bim fam (Custom, PLINK)
II.*Sample QC*
Sex Check (PLINK)
Missingness Outliers (PLINK)
Heterozygosity Rate Outliers (PLINK)
Calculate observed heterozygosity per individual
Plot Missingness on X axis, Heterozygosity on Y. Decide reasonable thresholds for exclusion
Relatedness Checks
Prune out high LD regions (e.g., HLA)
Prune down to 50,000 high quality, LD-independent SNPs
Check for IBD > 0.185, visualize (PLINK, R (turner))
Mark or exclude
Ancestry Checks (PLINK, smartPCA, R scripts)
Extract SNPs not featured in Hapmap 3 Rel. 2 four ancestral populations
Merge with hapmap data, flipping hapmap strand
PCA on merged file
Plot PC loadings
Determine all PCs having significant correlation to ancestry (R)
Exclude ancestry outliers (R)
Per Chip comparisons on a.-d. (Custom)
Exclude or mark all sample outliers
III. Marker QC
Excessive Missingness (PLINK)
Select threshold based on visual inspection of histogram
HWE (PLINK)
If a higher threshold is chosen, manually inspect cluster plot