I have used plink to produce an MDS plot using human GWAS with the following code (to reduce LD snps as recommended):
plink --bfile pFiles/geno --indep-pairwise 50 10 0.2 --out prune1
plink --bfile pFiles/geno --extract prune1.prune.in --genome --out ibs1
plink --bfile pFiles/geno --read-genome ibs1.genome --cluster cc --ppc 1e-3 --mds-plot 2 --out strat1
I then opened the mds file (strat1.mds) in R and plotted C1 vs. C2 - it seems clear that there are some outlier samples. Plot looks something like this:
Am I justified in removing these outliers from further analysis purely by looking at this data and essentially just saying "we took the clump in the middle for further analysis"
Or should I use something more subjective (e.g. PPC) to get rid of samples that look a bit out there from downstream analysis?
So this would be a valid approach to remove outliers from a GWAS study? i.e. calculate mean + sd for C1 &remove 'outliers' then repeat for C2? Is there a reference? I ask because I can't find anyone explicitly using such stats - I've just seen people sort of eyeball the figure and remove the ones that look funny e.g. http://onlinelibrary.wiley.com/doi/10.1002/ejp.560/full
"All subjects with an identity-by-state (IBS) genetic distance from the sample mean of more than 3 standard deviations were considered outliers with respect to genetic ancestry and were pruned from the sample. This was also confirmed through visual inspection of Multidimensional scaling (MDS) plots."
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3213201/
Just checking is this human data?
Yes - human data and passed all basic QC, call rates high in all samples etc. so I don't think it's some dodgy batch effect - certainly the outliers aren't all on a specific plate or a specific phenotype/gender etc.
The article looks good thanks it appears they use visual inspection to confirm and use IBS to remove outliers rather than any specific calculation based on MDS values, although I'm not 100% sure how they calculated sample mean pair-wise IBS given it's a pairwise calculation.