Hi all,
I am working with ancient DNA, and currently doing PCA analysis of my data. Everything seemed fine, till I did positive control with one low coverage sample, and it was placed somewhere completely out of common sense and its known origin.
I digged, that when working with data that has lots of missing SNPs, and generaly is of low coverage, I am supposed at heterozygous sites in my data set (for all individuals) in PED file (plink) to randomly select one of the alleles and make this site homozygous.
I see this point as my main deviation and possible explanation for what I see.
Has anyone already tackled this problem, or am I left with writing my own tool for this?
Kind regards,
me
yea I wrote my own thing, gonna look for it and send it to you
2016-12-09 3:41 GMT+01:00 zf1992lss on Biostar mailer@biostars.org:
So this code works on ped file, when they are coded as 0,1,2. Mainly getting hets to homozygotes, doesn't really affect the results for PCA all that much, but for the formal statistics it is a desirable preprocessing step.
Most probably when your samples are completely nonsensical, it's either you have messed up the strandeness of the SNPs, so now nothing matches, or the low coverage causes sth. called "shrinkage problem". To work around shrinkage and have really good PCA, is to run the smartpca and projection using your high coverage reference samples, and when selecting samples to be projected, you select the low coverage ancient individuals (your samples) and also reference samples, for which you artifically reduce the number of SNPs <- this worked for my at least. For the missing SNPs, if they are only missing in your samples, cause the coverage sucks, well that's ok, you just need to run smartpca and do the projection. However if genotyping rate for those SNPs is generally low across your reference samples you are better off deleting them. And about ld prunning <- it turns out that it's better to have more SNPs, than to delete half of them, cause they are in LD.