To do the population comparison between cohort of patient and 1KG. So, I have converted their VCF file to PED format;
vcftools --gzvcf 1000g/1000g_myvariants.vcf.gz --plink --out 1000g
vcftools --vcf myvariants.vcf --plink --out myvariants
and then, I took variant with snp ID
grep -o 'rs[0-9]*' 1000g_myvariants.map > rs.snplist.raw
and sorted and removed those were duplicated
sort rs.snplist.raw | uniq > rs.snplist.dedup
then I removed those were not matched allele codes
plink --file myvariants --extract rs.snplist.dedup --exclude all.missnp --recode --out myvariants.subset
plink --file 1000g_myvariants --extract rs.snplist.dedup --exclude all.missnp --recode --out 1000g_myvariants.subset
and finally I merged them
plink --file 1000g_myvariants.subset --merge myvariants.subset.ped myvariants.subset.map --recode --out all
and I created MDS plot
plink --file all --read-genome all.genome --cluster --mds-plot 2 --out all_mds_2
and plotted component 2 versus component 1
tab = read.table("plink.mds", h = T)
tab$pop = factor(c(rep("1KG", 1212), rep("mycohort", 285)))
plot(tab$C1, tab$C2, col=as.integer(tab$pop),xlab="eigenvector 2", ylab="eigenvector 1")
and here is how the result look like,
basically, there is a shift which I am curious what could be the reason ? do I have to filter more SNP to get the right match ? is there any other tools to run PCA rather than PLINK?
Is the 1000 genome variants some how normalized while the other cohort is not ?
Are the black dots individuals from 1000genomes? Which dataset are you using, exactly? Check if the separate groups are due to different sequencing technology.
yes, there are 1212 individuals in 1KG which are represented by black in the plot. mm, the 1KG sequencing has been done both with Illumina and ABI sequencing; I feel I should have normalize these two cohort separately somehow before hand.
Have you tried doing factor analysis to see which SNPs are underlying this? You could also just look at the rotated data (if you were to do the PCA with the
prcomp()
function in R, this would beoutput$x
). That's the next thing I would try.