PLINK can do tons of things so it's no wonder that you're feeling a bit lost, I remember when I started out with it I was completely overwhelmed.
The files seemed to have originated from phased data. Each ID is repeated twice, e.g. GSM0001_A, GSM0001_B. Does that mean they are the same samples but different chromosome? I don't know what to do with it. Or is it better if I start with unphased data?
Not sure if I understand correctly - PED files need a family ID and an individual ID, in absence of a family ID most people just repeat the individual ID twice. Maybe this is what you see?
The samples only have IID (repeated twice in the PED files) There are no FID, maternal or paternal ID, sex). I am also given a separate Excel file that contains the sex of each sample. Can I include the sex into my PED files (refer to above question for some info)?
Yes, definitely! So it looks like this right now?
GSM0001_A GSM0001_B A A G G A C .....
In that case, I'd add sex to all of them, and add 0s for maternal and paternal ID, here with 1
(male) for sex (2
is female, everything else is unknown), and -9
(unknown) for the one phenotype. I always give the phenotypes in an additional file as a normal spreadsheet.
GSM0001_A GSM0001_B 0 0 1 -9 A A G G A C .....
That way, you can correct for sex when you run a regression using the --sex
flag, for example to run a logistic regression, correcting for gender, on all of your phenotypes in your file called 'your_pheno_file.csv':
plink --file your_files --pheno your_pheno_file.csv --sex --logistic --adjust --out your_results --all-pheno
The MAP files don't come with SNP identifiers, only the location (e.g. 2:112774105). Is it possible for me to include rs# in the MAP file? I would like to check if the existing SNPs on the genes I'm interested in are also found in the dataset. What can I do if I couldn't include the rs# in MAP file?
That happened just recently to me, I used KAVIAR to get the rs# for all my SNPs, just copy paste from excel the chromosome and position into here: http://db.systemsbiology.net/kaviar/cgi-pub/Kaviar.pl
Then you can use Excel or a small script to insert your rsids. Keep in mind that not all SNPs have rsids!
I need to clean up my data before I could run any analysis, as there are duplicated samples, samples which are closely related and samples without geographical information. All I have with me, besides the PLINK files, is an Excel spreadsheet which contains the information for all the samples. What would you suggest I do?
If you have many SNPs, clean them using minor allele frequency and HWE, at the very least.
plink --file your_files --maf 0.05 --hwe --recode --out your_cleaned_files
To also remove empty individuals:
plink --file your_files --maf 0.05 --hwe --mind 0.8 --recode --out your_cleaned_files
Will remove individuals with more than 80% missing alleles.
If you have population stratification, you can use Plink's own IBS clustering to correct for that:
plink --file your_cleaned_files --cluster --ppc 0.01
This will create 4 files with clusters. Check them manually to see whether they conform to what you expect. Then to use these clusters in a different GWAS:
plink --file your_cleaned_files -mh --within plink.cluster2
I'm currently unsure whether it was cluster2 or cluster3 - just run it and have a look at the log, it should say 'X individuals assigned to Y clusters', where X and Y make sense.
You can also use STRUCTURE or EIGENSTRAT to correct for population stratification. I personally prefer the latter because the pictures are prettier :) EIGENSTRAT also takes your ped files. You can feed these into PLINK as covariates, have a look here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar
You can also play with GAPIT or TASSEL, which run analyses similar to PLINK, but are a bit easier to use.
I might have typos in the above commands, I haven't tested them right now
Dear Philipp,
Thanks for the reply...
The PED files looked like this:
That's why I am confused...
That looks good! Here's the manual for the PED format: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
Like I wrote above, most people just use the same ID for family and individual, since you rarely get well-defined families. Paternal and maternal are set to missing, which is what most people do. Sex is set to missing too - since you have the gender in another table, you might want to fix that. The phenotype is set to 1, unaffected (2 is affected, -9 and 0 are missing). Like I wrote above, I usually set that phenotype to 0 and make my own additional table of phenotypes as described here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#pheno
Looks good to me!
Thanks! I've been reading the plink documentation. Just find it tough since there's not "search" function on the website.
Just wondering, so I have phase data, which means each of my samples are analysed twice? trying to figure out how does plink works...
thanks!
This older thread on biostars has several good explanations of phased vs unphased data, better than I could explain: What Are Phased And Unphased Genotypes?