Question

Problems with MAP/PED files manipulation

2

Entering edit mode

10.3 years ago

Cindy Chan ▴ 20

Hi all,

I'm very new to GWAS. I am given a set of plink files for the gene coordinates which I'm interested to look into. However, I would need to do data cleanup. Some of the problems which I currently am confused include:

The files seemed to have originated from phased data. Each ID is repeated twice, e.g. GSM0001_A, GSM0001_B. Does that mean they are the same samples but different chromosome? I don't know what to do with it. Or is it better if I start with unphased data?
The samples only have IID (repeated twice in the PED files) There are no FID, maternal or paternal ID, sex). I am also given a separate Excel file that contains the sex of each sample. Can I include the sex into my PED files (refer to above question for some info)?
The MAP files don't come with SNP identifiers, only the location (e.g. 2:112774105). Is it possible for me to include rs# in the MAP file? I would like to check if the existing SNPs on the genes I'm interested in are also found in the dataset. What can I do if I couldn't include the rs# in MAP file?
What other file formats are used in GWAS which enables me more control over what I want to analyse in future? I'm very confused right now and feel constrained with what I can do...
I need to clean up my data before I could run any analysis, as there are duplicated samples, samples which are closely related and samples without geographical information. All I have with me, besides the PLINK files, is an Excel spreadsheet which contains the information for all the samples. What would you suggest I do?

Any form of advice will be greatly appreciated.

Thanks!

GWAS SNP PED MAP PLINK • 7.4k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 10.3 years ago by Cindy Chan ▴ 20

Ram · Answer 1 · 2014-08-26

PLINK can do tons of things so it's no wonder that you're feeling a bit lost, I remember when I started out with it I was completely overwhelmed.

The files seemed to have originated from phased data. Each ID is repeated twice, e.g. GSM0001_A, GSM0001_B. Does that mean they are the same samples but different chromosome? I don't know what to do with it. Or is it better if I start with unphased data?

Not sure if I understand correctly - PED files need a family ID and an individual ID, in absence of a family ID most people just repeat the individual ID twice. Maybe this is what you see?

The samples only have IID (repeated twice in the PED files) There are no FID, maternal or paternal ID, sex). I am also given a separate Excel file that contains the sex of each sample. Can I include the sex into my PED files (refer to above question for some info)?

Yes, definitely! So it looks like this right now?

GSM0001_A GSM0001_B  A A  G G  A C .....

In that case, I'd add sex to all of them, and add 0s for maternal and paternal ID, here with 1 (male) for sex (2 is female, everything else is unknown), and -9 (unknown) for the one phenotype. I always give the phenotypes in an additional file as a normal spreadsheet.

GSM0001_A GSM0001_B 0 0 1 -9 A A  G G  A C .....

That way, you can correct for sex when you run a regression using the --sex flag, for example to run a logistic regression, correcting for gender, on all of your phenotypes in your file called 'your_pheno_file.csv':

plink --file your_files --pheno your_pheno_file.csv --sex --logistic --adjust --out your_results --all-pheno

The MAP files don't come with SNP identifiers, only the location (e.g. 2:112774105). Is it possible for me to include rs# in the MAP file? I would like to check if the existing SNPs on the genes I'm interested in are also found in the dataset. What can I do if I couldn't include the rs# in MAP file?

That happened just recently to me, I used KAVIAR to get the rs# for all my SNPs, just copy paste from excel the chromosome and position into here: http://db.systemsbiology.net/kaviar/cgi-pub/Kaviar.pl

Then you can use Excel or a small script to insert your rsids. Keep in mind that not all SNPs have rsids!

I need to clean up my data before I could run any analysis, as there are duplicated samples, samples which are closely related and samples without geographical information. All I have with me, besides the PLINK files, is an Excel spreadsheet which contains the information for all the samples. What would you suggest I do?

If you have many SNPs, clean them using minor allele frequency and HWE, at the very least.

plink --file your_files --maf 0.05 --hwe --recode --out your_cleaned_files

To also remove empty individuals:

plink --file your_files --maf 0.05 --hwe --mind 0.8 --recode --out your_cleaned_files

Will remove individuals with more than 80% missing alleles.

If you have population stratification, you can use Plink's own IBS clustering to correct for that:

plink --file your_cleaned_files --cluster --ppc 0.01

This will create 4 files with clusters. Check them manually to see whether they conform to what you expect. Then to use these clusters in a different GWAS:

plink --file your_cleaned_files -mh --within plink.cluster2

I'm currently unsure whether it was cluster2 or cluster3 - just run it and have a look at the log, it should say 'X individuals assigned to Y clusters', where X and Y make sense.

You can also use STRUCTURE or EIGENSTRAT to correct for population stratification. I personally prefer the latter because the pictures are prettier :) EIGENSTRAT also takes your ped files. You can feed these into PLINK as covariates, have a look here: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar

You can also play with GAPIT or TASSEL, which run analyses similar to PLINK, but are a bit easier to use.

I might have typos in the above commands, I haven't tested them right now