Hi! I have a Plink gwas data set. All files are in binary format (.bed,.bim).
Before I get into details, here is my question: How can I subset for subjects I am interested in using Plink?
I am aiming to subset a group of individuals for my analysis, however when I run the following command my plink log file has the following output. I'm also including what the include subjects file looks like as well:
PLINK COMMAND:
plink --bfile /path/to/plink/files --keep /path/to/includesubjects.txt --make-bed --out /path/to/subsetted/plink/files
EXAMPLE OF TEXT FILE WITH INDIVIDUALS I WANT TO INCLUDE: The '0' correspond to the fam ID. These are unrelated subjects so I did not include a family ID. This also corresponds to what is in the fam file.
0 Subject_001
0 Subject_002
0 Subject_003
0 Subject_004
0 Subject_005
0 Subject_006
PLINK OUTPUT: Writing this text to log file [ /path/to/log/file.log ] Analysis started: Wed Apr 19 15:36:22 2017
Options in effect:
--bfile /path/to/plink/files
--keep /path/to/includesubjects.txt
--make-bed
--out /path/to/subsetted/plink/files
Reading map (extended format) from [ /path/to/plink/files.bim ]
5 markers to be included from [ /path/to/plink/files.bim ]
Reading pedigree information from [ /path/to/plink/files.fam ]
940 individuals read from [ /path/to/plink/files.fam ]
0 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
0 cases, 0 controls and 940 missing
559 males, 381 females, and 0 of unspecified sex
Reading genotype bitfile from [ /path/to/plink/files.bed ]
Detected that binary PED file is v1.00 SNP-major mode
Reading individuals to keep [ path/to/includesubjects.txt ] ... 0 read
940 individuals removed with --keep option
Before frequency and genotyping pruning, there are 5 SNPs
0 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 0
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 5 SNPs
The only explanation I can think of is that maybe the file format e.g. the spacing is different in the keep file or that plink cannot find the file... Try to extract a few lines from your map file and make a new includesubjects.txt with some random samples based on that and try it again.
Hi Floris! Thanks for your note. I actually figured it out :) I subsetted the existing fam file for the subjects I was interested in and then used that as the input for the --keep command. Thanks for your help though!
Hi Sheila, could you please explain how you did it (with command lines)? Thanks !
Hi Sheila, I am trying to remove ID patients from my data and I am using the original PED file for doing that. I create a .txt file with the number of ID family and ID patients that I want to remove put in two columns, but it still doesn't work. The analysis seems to go until the end of the process (creating temporary files) when appears the message saying: Error: duplicates ID.
My command is: $ ./plink --file name --remove IDlist.txt --out subset2 --make-bed
And my IDlist.txt is:
1 2204
2 1146
So I know I have few duplicates but I don't understand why the presence of duplicates does not allow the removing process.
How did you sort out your problem? Do you mind explaining here?
Hi@Vale, There are a couple things I'd check/try:
I hope that helps!