Using Plink, I would like to calculate allele frequencies for a subset of individuals (cases) from a total cohort of 188 individuals (92 cases + 96 controls). I have proper .ped and .map files.
I have tried multiple options but I am not able to subset data. These are two typical ways I have tried:
1)
./plink --file /path/in_data --chr 1-22 --allow-extra-chr --filter /path/cases.raw 1 --freq --make-bed --out out_data_cases
The .raw file has this format:
CAV-001 CAV-001 1
CAV-002 CAV-002 1
CAV-003 CAV-003 0
CAV-004 CAV-004 1
where the first column is Family ID, second column Individual ID, and in third column 1 are cases and 0 controls. I want to subset cases. All Family ID = Individual ID
2)
./plink --file /path/in_data --chr 1-22 --allow-extra-chr --keep /path/cases.txt --freq --make-bed --out out_data_cases
The cases.txt file includes columns 1 and 2 from the .raw file.
This is what I get in the .log file (some paths are not shown):
16384 MB RAM detected; reserving 8192 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (868263 variants, 188 people).
....
868263 variants loaded from .bim file.
188 people (0 males, 0 females, 188 ambiguous) loaded from .fam.
Ambiguous sex IDs written to xxxx.nosex
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 188 founders and 0 nonfounders present.
Calculating allele frequencies... done.
--freq: Allele frequencies (founders only) written to xxxx.frq
868263 variants and 188 people pass filters and QC.
Note: No phenotypes present.
--make-bed to ....
My question is similar as Cannot remove subjects from Plink files but I have tried what they suggest there, without positive outcome. Please help !
Thank you @Kevin Blighe. Plink should recognize the hyphens form the individuals names of the .ped file, as I am able to get the allele frequencies for the whole cohort but not for the cases subset. I did a trial just doing copy & paste from a few individuals from the .fam file (and also from the .ped file) and I get the same output.
Yes, you're correct that it should accept hyphens.
I realise that I have never actually used
--keep
in PLINK. For sample filtering, I have always conducted it outside of PLINK when I have my data in BCF or VCF format, and I filter with a command like this:I then convert these BCFs to PLINK format (> PLINK 1.9 required)
KeepCases.list
is jst a single column file of sample IDsKeepCases.IDSort.list
is the standard 2-column format, as per your own file. It instructs PLINK to read in the data and maintain the sample ordering as per KeepCases.IDSort.listI don't know if this is an option for you or not!
Thank you @Kevin Blighe, I will try that. I have Plink 1.9