Yes, this caught me off guard long ago, too, and I went crazy with it trying to figure it out.
You can use an individual sort file in order to control the ordering of your samples as it's converted from VCF / BCF to plink format, by using --indiv-sort file
:
plink --noweb --bcf Caucasian.bcf --keep-allele-order --indiv-sort file CaucasianIDSort.list --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out Caucasian ;
The file, CaucasianIDSort.list, contains 2 columns with FID and IID:
0 17442
0 16427
0 17423
0 17443
0 16451
0 17427
0 17447
0 16456
0 17428
0 17448
0 17412
This ordering should obviously match whatever FAM file you are using, too. I always use a custom FAM and specify it in all analyses with
--fam CaucasianCustom.fam
--------------------------------
If you need to worry about family structure, then these could be encoded in your VCF / BCF as FID_IID (you can modify VCF headers with bcftools reheader
). You then read these into PLINK, maintaining family information, with:
plink --noweb --bcf Families.bcf --keep-allele-order --indiv-sort file FamiliesIDsort.list --vcf-idspace-to _ --id-delim _ --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out Families ;
FamiliesIDsort.list looks like:
Fam1 16367
Fam1 16407
Fam2 16402
Fam2 16382
Fam3 16392
Fam3 16362
Fam4 16372
Fam4 16377
Fam5 16397
Fam5 16387
Fam6 16396
Fam6 16366
Fam7 16405
Fam7 16400
Fam8 16410
Fam8 16376
The sample IDs in the input VCF / BCF look like:
Fam1_16367
Fam1_16407
Fam2_16402
Fam2_16382
Fam3_16392
et cetera
Kevin
Hi Kevin! Many thanks for your detailed response! It's very useful. I'm wondering wether string order in
.ped
file somehow linked to string order of my.map
file? Probably not, but i'm not 100% sure. Am i right? Besides how is it possible to convert binary files to plain text ones. I'm just learningPlink
and would prefer to work with text files. Thank you again! DenisHey dude. The map file just contains information on the variants in the PED file. You could re-order the samples (rows) in the PED file but it would be wrong to re-order the columns (variants), because those are inextricably linked to the MAP file.
Hi Kevin! Thank you so much!
Hi Kevin! I'd like to clarify one more thing related to my post. I input into analysis phenotype data for all samples as a separate file. I followed by your advises and indicated
--indiv-sort
in my command line. But because of high missing genotype rate for a number of individuals, they were filtered out. As a result.ped
file has a different IIDs number and its order in comparison to my phenotype data file. Will it be a problem forPlink
to correctly merge genotype and phenotype data for the association test in that case?If you filter out variants, then it is no problem. If you filter out samples... then I would recommend updating your phenotype information, i.e., in your FAM file, and ensure that it matches the same order of FID and IID as your
--indiv-sort
fileIt may involve going back and forward a bit...
Generally, always ensure that your phenotype files and sample sort files have the same order and content (for FID and IID). Plink does not check that the ordering in the PED is the same as per your custom files.
Thank you so much for your help Kevin, I finally managed to use plink assoc to do the test, and this time I did it for 16 samples: 8 control and 8 case. (just for testing). This is what I got:
This means there is no real association between disease and mutations when P values are 1 and that there is an association when P values are 0? and why is chisq always 0? maybe I need to tune something else? Thanks a lot!
Yes, each of these p-values appears to be 1. Also, with 8 versus 8, you will find it difficult to get many statistically significant findings from this. In fact, from what I can see, all of the variants listed above are only found in 1 or 2 individuals (look at the
C_A
andC_U
columns). An NA will be returned for the p-value if the variant / SNP is missing in a simple (these would have been encoded as./.
in the VCF, whereas homozygous ref would be encoded0/0
).