Question

How to add phenotypic data to .fam in plink (specifically, assign 1 and 2 to case and control samples in the phenotypic value column.)

0

Entering edit mode

2.9 years ago

Michelle • 0

I have a large vcf file with both case and control samples in the file. I am planning to input the vcf into the --assoc function of plink with the --fam parameter that contains a .fam file that specifies which samples are case and which are control. I want to make a .fam file with the case and control samples labelled in the phenotype value column. How can I assign case/control to different samples for the .fam of my merged vcf? My big vcf file doesn't indicate whether the samples are case or control. I have two tsv files, one with a list of control samples and one with a list of case samples. Can I use these two files to specify in the .fam file which samples are case and which are control?

Could I also set the specific gene mutation as the case and samples without the mutation as the control, and run an association test based on those parameters?

plink vcf phenotype. • 4.4k views

ADD COMMENT • link updated 2.9 years ago by Sam ★ 4.8k • written 2.9 years ago by Michelle • 0

score 1 · Answer 1 · 2021-12-21

1

Entering edit mode

2.9 years ago

Sam ★ 4.8k

Assuming that you have a file call case.tsv and control.tsv, and each file contain the FID and IID of samples of case / control respectively.

You can do

awk 'NR == FNR{print $1,$2,"2"} NR != FNR{print $1, $2, "1"}' case.tsv control.tsv > Phenotype

NR==FNR means the number of row processed equal to the file number of row, which is only true for the first file

You can then use this file for association with

plink \
    --assoc \
    --pheno Phenotype \
  ....

There might even be more efficient way to do this, but this is what I come up with for now

ADD COMMENT • link 2.9 years ago by Sam ★ 4.8k

0

Entering edit mode

Thanks, this looks pretty good. The thing is, I only have the IID listed in the case and control .tsv. I think the FID and IID in my .fam are the same for each sample.

ADD REPLY • link 2.9 years ago by Michelle • 0

0

Entering edit mode

Also, I need the three fields between IID and Phenotypical value (all should be 0). How can I add 3 columns of 0 between IID and Phenotypical value?

ADD REPLY • link 2.9 years ago by Michelle • 0

0

Entering edit mode

I tried this, but plink log tells me that all the samples detected are control samples. Is it because I don't have the values for ID of father and mother in the file?

ADD REPLY • link 2.9 years ago by Michelle • 0

0

Entering edit mode

You don't need the file to be a .fam when you are working on a vcf, only really need a phenotype file, with format of FID IID Phenotype.

If your tsv only contain IID, you can do

awk 'NR == FNR{print $1,$1,"2"} NR != FNR{print $1, $1, "1"}' case.tsv control.tsv > Phenotype

Or did you already converted you vcf into a binary plink file?

In that case, you can do

plink \
   --bfile <prefix> \
   --pheno Phenotype \
   ...

ADD REPLY • link 2.9 years ago by Sam ★ 4.8k

0

Entering edit mode

Yeah, I had converted the vcf into a bfile prior. Thanks.

So it doesn't matter if the Phenotype file lists the samples in a diff order than the bfile?