I'm trying to convert a VCF of ~15K samples (~4K families) into PLINK binary format to check for any pedigree errors using KING.
The input files need to be in PLINK binary format, e.g., ex.bed, ex.fam, and ex.bim.
There are multiple tools out there to convert a VCF to PLINK binary format, namely the .bed, .bim. and .fam
files of the 'binary fileset'.
However, GATK3's VariantsToBinaryPed is the only tool I know of that requires as input a file stating family relationships among samples a priori, in order to create the above mentioned .fam
file (the first 6 columns of PLINK's .ped
file).
My question is, how are all the other VCF -> PLINK
conversion tools inferring relationships in order to output the .fam
file of the PLINK binary fileset?
Does the information contained in the .fam file, or any pedigree/sample metadata information, actually go into the .bed file?
Because if it doesn't... then I don't really need to worry about all of this - I could just create the .binary fileset from the VCF as instructed in the KING tutorial (without specifying a custom .fam), and then just supply my own .fam when running KING:
king -b ex.bed --fam ex.fam --bim ex.bim --related
My sense, though, is that pedigree info is somehow encoded in the .bed file, though, as other tools for converting VCF to PLINK, like the aforementioned VariantsToBinaryPed from GATK, specifically require a metadata file to create the PLINK binary fileset.
And, can one pass as input a .fam
file (or similar) to these tools if the sample relationships are actually known (from the pedigree), as is my case?
Thanks!
UPDATE:
For some reason, the number of samples in my VCF is smaller than those in my .fam file (which includes all the samples that were sequenced.) I inherited these files, so I'm not sure what the criteria was or why these ~700 samples didn't make it into the VCF.
Is there any way to tell PLINK2 to ignore this discrepancy, or do I need to go back and figure out which samples were filtered in the VCF and remove them from the .fam?
UPDATE 2
Stepping back - ultimately, my goal is to use this for pedigree checking using KING - In their tutorial, it is stated that one should first convert the VCF to PLINK binary format, and a command is given:
The VCF file of the sequence data can be easily converted into a PLINK binary format using PLINK2:
plink2 --vcf example.vcf.gz --make-bed --out ex
I understand that this command will output a PLINK binary fileset (ex.fam , ex.bed, ex.bim) -- however, I already have a .fam with the known relationships between samples, which is in fact what I want to check for errors with KING. Surely there must be a way to read the VCF into PLINK, and take the FIDs from the supplied .fam, in order to create the binary fileset with a known .fam file?
Thanks again.
plink 2.0 lets you use --fam with --vcf, and verifies that the sample IDs are consistent before proceeding.
Hello again, I guess I'm still confused about the right way to go about this -- have not been able to get past the VCF/.fam mismatch error.
I'm not sure that the options suggested for interpreting the VCF sampleIDs will help my particular case (
--double-id
,--id-delim
,--const-fid
) - The trouble is that the SampleIDs in the VCF are an exact match to the IID column of the .fam file.I've triple-checked that these two lists contain the same elements, and I'm passing the
--indiv-sort file
parameter, with a sort file that is an exact copy of the first two columns (FID and IID) of the .fam file, so the VCF sampleIDs should indeed be read by PLINK in the exact same order as the .fam IIDs.My VCF does not contain FIDs in the SampleIDs (just the IIDs), but the FIDs are stated in the .fam -- surely there must be a way to read in the VCF and take the FIDs from the supplied .fam to create the binary fileset?
Stepping back - ultimately, my goal is to use this for KING - In their tutorial, it is stated that one should first convert the VCF to PLINK binary format, and a command is given:
I understand that this command will output a PLINK binary fileset (
ex.fam , ex.bed, ex.bim
) -- however, I already have a .fam with the known relationships between samples, which is in fact what I want to check for errors with KING.Apologies if I'm being too silly - I think I'm just very confused at this point!
Thanks for all your help.
Then you will have to code the sample IDs in your VCF as FID_IID, and use
--id-delim _
when converting to plink. Data conversion steps are rarely streamlined and form a large part of our work, at least from what I can see.Yes, the developers of KING obviously didn't realise the glaring issue of losing family relationships when just using
plink2 --vcf
Hi Kevin Blighe, @gaelgarcia and others - I have been using this chat to help get me started with PLINK and Im running into an error when running. I am also using plink to convert a VCF to binary bed for input to KING. I am also using it to confirm family relationships as well, so I have a pedigree file with known families.
The error is: "Error: No '_' in sample ID." I converted all sample IDs to FID_IID as you suggested, and when I run bcftools query -l I see all of the updated sample IDs. When I run grep, it recognizes the "_". Im not sure why this would be occurring. Does anyone have any suggestions?
Does the information contained in the
.fam
file, or any pedigree/sample metadata information, actually go into the.bed
file?Because if it doesn't... then I don't really need to worry about all of this - I could just create the .binary fileset from the VCF as instructed in the KING tutorial (without specifying a custom .fam), and then just supply my own .fam when running KING:
My sense, though, is that pedigree info is somehow encoded in the
.bed
file, though, as other tools for converting VCF to PLINK, like the aforementioned VariantsToBinaryPed from GATK, specifically require a metadata file to create the PLINK binary fileset.No, the .bed file does not contain any pedigree/sample information; it's nothing but a matrix of {0, 1, 2, missing} allele count values. (Even the matrix dimensions aren't in the file!)
Ahh, that explains so much! Thanks!
You can use plink 1.9 —update-ids + —make-bed to fill in the initially-missing FIDs (https://www.cog-genomics.org/plink/1.9/data#update_indiv ). This requires you to create a four-column file describing how you want to change the IDs.
Thank you! I do with all these options were clearly listed in the documentation... 😔
@chrchang: So in this case, if I run :
./plink2 --make-bed --vcf ~/path/to/mips.vcf.bgz --fam ~/path/to/mips_pedigree.fam --out mips
,then (if I understand correctly),
--make-bed
will output amips.fam
file identical to the.fam
file passed with--fam
?If you do not specify any FAM file when you are converting from VCF/BCF to plink format, then plink will just create a 'dummy' FAM file with the same name as your dataset. It will look something like this:
When you read your data into plink, I strongly encourage you to make the use of a 'sort file'. For example:
It does not matter how your samples are positioned in the BCF/VCF file - the use of the
--indiv-sort file
parameter instructs plink to read samples into the plink object in that order. The file IDsort.list contains (FID
IID
):Why is this important? It's important because, without specifying the order, plink sometimes inputs the samples seemingly randomly, and may not match the order in your custom FAM file. So, when you now want to create your FAM file, ensure 100% that the sample order in the FAM file matches the order in the IDsort.list file.
Note also the
--id-delim _
parameter. This instructs plink that, in my VCF/BCF, the FID and IID are split by an underscore. So, in your VCF/BCF, you should rename your sample IDs to something like:You can do this with
bcftools reheader
----------------------------------------------
After that, use the
--fam
parameter in every command that you run with plink. For example:Thank you for this extremely valuable heads-up!
Kevin - Do the sample IDs in the VCF have to include both FID and IID?
My VCF Sample IDs consist of only the IID -- I'm still getting an error message after filtering the .fam file to contain only samples found in the VCF, in addition to specifying the order with an IDsort file.
Where
IDsort.list
is the first two columns oftargeted_seq_subsetVCF.fam
.No, you can leave it as just the sample ID on its own - sorry not to make that clear. However, if you then try to use a custom FAM with FIDs set, then plink will throw an error / not function as expected. In most of my work, I had no family data, so, in my custom FAM the FIDs were all zeros.
As chrchang523 mentions you can then later update the FIDs later within plink with
--update-parents
.Having to change IDs in a VCF is indeed a nuisance.
Hi, would like to confirm, for FID which are zeros, it would not be considered as one family in relationship inference? I was worried the same FID might lead samples being regarded as same family, and wonder if we need unique FID for all?
The command may error out if the sample IDs, or their order, aren’t identical between the .fam and the .vcf; you may need to use —double-id/—id-delim/—const-fid to change how the VCF IDs are interpreted. But if it works, the output .fam should be the same.
Thanks @Kevin and @chrchang -
For some reason, the number of samples in my VCF is smaller than those in my .fam file (which includes all the samples that were sequenced.) I inherited these files, so I'm not sure what the criteria was or why these ~700 samples didn't make it into the VCF.
Is there any way to tell PLINK2 to ignore this discrepancy, or do I need to go back and figure out which samples were filtered in the VCF and remove them from the .fam?
No, you can't tell plink2 to ignore the discrepancy, but in that case it's pretty straightforward to just import the VCF without pedigree info, and then patch in parental/sex(/phenotype?) information afterward with plink 1.9 --update-parents/--update-sex/--pheno.
Is it giving an error due to that discrepancy? Perhaps chrchang523 knows of a specific parameter that can be used.
Personally, I would indeed ensure that missing samples were not included in the FAM. You could output the row in your VCF that has the sample names and then convert them into a list format (plain text). Then use
grep -f
in oder to filter only those samples from the FAM.Actually, I jst tried that and it seems that it would be a quick solution. Here's how you could get your list of samples in the VCF:
Obviously then just remove the unnecessary first few rows (
#CHROM
POS
, etc)). After that, usegrep -f
to filter the FAMThank you, Kevin! I'll give this a try. In fact, it does return an error because of this discrepancy:
I'm pretty sure this is because of the missing samples in the VCF - the IIDs are identical between .fam and VCF.
Just for the sake of saving other newbies some time, here's how I extracted the sampleIDs from the VCF on Mac:
And then subset the original
.fam
to filter out samples NOT found in the VCF.Conversely, to get the sampleIDs missing from the VCF:
I just realized there's a problem with this logic -- the command gets every line from the
.fam
file where's a match to the sampleIDs extracted above. This means that samples that are not in the VCF, but, for example, their parents are, will have a match, and still be included. I reckon I should restrictgrep
to look for matches only in the 2nd column of the.fam
file.UPDATE:
This seems to do the trick -
Could could try awk instead. Match on the IID column and, if there's a match, print the entire line