I computed allele frequencies for a list of SNPs for a set of individuals in PLINK using the following line:
./plink --bfile Pop_Imputed_filtered --keep Pop.txt --freq --make-bed --out Pop
Now, I want to compute allele frequencies for the same set of SNPs but for each individual, so it should calculate individual frequencies instead of the average for the set of individuals contained in the Pop.txt file. I know that the output values would be 0, 0.5 or 1 (no alternate allele present/ heterozygous, homozygous). VCFTOOLS has got the --indv command to achieve this but I cannot find anything similar for Plink. This enabled me to write a loop for Vcftools to compute frequencies by individuals contained in the Pop.txt file, and it works. I tried to adapt it to PLINK by changing --indv to --keep but it doesn't find any individuals:
for pop in $(ls ~/genetics/hgdp/pops/*.txt); do ./plink2 --bfile ~/genetics/hgdp_hg38 --extract ~/genetics/rsID.txt --keep $pop --freq --make-bed --out freq-$(basename ${pop}); done
Where /pops is the folder containing the Pop.txt files for each population.
PLINK v2.00a3LM 64-bit Intel (20 Jan 2022) Options in effect: --bfile /home/davide/genetics_2020/hgdp_hg38 --extract /home/davide/genetics_2020/rsID.txt --freq --keep /home/davide/genetics_2020/hgdp/pops/ADYGEI.txt --make-bed --out freq-ADYGEI.txt
Hostname: davide Working directory: /home/davide/genetics_2020 Start time: Mon Mar 21 10:14:10 2022
Random number seed: 1647854050 128723 MiB RAM detected; reserving 64361 MiB for main workspace. Using up to 28 threads (change this with --threads). 929 samples (0 females, 0 males, 929 ambiguous; 929 founders) loaded from /home/davide/genetics_2020/hgdp_hg38.fam. 75310370 variants loaded from /home/davide/genetics_2020/hgdp_hg38.bim. Note: No phenotype data present. --extract: 8493 variants remaining. --keep: 0 samples remaining. Error: No samples remaining after main filters.
Looks like there is a mismatch between how the sample ID is represented in the --keep file vs. the .fam file. What are the contents of
/home/davide/genetics_2020/hgdp/pops/ADYGEI.txt
, and what are the first two lines of/home/davide/genetics_2020/hgdp_hg38.fam
?The first line of the . fam file:
The first line of the pop.txt file:
I can confirm that I used the pop.txt file for filtering frequencies by population and it works fine. Not sure why it's not working for freqs by individual using the loop.
Yes, the
pop.txt
file is fine. But the failing run is looking atADYGEI.txt
. What's the first line of that file?This one:
I changed the .pop file to be structure like the .fam file, and it recognized the individuals but it grouped them together, still not parsing frequencies by individual, only grouping them together. So must be something wrong with the loop.
Hmm, that also looks fine; I am unable to replicate your
--keep
failure on my end. If you can send me a command and a specific set of input files where--keep
fails on your end, I will take a look, but otherwise I don't know what's going on.Thanks a lot for your help! This is the link to download the files.
This is the command:
With the posted files, I'm getting a different error: --keep is working, but --extract is failing because chrposhg19_list.tab has chrom:pos variant IDs but the .bim file has chrom:pos:allele:allele variant IDs.
Yea that's an odd format returned by the imputation software. I guess I will have to edit the bim file to make it in the standard chr:pos format.
I have parsed the bim file through R and changed the chr:pos format to be the same as the chrposhg19_list.tab file. Please let me know if you get any errors.
I attached the file to run with the loop in the pcloud link.
The basic plink command now works on my end.
Yea but I am still not able to fetch the frequencies by individual. I have to run the --keep command each time for each individual using a different file containing only 1 line because the --keep command by defaults computes average frequencies for all individuals in the pop.txt file. Plink should have an --indv command like VCFTOOLS so we can extract frequencies for each individual with the pop.txt file.
As mentioned in this thread's initial post, plink 2.0 has a
--loop-cats
flag to generate per-population results in a single run, and the plink 1.x equivalent is to combine--freq
with--within
.With that said, I will add an
--indv
flag to the next plink 2.0 build.The 28 Mar 2022 build of plink2 has an --indv flag.
Oh that was quick. Impressive & thanks!
I was able to run the --indv command successfully on a file with a bed file by simply listing the IDs in single column. However, with another file, plink won't find any individuals, neither with single or double column in the keep/indv samples file. The .fam file (1KG) has this format:
HG00097 HG00097 0 0 0 -9
No individuals are found when using a sample file (--indv IDlist.txt) either in one column or double columns
--indv: 0 samples remaining.
What is strange is that the same ID file (with double columns) works when used with the -keep command:
--indv
takes a single ID, not a filename. So--indv IDlist.txt
shouldn't work.Sorry I forgot to mention I used it in this loop, which worked with another plink file, but not with this 1KG file.
Specifically, the log file is like this:
If 1kgtot.bim contains doubled IDs (e.g. "HG00096 HG00096", you need to provide a doubled ID to
--indv
. The log file shows that your script is passing a single-part ID ("--indv HG00096 --out ...").My recommendation is to zero out the first column of 1kgtot.bim (so the first line starts with "0 HG00096", etc.). Then plink2 will accept either "HG00096" or "0 HG00096" as referring to the first sample.
Thanks and yes I had tried that but failed. I used the --const-fid (converts sample IDs to within-family IDs while setting all family IDs to a single value (default '0')) plink command but the .fam file is unchanged. Not sure which other ways to zero out the first column there are (that is, using PLINK)? Otherwise I will just do it in R or shell
During VCF import, something needs to be done to convert the single-part sample IDs in the input to PLINK's two-part IDs.
--const-fid 0
is one of the possible rules, and has been the PLINK 2.0 default for the last several years. (In retrospect, this should always have been the default rule...)That's the only thing
--const-fid
does.It's possible to zero out the first .fam column "with PLINK", but frankly it's simpler to use a shell one-liner.