Computing allele frequencies per individual for a set of SNPs in PLINK
1
0
Entering edit mode
2.7 years ago
pifferdavide ▴ 110

I computed allele frequencies for a list of SNPs for a set of individuals in PLINK using the following line:

./plink --bfile Pop_Imputed_filtered --keep Pop.txt --freq --make-bed --out  Pop

Now, I want to compute allele frequencies for the same set of SNPs but for each individual, so it should calculate individual frequencies instead of the average for the set of individuals contained in the Pop.txt file. I know that the output values would be 0, 0.5 or 1 (no alternate allele present/ heterozygous, homozygous). VCFTOOLS has got the --indv command to achieve this but I cannot find anything similar for Plink. This enabled me to write a loop for Vcftools to compute frequencies by individuals contained in the Pop.txt file, and it works. I tried to adapt it to PLINK by changing --indv to --keep but it doesn't find any individuals:

for pop in $(ls ~/genetics/hgdp/pops/*.txt); do ./plink2 --bfile ~/genetics/hgdp_hg38 --extract ~/genetics/rsID.txt --keep $pop --freq --make-bed --out freq-$(basename ${pop}); done

Where /pops is the folder containing the Pop.txt files for each population.

plink frequencies allele • 4.0k views
ADD COMMENT
0
Entering edit mode
2.7 years ago

plink 2.0’s —loop-cats flag (https://www.cog-genomics.org/plink/2.0/other#loop_cats ) provides one way to do this without an external for-loop.

I’d need to see a full .log file from one of the failed runs to tell you why your current for-loop doesn’t work, though.

ADD COMMENT
0
Entering edit mode

PLINK v2.00a3LM 64-bit Intel (20 Jan 2022) Options in effect: --bfile /home/davide/genetics_2020/hgdp_hg38 --extract /home/davide/genetics_2020/rsID.txt --freq --keep /home/davide/genetics_2020/hgdp/pops/ADYGEI.txt --make-bed --out freq-ADYGEI.txt

Hostname: davide Working directory: /home/davide/genetics_2020 Start time: Mon Mar 21 10:14:10 2022

Random number seed: 1647854050 128723 MiB RAM detected; reserving 64361 MiB for main workspace. Using up to 28 threads (change this with --threads). 929 samples (0 females, 0 males, 929 ambiguous; 929 founders) loaded from /home/davide/genetics_2020/hgdp_hg38.fam. 75310370 variants loaded from /home/davide/genetics_2020/hgdp_hg38.bim. Note: No phenotype data present. --extract: 8493 variants remaining. --keep: 0 samples remaining. Error: No samples remaining after main filters.

ADD REPLY
0
Entering edit mode

Looks like there is a mismatch between how the sample ID is represented in the --keep file vs. the .fam file. What are the contents of /home/davide/genetics_2020/hgdp/pops/ADYGEI.txt, and what are the first two lines of /home/davide/genetics_2020/hgdp_hg38.fam?

ADD REPLY
0
Entering edit mode

The first line of the . fam file:

0   ITC-MAR.G_MarABG010D    0   0   0   -9

The first line of the pop.txt file:

ITC-MAR.G_MarABG010D

I can confirm that I used the pop.txt file for filtering frequencies by population and it works fine. Not sure why it's not working for freqs by individual using the loop.

ADD REPLY
0
Entering edit mode

Yes, the pop.txt file is fine. But the failing run is looking at ADYGEI.txt. What's the first line of that file?

ADD REPLY
0
Entering edit mode

This one:

0   ITC-MAR.G_MarABG010D    0   0   0   -9
ADD REPLY
0
Entering edit mode

I changed the .pop file to be structure like the .fam file, and it recognized the individuals but it grouped them together, still not parsing frequencies by individual, only grouping them together. So must be something wrong with the loop.

ADD REPLY
0
Entering edit mode

Hmm, that also looks fine; I am unable to replicate your --keep failure on my end. If you can send me a command and a specific set of input files where --keep fails on your end, I will take a look, but otherwise I don't know what's going on.

ADD REPLY
0
Entering edit mode

Thanks a lot for your help! This is the link to download the files.

This is the command:

for pop in $(ls ~/genetics/pops/*.txt); do ./plink2 --bfile Italy_Imputed_filtered --extract ~/genetics/chrposhg19_list.tab --keep $pop --freq --make-bed --out freq-$(basename ${pop}); done
ADD REPLY
0
Entering edit mode

With the posted files, I'm getting a different error: --keep is working, but --extract is failing because chrposhg19_list.tab has chrom:pos variant IDs but the .bim file has chrom:pos:allele:allele variant IDs.

ADD REPLY
0
Entering edit mode

Yea that's an odd format returned by the imputation software. I guess I will have to edit the bim file to make it in the standard chr:pos format.

ADD REPLY
0
Entering edit mode

I have parsed the bim file through R and changed the chr:pos format to be the same as the chrposhg19_list.tab file. Please let me know if you get any errors.

ADD REPLY
0
Entering edit mode

I attached the file to run with the loop in the pcloud link.

ADD REPLY
0
Entering edit mode

The basic plink command now works on my end.

ADD REPLY
0
Entering edit mode

Yea but I am still not able to fetch the frequencies by individual. I have to run the --keep command each time for each individual using a different file containing only 1 line because the --keep command by defaults computes average frequencies for all individuals in the pop.txt file. Plink should have an --indv command like VCFTOOLS so we can extract frequencies for each individual with the pop.txt file.

ADD REPLY
0
Entering edit mode

As mentioned in this thread's initial post, plink 2.0 has a --loop-cats flag to generate per-population results in a single run, and the plink 1.x equivalent is to combine --freq with --within.

With that said, I will add an --indv flag to the next plink 2.0 build.

ADD REPLY
0
Entering edit mode

The 28 Mar 2022 build of plink2 has an --indv flag.

ADD REPLY
0
Entering edit mode

Oh that was quick. Impressive & thanks!

ADD REPLY
0
Entering edit mode

I was able to run the --indv command successfully on a file with a bed file by simply listing the IDs in single column. However, with another file, plink won't find any individuals, neither with single or double column in the keep/indv samples file. The .fam file (1KG) has this format:

HG00096 HG00096 0   0   0   -9
HG00096 HG00096 0   0   0   -9

HG00097 HG00097 0 0 0 -9

No individuals are found when using a sample file (--indv IDlist.txt) either in one column or double columns

HG00096 HG00096 
HG00097 HG00097

--indv: 0 samples remaining.

What is strange is that the same ID file (with double columns) works when used with the -keep command:

./plink2 --bfile 1kgtot --keep ~/genetics_2020/1kg/grch37/pops/prova.txt --freq --make-bed --out prova
ADD REPLY
0
Entering edit mode

--indv takes a single ID, not a filename. So --indv IDlist.txt shouldn't work.

ADD REPLY
0
Entering edit mode

Sorry I forgot to mention I used it in this loop, which worked with another plink file, but not with this 1KG file.

for popfile in $(ls pops/AFR/*.txt);do
    for indvrow in $(grep 'HG' $popfile);
    do
            for pop in $(ls pops/AFR/*.txt);
                    do ./plink2 --indv $indvrow --bfile 1kgtot --extract ~/genetics_2020/rsID.txt --freq --out ~/genetics_2020/1kg/grch37/afr/freq-$(basename ${pop})-$indvrow
            done
    done done

Specifically, the log file is like this:

PLINK v2.00a3LM 64-bit Intel (28 Mar 2022) Options in effect: --bfile 1kgtot --extract /home/davide/genetics_2020/rsID.txt --freq --indv HG00096 --out /home/davide/genetics_2020/1kg/grch37/afr/freq-prova.txt-HG00096 Hostname: davide Working directory: /home/davide/genetics_2020/1kg/grch37 Start time: Sun Jun 26 11:20:56 2022 Random number seed: 1656235256 128723 MiB RAM detected; reserving 64361 MiB for main workspace. Using up to 28 threads (change this with --threads). 2504 samples (0 females, 0 males, 2504 ambiguous; 2504 founders) loaded from 1kgtot.fam. 81271745 variants loaded from 1kgtot.bim. Note: No phenotype data present. --extract: 8576 variants remaining. --indv: 0 samples remaining. Error: No samples remaining after main filters.

ADD REPLY
0
Entering edit mode

If 1kgtot.bim contains doubled IDs (e.g. "HG00096 HG00096", you need to provide a doubled ID to --indv. The log file shows that your script is passing a single-part ID ("--indv HG00096 --out ...").

My recommendation is to zero out the first column of 1kgtot.bim (so the first line starts with "0 HG00096", etc.). Then plink2 will accept either "HG00096" or "0 HG00096" as referring to the first sample.

ADD REPLY
0
Entering edit mode

Thanks and yes I had tried that but failed. I used the --const-fid (converts sample IDs to within-family IDs while setting all family IDs to a single value (default '0')) plink command but the .fam file is unchanged. Not sure which other ways to zero out the first column there are (that is, using PLINK)? Otherwise I will just do it in R or shell

ADD REPLY
0
Entering edit mode

During VCF import, something needs to be done to convert the single-part sample IDs in the input to PLINK's two-part IDs. --const-fid 0 is one of the possible rules, and has been the PLINK 2.0 default for the last several years. (In retrospect, this should always have been the default rule...)

That's the only thing --const-fid does.

It's possible to zero out the first .fam column "with PLINK", but frankly it's simpler to use a shell one-liner.

ADD REPLY

Login before adding your answer.

Traffic: 2910 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6