Question

How to correct for allele frequency mismatches?

0

Entering edit mode

3.4 years ago

optimistsso4co3 ▴ 140

By comparing my plink frequencies with reference panel i get the following graph:

enter image description here

As you can see, vast majority of genotypes align well with reference, however, some are directly inverted. I have already excluded all variants with mismatched genotypes (strand flip & allele switch).

What is the correct way to fix it? I could think of just remove variants with large fequency deviation from reference, however, that leaves lot of incorrect snps in the middle, or, if i exclude all from the middle region it leaves a big empty range of frequency (0.4-0.6).

plink • 1.3k views

ADD COMMENT • link 3.4 years ago by optimistsso4co3 ▴ 140

score 2 · Accepted Answer · 2022-01-05

2

Entering edit mode

3.4 years ago

Pierre Lindenbaum 166k

bcftools norm -c s

ADD COMMENT • link 3.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

This worked!

Here is a full code to normalise plink genotypes:

in=MyPlinkPrefix
refFa=Homo_sapiens_assembly38.fasta #this has to be downloaded from here https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0;tab=objects?prefix=&forceOnObjectsSortingFiltering=false

#rename SNP
awk -v OFS="\t"  '{$2=$1":"$4":"$6":"$5; print $0}' $in.bim > temp; mv temp $in.bim
#rename duplicated samples as IID_dup
awk 'BEGIN{OFS="\t"} cnt[$0]++{$2=$2"dup"cnt[$0]-1} 1' $in.fam > temp.fam; mv temp.fam $in.fam
#filter
plink2 --bfile $in --rm-dup force-first --maf 0.01 --snps-only just-acgt --output-chr chrM --chr 1-26 --set-hh-missing --max-alleles 2 --make-bed --out $in.filtered
#convert PLINK to VCF
plink2 --bfile $in.filtered --output-chr chrM --recode vcf --out $in.filtered
#bgzip and index
bgzip -c $in.filtered.vcf > $in.filtered.vcf.gz
tabix -f $in.filtered.vcf.gz 
#convert VCF to BCF
bcftools norm -Ou -m -any $in.filtered.vcf.gz > $in.bcf
#fix alleles
bcftools +fixref $in.bcf -Ob -o $in.reffix.bcf -- -f $refFa -m top
bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' $in.reffix.bcf > $in.norm.bcf
plink --bcf $in.norm.bcf \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--output-chr chrM \
--allow-extra-chr 0 \
--split-x b38 no-fail \
--make-bed \
--out $in.filtered

#after conversion to VCF FID and IID are glued together, so we will continue with pre-conversion .fam
mv temp.fam $in.filtered.fam

If PLINK genotypes are in hg19 build, liftOver it to hg38:

in=MyPlinkPrefix
chain=hg19ToHg38.over.chain #this has to be downloaded from here https://hgdownload.soe.ucsc.edu/gbdb/hg19/liftOver/

#remove duplicated SNP
plink2 --bfile $in --rm-dup force-first --output-chr chrM --make-bed --out $in.filtered
#create original BED file of you genotypes
awk 'OFS="\t"{print "chr"$1, $4-1, $4, $2}' $in.filtered.bim > $in.filtered.hg19.bed
liftOver $in.filtered.hg19.bed $chain $in.filtered.hg38.bed unMapped
plink --bfile $in.filtered --keep-allele-order --extract <(cut -f4  $in.filtered.hg38.bed) --update-map <(awk '{print $4,$3}' $in.filtered.hg38.bed) --make-bed --out $in.filtered; rm *~

I used tools automatically through conda (plink, plink2, samtools, ucsc-liftover). Thanks to Pierre Lindenbaum and Freeseek apol1 blogpost http://apol1.blogspot.com/2016/10/1000-genomes-project-phase-3-principal.html

ADD REPLY • link 3.4 years ago by optimistsso4co3 ▴ 140