How to correct for allele frequency mismatches?
1
0
Entering edit mode
2.9 years ago

By comparing my plink frequencies with reference panel i get the following graph:

enter image description here

As you can see, vast majority of genotypes align well with reference, however, some are directly inverted. I have already excluded all variants with mismatched genotypes (strand flip & allele switch).

What is the correct way to fix it? I could think of just remove variants with large fequency deviation from reference, however, that leaves lot of incorrect snps in the middle, or, if i exclude all from the middle region it leaves a big empty range of frequency (0.4-0.6).

plink • 1.0k views
ADD COMMENT
2
Entering edit mode
2.9 years ago
bcftools norm -c s
ADD COMMENT
0
Entering edit mode

This worked!

Here is a full code to normalise plink genotypes:

in=MyPlinkPrefix
refFa=Homo_sapiens_assembly38.fasta #this has to be downloaded from here https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0;tab=objects?prefix=&forceOnObjectsSortingFiltering=false

#rename SNP
awk -v OFS="\t"  '{$2=$1":"$4":"$6":"$5; print $0}' $in.bim > temp; mv temp $in.bim
#rename duplicated samples as IID_dup
awk 'BEGIN{OFS="\t"} cnt[$0]++{$2=$2"dup"cnt[$0]-1} 1' $in.fam > temp.fam; mv temp.fam $in.fam
#filter
plink2 --bfile $in --rm-dup force-first --maf 0.01 --snps-only just-acgt --output-chr chrM --chr 1-26 --set-hh-missing --max-alleles 2 --make-bed --out $in.filtered
#convert PLINK to VCF
plink2 --bfile $in.filtered --output-chr chrM --recode vcf --out $in.filtered
#bgzip and index
bgzip -c $in.filtered.vcf > $in.filtered.vcf.gz
tabix -f $in.filtered.vcf.gz 
#convert VCF to BCF
bcftools norm -Ou -m -any $in.filtered.vcf.gz > $in.bcf
#fix alleles
bcftools +fixref $in.bcf -Ob -o $in.reffix.bcf -- -f $refFa -m top
bcftools annotate -Ob -x ID -I +'%CHROM:%POS:%REF:%ALT' $in.reffix.bcf > $in.norm.bcf
plink --bcf $in.norm.bcf \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--output-chr chrM \
--allow-extra-chr 0 \
--split-x b38 no-fail \
--make-bed \
--out $in.filtered

#after conversion to VCF FID and IID are glued together, so we will continue with pre-conversion .fam
mv temp.fam $in.filtered.fam

If PLINK genotypes are in hg19 build, liftOver it to hg38:

in=MyPlinkPrefix
chain=hg19ToHg38.over.chain #this has to be downloaded from here https://hgdownload.soe.ucsc.edu/gbdb/hg19/liftOver/

#remove duplicated SNP
plink2 --bfile $in --rm-dup force-first --output-chr chrM --make-bed --out $in.filtered
#create original BED file of you genotypes
awk 'OFS="\t"{print "chr"$1, $4-1, $4, $2}' $in.filtered.bim > $in.filtered.hg19.bed
liftOver $in.filtered.hg19.bed $chain $in.filtered.hg38.bed unMapped
plink --bfile $in.filtered --keep-allele-order --extract <(cut -f4  $in.filtered.hg38.bed) --update-map <(awk '{print $4,$3}' $in.filtered.hg38.bed) --make-bed --out $in.filtered; rm *~

I used tools automatically through conda (plink, plink2, samtools, ucsc-liftover). Thanks to Pierre Lindenbaum and Freeseek apol1 blogpost http://apol1.blogspot.com/2016/10/1000-genomes-project-phase-3-principal.html

ADD REPLY

Login before adding your answer.

Traffic: 2028 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6