Quality Control of VCFs that used different genotyping arrays
0
0
Entering edit mode
15 months ago
Shane ▴ 20

I have three VCFs. Two of these VCFs were generated using the Precision Medicine Research Array (PMRA) and refer to SNPs as AX numbers. I was able to merge the two PMRA VCFs together.

Merged PMRA VCFs (Total genotyping rate is 0.924427):

1   AX-150343089    0   837711  T   C
1   AX-149471710    0   837756  T   G
1   AX-40234919 0   844647  G   T
1   AX-114086366    0   846320  A   G

However my Global Screening Array (GSA) VCF uses rs ids:

1   rs6605059   0   984039  T   C
1   rs4970414   0   984121  G   T
1   rs116781904 0   984475  A   G
1   GSA-rs61770779  0   984547  A   G

I was able to merge using bcftools merge pmra.vcf.gz gsa.vcf.gz -o gsa_pmra.vcf --force-samples

gsa_pmra.vcf (Total genotyping rate is 0.534432)

1   AX-29796323 0   1117607 C   G
1   GSA-rs61766344  0   1118711 T   C
1   AX-29797251 0   1120162 A   G
1   AX-29797373 0   1120521 A   C
1   AX-38925889;rs9442373   0   1127258 C   A
1   AX-29801021;rs4072537   0   1129916 T   C
1   AX-29801231;GSA-rs11260598  0   1130346 C   T
1   AX-29801717 0   1131207 T   C
1   AX-107792172    0   1133289 GT  G
1   AX-107792172    0   1133290 G   TT
1   rs61766346  0   1133503 A   G
1   AX-29803231 0   1134155 A   G

I have 4681 samples using PMRA and 40 samples using GSA

Total Positions: 944562 (PMRA), 692246 (GSA)

  • PMRA Only: 805321
  • GSA only: 553005
  • Overlap: 139241

My issue is that the genotyping rate (0.534432) is low and I am afraid that when I do any sort of quality control filtering, it will remove too many SNPs/samples. Does anyone have any advice/comments?

bcftools VCF • 795 views
ADD COMMENT
0
Entering edit mode

Are your all genotyping data in the same build? The SNPs ids with AX numbers or rsid is not a problem here. You can replace those with chr:position and later annotate into rsid. I would recommend you to use Plink to quality control before you merge your datasets. You can follow the following steps provided that your two datasets are in the same build-

  1. Convert your vcf files into plink binary files-

    plink --vcf pmra.vcf.gz --make-bed --out pmra

    plink --vcf gsa.vcf.gz --make-bed --out gsa

  2. Quality control your gsa.bed/bim/fam & pmra.bed/bim/fam files: I would subset data to chromosomes 1-23, get rid of AT,CG GC and TA SNPs and swap AX-number and rsids with chrom:position

  3. Find common SNP and Merged your QCed data: I would find common SNPs between your gsa and pmra data, and merge them. While merging you should make sure that each alleles match for the SNPs.

ADD REPLY
0
Entering edit mode

Thanks for your reply. Yes the genotyping data is in the same build.

Trying to understand this: "get rid of AT,CG GC and TA SNPs". "While merging you should make sure that each alleles match for the SNPs", for this, do you mean that I should ensure that the ref and alt allele from GSA and PMRA SNPs match?

ADD REPLY

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6