How to perform Allele balance on a VCF file from GATK?
1
0
Entering edit mode
2.7 years ago

Hello,

I have a VCF file containing SNP data from 100 individuals. I followed the GATK best practices with necessary filtering options. I want to perform another filtration based on Allele-balance. So, I used the following code for doing that:

bcftools view  -i 'AB > 0.25 && AB < 0.75 | AB < 0.01' only_SNPs_filter_removed_maxmiss_0.5.recode_dustmasked_header.vcf > AB.vcf

But it throws an error message stating that the AB tag is not found in the VCF header. I think unless specified, GATK does not produce that in the output VCF file. I also used Vcflib to do so but there I received another error message stating

cannot compare (>) objects of dissimilar types
3 4

Can you please suggest how to perform that AB filtering on my current VCF file? Thank you.

bcftools SNP VCFtools GATK • 2.2k views
ADD COMMENT
0
Entering edit mode

Can you please suggest how to perform that AB filtering on my current VCF file?

what INFO and FORMAT are present in your VCF ?

ADD REPLY
0
Entering edit mode

Here are the INFO in the VCF header

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=DP3,Description="DP < 3">
##FILTER=<ID=FS10,Description="FS > 10.0">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=MQ30,Description="MQ < 30.0">
##FILTER=<ID=MQRankSum-5.0,Description="MQRankSum < -5.0">
##FILTER=<ID=QD2,Description="QD < 10.0">
##FILTER=<ID=QUAL30,Description="QUAL < 30.0">
##FILTER=<ID=ReadPosRankSum-8,Description="ReadPosRankSum < -8.0">
##FILTER=<ID=SOR3,Description="SOR > 3.0">
##FILTER=<ID=nSamples,Description="AN < 5">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=CombineGVCFs,CommandLine="CombineGVCFs --output /gxfs_work1/cau/suaph275/Chilense_Hiseq/combined_gvcf/Chilense_99.g.vcf --variant /gxfs_work1/cau/suaph275/Chilense_Hiseq/Chilense_99.g.vcf.list --reference /gxfs_work1/cau/suaph275/Chilense_Hiseq/S_chilense_Hirise.fasta --convert-to-base-pair-resolution false --break-bands-at-multiples-of 0 --input-is-somatic false --drop-somatic-filtering-annotations false --ignore-variants-starting-outside-interval false --combine-variants-distance 0 --max-distance 2147483647 --ref-padding 1 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false --disable-tool-default-annotations false --enable-all-annotations false --allow-old-rms-mapping-quality-annotation-data false",Version="4.1.8.1",Date="March 13, 2022 3:20:05 PM UTC">
##GATKCommandLine=<ID=SelectVariants,CommandLine="SelectVariants --output only_SNPs_filter_removed.vcf --exclude-filtered true --variant only_SNPs_filters_added.vcf --reference /gxfs_work1/cau/suaph275/Bowtie2_results/S_chilense_Hirise.fasta --invertSelect false --exclude-non-variants false --preserve-alleles false --remove-unused-alternates false --restrict-alleles-to ALL --keep-original-ac false --keep-original-dp false --mendelian-violation false --invert-mendelian-violation false --mendelian-violation-qual-threshold 0.0 --select-random-fraction 0.0 --remove-fraction-genotypes 0.0 --fully-decode false --max-indel-size 2147483647 --min-indel-size 0 --max-filtered-genotypes 2147483647 --min-filtered-genotypes 0 --max-fraction-filtered-genotypes 1.0 --min-fraction-filtered-genotypes 0.0 --max-nocall-number 2147483647 --max-nocall-fraction 1.0 --set-filtered-gt-to-nocall false --allow-nonoverlapping-command-line-samples false --suppress-reference-path false --genomicsdb-use-bcf-codec false --genomicsdb-shared-posixfs-optimizations false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false",Version="4.1.8.1",Date="March 21, 2022 10:32:47 AM UTC">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQandDP,Number=2,Type=Integer,Description="Raw data (sum of squared MQ and total depth) for improved RMS Mapping Quality calculation. Incompatible with deprecated RAW_MQ formulation.">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
ADD REPLY
0
Entering edit mode

Have you tried vcffilter?

Also, if the issue is that the AB header is missing in the VCF, maybe you could add it manually yourself and see if it works?

ADD REPLY
0
Entering edit mode

Yes, I tried VCFfilter and then I got the above error. I posted it.

ADD REPLY
0
Entering edit mode

Hi, I manually added the AB header in the VCF file and then the vcffilter ran successfully. But using the above parameters the number of SNPs did not change after filtering for AB.

ADD REPLY
0
Entering edit mode
2.7 years ago
iraun 6.2k

Have you tried to run some manual awk? Something like this could work:

awk -F'\t' '{if ( $1 ~ /^#/ ){print}else{split($8,l1,";");split(l1[1],l2,"="); ab=l2[2];if (ab<0.01){print}; if (ab>0.25 && ab<0.75){print}}}' INPUT_FILE

This one liner assumes that the AB field in INFO column is the first field. Modify the l1[1] part (to l1[2], l1[3] ... etc), if AB field is not the first one.

ADD COMMENT
0
Entering edit mode

OK I will give it a try and let you know. Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1691 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6