Entering edit mode
10 months ago
avelarbio46
▴
30
Hello everyone!
I'm trying to reduce the FORMAT in my vcf file by doing some summary statistics. To do this, I'm using:
MYVCF=my_multisample_vcf_path
paste <(bcftools view "$MYVCF" \|
awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT"} !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}') <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" \|
awk 'BEGIN {OFS="\t"; print "nHomAlt\tnHet\tnHomRef"} {nHet=gsub(/0\|1|1\|0|0\/1|1\/0/, ""); nHomAlt=sub(/1\|1|1\/1/, ""); nHomRef=gsub(/0\|0|0\/0/, ""); print nHomAlt,nHet,nHomRef}') \|
sed 's/,\t/\t/g' | sed 's/,$//g' >> out_put.vcf
This is generating 3 columns with the name of the samples that are Het, HomAlt and HomRef for each variant.
I want to do the same thing for DP4 , but instead of printing the names of samples, print the mean of all samples for each variant
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="ref forward, ref reverse, alt forward, alt reverse">
Obviously, DP4 is a little more complex of a field then GT
Is there anyway to do this with AWK or any other tool?
So, basically, add 4 columns to VCF
DP4_ref_forward_mean DP4_ref_reverse_mean DP4_alt_forward_mean DP4_alt_foward