Hi all,
I was trying to get stats on # HOM VAR #HET etc from my vcf file using VariantsToTable
. As below:
gatk VariantsToTable
-R /hg38/Homo_sapiens_assembly38.fasta \
-V all_jointcalls_annRegion.vcf \
-F CHROM -F ID -F POS -F REF -F ALT -F QUAL -F DP \
-GF GT -GF GQ -GF HET -GF HOM-REF \
-O all_jointcalls_trial.table
But no HET or HOM etc is called across board !
My output:
CHROM ID POS REF ALT QUAL DP sample1.GT sample1.GQ sample1.HET sample1.HOM-REF
chr2 rs2303425 47403074 T C 11431.45 4041 T/T 99 NA NA
However when I used Dave Tang's way: https://github.com/davetang/learning_vcf_file#extracting-info-fields
bcftools stats -s - all_jointcalls_annRegion.vcf | grep -A 169 > "Per-sample counts"
I get counts:
# PSC, Per-sample counts. Note that the ref/het/hom counts include only SNPs, for indels see PSI. The rest include both SNPs and indels.
# PSC [2]id [3]sample [4]**nRefHom** [5]**nNonRefHom** [6]**nHets** [7]nTransitions [8]nTransversions [9]nIndels [10]average depth [11]nSingletons [12]nHapRef
[13]nHapAlt [14]nMissing
PSC 0 sample1 **658** **38** **65** 63 40 0 3.6 8 0 0 303
This is true across all variants across all samples that I have in my multi sample vcf file. bcftools gives me numbers whereas GATK VariantsToTable is a NA
.
What is not right here with VariantsToTable`? I read somewhere on GATK posts on their site that they do not recommend other ways of parsing vcf generated by GATK, so a little hesitant using bcftools. Anything wrong in parsing it with bcftools?
Thankyou Kevin. Maybe GATK variantsToTable works well with single sample vcf rather than multi sample vcf. I did go through your code from the links posted and that was great ! Thankyou so much!
Hi Kevin, I tried your code from the link above A: How to get sample names and genotype for SNP in multi-sample VCF file and it works well and does what I need. However I noticed a small discrepancy. I have total of 2931 variants across all my samples in my cohort across 8 genes I am interested in. I broke it down to the genes I am interested in and I get totals per gene as follows:
Problem : I looked at distribution of homRef, homALT and het across a subset of 10 individuals for gene 1. Per sample their homRef, homALt, Het, nocalls must add to 1065. However in couple samples I see they add to 1063 and in 1 sample it comes to 1062. Why is it missing the counts in those cases? Code was run as is without change.I normalised my vcf as you had mentioned in your previous link.
Could it be that in those samples with lower counts, that there are multiallelics that are just counted once? hence the discrepancy?whereas in the samples that counted to exact 1065, they all were biallelic? I am not quite understanding this part. Any insight would help me better understand. Thankyou in advance !
Hi and thanks for testing that code. With no way for me to reproduce this observed behaviour, I am not to know what could be the issue. Multi-allelic sites are common causes of inconsistencies, and also sites that are missing, like
./.
. In this case, it is likely missing sites because I do not seem to account for those in my code.If you isolate the 'problematic' samples, can you observe any missing genotypes?
Kevin, my apologies. Your code works fine. Too many things happening simultaneously for me ! going crazy ! The output I was mentioning was from using bcftools stats and that is where I am getting variations.
I cant understand why bcftools would miss a couple counts. I dont have much experience with bcftools, I implementing it and learning about what all it can do simultaneously. From your experience ( or anyone else reading the forum) any suggestions?