VariantToTable not calling HET HOM etc
1
0
Entering edit mode
4.2 years ago
geneart$$ ▴ 50

Hi all,

I was trying to get stats on # HOM VAR #HET etc from my vcf file using VariantsToTable. As below:

gatk VariantsToTable
 -R /hg38/Homo_sapiens_assembly38.fasta \
-V all_jointcalls_annRegion.vcf \
-F CHROM -F ID -F POS -F REF -F ALT -F QUAL -F DP \
-GF GT -GF GQ  -GF HET -GF HOM-REF \
-O all_jointcalls_trial.table

But no HET or HOM etc is called across board !

My output:

CHROM   ID      POS     REF     ALT     QUAL    DP      sample1.GT   sample1.GQ     sample1.HET    sample1.HOM-REF 
chr2    rs2303425    47403074        T       C       11431.45        4041    T/T     99    NA      NA

However when I used Dave Tang's way: https://github.com/davetang/learning_vcf_file#extracting-info-fields

bcftools stats -s - all_jointcalls_annRegion.vcf | grep -A 169 > "Per-sample counts"

I get counts:

# PSC, Per-sample counts. Note that the ref/het/hom counts include only SNPs, for indels see PSI. The rest include both SNPs and indels.
# PSC   [2]id   [3]sample       [4]**nRefHom**      [5]**nNonRefHom**   [6]**nHets**      [7]nTransitions [8]nTransversions       [9]nIndels      [10]average depth       [11]nSingletons [12]nHapRef
     [13]nHapAlt     [14]nMissing
PSC     0       sample1        **658**    **38**     **65**     63      40      0       3.6     8       0       0       303

This is true across all variants across all samples that I have in my multi sample vcf file. bcftools gives me numbers whereas GATK VariantsToTable is a NA.

What is not right here with VariantsToTable`? I read somewhere on GATK posts on their site that they do not recommend other ways of parsing vcf generated by GATK, so a little hesitant using bcftools. Anything wrong in parsing it with bcftools?

GATK VariantsToTable • 2.1k views
ADD COMMENT
1
Entering edit mode
4.2 years ago

Can you confirm that you have first posted this on GATK Support Forum? That would be the better place to post.

Similar / related answers:

Kevin

ADD COMMENT
1
Entering edit mode

Thankyou Kevin. Maybe GATK variantsToTable works well with single sample vcf rather than multi sample vcf. I did go through your code from the links posted and that was great ! Thankyou so much!

ADD REPLY
0
Entering edit mode

Hi Kevin, I tried your code from the link above A: How to get sample names and genotype for SNP in multi-sample VCF file and it works well and does what I need. However I noticed a small discrepancy. I have total of 2931 variants across all my samples in my cohort across 8 genes I am interested in. I broke it down to the genes I am interested in and I get totals per gene as follows:

gene 1: 1065
gene 2: 190
gene 3: 107
gene 4: 305
gene 5: 84
gene 6: 946
gene 7: 125
gene 8: 109
Total add to 2931 and I am good !

Problem : I looked at distribution of homRef, homALT and het across a subset of 10 individuals for gene 1. Per sample their homRef, homALt, Het, nocalls must add to 1065. However in couple samples I see they add to 1063 and in 1 sample it comes to 1062. Why is it missing the counts in those cases? Code was run as is without change.I normalised my vcf as you had mentioned in your previous link.

Could it be that in those samples with lower counts, that there are multiallelics that are just counted once? hence the discrepancy?whereas in the samples that counted to exact 1065, they all were biallelic? I am not quite understanding this part. Any insight would help me better understand. Thankyou in advance !

ADD REPLY
1
Entering edit mode

Hi and thanks for testing that code. With no way for me to reproduce this observed behaviour, I am not to know what could be the issue. Multi-allelic sites are common causes of inconsistencies, and also sites that are missing, like ./.. In this case, it is likely missing sites because I do not seem to account for those in my code.

If you isolate the 'problematic' samples, can you observe any missing genotypes?

ADD REPLY
1
Entering edit mode

Kevin, my apologies. Your code works fine. Too many things happening simultaneously for me ! going crazy ! The output I was mentioning was from using bcftools stats and that is where I am getting variations.

I cant understand why bcftools would miss a couple counts. I dont have much experience with bcftools, I implementing it and learning about what all it can do simultaneously. From your experience ( or anyone else reading the forum) any suggestions?

ADD REPLY

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6