Hello everyone,
Following the GATK germline short variant discovery instructions, I have written a script that runs the pipeline from fastq pre-processing through to creating a gvcf with HaplotypeCaller. On inspection I have noticed that my GVCF:
1) Does not have a ##reference entry in the header even though i did pass a reference in the command
--reference /NGS/musRefs_10/gatk/ref/ensembl92/Mus_musculus.GRCm38.dna.toplevel.fa
2) Every record except 1 contains "END=" for example:
9 36758659 . C NON_REF . . END=36758689 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,38
The 1 record does not contain "END=":
"9 36758714 rs45725377 T C,<non_ref> 22.58 . DB;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:49,6,0,49,6,49:0,0,1,1"
I have read What is a GVCF and how is it different from a 'regular' VCF? which states:
The second thing to look for is the END tag in the INFO field of non-variant block records.
Is it normal that every entry except for 1 is a non-variant block record? No errors are being returned. The only thing I can think of is that I ran the script on too small a test subset of the whole data. The test data is 2 fastq files of 10,000 lines.
I'm really not sure where I went wrong here. I appreciate any guidance.
Regards, Kenneth
Ok finswimmer...thanks for that. It is a whole exome dataset and I use gene coords as the intervals option.
I just created a larger subset of 1,000,000 lines and that should be done in a few hours so hopefully. Ill see a proportional increase.
EDIT: That didnt take as long as I thought - 30min (nice optimizing!). 6142 variants returned from 250000 reads. I think there is no problem.