I want to impute missing genotypic data. I downloaded beagle from here, and I ran it on a test file below:
##fileformat=VCFv4.1
##medaka_version=1.0.3
##contig=<ID=chr1>
##INFO=<ID=pos1,Number=.,Type=Integer,Description="POS of incorporated variants from haplotype 1">
##INFO=<ID=q1,Number=1,Type=Float,Description="Combined qual score for haplotype 1">
##INFO=<ID=pos2,Number=.,Type=Integer,Description="POS of incorporated variants from haplotype 2">
##INFO=<ID=q2,Number=1,Type=Float,Description="Combined qual score for haplotype 2">
##FORMAT=<ID=GT,Number=G,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=G,Type=Integer,Description="Genotype quality score">
##CL=medaka_variant -U -o chr1 -m r941_prom_variant_g360 -s r941_prom_snp_g360 -i PAD65442_3.6.1_pass.bam -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -r chr1:0-10000000 -t 4; Fri 3 Jul 21:15:23 BST 2020
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4 SAMPLE5
chr1 10108 . C CT 14.91 PASS pos1=10108;pos2=10108;q1=10.99;q2=18.83 GT:GQ 1|1:15 1|1:15 1|1:15 1|1:15 1|1:15
chr1 10177 . A AC 4.852 PASS pos2=10177;q2=4.852 GT:GQ 0|1:5 1|1:15 1|1:15 1|1:13 1|1:16
chr1 10257 . A C 0.799 PASS pos1=10257;q1=0.799 GT:GQ 1/1:1 1|1:15 1|1:15 0|1:15 1|1:15
chr1 10291 . C T 8.544 PASS pos2=10291;q2=8.544 GT:GQ 0|1:9 1|1:15 1|1:15 0/1:12 1|1:15
chr1 10297 . C T 8.215 PASS pos2=10297;q2=8.215 GT:GQ 0|1:8 1|1:15 1|0:15 1|1:14 1|1:16
chr1 10303 . C T 0.246 PASS pos2=10303;q2=0.246 GT:GQ ./. 1|1:15 1|0:15 1|1:14 1|1:15
chr1 10309 . C T 2.7155 PASS pos1=10309;pos2=10309;q1=1.046;q2=4.385 GT:GQ 1|0:3 0|1:15 1|1:15 1|1:15 1|1:15
chr1 10315 . C T 4.8525 PASS pos1=10315;pos2=10315;q1=3.083;q2=6.622 GT:GQ 1|1:5 0|1:15 1|1:15 1|1:15 1|1:15
chr1 10321 . C T 0.562 PASS pos2=10321;q2=0.562 GT:GQ 0|1:1 1|1:15 1|1:15 0|1:15 1|1:15
And it generated an output no problem.
I then ran it on my real data (which is also a VCF file); where the data looks almost identical to above (with a LOT of header lines at the start, >3500 of them); this is just an example of the structure of the lines, I can't put actual lines in both for confidentiality and each line is >200 entries long:
chr10 11182636 . AT A 103.3 PASS AC=1;AF=2 ./. 0/0:7 0/1:35 ./. 0/1:22
And I get the error attached (sorry that it's an image, I have to work through a remote desktop so I can't copy/paste):
I can't understand how my data is different to the example that it doesn't work. I appreciate it might be difficult to point me in the right direction without seeing the full file, but first there's >3500 header lines (starting with '##') and then the data itself is on a remote desktop so can't be copied/pasted, and also cannot be shared as it's patient data, but if someone had any idea for a direction I could go in, I'd appreciate it.