Question

Beagle: Index -l out of bounds for length 2

0

Entering edit mode

2.7 years ago

Tom ▴ 50

I want to impute missing genotypic data. I downloaded beagle from here, and I ran it on a test file below:

##fileformat=VCFv4.1
##medaka_version=1.0.3
##contig=<ID=chr1>
##INFO=<ID=pos1,Number=.,Type=Integer,Description="POS of incorporated variants from haplotype 1">
##INFO=<ID=q1,Number=1,Type=Float,Description="Combined qual score for haplotype 1">
##INFO=<ID=pos2,Number=.,Type=Integer,Description="POS of incorporated variants from haplotype 2">
##INFO=<ID=q2,Number=1,Type=Float,Description="Combined qual score for haplotype 2">
##FORMAT=<ID=GT,Number=G,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=G,Type=Integer,Description="Genotype quality score">
##CL=medaka_variant -U -o chr1 -m r941_prom_variant_g360 -s r941_prom_snp_g360 -i PAD65442_3.6.1_pass.bam -f GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -r chr1:0-10000000 -t 4; Fri  3 Jul 21:15:23 BST 2020
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4 SAMPLE5
chr1    10108   .   C   CT  14.91   PASS    pos1=10108;pos2=10108;q1=10.99;q2=18.83 GT:GQ   1|1:15  1|1:15  1|1:15  1|1:15  1|1:15  
chr1    10177   .   A   AC  4.852   PASS    pos2=10177;q2=4.852 GT:GQ   0|1:5   1|1:15  1|1:15  1|1:13  1|1:16
chr1    10257   .   A   C   0.799   PASS    pos1=10257;q1=0.799 GT:GQ   1/1:1   1|1:15  1|1:15  0|1:15  1|1:15
chr1    10291   .   C   T   8.544   PASS    pos2=10291;q2=8.544 GT:GQ   0|1:9   1|1:15  1|1:15  0/1:12  1|1:15
chr1    10297   .   C   T   8.215   PASS    pos2=10297;q2=8.215 GT:GQ   0|1:8   1|1:15  1|0:15  1|1:14  1|1:16
chr1    10303   .   C   T   0.246   PASS    pos2=10303;q2=0.246 GT:GQ   ./. 1|1:15  1|0:15  1|1:14  1|1:15
chr1    10309   .   C   T   2.7155  PASS    pos1=10309;pos2=10309;q1=1.046;q2=4.385 GT:GQ   1|0:3   0|1:15  1|1:15  1|1:15  1|1:15
chr1    10315   .   C   T   4.8525  PASS    pos1=10315;pos2=10315;q1=3.083;q2=6.622 GT:GQ   1|1:5   0|1:15  1|1:15  1|1:15  1|1:15
chr1    10321   .   C   T   0.562   PASS    pos2=10321;q2=0.562 GT:GQ   0|1:1   1|1:15  1|1:15  0|1:15  1|1:15

And it generated an output no problem.

I then ran it on my real data (which is also a VCF file); where the data looks almost identical to above (with a LOT of header lines at the start, >3500 of them); this is just an example of the structure of the lines, I can't put actual lines in both for confidentiality and each line is >200 entries long:

chr10   11182636    .   AT  A   103.3   PASS    AC=1;AF=2   ./. 0/0:7   0/1:35  ./. 0/1:22

And I get the error attached (sorry that it's an image, I have to work through a remote desktop so I can't copy/paste):

enter image description here

I can't understand how my data is different to the example that it doesn't work. I appreciate it might be difficult to point me in the right direction without seeing the full file, but first there's >3500 header lines (starting with '##') and then the data itself is on a remote desktop so can't be copied/pasted, and also cannot be shared as it's patient data, but if someone had any idea for a direction I could go in, I'd appreciate it.

beagle genotyping • 543 views

ADD COMMENT • link 2.7 years ago by Tom ▴ 50