I'm doing phasing with beagle 5.2 on SNP data from illumina microarray.
Starting from an unphased VCF with around 600,000 SNPs.
I also trio-phased the same VCF, so I have a phased control VCF
I run a simple pipeline,
java -Xmx4g -jar beagle.28Jun21.220.jar impute=false gt=source.vcf map=./hapmap/plink.chr1.GRCh37.map out=out iterations=40 ref=./chr1.1kg.phase3.v5a.b37.bref3 chrom=1
The genetic map from hapmap, and the reference from 1000genomes.
The resulting "phased" VCF from beagle differs greatly from the one I got from the trio phasing. Anyone knows any parameter tunning to apply in order to have the proper phased output? I tried larger window (up to 100Cm), more iterations (up to 120), larger overlap (up to 5Cm), with no good results.
I tried to reduce the reference human assembly, extraction only the positions that are present in the source VCF, using bedtools:
in 1st place I uncompress the bref3:
java -jar unbref3.28Jun21.220.jar chr1.1kg.phase3.v5a.b37.bref3 > chr1.1kg.phase3.v5a.b37.vcf
the I extract the intersection between this VCF and the source file:
bedtools intersect -b source.vcf -a chr1.1kg.phase3.v5a.b37.vcf > reduced.chr1.1kg.phase3.v5a.b37.vcf
at a last step I ran beagle again:
java -Xmx4g -jar beagle.28Jun21.220.jar impute=false gt=source.vcf map=./hapmap/plink.chr1.GRCh37.map out=out iterations=40 ref=./reduced.chr1.1kg.phase3.v5a.b37.bref3 chrom=1
but the "phase" output still different form the phased data confirmed by trio.
Any clues or suggestions? Thank you in advance.
jp
PS: This is an extract from the source VCF:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT arivcf
1 82154 rs4477212 A . . . . GT 0/0
1 752566 rs3094315 G . . . . GT ./.
1 752721 rs3131972 A . . . . GT 0/0
1 768448 rs12562034 G . . . . GT ./.
1 776546 rs12124819 A . . . . GT ./.
1 798959 rs11240777 G A . . . GT 1/0
1 800007 rs6681049 T . . . . GT ./.
1 838555 rs4970383 C . . . . GT 0/0
1 846808 rs4475691 C T . . . GT 0/1
1 854250 rs7537756 A . . . . GT 0/0
1 861808 rs13302982 A G . . . GT 0/1
1 873558 rs1110052 G T . . . GT 0/1
1 882033 rs2272756 G A . . . GT 1/0
When you say 'The resulting "phased" VCF from beagle differs greatly from the one I got from the trio phasing' - how different are we talking? What kind of differences is there?
One option would be to use shapeit4 and integrate trio phasing with reference based phasing in one step.
The differences are phase flips every few consecutive heterozygous positions (between 2 and 10 positions). The genotype is OK but the phase flips compared with my phased information from the trio.
I can use shapeit4 for this particular case because I have the trio, but I need to tune the pipeline for standalone samples with no pedigree info, in order to make a later IBD detection with Refined IBD.
I also tried to process the sample with the Michigan Imputation server (against 1000G and HRC reference panels), and the output was even worse (they use Eagle 2.4).