Hi everyone,
I'm trying to phase a multi-sample (12 samples) vcf file with the first chromosome. I got this vcf after pruning with plink and recode it back to vcf. The file looks like this:
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLES
1 112 . C T . . PR GT ./. 0/1 ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 1 170 . T G . . PR GT 0/0 0/1 ./. 0/1 ./. 0/1 ./. 0/1 ./. 0/1 ./. ./. 1 370 . G A . . PR GT ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/1 ./. 1 482 . T C . . PR GT ./. 0/1 ./. ./. ./. 0/1 ./. 0/1 ./. 0/1 ./. ./. 1 555 . C G . . PR GT ./. ./. ./. 0/1 ./. ./. ./. 0/1 0/1 ./. ./. ./. 1 1268 . G A . . PR GT ./. ./. ./. 0/1 0/0 0/1 ./. 0/1 ./. 0/1 ./. ./. 1 1946 . C G . . PR GT ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/1 ./. 1 3014 . G T . . PR GT ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/1 ./. 1 3392 . G A . . PR GT ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/1 ./. ./. 1 3430 . C T . . PR GT ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/1 0/1 ./. 1 3966 . G A . . PR GT ./. ./. ./. ./. ./. ./. ./. ./. ./. ./. 0/1 ./. 1 3982 . C T . . PR GT ./. 0/0 ./. 0/1 0/0 0/1 ./. 0/1 0/1 0/1 0/0 ./. 1 4036 . A G . . PR GT ./. 0/1 ./. ./. ./. ./. ./. ./. ./. ./. ./. ./.
Now I'm trying to phase this file using beagle 5.2. My comand line looks like this;
java -jar /path-to-file/beagle.21Apr21.304.jar gt=file_pruned.vcf out=file_pruned_beagle_phased iterations=10
But I'm geting an error message that I think it has to do with MAF frequencies. But I don't really know what I'm doing wrong. Any suggestions are welcome!! :)
Exception in thread "main" java.lang.IllegalArgumentException: invalid array at vcf.LowMafRefGTRec.throwArrayError(LowMafRefGTRec.java:149) at vcf.LowMafRefGTRec.checkIndicesAndReturnMajorAllele(LowMafRefGTRec.java:143) at vcf.LowMafRefDiallelicGTRec.<init>(LowMafRefDiallelicGTRec.java:129) at vcf.RefGTRec.hapCodedInstance(RefGTRec.java:113) at phase.Stage2Haps.recs(Stage2Haps.java:167) at phase.Stage2Haps.lambda$stage2Haps$1(Stage2Haps.java:140) at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:271) at java.base/java.util.stream.IntPipeline$1$1.accept(IntPipeline.java:180) at java.base/java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:104) at java.base/java.util.Spliterator$OfInt.forEachRemaining(Spliterator.java:699) at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) at java.base/java.util.stream.Nodes$CollectorTask.doLeaf(Nodes.java:2191) at java.base/java.util.stream.Nodes$CollectorTask.doLeaf(Nodes.java:2157) at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327) at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:746) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:408) at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:736) at java.base/java.util.stream.Nodes.collect(Nodes.java:336) at java.base/java.util.stream.ReferencePipeline.evaluateToNode(ReferencePipeline.java:109) at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545) at java.base/java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260) at java.base/java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:517) at phase.Stage2Haps.stage2Haps(Stage2Haps.java:141) at phase.PhaseLS.runStage2(PhaseLS.java:269) at main.Main.phaseStage2Variants(Main.java:209) at main.Main.phaseTarg(Main.java:182) at main.Main.phaseAndImpute(Main.java:171) at main.Main.main(Main.java:126)
Thanks,
Pedro
First - there's no need to prune SNPs when you are phasing - you are just throwing away the LD information that beagle needs to phase your samples. Did someone tell you to prune them? You also aren't likely to get any kind of decent results phasing 12 samples without a reference panel - this could potentially be the cause of the error. Is this data for humans? If so, then you should use something like the 1000 genomes as a reference. Also - wrap your code and output with the code tags - it'll make it much easier to diagnose any problems.
Well, I tried phasing the unprunned data with shapeit and took me around 3 days. Hence the prunning ideia, lowering SNP data by removing redundate SNPs might be faster. These are pig samples so I haven't found a reference file that I can use. If you have any suggestion, I'd apreciate it.
OK. How many samples / SNPs do you have and which version of shapeit did you use? Pruning really harms accuracy so I think you should avoid doing that when possible. Do you have access to an computing cluster?
So we have 12 samples (1 108 008 SNPs). I used shapeit v2.r904 and ran an applet on DNA nexus using a mem3_ssd1_v2_x96 computer
OK. Well without a reference panel of pigs realistically there's no point in trying to phase or impute 12 samples. Perhaps try to access the one avaliable here? https://gsejournal.biomedcentral.com/articles/10.1186/s12711-019-0445-y
Did you end up finding a solution? I am in a similar situation with 9 samples and no reference panel.
Hi @mglasena, so I had the same problem over and over again for a while. I thought I had to pool samples from the same populations. In fact, I had about 80 samples from 5 different populations. I created a VCF file with all samples (BCFTools merge), and I tried with that file, but still had the same problem. So we thought it might be a problem with our Linux server. We downloaded our data to a private server and ran beagle, and it worked fine. Our problem was definitely due to incompatibilities between beagle and our server. Hope it helps solving your problem!
You can't use Beagle with that few samples and no reference panel. It won;t give meaningful results.
How did you determine this?
Just a bit of experience working with these kind of software and speaking with the authors. I would strongly advise not to procede unless you have a reference panel or more individuals.