Hello!
I am currently trying to impute paths through a built Practical Haplotype Graph, i.e. use the -ImputePipelinePlugin -imputeTarget command. The PHG version I use is 1.2. I populated the database using assemblies and the built-in anchorwave plugin. I have fastq files as input for imputation.
I have trouble setting the pangenomeHaplotypeMethod/pathHaplotypeMethod parameters correctly. The error I get says: "CreateGraphUtils: methodId: no method name assembly_by_anchorwave". I do not quite understand the documentation here and here. Are these parameters not user defined?
Or are they perhaps set in a previous step? If so, it might be of import that I skipped the "Create Consensus Haplotypes" step, because it was marked as optional and I specifically wanted as many versions of each haplotype as the pangenome could contain. Though I do not find the pangenomeHaplotypeMethod/pathHaplotypeMethod parameters in the documentation of the "Create Consensus Haplotypes" step. Can I find the correct method names in the liquibase database itself? If so how?
If needed, my first error message:
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: haplotype method: assembly_by_anchorwave range group method: null
[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.api.CreateGraphUtils - CreateGraphUtils: methodId: no method name assembly_by_anchorwave
java.lang.IllegalArgumentException: CreateGraphUtils: methodId: no method name assembly_by_anchorwave
at net.maizegenetics.pangenome.api.CreateGraphUtils.methodId(CreateGraphUtils.java:1242)
at net.maizegenetics.pangenome.api.CreateGraphUtils.createHaplotypeNodes(CreateGraphUtils.java:408)
at net.maizegenetics.pangenome.api.CreateGraphUtils.createHaplotypeNodes(CreateGraphUtils.java:1009)
at net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin.processData(HaplotypeGraphBuilderPlugin.java:84)
at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:111)
at net.maizegenetics.pangenome.pipeline.ImputePipelinePlugin.runImputationPipeline(ImputePipelinePlugin.kt:191)
at net.maizegenetics.pangenome.pipeline.ImputePipelinePlugin.processData(ImputePipelinePlugin.kt:151)
at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:111)
at net.maizegenetics.plugindef.AbstractPlugin.dataSetReturned(AbstractPlugin.java:2017)
at net.maizegenetics.plugindef.ThreadedPluginListener.run(ThreadedPluginListener.java:29)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.api.HaplotypeGraphBuilderPlugin - CreateGraphUtils: methodId: Problem getting id for method: assembly_by_anchorwave
And my config file for this step:
# Imputation Pipeline parameters for fastq or SAM files
# Required Parameters!!!!!!!
#--- Database ---
host=localHost
user=xxx
password=xxx
DB=/PHG/phg_run1.db
DBtype=sqlite
#--- Used by liquibase to check DB version ---
liquibaseOutdir=/PHG/outputDir/
#--- Used for writing a pangenome reference fasta(not needed when inputType=vcf) ---
pangenomeHaplotypeMethod=assembly_by_anchorwave
pangenomeDir=/PHG/outputDir/pangenome
indexKmerLength=21
indexWindowSize=11
indexNumberBases=90G
#--- Used for mapping reads
inputType=fastq
readMethod=20230213_run1
keyFile=/PHG/readMapping_key_file.txt
fastqDir=/PHG/inputDir/imputation/fastq/
samDir=/PHG/inputDir/imputation/sam/
lowMemMode=true
maxRefRangeErr=0.25
outputSecondaryStats=false
maxSecondary=20
fParameter=f15000,16000
minimapLocation=minimap2
#--- Used for path finding
pathHaplotypeMethod=assembly_by_anchorwave
pathMethod=20230213_run1
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=1
minTransitionProb=0.0005
numThreads=4
probCorrect=0.99
removeEqual=false
splitNodes=true
splitProb=0.99
usebf=true
maxParents = 1000000
minCoverage = 1.0
#parentOutputFile = **OPTIONAL**
# used by haploid path finding only
usebf=true
minP=0.8
# used by diploid path finding only
maxHap=11
maxReadsKB=100
algorithmType=classic
#--- Used to output a vcf file for pathMethod
outVcfFile=/PHG/outputDir/align/20230213_run1_variants.vcf
#~~~ Optional Parameters ~~~
#pangenomeIndexName=**OPTIONAL**
#readMethodDescription=**OPTIONAL**
#pathMethodDescription=**OPTIONAL**
debugDir=/PHG/debugDir/
#bfInfoFile=**OPTIONAL**
localGVCFFolder=/PHG/outputDir/align/gvcfs # added because demanded by error message
Ah thank you very much!! That is on me, for some reason I read the documentation like it would also load the haplotypes into the db.
I have now done the loading of gvcfs into the db, but still have questions.
For one, is it possible to impute the paths separately for every set of WGS data I have or is there some benefit to submit all together apart from automation?
Then, when I try to build the pangenome fasta this time, I get the below error and am wondering if I have another mistake earlier and if you could help me spot it. I am not working on different servers, instead all files and the db are in folders mounted on singularity using the -B option. For that reason I only used the path without the server address whenever I wrote one. However I also can not find the point in the pipeline where I would have defined the reference genome gvcf path. I copied the gvcf and the index into the folder named in the error message, but before it was simply located in /PHG/inputDir/reference/ Can I not use this without the server address? Or have I done another mistake earlier in the pipeline? Thank you very much in advance!!