Here, presented my PHG Scripts and required files. An error message appeared in the 6. imputation step.
ERROR net.maizegenetics.plugindef.AbstractPlugin - Index: 2, Size: 1
Sorry for the lengthy content. When I think about it, there seems to be a problem with the keyfile or vcf index file. Please point it out.
1. Create Default Directory
$ docker run --name create_directory --rm \
-v /hpvol/user/jysong/phg/:/phg/ \
-t maizegenetics/phg:latest \
/tassel-5-standalone/run_pipeline.pl -debug -MakeDefaultDirectoryPlugin -workingDir /phg/ -endPlugin
2. Create a bed file to define genome intervals
$ docker run --name test_assemblies --rm \
-v /hpvol/user/jysong/phg/:/phg/ \
-t maizegenetics/phg \
/tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters /phg/configSQLite.txt \
-CreateValidIntervalsFilePlugin -intervalsFile /phg/Gmax_275_Wm82.a2.v1.gene.bed \
-referenceFasta /phg/inputDir/reference/Gmax_275_v2.0.fa.gz \
-mergeOverlaps true \
-generatedFile /phg/Gmax_validBedFile.bed -endPlugin
3. Create Initial DataBase
$ docker run --name create_initial_db --rm \
-v /hpvol/user/jysong/phg/:/phg/ \
-t maizegenetics/phg \
/tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters /phg/configSQLite.txt \
-MakeInitialPHGDBPipelinePlugin -endPlugin
(1). configSQLite.txt
host=localHost
user=sqlite
password=sqlite
DB=/phg/gvcf
DBtype=sqlite
# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
anchors=/phg/Gmax_validBedFile.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=/hpvol/user/jysong/phg/inputDir/assemblies
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir
CreateIntervalBedFilesPlugin.refRangeMethods=FocusRegion,FocusComplement
(2) load_genome_data.txt
Genotype Hapnumber Dataline ploidy genesPhased chromsPhased confidence Method MethodDetails
Wm82 0 core_collection 1 true true 1 Ref_Wm82_method corecoll_test version for a2
4. Create Haplotype From GVCF
$ docker run --name load_Haplotype --rm -v /hpvol/user/jysong/phg/:/phg/ -t maizegenetics/phg ./CreateHaplotypesFromGVCF.groovy -config /phg/LoadHaploconfig.txt
(3) LoadHaploconfig.txt
#!
### config file.
### Anything marked with UNASSIGNED needs to be set for at least one of the steps
### If it is marked as OPTIONAL, it will only need to be set if you want to run specific steps.
host=localHost
user=sqlite
password=sqlite
DB=/phg/gvcf
DBtype=sqlite
# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
anchors=/phg/Gmax_validBedFile.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=/hpvol/user/jysong/phg/inputDir/assemblies
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir
### Align WGS fastq files to reference genome parameters
# File Directories
gvcfFileDir=/phg/inputDir/loadDB/gvcf/
tempFileDir=/phg/inputDir/loadDB/temp/
filteredBamDir=/phg/inputDir/loadDB/bam/filteredBAMs/
dedupedBamDir=/phg/inputDir/loadDB/bam/DedupBAMs/
# TASSEL parameters
Xmx=10G
tasselLocation=/tassel-5-standalone/run_pipeline.pl
# PHG CreateHaplotypes Parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
wgsKeyFile=/phg/load_wgs_genome_key_file.txt
LoadHaplotypesFromGVCFPlugin.gvcfDir=/phg/inputDir/loadDB/gvcf/
LoadHaplotypesFromGVCFPlugin.referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
CreateIntervalBedFilesPlugin.refRangeMethods=FocusRegion,FocusComplement
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=GATK_PIPELINE
LoadHaplotypesFromGVCFPlugin.haplotypeMethodDescription=”GATK_PIPELINE”
extendedWindowSize = 1000
mapQ = 48
# GATK and Sentieon Parameters
gatkPath = /gatk/gatk
numThreads=10
sentieon_license
sentieonPath=/sentieon/bin/sentieon
(4) load_wgs_genome_key_file.txt
sample_name sample_description files type chrPhased genePhased phasingConf libraryID
CMJ001 CMJ001 Aligned using BWA CMJ001_srt_dedup_mapQ_output.g.vcf GVCF true true .99 null
CMJ002 CMJ002 Aligned using BWA CMJ002_srt_dedup_mapQ_output.g.vcf GVCF true true .99 null
5. Create consensus haplotypes
$ docker run --name phg_container_consensus --rm -v /hpvol/user/jysong/PHG_gvcf/:/phg/ -t maizegenetics/phg:latest /CreateConsensi.sh /phg/consensusconfig.txt Gmax_275_v2.0.fa.gz GATK_PIPELINE CONSENSUS -collapseMethod CONSENSUS
(5) consensusconfig.txt
##!
### config file.
### Anything marked with UNASSIGNED needs to be set for at least one of the steps
### If it is marked as OPTIONAL, it will only need to be set if you want to run specific steps.
host=localHost
user=sqlite
password=sqlite
DB=/phg/gvcf
DBtype=sqlite
# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
anchors=/phg/Gmax_validBedFile.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=/hpvol/user/jysong/phg/inputDir/assemblies
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir
### Align WGS fastq files to reference genome parameters
# File Directories
gvcfFileDir=/phg/inputDir/loadDB/gvcf/
tempFileDir=/phg/inputDir/loadDB/temp/
filteredBamDir=/phg/inputDir/loadDB/bam/filteredBAMs/
dedupedBamDir=/phg/inputDir/loadDB/bam/DedupBAMs/
# TASSEL parameters
# Xmx=10G
tasselLocation=/tassel-5-standalone/run_pipeline.pl
# PHG CreateHaplotypes Parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
wgsKeyFile=/phg/load_wgs_genome_key_file.txt
LoadHaplotypesFromGVCFPlugin.gvcfDir=/phg/inputDir/loadDB/gvcf/
LoadHaplotypesFromGVCFPlugin.referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=GATK_PIPELINE
LoadHaplotypesFromGVCFPlugin.haplotypeMethodDescription=”GATK_PIPELINE”
extendedWindowSize = 1000
mapQ = 48
# GATK and Sentieon Parameters
gatkPath = /gatk/gatk
numThreads=10
sentieon_license
sentieonPath=/sentieon/bin/sentieon
# CreateConsensi parameters
haplotypeMethod = GARK_PIPELINE
consensusMethod = CONSENSUS
mxDiv = 0.005
seqErr = 0.02
minSites = 20
minTaxa = 2
#rankingFile = null
clusteringMode = upgma
# Graph Building Parameters
includeVariants = true
#FilterGVCF Parameters. Adding any of these will add more filters.#exclusionString=**UNASSIGNED**
#DP_poisson_min=0.0
#DP_poisson_max=1.0
#DP_min=**UNASSIGNED**
#DP_max=**UNASSIGNED**
#GQ_min=**UNASSIGNED**
#GQ_max=**UNASSIGNED**
#QUAL_min=**UNASSIGNED**
#QUAL_max=**UNASSIGNED**
#filterHets=**UNASSIGNED**
exportMergedVCF=/tempFileDir/data/outputs/mergedVCFs/
6. Imputation
$ docker run --name pipeline_container --rm -v /hpvol/user/jysong/phg_test/:/phg/ -t maizegenetics/phg /tassel-5-standalone/run_pipeline.pl -Xmx10G -debug -configParameters /phg/imputevcfconfig.txt -ImputePipelinePlugin -imputeTarget pathToVCF -endPlugin
(6) imputevcfconfig.txt
# Imputation Pipeline parameters for VCF files
#!!! Required Parameters !!!
#--- Database ---
host=localHost
user=sqlite
password=sqlite
DB=/phg/gvcf
DBtype=sqlite
#--- Used by liquibase to check DB version ---
liquibaseOutdir=/phg/outputDir
#--- Used for indexing SNP positions ---
# pangenomeHaplotypeMethod is the database method or methods for the haplotypes to which SNPs will be indexed
# the index file lists the SNP allele to haplotype mapping and is used for mapping reads
pangenomeHaplotypeMethod=CONSENSUS
pangenomeDir=/phg/outputDir/pangenome
indexFile=/phg/outputDir/vcfIndexfile
vcfFile=/phg/inputDir/imputation/vcf/SoyHapMap.SNP.GT.fixed.vcf.414accession.KASP.gz
#--- Used for mapping reads
# readMethod is the method name for storing the resulting read mappings
# countAlleleDepths=true means allele depths will be used for haplotype counts, which is almost always a good choice
inputType=vcf
keyFile=/phg/readMapping_key_file.txt
readMethod=GATK_PIPELINE
vcfDir=/phg/inputDir/loadDB/gvcf/
countAlleleDepths=true
#--- Used for path finding
# pathHaplotypeMethod determines which haplotypes will be consider for path finding
# pathHaplotypeMethod should be the same as pangenomeHaplotypeMethod, but could be a subset
# pathMethod is the method name used for storing the paths
pathHaplotypeMethod=GATK_PIPELINE
pathMethod=GATK_PIPELINE
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=20
minTransitionProb=0.001
numThreads=3
probCorrect=0.99
removeEqual=true
splitNodes=true
splitProb=0.99
usebf=false
maxParents = 1000000
minCoverage = 1.0
#parentOutputFile = **OPTIONAL**
#--- used by haploid path finding only
# usebf - if true use Forward-Backward algorithm, other Viterbi
usebf=false
minP=0.8
#--- used by diploid path finding only
maxHap=11
maxReadsKB=100
algorithmType=efficient
#--- Used to output a vcf file for pathMethod
outVcfFile=/phg/outputDir/Result.vcf
#~~~ Optional Parameters ~~~
#readMethodDescription=**OPTIONAL**
#pathMethodDescription=**OPTIONAL**
#bfInfoFile=**OPTIONAL**
#~~~ providing a value for outputDir will write read mappings to file rather than the PHG db ~~~
outputDir=/phg/
(7) readMapping_key_file.txt
cultivar flowcell_lane filename PlateID
CMJ001 wgsFlowcell CMJ001_srt_dedup_mapQ_output.g.vcf wgs
CMJ002 wgsFlowcell CMJ002_srt_dedup_mapQ_output.g.vcf wgs
Thank you for your time.
Did you check the log files after each step to verify that nothing failed before the imputation step? I would start there. If output logs from each step look good with no errors, then please attach the log file from the imputation step and we'll take a look.
https://drive.google.com/drive/folders/1kLUgeF--w_EdIohIWPqmlwll7lHFn86f?usp=sharing
Log files for each step have been uploaded to this drive.
Warning phrase appears in Step 04.CreateHaplotype
WARNING: ConvertVariantContextToVariantInfo:determineASMINfo has empty variant list, creating default ASMVariantInfo object
Please see pjb39's response. Let us know if you have problems after you've addressed the issues he noted.