Question

PHG Imputation

0

Entering edit mode

2.4 years ago

지용 ▴ 20

Here, presented my PHG Scripts and required files. An error message appeared in the 6. imputation step.

ERROR net.maizegenetics.plugindef.AbstractPlugin - Index: 2, Size: 1

Sorry for the lengthy content. When I think about it, there seems to be a problem with the keyfile or vcf index file. Please point it out.

1. Create Default Directory

$ docker run --name create_directory --rm \
-v /hpvol/user/jysong/phg/:/phg/ \
-t maizegenetics/phg:latest \
/tassel-5-standalone/run_pipeline.pl -debug -MakeDefaultDirectoryPlugin -workingDir /phg/ -endPlugin

2. Create a bed file to define genome intervals

$ docker run --name test_assemblies --rm  \
-v /hpvol/user/jysong/phg/:/phg/ \
-t maizegenetics/phg \
/tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters /phg/configSQLite.txt \
-CreateValidIntervalsFilePlugin -intervalsFile /phg/Gmax_275_Wm82.a2.v1.gene.bed \
-referenceFasta /phg/inputDir/reference/Gmax_275_v2.0.fa.gz \
-mergeOverlaps true \
-generatedFile /phg/Gmax_validBedFile.bed -endPlugin

3. Create Initial DataBase

$ docker run --name create_initial_db --rm \ 
-v /hpvol/user/jysong/phg/:/phg/ \ 
-t maizegenetics/phg \ 
/tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters /phg/configSQLite.txt \ 
-MakeInitialPHGDBPipelinePlugin -endPlugin

(1). configSQLite.txt

host=localHost
user=sqlite
password=sqlite
DB=/phg/gvcf
DBtype=sqlite

# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
anchors=/phg/Gmax_validBedFile.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=/hpvol/user/jysong/phg/inputDir/assemblies
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir
CreateIntervalBedFilesPlugin.refRangeMethods=FocusRegion,FocusComplement

(2) load_genome_data.txt

    Genotype    Hapnumber   Dataline    ploidy  genesPhased chromsPhased    confidence  Method  MethodDetails
Wm82    0   core_collection 1   true    true    1   Ref_Wm82_method corecoll_test version for a2

4. Create Haplotype From GVCF

$ docker run --name load_Haplotype --rm -v /hpvol/user/jysong/phg/:/phg/ -t maizegenetics/phg ./CreateHaplotypesFromGVCF.groovy -config /phg/LoadHaploconfig.txt

(3) LoadHaploconfig.txt

    #!

### config file. 
### Anything marked with UNASSIGNED needs to be set for at least one of the steps
### If it is marked as OPTIONAL, it will only need to be set if you want to run specific steps. 
host=localHost
user=sqlite
password=sqlite
DB=/phg/gvcf
DBtype=sqlite

# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
anchors=/phg/Gmax_validBedFile.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=/hpvol/user/jysong/phg/inputDir/assemblies
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir


### Align WGS fastq files to reference genome parameters

# File Directories
gvcfFileDir=/phg/inputDir/loadDB/gvcf/
tempFileDir=/phg/inputDir/loadDB/temp/
filteredBamDir=/phg/inputDir/loadDB/bam/filteredBAMs/
dedupedBamDir=/phg/inputDir/loadDB/bam/DedupBAMs/

# TASSEL parameters
Xmx=10G
tasselLocation=/tassel-5-standalone/run_pipeline.pl

# PHG CreateHaplotypes Parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
wgsKeyFile=/phg/load_wgs_genome_key_file.txt
LoadHaplotypesFromGVCFPlugin.gvcfDir=/phg/inputDir/loadDB/gvcf/
LoadHaplotypesFromGVCFPlugin.referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
CreateIntervalBedFilesPlugin.refRangeMethods=FocusRegion,FocusComplement
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=GATK_PIPELINE
LoadHaplotypesFromGVCFPlugin.haplotypeMethodDescription=”GATK_PIPELINE”
extendedWindowSize = 1000
mapQ = 48

# GATK and Sentieon Parameters
gatkPath = /gatk/gatk
numThreads=10
sentieon_license
sentieonPath=/sentieon/bin/sentieon

(4) load_wgs_genome_key_file.txt

sample_name sample_description  files   type    chrPhased   genePhased  phasingConf libraryID
CMJ001  CMJ001 Aligned using BWA    CMJ001_srt_dedup_mapQ_output.g.vcf  GVCF    true    true    .99 null
CMJ002  CMJ002 Aligned using BWA    CMJ002_srt_dedup_mapQ_output.g.vcf  GVCF    true    true    .99 null

5. Create consensus haplotypes

$ docker run --name phg_container_consensus --rm -v /hpvol/user/jysong/PHG_gvcf/:/phg/ -t maizegenetics/phg:latest /CreateConsensi.sh /phg/consensusconfig.txt Gmax_275_v2.0.fa.gz GATK_PIPELINE CONSENSUS -collapseMethod CONSENSUS

(5) consensusconfig.txt

    ##!

### config file. 
### Anything marked with UNASSIGNED needs to be set for at least one of the steps
### If it is marked as OPTIONAL, it will only need to be set if you want to run specific steps. 
host=localHost
user=sqlite
password=sqlite
DB=/phg/gvcf
DBtype=sqlite

# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
anchors=/phg/Gmax_validBedFile.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=/hpvol/user/jysong/phg/inputDir/assemblies
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir


### Align WGS fastq files to reference genome parameters

# File Directories
gvcfFileDir=/phg/inputDir/loadDB/gvcf/
tempFileDir=/phg/inputDir/loadDB/temp/
filteredBamDir=/phg/inputDir/loadDB/bam/filteredBAMs/
dedupedBamDir=/phg/inputDir/loadDB/bam/DedupBAMs/

# TASSEL parameters
# Xmx=10G
tasselLocation=/tassel-5-standalone/run_pipeline.pl

# PHG CreateHaplotypes Parameters
referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
wgsKeyFile=/phg/load_wgs_genome_key_file.txt
LoadHaplotypesFromGVCFPlugin.gvcfDir=/phg/inputDir/loadDB/gvcf/
LoadHaplotypesFromGVCFPlugin.referenceFasta=/phg/inputDir/reference/Gmax_275_v2.0.fa.gz
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=GATK_PIPELINE
LoadHaplotypesFromGVCFPlugin.haplotypeMethodDescription=”GATK_PIPELINE”
extendedWindowSize = 1000
mapQ = 48

# GATK and Sentieon Parameters
gatkPath = /gatk/gatk
numThreads=10
sentieon_license
sentieonPath=/sentieon/bin/sentieon


# CreateConsensi parameters
haplotypeMethod = GARK_PIPELINE
consensusMethod = CONSENSUS
mxDiv = 0.005
seqErr = 0.02
minSites = 20
minTaxa = 2
#rankingFile = null
clusteringMode = upgma


# Graph Building Parameters
includeVariants = true

#FilterGVCF Parameters.  Adding any of these will add more filters.#exclusionString=**UNASSIGNED**
#DP_poisson_min=0.0
#DP_poisson_max=1.0
#DP_min=**UNASSIGNED**
#DP_max=**UNASSIGNED**
#GQ_min=**UNASSIGNED**
#GQ_max=**UNASSIGNED**
#QUAL_min=**UNASSIGNED**
#QUAL_max=**UNASSIGNED**
#filterHets=**UNASSIGNED**

exportMergedVCF=/tempFileDir/data/outputs/mergedVCFs/

6. Imputation

$ docker run --name pipeline_container --rm -v /hpvol/user/jysong/phg_test/:/phg/ -t maizegenetics/phg /tassel-5-standalone/run_pipeline.pl -Xmx10G -debug -configParameters /phg/imputevcfconfig.txt -ImputePipelinePlugin -imputeTarget pathToVCF -endPlugin

(6) imputevcfconfig.txt

    # Imputation Pipeline parameters for VCF files

#!!! Required Parameters !!!
#--- Database ---
host=localHost
user=sqlite
password=sqlite
DB=/phg/gvcf
DBtype=sqlite

#--- Used by liquibase to check DB version ---
liquibaseOutdir=/phg/outputDir

#--- Used for indexing SNP positions ---
#   pangenomeHaplotypeMethod is the database method or methods for the haplotypes to which SNPs will be indexed
#   the index file lists the SNP allele to haplotype mapping and is used for mapping reads
 pangenomeHaplotypeMethod=CONSENSUS
 pangenomeDir=/phg/outputDir/pangenome
 indexFile=/phg/outputDir/vcfIndexfile
 vcfFile=/phg/inputDir/imputation/vcf/SoyHapMap.SNP.GT.fixed.vcf.414accession.KASP.gz

#--- Used for mapping reads
#   readMethod is the method name for storing the resulting read mappings
#   countAlleleDepths=true means allele depths will be used for haplotype counts, which is almost always a good choice
inputType=vcf
keyFile=/phg/readMapping_key_file.txt
readMethod=GATK_PIPELINE
vcfDir=/phg/inputDir/loadDB/gvcf/
countAlleleDepths=true

#--- Used for path finding
#   pathHaplotypeMethod determines which haplotypes will be consider for path finding
#   pathHaplotypeMethod should be the same as pangenomeHaplotypeMethod, but could be a subset
#   pathMethod is the method name used for storing the paths
pathHaplotypeMethod=GATK_PIPELINE
pathMethod=GATK_PIPELINE
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=20
minTransitionProb=0.001
numThreads=3
probCorrect=0.99
removeEqual=true
splitNodes=true
splitProb=0.99
usebf=false
maxParents = 1000000
minCoverage = 1.0
#parentOutputFile = **OPTIONAL**

#--- used by haploid path finding only
#   usebf - if true use Forward-Backward algorithm, other Viterbi
usebf=false
minP=0.8

#--- used by diploid path finding only
maxHap=11
maxReadsKB=100
algorithmType=efficient

#--- Used to output a vcf file for pathMethod
outVcfFile=/phg/outputDir/Result.vcf

#~~~ Optional Parameters ~~~
#readMethodDescription=**OPTIONAL**
#pathMethodDescription=**OPTIONAL**
#bfInfoFile=**OPTIONAL**
#~~~ providing a value for outputDir will write read mappings to file rather than the PHG db ~~~
outputDir=/phg/

(7) readMapping_key_file.txt

    cultivar    flowcell_lane   filename    PlateID
CMJ001  wgsFlowcell CMJ001_srt_dedup_mapQ_output.g.vcf  wgs
CMJ002  wgsFlowcell  CMJ002_srt_dedup_mapQ_output.g.vcf  wgs

Thank you for your time.

PHG Keyfile Imputation • 1.2k views

ADD COMMENT • link updated 2.4 years ago by lcj34 ▴ 420 • written 2.4 years ago by 지용 ▴ 20

0

Entering edit mode

Did you check the log files after each step to verify that nothing failed before the imputation step? I would start there. If output logs from each step look good with no errors, then please attach the log file from the imputation step and we'll take a look.

ADD REPLY • link 2.4 years ago by lcj34 ▴ 420

0

Entering edit mode

https://drive.google.com/drive/folders/1kLUgeF--w_EdIohIWPqmlwll7lHFn86f?usp=sharing

Log files for each step have been uploaded to this drive.

Warning phrase appears in Step 04.CreateHaplotype

WARNING: ConvertVariantContextToVariantInfo:determineASMINfo has empty variant list, creating default ASMVariantInfo object

ADD REPLY • link 2.4 years ago by 지용 ▴ 20

0

Entering edit mode

Please see pjb39's response. Let us know if you have problems after you've addressed the issues he noted.

ADD REPLY • link 2.4 years ago by lcj34 ▴ 420

score 1 · Answer 1 · 2022-07-08

I can see a couple of a problems in the config file. Getting the method names right can be confusing. First, as the comments state, pathHaplotypeMethod should be the same as pangenomeHaplotypeMethod. So, use pathHaplotypeMethod=CONSENSUS. Second, use different names for readMethod and pathMethod. And do not reuse GATK_PIPELINE for either. But, in this case the problem is most certainly pathHaplotypeMethod. When you indexed the vcf, you told the software to use the CONSENSUS haplotypes. Then later you tried to use SNPs indexed to CONSENSUS haplotypes to tag the GATK_PIPELINE haplotypes, which failed because there is no overlap.