Question

PHG Load haplotype and create consensus

1

Entering edit mode

22 months ago

지용 ▴ 20

Here, presented my PHG scripts, config, wgs_keyfile.

1. Create valid intervals

docker run --name test_assemblies --rm -v /DATA/jysong/PHG/ver1.0_phg/:/phg/ -t maizegenetics/phg:1.0 /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters /phg/Masterconfig.txt -CreateValidIntervalsFilePlugin -intervalsFile /phg/inputDir/reference/glyma.Wm82.gnm4.ann1.T8TQ.gene_models_main.bed -referenceFasta /phg/inputDir/reference/glyma.Wm82.gnm4.4PTR.genome_main.fixed.fna.gz -mergeOverlaps true -generatedFile /phg/validBedFile.bed -endPlugin &> Log/1.Create_validinterval.txt &

2. Create initial DB

docker run --name create_initial_db --rm -v /DATA/jysong/PHG/ver1.0_phg/:/phg/ -t maizegenetics/phg:1.0 /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters /phg/Masterconfig.txt -MakeInitialPHGDBPipelinePlugin -endPlugin &> Log/2.Create_InitialDB.txt &

3. check plugin update

docker run --name create_directory --rm -v /DATA/jysong/PHG/ver1.0_phg/:/phg/ -t maizegenetics/phg:1.0 /tassel-5-standalone/run_pipeline.pl -debug -configParameters /phg/Masterconfig.txt -CheckDBVersionPlugin -outputDir /phg/outputDir -endPlugin -LiquibaseUpdatePlugin -outputDir /phg/outputDir -endPlugin &> Log/3.Check_Plugin.txt &

4. Load haplotype (In docker)

./CreateHaplotypesFromFastq.groovy -config phg/Masterconfig.txt &> phg/Log/4.CreateHAplotypeFromFastq.txt &

5. Create consensus (In docker)

/tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters Masterconfig.txt -HaplotypeGraphBuilderPlugin -configFile Masterconfig.txt -methods GATK_PIPELINE1:CONSENSUS -includeVariantContexts true -endPlugin -RunHapConsensusPipelinePlugin -referenceFasta /inputDir/reference/glyma.Wm82.gnm4.4PTR.genome_main.fixed.fna.gz -dbConfigFile Masterconfig.txt -collapseMethod CONSENSUS -collapseMethodDetails G.max_test -rankingFile rankingFile.txt -clusteringMode kmer_assembly -isTestMethod true -endPlugin &> phg/Log/5.Concensus.txt &

Config file

### config file. 
### Anything marked with UNASSIGNED needs to be set for at least one of the steps
### If it is marked as OPTIONAL, it will only need to be set if you want to run specific steps. 
host=localHost
user=sqlite
password=sqlite
DB=/phg/Gmax430
DBtype=sqlite

# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/glyma.Wm82.gnm4.4PTR.genome_main.fna.gz
anchors=/phg/validBedFile.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=localHost;/DATA/jysong/ver1.0_phg/ref
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir


### Align WGS fastq files to reference genome parameters

# File Directories
gvcfFileDir=/phg/inputDir/loadDB/gvcf/
tempFileDir=/phg/inputDir/loadDB/temp/
filteredBamDir=/phg/inputDir/loadDB/bam/filteredBAMs/
dedupedBamDir=/phg/inputDir/loadDB/bam/DedupBAMs/

# TASSEL parameters
Xmx=100G
tasselLocation=/tassel-5-standalone/run_pipeline.pl

# PHG CreateHaplotypes Parameters
wgsKeyFile=/phg/wgs_KeyFile.txt
LoadHaplotypesFromGVCFPlugin.gvcfDir=/phg/inputDir/loadDB/gvcf/
LoadHaplotypesFromGVCFPlugin.referenceFasta=/phg/inputDir/reference/glyma.Wm82.gnm4.4PTR.genome_main.fna.gz
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=GATK_PIPELINE
LoadHaplotypesFromGVCFPlugin.haplotypeMethodDescription=GATK_PIPELINE
extendedWindowSize = 1000
mapQ = 48


# GATK and Sentieon Parameters
gatkPath = /gatk/gatk
numThreads=35
sentieon_license
sentieonPath=/sentieon/bin/sentieon


# CreateConsensi parameters
haplotypeMethod = GARK_PIPELINE
consensusMethod = CONSENSUS
mxDiv = 0.005
seqErr = 0.02
minSites = 20
minTaxa = 2
maxThreads = 60
#rankingFile = null
#clusteringMode = upgma


# Graph Building Parameters
includeVariants = true

#FilterGVCF Parameters.  Adding any of these will add more filters.#exclusionString=**UNASSIGNED**
#DP_poisson_min=0.0
#DP_poisson_max=1.0
#DP_min=100
#DP_max=**UNASSIGNED**
#GQ_min=10
#GQ_max=**UNASSIGNED**
#QUAL_min=30
#QUAL_max=**UNASSIGNED**
#filterHets=**UNASSIGNED**

# Imputation Pipeline parameters for VCF files

#--- Used by liquibase to check DB version ---
liquibaseOutdir=/phg/outputDir

#--- Used for indexing SNP positions ---
#   pangenomeHaplotypeMethod is the database method or methods for the haplotypes to which SNPs will be indexed
#   the index file lists the SNP allele to haplotype mapping and is used for mapping reads
 pangenomeHaplotypeMethod=CONSENSUS
 pangenomeDir=/phg/outputDir/pangenome
 indexFile=/phg/outputDir/vcfIndexFile
 vcfFile=/phg/inputDir/imputation/vcf/SoyHapMap.SNP.GT.fixed.vcf.414accession.KASP.gz

#--- Used for mapping reads
#   readMethod is the method name for storing the resulting read mappings
#   countAlleleDepths=true means allele depths will be used for haplotype counts, which is almost always a good choice
inputType=vcf
keyFile=/phg/readMapping_key_file.txt
readMethod=GBD_readMethod
vcfDir=/phg/inputDir/loadDB/gvcf/
countAlleleDepths=true

#--- Used for path finding
#   pathHaplotypeMethod determines which haplotypes will be consider for path finding
#   pathHaplotypeMethod should be the same as pangenomeHaplotypeMethod, but could be a subset
#   pathMethod is the method name used for storing the paths
pathHaplotypeMethod=CONSENSUS
pathMethod=GBD_pathMethod
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=20
minTransitionProb=0.001
numThreads=3
probCorrect=0.99
removeEqual=true
splitNodes=true
splitProb=0.99
usebf=false
maxParents = 1000000
minCoverage = 1.0
#parentOutputFile = **OPTIONAL**

#--- used by haploid path finding only
#   usebf - if true use Forward-Backward algorithm, other Viterbi
usebf=false
minP=0.8

#--- used by diploid path finding only
maxHap=11
maxReadsKB=100
algorithmType=efficient

#--- Used to output a vcf file for pathMethod
outVcfFile=/phg/outputDir/Result.vcf

#~~~ Optional Parameters ~~~
#readMethodDescription=**OPTIONAL**
#pathMethodDescription=**OPTIONAL**
#bfInfoFile=**OPTIONAL**
#~~~ providing a value for outputDir will write read mappings to file rather than the PHG db ~~~
outputDir=/phg/

wgs_keyfile enter image description here

The output file of the filtered gvcf file did not show any sequences. And in create consensus step, sever;/path/to/file error occurred.

enter image description here

After changing the colon of gvcfserverpath to a semicolon, executing CreateFromGVCF.groovy and proceeding with the create Consensus step, the same error occurred. Is there something wrong with the config file or keyfile?

haplotype Consensus PHG • 1.3k views

ADD COMMENT • link updated 21 months ago by lcj34 ▴ 420 • written 22 months ago by 지용 ▴ 20

score 1 · Answer 1 · 2023-01-26

1

Entering edit mode

22 months ago

lcj34 ▴ 420

I'm a little confused on your steps. You loaded the haplotypes using the CreateHaplotypesFromFastq.groovy script, then attempted to reload the same files using the CreateHaplotypesFromGVCF.groovy? I would have expected an error if the same data was already loaded to the db

In terms of the config file: yes, the server address and path to file must be separated with a semi-colon. This data is stored to the database when the haplotypes are loaded. Changing the config file after loading will not fix the error as the data is already in the database. It would be nice if the code caught the lack of a semi-colon prior to loading the db, but that is difficult as the user may indicate a path without a server, in which case there is legitimately no semi-colon.

Is this the only error you saw? Did you check logs after each step ? How much trouble is it to start over and re-create your db? This will be your best bet right now. Check each step along the way for errors.

Also curious why you entered the docker to run your steps 4 and 5 vs calling the docker from the command line as you did in steps 1-3. Note - Step 3 should not be necessary as the MakeInitialPHGDBPIpelinePlugin will call liquibase to mark all change sets as "executed".

ADD COMMENT • link 22 months ago by lcj34 ▴ 420

0

Entering edit mode

If using CreateHaplotypesFromGVCF.groovy after CreateHaplotypeFromFastq.groovy, the DB I thought it would be updated with the changed gvcfSeverPath in wgs_keyfile.

If running it with the command line in step 4-5, the program of samtools or bwa can be checked in process (ps -a), but There was a case where the activation of the core was lost, so I ran it inside Docker.

Thank you for your reply.

Create a new DB and try again.

ADD REPLY • link 22 months ago by 지용 ▴ 20

0

Entering edit mode

docker run --name consensus_container --rm \
    -v ${WORKING_DIR}/:/phg/ \
    -t maizegenetics/phg:1.0 \
    /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters ${DOCKER_CONFIG_FILE} \
        -HaplotypeGraphBuilderPlugin -configFile ${DOCKER_CONFIG_FILE}.txt -methods method1:method2 \ 
                -includeVariantContexts true -endPlugin \
        -RunHapConsensusPipelinePlugin \
                -referenceFasta /phg/inputDir/reference/Zm-B73-REFERENCE-NAM-5.0.fa \
                -dbConfigFile ${DOCKER_CONFIG_FILE}  \
                -collapseMethod NAM_CONSENSUS_mxDiv_10ToNeg4 \
                -collapseMethodDetails NAM_CONSENSUS_mxDiv_10ToNeg4 \
                -rankingFile /phg/rankingFile.txt \
                -mxDiv 0.0001 \
                -clusteringMode kmer_assembly -endPlugin

Among the commands of the Create Consensus haplotype step, which parameters should be entered in the method of -method method1:method2?

ADD REPLY • link 21 months ago by 지용 ▴ 20

score 0 · Answer 2 · 2023-03-02

When you created haplotypes for the graph, a method was used and that method is associated with each haplotype. You may have multiple haplotype methods, if you loaded multiple sets of haplotypes each with a different method.

The HaplotypeGraphBuilderPlugin -methods parameter is used to identify which haplotypes should be included in the graph when creating consensus haplotypes The values for this parameter are pairs of methods, each pair consisting of a haplotype method name and range group method name. The method name, range group name pair is separated by a comma, and pairs are separated by a colon. The range group is optional.

For example, if you have range groups identified as "FocusRegion" and "FocusComplement" (the defaults if you did not specify group names in your reference_ranges bed file), and you loaded haplotypes with method names "NAM" and "ref_method" and wanted to include both haplotype methods and only the FocusRegion ranges, you would have a method parameter that looks like:

methods NAM,FocusRegion:ref_method,FocusRegion

If you wanted both NAM and ref_method but all range groups, you could leave off the range method and use a parameter e.g.

methods NAM:ref_method

The values for the -methods parameter for the HaplotypeGraphBuilderPlugin depend on the method names you have in your database, and which haplotypes (identified by the method name) you want included in the graph to be used for consensus creation.

I hope this helps