Hi everyone,
I am trying to use the maize NAM founder PHG database (phg_v5Assemblies_20200608.db) to impute additional SNPs into a VCF file that has sparse SNPs. I have read the documentation on the wiki and old posts here, and believe I only need to run step 3, "Impute variants or haplotypes". I followed the wiki and have docker installed and set up my config file below:
# Imputation Pipeline parameters for VCF files
#!!! Required Parameters !!!
#--- Database ---
host=localHost
user=sqlite
password=sqlite
DB=/phg/phg_v5Assemblies_20200608.db
DBtype=sqlite
#--- Used by liquibase to check DB version ---
forceDBUpdate=true
liquibaseOutdir=/phg/outputDir
#--- Used for indexing SNP positions ---
# pangenomeHaplotypeMethod is the database method or methods for the haplotypes to which SNPs will be indexed
# the index file lists the SNP allele to haplotype mapping and is used for mapping reads
pangenomeHaplotypeMethod=mummer4
indexFile=/phg/outputDir/vcfIndexFile
vcfFile=/phg/inputDir/imputation/vcf/B73.GenoOutput1-10.vcf
#--- Used for mapping reads
# readMethod is the method name for storing the resulting read mappings
# countAlleleDepths=true means allele depths will be used for haplotype counts, which is almost always a good choice
lowMemMode=true
inputType=vcf
keyFile=/phg/readMapping_key_file.txt
readMethod=MT1
vcfDir=/phg/inputDir/imputation/vcf/
countAlleleDepths=true
maxRefRangeErr=0.25
outputSecondaryStats=true
maxSecondary=50
fParameter=f1000,5000
minimapLocation=minimap2
#--- Used for path finding
# pathHaplotypeMethod determines which haplotypes will be consider for path finding
# pathHaplotypeMethod should be the same as pangenomeHaplotypeMethod, but could be a subset
# pathMethod is the method name used for storing the paths
pathHaplotypeMethod=mummer4
pathMethod=MDefaultTest
maxNodes=1000
maxReads=10000
minReads=1
minTaxa=1
minTransitionProb=0.001
numThreads=3
probCorrect=0.99
removeEqual=true
splitNodes=true
splitProb=0.99
usebf=false
maxParents = 1000000
minCoverage = 1.0
#parentOutputFile = **OPTIONAL**
#--- used by haploid path finding only
# usebf - if true use Forward-Backward algorithm, other Viterbi
#usebf=false
#minP=0.8
#--- used by diploid path finding only
#maxHap=11
#maxReadsKB=100
#algorithmType=efficient
#--- Used to output a vcf file for pathMethod
outVcfFile=/phg/outputDir/ImputedResult.vcf
#~~~ Optional Parameters ~~~
#readMethodDescription=**OPTIONAL**
#pathMethodDescription=**OPTIONAL**
#bfInfoFile=**OPTIONAL**
#~~~ providing a value for outputDir will write read mappings to file rather than the PHG db ~~~
outputDir=/phg/
However, when I try running the following command, it hangs on the createHaplotypeNodes step and then says "killed" afterward:
docker run --name pipeline_container --rm -v ${WORKING_DIR}/:/phg/ -t maizegenetics/phg /tassel-5-standalone/run_pipeline.pl -Xmx48G
-debug -configParameters /phg/docker_vcf_imputation_config.txt -ImputePipelinePlugin -forceDBUpdate -imputeTarget pangenome -endPlugin > 3a.log
Here is the relevant part of the log output (it won't let me post the entire log, too long possibly?):
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = /phg/phg_v5Assemblies_20200608.db host: localHost user: sqlite type: sqlite
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:/phg/phg_v5Assemblies_20200608.db
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: query statement: select reference_ranges.ref_range_id, chrom, range_start, range_end, methods.name from reference_ranges INNER JOIN ref_range_ref_range_method on ref_range_ref_range_method.ref_range_id=reference_ranges.ref_range_id INNER JOIN methods on ref_range_ref_range_method.method_id = methods.method_id AND methods.method_type = 7 ORDER BY reference_ranges.ref_range_id
methods size: 1
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: number of reference ranges: 71354
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRangesAsMap: time: 0.301803638 secs.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: query statement: SELECT gamete_haplotypes.gamete_grp_id, genotypes.line_name FROM gamete_haplotypes INNER JOIN gametes ON gamete_haplotypes.gameteid = gametes.gameteid INNER JOIN genotypes on gametes.genoid = genotypes.genoid ORDER BY gamete_haplotypes.gamete_grp_id;
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: number of taxa lists: 27
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - taxaListMap: time: 0.010250027 secs.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: haplotype method: mummer4 range group method: null
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - createHaplotypeNodes: query statement: SELECT haplotypes_id, gamete_grp_id, haplotypes.ref_range_id, asm_contig, asm_start_coordinate, asm_end_coordinate, asm_strand, genome_file_id, sequence, seq_has...
Killed
If the entire log is needed to identify the issue I can try and post it in a reply below or provide a link. I believe I need to run the PHG pipeline all the way through to the 3E vcf output, but am currently just trying to get the 3A pangenome step to work first. I thought it might be a problem with the phg_v5Assemblies_20200608.db file being made in an earlier version of PHG (0.0.40 or earlier) so I tried running it using phg:0.0.20, 0.0.22, and 0.0.40, but it is still killed regardless. I believe "killed" indicates an issue with memory, but the Mac I'm running it on has 64 GB of RAM and the Activity Monitor never spikes significantly while the command is running in terminal. Any help or direction would be appreciated!