I have successfully set up a PHG using version 1.2 of the Docker tool through step 1, and am now attempting to load in BAMs as part of step 2. When I run:
singularity exec -B ${WORKING_DIR}/phg $DOCKER /CreateHaplotypesFromBAM.groovy -config $CONFIG_FILE
I get:
ERROR net.maizegenetics.plugindef.AbstractPlugin - Error Loading in Bed file, file is empty. Please double check: phg/inputDir/loadDB/bam/temp/intervals.bed
When I look at the intervals.bed file manually, it is indeed empty. However, running the -CreateValidIntervalsFilePlugin by itself results in a populated intervals.bed file with no errors or problems. I can't figure out how to use this intervals file for the CreateHaplotypesFromBAM.groovy command though, or make CreateHaplotypesFromBAM.groovy work.
singularity exec -B ${WORKING_DIR}/phg $DOCKER /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters $CONFIG_FILE \
-CreateValidIntervalsFilePlugin -intervalsFile ${WORKING_DIR}/phg/anchors.bed \
-referenceFasta ${WORKING_DIR}/phg/inputDir/reference/iwgsc_refseqv2.1_assembly_chr_split.fa \
-mergeOverlaps true \
-generatedFile "$INTERVAL.bed" -endPlugin
I have also tried running CreateHaplotypesFromBAM.groovy with all the flags from my -CreateValidIntervalsFilePlugin and get the same result: an unpopulated intervals.bed file.
Below is a more complete error report and my config file:
[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.db_loading.CreateIntervalBedFilesPlugin - Getting the db connection from the file
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - first connection: dbName from config file = phg/srww_phg_v2dot1.db host: localHost user: sqlite type:
sqlite
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Database URL: jdbc:sqlite:phg/srww_phg_v2dot1.db
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.DBLoadingUtils - Connected to database:
[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.db_loading.CreateIntervalBedFilesPlugin - Pulling the reference ranges from the graph stored in the database
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRanges: query statement: select reference_ranges.ref_range_id, chrom, range_start, range_end, meth
ods.name from reference_ranges INNER JOIN ref_range_ref_range_method on ref_range_ref_range_method.ref_range_id=reference_ranges.ref_range_id INNER JOIN methods on ref_range_ref_r
ange_method.method_id = methods.method_id AND methods.method_type = 7 ORDER BY reference_ranges.ref_range_id
methods size: 1
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRanges: number of reference ranges: 27978
[pool-1-thread-1] INFO net.maizegenetics.pangenome.api.CreateGraphUtils - referenceRanges: time: 0.133820701 secs.
[pool-1-thread-1] DEBUG net.maizegenetics.pangenome.db_loading.CreateIntervalBedFilesPlugin - Writing out the BED files using the reference ranges pulled from the graph.
Config:
###config file.
### Anything marked with UNASSIGNED needs to be set for at least one of the steps
### If it is marked as OPTIONAL, it will only need to be set if you want to run specific steps.
host=localHost
user=sqlite
password=sqlite
DB=phg/srww_phg_v2dot1.db
DBtype=sqlite
outputDir=phg/outputDir
##Step 1B
# Load genome intervals parameters
referenceFasta=/90daydata/genolabswheatphg/SRWW_PHG_3/phg/inputDir/reference/iwgsc_refseqv2.1_assembly_chr_split.fa
anchors=/90daydata/genolabswheatphg/SRWW_PHG_3/phg/anchors.bed
genomeData=phg/inputDir/reference/load_genome_data.txt
localGVCFFolder=phg/outputDir/GVCF_local
###Not included in example config
refServerPath=Atlas-dtn.hpc.msstate.edu;/project/genolabswheatphg/srww_phg/ref
liquibaseOutdir=phg/outputDir
#System parameters. Xmx is the java heap size and numThreads will be used to set threads available for multithreading components.
Xmx=100G
numThreads=20
##Keyfile location.
keyFile=phg/loadHapsGVCF_keyfile_BAMS.txt
#keyFile=phg/loadHaps_fasta_keyfile.txt
asmMethodName=mummer4
wgsMethodName=GATK_PIPELINE
consensusMethodName=CONSENSUS
inputConsensusMethods=GATK_PIPELINE
fastqFileDir=phg/inputDir/loadDB/fastq/
dedupedBamDir=phg/inputDir/loadDB/bam/dedup/
#dedupedBamDir=phg/inputDir/BAMs/
#gvcfFileDir=phg/inputDir/loadDB/gvcf/
gvcfDir=phg/inputDir/loadDB/gvcf/
#localGVCFFolder=phg/outputDir/GVCF_local
filteredBamDir=phg/inputDir/BAMs_filtered/
wgsKeyFile=phg/loadHapsGVCF_keyfile_BAMS.txt
mapQ=48
refRangeMethods=FocusRegion,FocusComplement
extendedWindowSize=1000
haplotypeMethodName=TEST_PARENT_LOAD
gvcfFileDir =phg/inputDir/loadDB/gvcf/
tempFileDir =phg/inputDir/loadDB/bam/temp/
filteredOutputBAMDir=phg/inputDir/loadDB/bam/mapqFiltered/
dedupedBAMDir=phg/inputDir/loadDB/bam/dedup/
intervalsFile=phg/anchors.bed
generatedFile=phg/inputDir/loadDB/bam/temp/intervals.bed
###Assembly from alignment using anchorwave settings
#AssemblyMAFFromAnchorWavePlugin.outputDir=phg/outputDir
#AssemblyMAFFromAnchorWavePlugin.keyFile=phg/anchorwave_keyfile.txt
#AssemblyMAFFromAnchorWavePlugin.gffFile=phg/anchors.gff3
#AssemblyMAFFromAnchorWavePlugin.refFasta=phg/inputDir/reference/iwgsc_refseqv2.1_assembly_chr_split.fa
#AssemblyMAFFromAnchorWavePlugin.threadsPerRun=4
#AssemblyMAFFromAnchorWavePlugin.numRuns=2
# WGS Haplotype Filtering criteria. These are the defaults.
GQ_min=50
QUAL_min=200
DP_poisson_min=.01
DP_poisson_max=.99
filterHets=true
##Consensus Plugin Parameters
minFreq=0.5
maxClusters=30
minSite=30
minCoverage=0.1
maxThreads=10
minTaxa=1
mxDiv=0.01
#This sets the type of clustering mode.
#Valid params are: upgma, upgma_assembly, and kmer_assembly
#The two assembly parameters are designed for assembly haplotypes and will choose a representative haplotype as the consensus instead of attempting to merge calls like with upgma.
clusteringMode=kmer_assembly
#If you want to use an assembly clusteringMode, you must have a ranking file.
#The ranking file must be a tab separated list of taxon\trankingScore where higher numbers are a better rank. This file is used to chose the representative haplotype
rankingFile=phg/ranking_file.txt
##Optional if you want to use kmer_assembly as the clusteringMode. Otherwise is ignored
kmerSize=7
distanceCalculation=Euclidean
##Graph building parameters
includeVariants=true
You should not need to run CreateValidIntervalsFilePlugin. Instead you should use the same bed file you used during Step one to populate the DB. Looking at your log my best guess is that its the same bed as here:
Does this file have data in it? Also verify that there were no errors in the log of the Step 1 run.