Setup PHG with Singularity
1
0
Entering edit mode
3.8 years ago
petinho86 • 0

Hi,

I am trying to setup PHG. I am using Singularity as I am working on a cluster.

The installation sis pretty simple as described on : https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/CreatePHG_step0_singularity.md

So I ran:

$ cd phg_singularity 
$ module load singularity
$ singularity pull docker://maizegenetics/phg
$ singularity pull docker://maizegenetics/phg_liquibase $singularity
$ build phg_22.simg docker://maizegenetics/phg:0.0.22

And followed up with building the default data structure:

$ cd ../phg_run
$ singularity exec -B /absolute/path/phg_singularity/:/phg/ /absolute/path/phg_singularity/phg_22.simg /tassel-5-standalone/run_pipeline.pl -debug -Xmx1G -MakeDefaultDirectoryPlugin -workingDir /phg/ -endPlugin

This ran without troubles. I followed with copying files into the proper directories and filled in the phg_run/inputDir/load_genome_data.txt and other key files.

The next step would be the setup of the initial PHG database: https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/MakeInitialPHGDBPipeline.md

Sadly, all steps from here on are only documented for the Docker installation of PHG

I changed:

$ WORKING_DIR=local/directory/where/MakeDefaultDirectory/was/run/
$ DOCKER_CONFIG_FILE=/phg/config.txt

$ docker run --name create_initial_db --rm \
    -v ${WORKING_DIR}/:/phg/ \
    -t maizegenetics/phg:latest \
    /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters ${DOCKER_CONFIG_FILE} \
    -MakeInitialPHGDBPipelinePlugin -endPlugin

Into the Singularity version: (probably not quite right, but is seems to run somehow)

$ cd ../phg_run
$ DOCKER_CONFIG_FILE=config.txt
$ singularity exec \
    -B /absolute/path/phg_singularity/:/phg/ /absolute/path/phg_singularity/phg_22.simg \
    /tassel-5-standalone/run_pipeline.pl -Xmx100G -debug -configParameters ${DOCKER_CONFIG_FILE} \
    -MakeInitialPHGDBPipelinePlugin -endPlugin

While this appears to run, it crashes with UNASSIGNED values of the config.txt file

An example of config.txt is given on: https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/MakeInitialPHGDBPipeline.md

Yet the example file:

host=localHost
user=sqlite
password=sqlite
DB=/phg/phg_db_name.db
DBtype=sqlite
# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Ref.fa
anchors=/phg/anchors.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=irods:/ibl/home/assemblies/
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir

looks very different from the config file which was created in my file system:

########################################
#Required Parameters:
########################################
HaplotypeGraphBuilderPlugin.methods=**UNASSIGNED**
HaplotypeGraphBuilderPlugin.configFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.dbConfigFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.refRangeMethods=**UNASSIGNED**
GetDBConnectionPlugin.create=**UNASSIGNED**
GetDBConnectionPlugin.config=**UNASSIGNED**
LoadAllIntervalsToPHGdbPlugin.genomeData=**UNASSIGNED**
LoadAllIntervalsToPHGdbPlugin.outputDir=**UNASSIGNED**
LoadAllIntervalsToPHGdbPlugin.ref=**UNASSIGNED**
LoadAllIntervalsToPHGdbPlugin.anchors=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.wgsKeyFile=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.bedFile=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.gvcfDir=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.referenceFasta=**UNASSIGNED**
FilterGVCFSingleFilePlugin.inputGVCFFile=**UNASSIGNED**
FilterGVCFSingleFilePlugin.outputGVCFFile=**UNASSIGNED**
FilterGVCFSingleFilePlugin.configFile=**UNASSIGNED**
RunHapConsensusPipelinePlugin.collapseMethod=**UNASSIGNED**
RunHapConsensusPipelinePlugin.dbConfigFile=**UNASSIGNED**
AssemblyHaplotypesMultiThreadPlugin.outputDir=**UNASSIGNED**
AssemblyHaplotypesMultiThreadPlugin.keyFile=**UNASSIGNED**
referenceFasta=**UNASSIGNED**

########################################
#Defaulted parameters:
########################################
HaplotypeGraphBuilderPlugin.includeSequences=true
HaplotypeGraphBuilderPlugin.includeVariantContexts=false
CreateIntervalBedFilesPlugin.windowSize=1000
CreateIntervalBedFilesPlugin.bedFile=intervals.bed
LoadHaplotypesFromGVCFPlugin.queueSize=30
LoadHaplotypesFromGVCFPlugin.mergeRefBlocks=false
LoadHaplotypesFromGVCFPlugin.numThreads=3
LoadHaplotypesFromGVCFPlugin.maxNumHapsStaged=10000
RunHapConsensusPipelinePlugin.minTaxa=1
RunHapConsensusPipelinePlugin.distanceCalculation=Euclidean
RunHapConsensusPipelinePlugin.minFreq=0.5
RunHapConsensusPipelinePlugin.minCoverage=0.1
RunHapConsensusPipelinePlugin.mxDiv=0.01
RunHapConsensusPipelinePlugin.clusteringMode=upgma
RunHapConsensusPipelinePlugin.maxClusters=30
RunHapConsensusPipelinePlugin.minSites=30
RunHapConsensusPipelinePlugin.maxThreads=1000
RunHapConsensusPipelinePlugin.kmerSize=7
AssemblyHaplotypesMultiThreadPlugin.mummer4Path=/mummer/bin/
AssemblyHaplotypesMultiThreadPlugin.loadDB=true
AssemblyHaplotypesMultiThreadPlugin.minInversionLen=7500
AssemblyHaplotypesMultiThreadPlugin.assemblyMethod=mummer4
AssemblyHaplotypesMultiThreadPlugin.entryPoint=all
AssemblyHaplotypesMultiThreadPlugin.numThreads=3
AssemblyHaplotypesMultiThreadPlugin.clusterSize=250
numThreads=10
Xmx=10G
picardPath=/picard.jar
gatkPath=/gatk/gatk
tasselLocation=/tassel-5-standalone/run_pipeline.pl
fastqFileDir=/tempFileDir/data/fastq/
tempFileDir=/tempFileDir/data/bam/temp/
dedupedBamDir=/tempFileDir/data/bam/DedupBAMs/
filteredBamDir=/tempFileDir/data/bam/filteredBAMs/
gvcfFileDir=/tempFileDir/data/gvcfs/
extendedWindowSize=1000
mapQ=48

#Sentieon Parameters.  Uncomment and set to use sentieon:
#sentieon_license=**UNASSIGNED**
#sentieonPath=/sentieon/bin/sentieon


########################################
#Optional Parameters With No Default Values:
########################################
HaplotypeGraphBuilderPlugin.chromosomes=null
HaplotypeGraphBuilderPlugin.haplotypeIds=null
CreateIntervalBedFilesPlugin.extendedBedFile=null
LoadHaplotypesFromGVCFPlugin.haplotypeMethodDescription=null
RunHapConsensusPipelinePlugin.referenceFasta=null
RunHapConsensusPipelinePlugin.rankingFile=null
RunHapConsensusPipelinePlugin.collapseMethodDetails=null
AssemblyHaplotypesMultiThreadPlugin.gvcfOutputDir=null


#FilterGVCF Parameters.  Adding any of these will add more filters.#exclusionString=**UNASSIGNED**
#DP_poisson_min=0.0
#DP_poisson_max=1.0
#DP_min=**UNASSIGNED**
#DP_max=**UNASSIGNED**
#GQ_min=**UNASSIGNED**
#GQ_max=**UNASSIGNED**
#QUAL_min=**UNASSIGNED**
#QUAL_max=**UNASSIGNED**
#filterHets=**UNASSIGNED**

This is where I cannot figure out, what to fill in into the UNASSIGNED sections. I assume the file looks different because of the Singularity setup I ran ? Can you give me an example for this file ?

It would be also very helpful if you could add a singularity version of most commands shown in the Wiki, as they do seem very different from the Docker commands...

Cheers Jakob

phg • 1.8k views
ADD COMMENT
0
Entering edit mode

Hello,

The main reason that it looks a lot different is that MakeDefaultDirectoryPipelinePlugin attempts to create a config file based on all of the parameters available to all the plugins that are run for the pipelines. It has a lot of extra parameters. The parameters marked with **UNASSIGNED** are not automatically filled as they are mostly parameters you should set or file names of your specific files. That being said, the documentation should be a lot better with respect to the configuration parameters and have more clear examples.

If you replace all of the **UNASSIGNED** parameters in your config with the following it should work better.

#General Graph Building parameters
#Should depend on either asmMethod, LoadHaplotypesFromGVCFPlugin.haplotypeMethodName or RunHapConsensusPipelinePlugin.collapseMethod depending on what you are running.
HaplotypeGraphBuilderPlugin.methods=mummer4

LoadAllIntervalsToPHGdbPlugin.genomeData=/phg/inputDir/reference/Ref_Assembly_load_data.txt
LoadAllIntervalsToPHGdbPlugin.outputDir=/phg/outputDir/align/
LoadAllIntervalsToPHGdbPlugin.anchors=/phg/anchors.bed

#These two need to match. And the Ref.fa should be replaced with your reference.
referenceFasta=/phg/inputDir/reference/Ref.fa
LoadAllIntervalsToPHGdbPlugin.ref=/phg/inputDir/reference/Ref.fa

#Loading WGS Haplotype Parameters.
LoadHaplotypesFromGVCFPlugin.wgsKeyFile=/phg/keyFile.txt

## This should be assigned an informative method name to be loaded into the db.  
#It should represent the samples and types of data creating the haplotypes.
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=GATK_PIPELINE
LoadHaplotypesFromGVCFPlugin.gvcfDir=/phg/inputDir/loadDB/gvcf/

#Set this if you want to store a different name
RunHapConsensusPipelinePlugin.collapseMethod=CONSENSUS

AssemblyHaplotypesMultiThreadPlugin.outputDir=/phg/outputDir/align/
AssemblyHaplotypesMultiThreadPlugin.keyFile=/phg/asm_keyFile.txt

If you are loading in Assemblies only as haplotypes, you should remove any of the lines starting with LoadHaplotypesFromGVCFPlugin. And if you are creating haplotypes based on WGS only, remove the lines starting with AssemblyHaplotypesMultithreadPlugin. You will need to set referenceFasta and LoadAllIntervalsToPHGdbPlugin.ref to match the names of your files otherwise it will throw more errors.

I will update the documentation here to include a simple suggested config file. We will also look into adding in singularity instructions as well.

Thanks, Zack Miller

ADD REPLY
0
Entering edit mode

Hi Zack,

Thanks a lot.

I started to fill in the gaps, but there are still plenty of spaces / UNASSIGED sections to fill.

I figured that for the initial run I mainly need to setup the required parameters: (this is how my config.txt looks now):

########################################
#Required Parameters:
########################################
HaplotypeGraphBuilderPlugin.methods=mummer4
HaplotypeGraphBuilderPlugin.configFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.dbConfigFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.refRangeMethods=**UNASSIGNED**
GetDBConnectionPlugin.create=**UNASSIGNED**
GetDBConnectionPlugin.config=**UNASSIGNED**
LoadAllIntervalsToPHGdbPlugin.genomeData=/phg/inputDir/reference/load_genome_data.txt
LoadAllIntervalsToPHGdbPlugin.outputDir=/phg/outputDir/align/
LoadAllIntervalsToPHGdbPlugin.ref=/phg/inputDir/reference/Ref.fa
LoadAllIntervalsToPHGdbPlugin.anchors=/phg/anchors.bed
LoadHaplotypesFromGVCFPlugin.wgsKeyFile=/phg/wgs_keyfile.txt
LoadHaplotypesFromGVCFPlugin.bedFile=**UNASSIGNED**
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=GATK_PIPELINE
LoadHaplotypesFromGVCFPlugin.gvcfDir=/phg/inputDir/loadDB/gvcf/
LoadHaplotypesFromGVCFPlugin.referenceFasta=**UNASSIGNED**
FilterGVCFSingleFilePlugin.inputGVCFFile=**UNASSIGNED**
FilterGVCFSingleFilePlugin.outputGVCFFile=**UNASSIGNED**
FilterGVCFSingleFilePlugin.configFile=**UNASSIGNED**
RunHapConsensusPipelinePlugin.collapseMethod=CONSENSUS
RunHapConsensusPipelinePlugin.dbConfigFile=**UNASSIGNED**
AssemblyHaplotypesMultiThreadPlugin.outputDir=/phg/outputDir/align/
AssemblyHaplotypesMultiThreadPlugin.keyFile=/phg/asm_keyFile.txt
referenceFasta=/phg/inputDir/reference/Ref.fa

Again it is still very much not even close to the same file you hav, but some lines do match. This is probably because singularity .

I am mainly struggleing to find these plugin configs.

I assume the plugins are sitting somewhere int he singularity containers ? Or are these plugins somewhere within the working directory ?

Cheers Jakob

ADD REPLY
0
Entering edit mode

Hello,

I mentioned to replace all of the UNASSIGNED ones with what I had posted. I removed a lot of the specific PluginParameters where they had shared parameters with other plugins.

TASSEL(which the PHG is based off of) allows you to just specify the shared parameter name. An example of this is referenceFasta being shared by a number of plugins. You could specify Plugin1.referenceFasta, Plugin2.referenceFasta and so on, but for your use they are all the same. Some of the parameters have good defaults as well so they could be removed.

The config file has nothing to do with singularity.

The plugins are sitting in the Docker container(and by extension the singularity container) as part of TASSELs lib folder(/tassel-5-standalone/lib/phg.jar). When you run /tassel-5-standalone/run_pipeline.pl -MakeDefaultDirectoryPlugin... you are actually running a plugin which is in the container.

ADD REPLY
0
Entering edit mode

Hi Zack,

Yes singularity has nothing to do with this. It looks like it all stands and falls with a proper config file.

I hacked the config file a lot and got some steps further into the pipeline. I replaced the #required parameters section with the part you wrote and I added the sql db config section to the file as well, leading to this pretty hacky version:

#host option
host=localHost
user=sqlite
password=sqlite
DB=/phg/phg_db_name.db
DBtype=sqlite
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir
refServerPath=/scratch/pawsey0149/jpetereit/graph/phg/inputDir/reference
numThreads=5

inputConsensusMethods=GATK_PIPELINE
consensusMethodName=CONSENSUS


#General Graph Building parameters
#Should depend on either asmMethod, LoadHaplotypesFromGVCFPlugin.haplotypeMethodName or RunHapConsensusPipelinePlugin.collapseMethod depending on what you are running.
HaplotypeGraphBuilderPlugin.methods=mummer4

LoadAllIntervalsToPHGdbPlugin.genomeData=/phg/inputDir/reference/load_genome_data.txt
LoadAllIntervalsToPHGdbPlugin.outputDir=/phg/outputDir/align/
LoadAllIntervalsToPHGdbPlugin.anchors=/phg/validBedFile-small.bed

#These two need to match. And the Ref.fa should be replaced with your reference.
referenceFasta=/phg/inputDir/reference/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa
LoadAllIntervalsToPHGdbPlugin.ref=/phg/inputDir/reference/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa


## This should be assigned an informative method name to be loaded into the db.
#It should represent the samples and types of data creating the haplotypes.
LoadHaplotypesFromGVCFPlugin.haplotypeMethodName=GATK_PIPELINE
LoadHaplotypesFromGVCFPlugin.gvcfDir=/phg/inputDir/loadDB/gvcf/
#Set this if you want to store a different name
RunHapConsensusPipelinePlugin.collapseMethod=CONSENSUS
gvcfOutputDir=/phg/outputDir/gvcf/
AssemblyHaplotypesMultiThreadPlugin.outputDir=/phg/outputDir/align/
AssemblyHaplotypesMultiThreadPlugin.keyFile=/phg/load_asm_genome_key_file.txt

########################################
#Defaulted parameters:
########################################


.
.
.
.
(remainder of the default options)

Then I created an arbitrary interval range file with 100 x 1000bp ranges. This let me run the MakeInitialPHGDBPipeline Plugin, partially.

these completed succesfully: GetDBConnectionPlugin LoadAllIntervalsToPHGdbPlugin

This one failed: LiquibaseUpdatePlugin

I initialy ran it with the docker image phg:latest. This ran into an error: Database too old (v 0.0.10) please update, which then fails because it's too old to update via Liquibase.

Then I pulled a newer docker image from the nightly build phg:0.026. It created a newer database (v 0.0.24), but then was incompatible with phg v0.026 (I assume thats a minor bug?)

So I pulled version 0.024 from docker which completed the MakeInitialPHGDBPipeline Plugin (still couldn't run Liquibase, but finished).

But the database seems still corrupt. I made GVCF's, a haplotype keyfile and a CreateHaplotypesFromFastq.groovy config file (which is the combination of the host options and the following haplotype options):

REF_DIR=/phg/inputDir/reference/
GVCF_DIR=/phg/inputDir/loadDB/gvcf/
GVCF_DIR_IN_DOCKER=/tempFileDir/data/outputs/gvcfs/
DB=/phg/phg_db_name.db
CONFIG_FILE=/phg/config.txt
CONFIG_FILE_IN_DOCKER=/tempFileDir/data/config.txt
KEY_FILE=/phg/haplotype_key.txt
KEY_FILE_IN_DOCKER=/tempFileDir/data/keyFile.txt
wgsKeyFile=/phg/haplotype_key.txt
gvcfDir=/phg/inputDir/gvcf/
referenceFasta=/phg/inputDir/reference/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa
haplotypeMethodName=GATK_PIPELINE

then I run the CreateHaplotypesFromFastq.groovy pipelime using singularity: singularity exec -B ./:/phg/ /group/pawsey0149/jpetereit/PHG/phg_24.simg /CreateHaplotypesFromGVCF.groovy -config haplotype_config.txt

It runs for a bit, but fails with the following error:

[DefaultDispatcher-worker-2] INFO net.maizegenetics.pangenome.db_loading.LoadHaplotypesFromGVCFPlugin - Done setting up variables for file /phg/inputDir/gvcf/SRR9969480.g.vcf. Moving on to processing reference ranges. [pool-1-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - Error writing to the DB:

Which I assume is because of too many hacks from my side.

I also get pretty confused with the config file and the parameters, there are just soo many.

Can you doublecheck what happens with these DB versions and the updates ?

Would you normally run the whole pipeline with a singular config file ? Is it maybe worth making a specific config file for each pipeline step ?

Can you upload an example of a complete config file ?

ADD REPLY
0
Entering edit mode

Hi,

I'll add the CreateHaplotypesFromGVCF log in the next days. It did work in the end. And I probably had some mismatching bed files.

The out of memory problem seems very odd. I have a 430mb genome (rice) and 4 Taxons. And I only added 1000 x 1000bp ranges. Nothing spectacular.

I found the problem.

CreateConsensi.sh looks for Xmx in the config file which I did not set, so it defaulted back to 1350mb.

I added

Xmx=20G

to the config, which now seems not to break anymore.

I'll add more to this thread once everything works :)

Cheers J

ADD REPLY
0
Entering edit mode

Hi everyone,

Newbie here. I'm trying to run the CreateHaplotypesFromGVCF plugin here and I keep getting the following error:

Error writing to the DB:
[pool-1-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - Error writing to the DB:
java.lang.IllegalStateException: Error writing to the DB:
        at net.maizegenetics.pangenome.db_loading.LoadHaplotypesFromGVCFPlugin.processData(LoadHaplotypesFromGVCFPlugin.kt:226)
        at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:111)
        at net.maizegenetics.plugindef.AbstractPlugin.dataSetReturned(AbstractPlugin.java:2017)
        at net.maizegenetics.plugindef.ThreadedPluginListener.run(ThreadedPluginListener.java:29)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.NoSuchElementException: Collection is empty.
        at kotlin.collections.CollectionsKt___CollectionsKt.first(_Collections.kt:184)
        at net.maizegenetics.pangenome.db_loading.LoadHaplotypesFromGVCFPlugin$processKeyFileEntry$2.invokeSuspend(LoadHaplotypesFromGVCFPlugin.kt:312)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.DispatchedTask.run(Dispatched.kt:241)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:594)
        at kotlinx.coroutines.scheduling.CoroutineScheduler.access$runSafely(CoroutineScheduler.kt:60)
        at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:740)

I saw user: petinho86 mention about the error: couldn't write to the db. Was the error the same as I have? If yes, was the only problem with bedfile and nothing else? I tried using the same bed file that I used for the db but that doesn't seem to help. I get the same error. Please let me know

Thanks B

ADD REPLY
0
Entering edit mode

Please don't add questions as answers to a pre-existing thread. If your problem is different then please start a new thread.

ADD REPLY
0
Entering edit mode
3.8 years ago
petinho86 • 0

Update

Hi again.

Turns out having multiple config files makes this a lot easier to grasp. It would be probably good to show that in the description of each step, eg: makeinitialDB_config.txt

Here is the way it works for me:

Make working directory using singularity:

$ singularity exec -B ./:/phg/ /phg/singularity-image/path/PHG/phg_26.simg /tassel-5-standalone/run_pipeline.pl -debug -Xmx1G -MakeDefaultDirectoryPlugin -workingDir /phg/ -endPlugin

Make initial DB:

First, I create a dummy ranges file with 1000bp ranges. looks like this:

1       0       1000    chr1_1
1       1000    2000    chr1_2
1       2000    3000    chr1_3
1       3000    4000    chr1_4
1       4000    5000    chr1_5
1       5000    6000    chr1_6
1       6000    7000    chr1_7
1       7000    8000    chr1_8
1       8000    9000    chr1_9
1       9000    10000   chr1_10

then I run the bed file validator on it:

$ singularity exec -B ./:/phg/ /phg/singularity-image/path/PHG/phg_26.simg /tassel-5-standalone/run_pipeline.pl -Xmx50G -debug -CreateValidIntervalsFilePlugin -intervalsFile intervals.bed -referenceFasta inputDir/reference/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa -mergeOverlaps true -generatedFile validBedFile.bed -mergeOverlaps true -endPlugin

Then I create a new config file (and ignore the initial config file):

$ vim makeinitialdb_config.txt

-

#host option
host=localHost
user=sqlite
password=sqlite
DB=/phg/rice.db
DBtype=sqlite
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir
refServerPath=/phg/inputDir/reference
# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa
genomeData=/phg/inputDir/reference/load_genome_data.txt
anchors=/phg/validBedFile.bed

Add haplotypes

I have Hifi data, so I adding data from wgs doesn't seem to run properly (I assume BWA doesn't like it). So I made GVCF files with GATK. I haven't figured out how to add GVCF files using the PopulatePHGDBPipelinePlugin. So I used the subscript CreateHaplotypesFromGVCF.groovy.

It needs a config file which I named: addhaplotypes_config.txt

#host option
host=localHost
user=sqlite
password=sqlite
DB=/phg/rice.db
DBtype=sqlite
#liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir
refServerPath=/phg/inputDir/reference
# Load genome intervals parameters
referenceFasta=/phg/inputDir/reference/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa
anchors=/phg/validBedFile.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt

#Set these properties
REF_DIR=/phg/inputDir/reference/
GVCF_DIR=/phg/inputDir/loadDB/gvcf/
GVCF_DIR_IN_DOCKER=/tempFileDir/inputDir/loadDB/gvcf/
DB=/phg/rice.db
CONFIG_FILE=/phg/sql_config.txt
CONFIG_FILE_IN_DOCKER=/tempFileDir/sql_config.txt
KEY_FILE=/phg/haplotype_key.txt
KEY_FILE_IN_DOCKER=/tempFileDir/haplotype_key.txt
wgsKeyFile=haplotype_key.txt
gvcfDir=/phg/inputDir/loadDB/gvcf/
haplotypeMethodName=GATK_PIPELINE

This also needs a haplotype keyfile and load_genome_data key file which I set as mentioned in the wiki. The only cofusing part is that the haplotype keyfile parameter is called wgsKeyFile. On top of it the config file specifies the path to a host option file, which I called sql_config.txt:

#host option
host=localHost
user=sqlite
password=sqlite
DB=/phg/rice.db
DBtype=sqlite

Now I can add the haplotypes to the DB:

$ singularity exec -B ./:/phg/ /group/pawsey0149/PHG/phg_26.simg /CreateHaplotypesFromGVCF.groovy -config addhaplotypes_config.txt

I encountered multiple times an error :

Error: couldn't write to the database

which appears to be the case when the validBedFile.bed is not 100% right.

Create consensus:

This is as I understand the last part of step2:

Create consensus

$ singularity exec -B ./:/phg/ /phg/singularity-image/path/PHG/phg_26.simg /CreateConsensi.sh sql_config.txt Oryza_sativa.IRGSP-1.0.dna.toplevel.fa GATK_PIPELINE CONSENSUS

This step appears to use lots of memory. I ran it once and ran out of memory with 100G of memory. I am currently retrying with 1T of memory to see how this goes.

I am sure I still have plenty or unneccessary lines in some of the config files. But at least it's runnign atm. I hope that helps anyone who is stuck with it like me :)

Cheers J

ADD COMMENT
0
Entering edit mode

Would you mind sharing the log file from when you are running CreateHaplotypesFromGVCF? You definitely need to have the same BED file as you used to initially make the DB(or at least a subset of the records). If the coordinates are different, it will not work.

Consensus can take a bit of RAM, but over 100G seems a lot. How many taxon do you have, how long is the genome and how many reference ranges do you have?

ADD REPLY

Login before adding your answer.

Traffic: 1947 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6