Entering edit mode
8 months ago
b.contreras.moreira
▴
310
Hi, after completing the MakeDefaultDirectoryPlugin step, I am now trying MakeInitialPHGDBPipelinePlugin. Following the documentation my config file contains at the top:
host=localHost
user=sqlite
password=sqlite
DB=/phg/phg.db
DBtype=sqlite
referenceFasta=/phg/inputDir/reference/ref.fna.gz
anchors=/phg/inputDir/reference/genes.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=/path/to/whatever
liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir
# plus the rest of the lines created automatically during MakeDefaultDirectoryPlugin,
# see the first few ones:
HaplotypeGraphBuilderPlugin.configFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.dbConfigFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.refRangeMethods=**UNASSIGNED**
...
The actual Docker command is:
WORKING_DIR=/data/phg/
DOCKER_CONFIG_FILE=/phg/config.txt
docker run --name create_initial_db --rm \
-v ${WORKING_DIR}/:/phg/ \
-t maizegenetics/phg:1.9 \
/tassel-5-standalone/run_pipeline.pl -Xmx100G \
-debug -configParameters ${DOCKER_CONFIG_FILE} \
-MakeInitialPHGDBPipelinePlugin -endPlugin
It seems to fill correctly the sqlite database but then some errors occur:
[pool-2-thread-1] ERROR net.maizegenetics.util.Utils - getBufferedReader: Error getting reader for: **UNASSIGNED**
...
[pool-2-thread-1] ERROR net.maizegenetics.plugindef.AbstractPlugin - Utils: getBufferedReader: problem getting reader: **UNASSIGNED** (No such file or directory)
...
[pool-2-thread-1] ERROR net.maizegenetics.pangenome.liquibase.LiquibaseUpdatePlugin - LiquibaseUpdatePLugin: File /phg/outputDir/run_yes.txt does not exist. CheckDBVersionPlugin has determined your database is not recent enough to be updated with liquibase. It does not contain the variants table.
...
[pool-2-thread-1] ERROR net.maizegenetics.plugindef.AbstractPlugin - LiquibaseUpdatePlugin::processData: problem running liquibase
I guess I need to add more details to config.txt
but I would need some help,
thanks, Bruno
Bruno - thanks for sending the log file. The MakeInitialPHGDBPipeline calls LoadAllIntervalsToPHGdbPlugin. That plugin has required parameters.
In the default config file under the "#Required Parameters" section, you will need to fill out all parameters for the plugin LoadAllIntervalsToPHGdbPlugin. Replace "UNASSIGNED with a value relative to your docker folders.
LoadAllIntervalsToPHGdbPlugin.genomeData=UNASSIGNED LoadAllIntervalsToPHGdbPlugin.outputDir=UNASSIGNED LoadAllIntervalsToPHGdbPlugin.ref=UNASSIGNED LoadAllIntervalsToPHGdbPlugin.refServerPath=UNASSIGNED LoadAllIntervalsToPHGdbPlugin.gvcfServerPath=UNASSIGNED // note - this one is no longer valid - ignore it LoadAllIntervalsToPHGdbPlugin.anchors=UNASSIGNED
See the example from this page of the documentation (under "Config File Parameters for this step": https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/MakeInitialPHGDBPipeline.md
Let me know if you have further questions.
Thanks lcj34
config.txt
without the prefix "LoadAllIntervalsToPHGdbPlugin.", should I duplicate them?Any help appreciated, Bruno
No, you shouldn't need to duplicate them. The specific parameter missing based on the log file was the genomeData parameter. Do you have that set?
Regarding "ref" vs "referenceFasta". If you call LoadAllIntervalsToPHGdbPlugin() directly it looks for the "ref" parameter.
But when called from MakeInitialPHGDBPipeline, the latter looks for the "referenceFasta" parameter in the config file and sends that value. That is what is shown in the documentation page I posted.
I agree - it is confusing that these are different. I think "referenceFasta" was used by multiple plugins, so MakeInitialPHGDBPipeline searched for that value to send as a parameter.
Thanks lcj34 , it seems to be running now, will see if it completes and I will post here if needed. I did need to set the following params, despite the fact they were already set in
config.txt
without theLoadAllIntervalsToPHGdbPlugin.
prefix (see above):Note my interval file contains merged gene regions, as the original ones overlapped in some cases (in different strands mostly) and I was warned about that. In case this helps anyone, I did it with
Eventually LoadAllIntervalsToPHGdbPlugin loaded all intervals from the input BED file, but then it tried to load variants as well and failed; I thought I was only loading intervals, how can I state that I don't want to load any gVCF files yet? Please see the output:
java.lang.IllegalArgumentException: LoadAllIntervalsToPHGdbPlugin : error processing/loading intervals bgzipAndIndexGVCFfile: error bgzipping and/or tabix'ing file /phg/inputDir/reference/Morex.gvcf
In case it helps, this is how my file
inputDir/reference/load_genome_data.txt
looks like:Note I added the last column as it complained otherwise, thanks again lcj34 , Bruno
The code is only processing the reference. Once it finishes loading the reference it creates a gvcf file of the reference data, then tries to compress and index it with bgzip and tabix. The error indicates there was a problem with indexing this file, and I assume this is due to sequence size. From the tabix (samtools) man page: "The tabix (. tbi) and BAI index formats can handle individual chromosomes up to 512 Mbp (2^29 bases) in length. If your input file might contain data lines with begin or end positions greater than that, you will need to use a CSI index"
We've seen this problem with large genomes e.g. wheat. In phgv1 we use htsjdk to read and query gvcfs at later stages of the pipeline. This is where the indexed files are required and htsjdk does not support the csi indices, only tbi. We have solved this in phgv2 by using tileDB and AGC and indexing with CSI
The solution for you would be to split your chromosomes and/or try running with phgv2.
OK two more questions:
Thanks so much for your help lcj34
yes, you will need to merge chr arms , the v1 code does not handle that.
Regarding phg_v2: yes, that is the github. I think the documentation should be sufficient. We are not using docker - only conda. Also note that currently this only runs on linux. We will support Mac OS soon, but have no plans to support WIndows. However, you can run on windows if you install WIndows Subsystem Linux (WSL). We have users who have used that successfully.