Question

Errors running PHG MakeInitialPHGDBPipelinePlugin (**UNASSIGNED**)

0

Entering edit mode

8 months ago

b.contreras.moreira ▴ 310

Hi, after completing the MakeDefaultDirectoryPlugin step, I am now trying MakeInitialPHGDBPipelinePlugin. Following the documentation my config file contains at the top:

host=localHost
user=sqlite
password=sqlite
DB=/phg/phg.db
DBtype=sqlite

referenceFasta=/phg/inputDir/reference/ref.fna.gz
anchors=/phg/inputDir/reference/genes.bed
genomeData=/phg/inputDir/reference/load_genome_data.txt
refServerPath=/path/to/whatever
liquibase results output directory, general output directory
outputDir=/phg/outputDir
liquibaseOutdir=/phg/outputDir

# plus the rest of the lines created automatically during MakeDefaultDirectoryPlugin, 
# see the first few ones:
HaplotypeGraphBuilderPlugin.configFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.dbConfigFile=**UNASSIGNED**
CreateIntervalBedFilesPlugin.refRangeMethods=**UNASSIGNED**
...

The actual Docker command is:

WORKING_DIR=/data/phg/
DOCKER_CONFIG_FILE=/phg/config.txt

docker run --name create_initial_db --rm \
  -v ${WORKING_DIR}/:/phg/ \
  -t maizegenetics/phg:1.9 \
  /tassel-5-standalone/run_pipeline.pl -Xmx100G \
  -debug -configParameters ${DOCKER_CONFIG_FILE} \
  -MakeInitialPHGDBPipelinePlugin -endPlugin

It seems to fill correctly the sqlite database but then some errors occur:

[pool-2-thread-1] ERROR net.maizegenetics.util.Utils - getBufferedReader: Error getting reader for: **UNASSIGNED**
...
[pool-2-thread-1] ERROR net.maizegenetics.plugindef.AbstractPlugin - Utils: getBufferedReader: problem getting reader: **UNASSIGNED** (No such file or directory)
...
[pool-2-thread-1] ERROR net.maizegenetics.pangenome.liquibase.LiquibaseUpdatePlugin - LiquibaseUpdatePLugin: File /phg/outputDir/run_yes.txt does not exist.  CheckDBVersionPlugin has determined your database is not recent enough to be updated with liquibase.  It does not contain the variants table.
...
[pool-2-thread-1] ERROR net.maizegenetics.plugindef.AbstractPlugin - LiquibaseUpdatePlugin::processData: problem running liquibase

I guess I need to add more details to config.txt but I would need some help,

thanks, Bruno

pangenome plants PHG • 1.1k views

ADD COMMENT • link updated 8 months ago by lcj34 ▴ 420 • written 8 months ago by b.contreras.moreira ▴ 310

score 1 · Answer 1 · 2024-03-12

1

Entering edit mode

8 months ago

lcj34 ▴ 420

Yes, you are missing some default parameters. WIll you post, or send me directly (lcj34@cornell.edu), your full log file? I need to see exactly which file it is trying to open that is undefined.

ADD COMMENT • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

Bruno - thanks for sending the log file. The MakeInitialPHGDBPipeline calls LoadAllIntervalsToPHGdbPlugin. That plugin has required parameters.

In the default config file under the "#Required Parameters" section, you will need to fill out all parameters for the plugin LoadAllIntervalsToPHGdbPlugin. Replace "UNASSIGNED with a value relative to your docker folders.

LoadAllIntervalsToPHGdbPlugin.genomeData=UNASSIGNED LoadAllIntervalsToPHGdbPlugin.outputDir=UNASSIGNED LoadAllIntervalsToPHGdbPlugin.ref=UNASSIGNED LoadAllIntervalsToPHGdbPlugin.refServerPath=UNASSIGNED LoadAllIntervalsToPHGdbPlugin.gvcfServerPath=UNASSIGNED // note - this one is no longer valid - ignore it LoadAllIntervalsToPHGdbPlugin.anchors=UNASSIGNED

See the example from this page of the documentation (under "Config File Parameters for this step": https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/UserInstructions/MakeInitialPHGDBPipeline.md

Let me know if you have further questions.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

Thanks lcj34

Some of those parameters were already set in config.txt without the prefix "LoadAllIntervalsToPHGdbPlugin.", should I duplicate them?
I can't see examples for '.ref' in the URL you mention, is it the same as 'reference'?

Any help appreciated, Bruno

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

No, you shouldn't need to duplicate them. The specific parameter missing based on the log file was the genomeData parameter. Do you have that set?

Regarding "ref" vs "referenceFasta". If you call LoadAllIntervalsToPHGdbPlugin() directly it looks for the "ref" parameter.

But when called from MakeInitialPHGDBPipeline, the latter looks for the "referenceFasta" parameter in the config file and sends that value. That is what is shown in the documentation page I posted.

I agree - it is confusing that these are different. I think "referenceFasta" was used by multiple plugins, so MakeInitialPHGDBPipeline searched for that value to send as a parameter.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

Thanks lcj34 , it seems to be running now, will see if it completes and I will post here if needed. I did need to set the following params, despite the fact they were already set in config.txt without the LoadAllIntervalsToPHGdbPlugin. prefix (see above):

LoadAllIntervalsToPHGdbPlugin.genomeData=/phg/inputDir/reference/load_genome_data.txt
LoadAllIntervalsToPHGdbPlugin.outputDir=/phg/outputDir
LoadAllIntervalsToPHGdbPlugin.anchors=/phg/inputDir/reference/merged.bed

Note my interval file contains merged gene regions, as the original ones overlapped in some cases (in different strands mostly) and I was warned about that. In case this helps anyone, I did it with

bedtools merge -i original.bed -c 4 -o collapse > merged.bed

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

0

Entering edit mode

Eventually LoadAllIntervalsToPHGdbPlugin loaded all intervals from the input BED file, but then it tried to load variants as well and failed; I thought I was only loading intervals, how can I state that I don't want to load any gVCF files yet? Please see the output:

....
[pool-2-thread-1] INFO net.maizegenetics.pangenome.db_loading.PHGdbAccess - putRefRangeRefRangeMethod: method_id 35628, total count loaded : 1
[pool-2-thread-1] INFO net.maizegenetics.pangenome.db_loading.LoadAllIntervalsToPHGdbPlugin - writeRefRangeRefRangeMethodTable finished
[pool-2-thread-1] INFO net.maizegenetics.pangenome.db_loading.LoadAllIntervalsToPHGdbPlugin - createLoadREfRanges: calling putRefAnchorData, hapMethodId= 1 size of anchorsToLoad 35627
[pool-2-thread-1] INFO net.maizegenetics.pangenome.db_loading.VariantLoadingUtils - bgzipping  file /phg/inputDir/reference/Morex.gvcf
[pool-2-thread-1] WARN net.maizegenetics.pangenome.db_loading.VariantLoadingUtils - 
ERROR 1 creating tabix indexed  version of file: /phg/inputDir/reference/Morex.gvcf.gz
[pool-2-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - LoadAllIntervalsToPHGdbPlugin : error processing/loading intervals bgzipAndIndexGVCFfile: error bgzipping and/or tabix'ing file /phg/inputDir/reference/Morex.gvcf

java.lang.IllegalArgumentException: LoadAllIntervalsToPHGdbPlugin : error processing/loading intervals bgzipAndIndexGVCFfile: error bgzipping and/or tabix'ing file /phg/inputDir/reference/Morex.gvcf

In case it helps, this is how my file inputDir/reference/load_genome_data.txt looks like:

Genotype    Hapnumber   Dataline    ploidy  genesPhased chromsPhased    Confidence  Method  MethodDetails   gvcfServerPath
Morex   0   MorexV3 1   false   false   1   noChrUn ChrUn split in INSDC contigs    xxx.xxx.xxx.xxx;/data/phg/gvcf

Note I added the last column as it complained otherwise, thanks again lcj34 , Bruno

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

1

Entering edit mode

The code is only processing the reference. Once it finishes loading the reference it creates a gvcf file of the reference data, then tries to compress and index it with bgzip and tabix. The error indicates there was a problem with indexing this file, and I assume this is due to sequence size. From the tabix (samtools) man page: "The tabix (. tbi) and BAI index formats can handle individual chromosomes up to 512 Mbp (2^29 bases) in length. If your input file might contain data lines with begin or end positions greater than that, you will need to use a CSI index"

We've seen this problem with large genomes e.g. wheat. In phgv1 we use htsjdk to read and query gvcfs at later stages of the pipeline. This is where the indexed files are required and htsjdk does not support the csi indices, only tbi. We have solved this in phgv2 by using tileDB and AGC and indexing with CSI

The solution for you would be to split your chromosomes and/or try running with phgv2.

ADD REPLY • link 8 months ago by lcj34 ▴ 420

0

Entering edit mode

OK two more questions:

If we split large chromosomes, then it will be up to us to merge chr arms from the results output by phg, or does the v1 code help with that?
I guess you mean https://github.com/maize-genetics/phg_v2, is there a docker of v2? Is the documentation enough to run that?

Thanks so much for your help lcj34

ADD REPLY • link 8 months ago by b.contreras.moreira ▴ 310

1

Entering edit mode

yes, you will need to merge chr arms , the v1 code does not handle that.

Regarding phg_v2: yes, that is the github. I think the documentation should be sufficient. We are not using docker - only conda. Also note that currently this only runs on linux. We will support Mac OS soon, but have no plans to support WIndows. However, you can run on windows if you install WIndows Subsystem Linux (WSL). We have users who have used that successfully.

ADD REPLY • link 8 months ago by lcj34 ▴ 420