Question

PHG database building for very large genomes

0

Entering edit mode

23 months ago

twrl8 • 0

Hello all!

I am trying to use the Practical Haplotype Graph to create a new PHG database and use it later on. I am using PHG version 1.2. Currently I am stuck at the second step, MakeInitialPHGDBPipelinePlugin. Looking at the -debug output GetDBConnectionPlugin completes successfully and the first few steps of LoadAllIntervalsToPHGdbPlugin aswell. However, once the GVCF file is supposed to be indexed and error occurs.

Last successful operation:

[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.LoadAllIntervalsToPHGdbPlugin - writeRefRangeRefRangeMethodTable finished
[pool-1-thread-1] INFO net.maizegenetics.pangenome.db_loading.LoadAllIntervalsToPHGdbPlugin - createLoadREfRanges: calling putRefAnchorData, hapMethodId= 1 size of anchorsToLoad 422710

Followed by the error:

Dec 19, 2022 12:09:19 PM net.maizegenetics.pangenome.db_loading.VariantLoadingUtilsKt bgzipAndIndexGVCFfile
INFO: bgzipping  file /xxx/projects/P003_PHG_genome_build/data/PHG/inputDir/reference/ref.gvcf
Dec 19, 2022 12:09:20 PM net.maizegenetics.pangenome.db_loading.VariantLoadingUtilsKt bgzipAndIndexGVCFfile
WARNING:
ERROR 1 creating tabix indexed  version of file: /xxx/projects/P003_PHG_genome_build/data/PHG/inputDir/reference/ref.gvcf.gz
[pool-1-thread-1] DEBUG net.maizegenetics.plugindef.AbstractPlugin - LoadAllIntervalsToPHGdbPlugin : error processing/loading intervals bgzipAndIndexGVCFfile: error bgzipping and/or tabix'ing file /xxx/projects/P003_PHG_genome_build/data/PHG/inputDir/reference/ref.gvcf
java.lang.IllegalArgumentException: LoadAllIntervalsToPHGdbPlugin : error processing/loading intervals bgzipAndIndexGVCFfile: error bgzipping and/or tabix'ing file /xxx/projects/P003_PHG_genome_build/data/PHG/inputDir/reference/ref.gvcf
        at net.maizegenetics.pangenome.db_loading.LoadAllIntervalsToPHGdbPlugin.createLoadRefRanges(LoadAllIntervalsToPHGdbPlugin.kt:346)
        at net.maizegenetics.pangenome.db_loading.LoadAllIntervalsToPHGdbPlugin.processData(LoadAllIntervalsToPHGdbPlugin.kt:170)
        at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:111)
        at net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin.loadGenomeIntervals(MakeInitialPHGDBPipelinePlugin.kt:83)
        at net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin.processData(MakeInitialPHGDBPipelinePlugin.kt:36)
        at net.maizegenetics.plugindef.AbstractPlugin.performFunction(AbstractPlugin.java:111)
        at net.maizegenetics.plugindef.AbstractPlugin.dataSetReturned(AbstractPlugin.java:2017)
        at net.maizegenetics.plugindef.ThreadedPluginListener.run(ThreadedPluginListener.java:29)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
Usage:
LoadAllIntervalsToPHGdbPlugin <options>
-ref <Reference Genome File> : Referemce Genome File for aligning against  (required)
-anchors <Anchors File> : Tab-delimited file containing Chrom, StartPosition, EndPosition, Type (required)
-genomeData <Genome Data File> : Path to tab-delimited file containing genome specific data with header line:
Genotype Hapnumber Dataline Ploidy Reference GenePhased ChromPhased Confidence Method MethodDetails gvcfServerPath
The gvcfServerPath column should hold a semi-colon separated servername and path where gvcf files will be uploaded, e.g. 128.9.9.9;/path/to/gvcfs/  (required)
-outputDir <Output Directory> : Directory to write liquibase changeLogSync output  (required)
-refServerPath <Reference Server Path> : String that contains a server name or ip address, followed by a semi-colon, then the file path where the reference genome will be stored for future access.  This ia a more permanent location, not where the genome file lives for processing via this plugin. (required)
-isTestMethod <true | false> : Indication if the data is to be loaded against a test method. Data loaded with test methods are not cached with the PHG ktor server (Default: false)

[pool-1-thread-1] ERROR net.maizegenetics.plugindef.AbstractPlugin - LoadAllIntervalsToPHGdbPlugin : error processing/loading intervals bgzipAndIndexGVCFfile: error bgzipping and/or tabix'ing file /xxx/projects/P003_PHG_genome_build/data/PHG/inputDir/reference/ref.gvcf
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin - Done loading Genome Intervals step.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin - Checking if Liquibase can be run.
[pool-1-thread-1] INFO net.maizegenetics.pangenome.pipeline.MakeInitialPHGDBPipelinePlugin - Liquibase can be run.  Setting it up using changelogsync.
[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin - Starting net.maizegenetics.pangenome.liquibase.LiquibaseUpdatePlugin: time: Dec 19, 2022 12:09:20
[pool-1-thread-1] ERROR net.maizegenetics.plugindef.AbstractPlugin - -outputDir is required.

[pool-1-thread-1] INFO net.maizegenetics.plugindef.AbstractPlugin -
Usage:
LiquibaseUpdatePlugin <options>
-outputDir <Output Directory> : Directory path to write any liquibase output files. (required)
-command <Liquibase command> : Command for liquibase to execute: must be update or changeLogSync, defaults to update. (Default: update)

I have so far unsuccessfully tried to find the bgzipAndIndexGVCFfile() function in the bitbucket repository, but I would guess that it tries to create a tabix index for the GVCF. The trouble is that my genome(s) are larger than the ~500Mbp cap for tabix style indexing and would have to be indexed in the csi style.

Is this supported with PHG? Does anyone have any experiences with this?

Many thanks in advance!!

phg • 1.1k views

ADD COMMENT • link 21 months ago by twrl8 • 0

score 1 · Accepted Answer · 2022-12-20

The older versions of PHG stored variants in a variant table in the db. We found this to be problematic. When the dbs got large, and these tables grew, the PHG would sometimes hang trying to pull variants.

Because of this, the new schema (PHG version 1.0 and greater) now stores variants externally in GVCF files. This has proved to be much more performant, and has the benefit of the variants stored in a standard format.

Due to this, we use htsjdk to read and query the gvcfs. This is where the indexed files are required, and htsjdk does not support the csi indexes, only tbi. There have been requests to htsjdk to update their code to support the CSI indices but to our knowledge, they currently have no plans for this support.

In order to successfully load your data to the PHG you'll need to split your chromosomes . We are working on updates to the code that will remove the need for the indexed gvcfs, but that update won't be available until sometime in to the new year. We understand this is an issue that must be addressed, and we hope to have a resolution in our code soon.